VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

Ultra-high-resolution editing comparison between input, VINS-120K adaptation, and Kontext plus super-resolution.

Abstract

Directly editing ultra-high-resolution images is valuable but still underexplored, mainly because high-quality 4K editing data is scarce and high-frequency texture modeling is difficult. VINS-120K introduces 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution, with an average size of 4656 × 4138. Built on this dataset, the paper proposes a high-frequency-aware post-adaptation strategy that extends pretrained non-high-resolution editors to the UHR regime, and evaluates them on VINS-4KEval, a 509-sample 4K benchmark covering 13 edit types.

VINS-120K Dataset

VINS-120K is built from real-world 8K-UHD videos and carefully filtered open-source editing data, preserving native detail while expanding long-tail edit coverage.

Overview of VINS-120K edited triplets across editing types.

120KUHR editing triplets

4656 × 4138average resolution

13editing types

509VINS-4KEval samples

High-Frequency-Aware Post-Adaptation

The method adapts pretrained NHR editors to 4K by stabilizing very long token sequences and explicitly supervising high-frequency detail.

Attention Rescaling

UHR images produce much longer token sequences, which can over-smooth attention. A resolution-aware temperature sharpens attention responses back toward the pretrained regime.

RoPE Rescaling

The rotary base is stretched using an NTK-aware principle, keeping unseen UHR positions within a more stable positional encoding range.

Frequency-Focused Loss

A dynamic 2D-DFT loss emphasizes high-frequency bands during later denoising steps, where fine texture and local details are decoded.

Overview of the high-frequency-aware post-adaptation method.

Results

On VINS-4KEval, post-adapting FLUX.1-Kontext-dev improves UHR detail fidelity while preserving competitive instruction-following behavior.

Native 4K detail is the priority. Fidelity

The post-adapted model better preserves and synthesizes fine textures that are usually weakened by downsample-edit-upsample pipelines.

Instruction following stays competitive. Editing Quality

On VINS-4KEval, the adapted editor improves edit quality and detail preservation while keeping instruction adherence close to the strongest baselines.

Quantitative gains support the visual trend. Benchmark

The full table reports the best pFID among evaluated systems, with the qualitative results showing clearer high-frequency reconstruction.

Method	Instr. Adh.	Edit Qual.	Detail Pres.	SC	PQ	VIEScore	pFID
Seedream 4.0	4.60	4.79	4.70	7.95	8.12	8.03	12.82
AnyEdit	3.24	3.89	3.57	4.09	7.32	5.71	18.44
ICEdit	3.76	4.42	4.09	5.77	7.81	6.79	16.69
Bagel	4.15	4.39	4.27	7.23	7.76	7.49	15.41
Omnigen2	4.14	4.54	4.34	6.73	7.86	7.29	18.73
Step1X-Edit	4.06	4.50	4.28	6.94	7.79	7.37	15.37
Kontext-dev	4.22	4.60	4.41	6.95	7.92	7.43	12.66
Kontext-dev + Post-Adaptation	4.23	4.70	4.47	6.89	7.98	7.44	9.15

Generalization to Qwen-Image-Edit-2511

The same adaptation idea also transfers to Qwen-Image-Edit-2511, improving local object replacement and global scene transformation while preserving UHR visual details.

Generalization results on Qwen-Image-Edit-2511.

Gallery

Input-output pairs from the dataset and qualitative evaluations, organized as paired scrolling cards for direct before/after comparison.

More Dataset Samples

Input

Output

Change the number of the car to '2023'

Input

Output

Converts pictures to watercolor style

Input

Output

Change the woman's pose to hands pressed together in prayer

Input

Output

Change the colour of the lifeguard tower to red and blue

Input

Output

Place the woman in The Bund in Shanghai

Input

Output

Pan the frame to the left

Input

Output

Slightly zoom in to show more island details

Input

Output

Place the woman on Nanjing Road, Shanghai, keeping her unchanged

Input

Output

Make the overall image warmer and highlight the feeling of autumn

Input

Output

Add a line to the right side of the picture: "It's the evening wind, it's the end of the water"

Input

Output

Replace background with a train station waiting hall, keep the person unchanged

Input

Output

Raise the groom's head to face the bride. Change the expressions of both the groom and the bride to smiles. Reposition the couple's hands so they are holding both of each other's hands

Input

Output

Close the mouth of the man on the left and pucker his lips. Turn the head of the older man on the right to look down and close his eyes

Input

Output

Add a hand wearing a transparent glove from the top of the frame, placing a sesame seed bun onto the burger

Input

Output

Slightly zoom out and lower the robotic arm to place the layer of boxes it is holding onto the stack on the pallet

Input

Output

Remove the 'BROWN OWL' text box from the bottom left. Lift the owlet's head to face forward

Input

Output

Turn the bird's head out from behind the branch to reveal its face and beak

Input

Output

Slightly zoom in, remove the tomato from the upper left, and add a pizza cutter pressing down into the center of the pizza

More Qualitative Results

Input

Output

Change the peacock's green feathers into yellow

Input

Output

Replace the sky in the background with a brilliant starry sky

Input

Output

Zoom out for a wider view

Input

Output

Change the pose of the woman to close her eyes

Input

Output

Replace this sports car with an F1 car

Input

Output

Transform the backdrop into a serene forest

Input

Output

Simulate a sunrise with golden light reflecting off the cabin.

BibTeX

If you find this work useful, please cite it.

@inproceedings{chen2026vins120k,
  title     = {VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset},
  author    = {Chen, Zhizhou and Guan, Shanyan and Gao, Zhanxin and Ci, En and
               Ge, Yanhao and Li, Wei and Zhang, Zhenyu and Yang, Jian and Tai, Ying},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}