CVPR 2026

VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset

The first large-scale 4K instruction-based image editing dataset, paired with a high-frequency-aware post-adaptation strategy for native UHR editing.

Zhizhou Chen1*, Shanyan Guan2*, Zhanxin Gao1, En Ci1, Yanhao Ge2, Wei Li2, Zhenyu Zhang1, Jian Yang1, Ying Tai1†

1Nanjing University, China    2vivo, China    *Equal contribution    Corresponding author

Ultra-high-resolution editing comparison between input, VINS-120K adaptation, and Kontext plus super-resolution.

Abstract

Directly editing ultra-high-resolution images is valuable but still underexplored, mainly because high-quality 4K editing data is scarce and high-frequency texture modeling is difficult. VINS-120K introduces 120K carefully curated triplets of instruction, input image, and edited image. Each image exceeds 4K resolution, with an average size of 4656 × 4138. Built on this dataset, the paper proposes a high-frequency-aware post-adaptation strategy that extends pretrained non-high-resolution editors to the UHR regime, and evaluates them on VINS-4KEval, a 509-sample 4K benchmark covering 13 edit types.

VINS-120K Dataset

VINS-120K is built from real-world 8K-UHD videos and carefully filtered open-source editing data, preserving native detail while expanding long-tail edit coverage.

Overview of VINS-120K edited triplets across editing types.
The VINS-120K data filtering pipeline.
120KUHR editing triplets
4656 × 4138average resolution
13editing types
509VINS-4KEval samples

High-Frequency-Aware Post-Adaptation

The method adapts pretrained NHR editors to 4K by stabilizing very long token sequences and explicitly supervising high-frequency detail.

Attention Rescaling

UHR images produce much longer token sequences, which can over-smooth attention. A resolution-aware temperature sharpens attention responses back toward the pretrained regime.

RoPE Rescaling

The rotary base is stretched using an NTK-aware principle, keeping unseen UHR positions within a more stable positional encoding range.

Frequency-Focused Loss

A dynamic 2D-DFT loss emphasizes high-frequency bands during later denoising steps, where fine texture and local details are decoded.

Overview of the high-frequency-aware post-adaptation method.

Results

On VINS-4KEval, post-adapting FLUX.1-Kontext-dev improves UHR detail fidelity while preserving competitive instruction-following behavior.

Native 4K detail is the priority. Fidelity

The post-adapted model better preserves and synthesizes fine textures that are usually weakened by downsample-edit-upsample pipelines.

Instruction following stays competitive. Editing Quality

On VINS-4KEval, the adapted editor improves edit quality and detail preservation while keeping instruction adherence close to the strongest baselines.

Quantitative gains support the visual trend. Benchmark

The full table reports the best pFID among evaluated systems, with the qualitative results showing clearer high-frequency reconstruction.

Qualitative comparisons on VINS-4KEval.
Method Instr. Adh. Edit Qual. Detail Pres. SC PQ VIEScore pFID
Seedream 4.04.604.794.707.958.128.0312.82
AnyEdit3.243.893.574.097.325.7118.44
ICEdit3.764.424.095.777.816.7916.69
Bagel4.154.394.277.237.767.4915.41
Omnigen24.144.544.346.737.867.2918.73
Step1X-Edit4.064.504.286.947.797.3715.37
Kontext-dev4.224.604.416.957.927.4312.66
Kontext-dev + Post-Adaptation4.234.704.476.897.987.449.15

Generalization to Qwen-Image-Edit-2511

The same adaptation idea also transfers to Qwen-Image-Edit-2511, improving local object replacement and global scene transformation while preserving UHR visual details.

Generalization results on Qwen-Image-Edit-2511.

BibTeX

If you find this work useful, please cite it.

@inproceedings{chen2026vins120k,
  title     = {VINS-120K: Ultra High-Resolution Image Editing with A Large-Scale Dataset},
  author    = {Chen, Zhizhou and Guan, Shanyan and Gao, Zhanxin and Ci, En and
               Ge, Yanhao and Li, Wei and Zhang, Zhenyu and Yang, Jian and Tai, Ying},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year      = {2026}
}