NeurIPS Domain-RAG Domain-RAG

Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection
Yu Li1,*, Xingyu Qiu1,*, Yuqian Fu2,*†, Jie Chen3, Tianwen Qian4,
Xu Zheng2,5, Danda Pani Paudel2, Yanwei Fu1, Xuanjing Huang1, Luc Van Gool2, Yu-Gang Jiang1
1Fudan University    2INSAIT, Sofia University “St. Kliment Ohridski”    3Fuzhou University
4East China Normal University    5HKUST(GZ)
*These authors have equal contributions.    Corresponding author.
Fudan University INSAIT Fuzhou University East China Normal University HKUST(GZ)
NeurIPS 2025
Teaser Image
Figure 1: Teaser. Given images from distinct novel domains, we compare generation results of baseline methods (a–c) and our approach (d), and illustrate the main pipeline of our Domain-RAG (e).

Abstract

Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel objects with only a handful of labeled samples from previously unseen domains. While data augmentation and generative methods have shown promise in few-shot learning, their effectiveness for CD-FSOD remains unclear due to the need for both visual realism and domain alignment. Existing strategies, such as copy-paste augmentation and text-to-image generation, often fail to preserve the correct object category or produce backgrounds coherent with the target domain, making them non-trivial to apply directly to CD-FSOD. To address these challenges, we propose Domain-RAG, a training-free, retrieval-guided compositional image generation framework tailored for CD-FSOD. Domain-RAG consists of three stages: domain-aware background retrieval, domain-guided background generation, and foreground-background composition. Specifically, the input image is first decomposed into foreground and background regions. We then retrieve semantically and stylistically similar images to guide a generative model in synthesizing a new background, conditioned on both the original and retrieved contexts. Finally, the preserved foreground is composed with the newly generated domain-aligned background to form the generated image. Without requiring any additional supervision or training, Domain-RAG produces high-quality, domain-consistent samples across diverse tasks, including CD-FSOD, remote sensing FSOD, and camouflaged FSOD. Extensive experiments show consistent improvements over strong baselines and establish new state-of-the-art results.

Method Overview

We propose Domain-RAG, a novel training-free, retrieval-guided compositional generation framework that enhances support diversity by generating domain-aligned samples. To enable retrieval, we use COCO as the database $ \mathcal{D}_{\text{base}} $, serving as a gallery of candidate backgrounds. Following the core principle of “fix the foreground, adapt the background”, Domain-RAG processes each support image $ x \in \mathcal{S} $ by first decomposing it into foreground object(s) and background. The framework then proceeds through three key stages: (1) domain-aware background retrieval first obtains the inpainted background $ b_{\text{init}} $ from $ x $ and then retrieves $ G $ candidate backgrounds $ b_{\text{re}} $ from $ \mathcal{D}_{\text{base}} $ that are semantically and stylistically similar; (2) domain-guided background generation feeds each $ { b_{\text{init}}, b_{\text{re}} } $ pair into a generative model to synthesize a new domain-aligned background $ b_{\text{dom}} $; (3) foreground-background composition finally produces $ n $ new images $ x^{+} $ by compositing the preserved foreground onto each $ b_{\text{dom}} $ using a mask-guided generative model.

Framework Architecture
Figure 2: Method Framework. Illustration of our Domain-RAG. Built on our principle of "fix the foreground, adapt the background", we first decompose image and process it with three key modules: domain-aware background retrieval, domain-guided background generation, and foreground-background composition.

Experimental Results

We evaluate Domain-RAG on multiple benchmarks, demonstrating consistent improvements.

CD-FSOD Results
Table 1: CD-FSOD Performance. Comparison with state-of-the-art methods on standard Cross-Domain benchmarks.
Camouflage FSOD Results
Table 2: Camouflage FSOD. Results on camouflaged object detection tasks.
Remote Sensing Results
Table 3: Additional Results. Comparison of augmentation methods (mAP) on the CD-FSOD benchmark under 1-shot.

Qualitative Visualization

Figure 3: Qualitative Visualization. (a) Ablation study on modules, results reported on NEU-DET, 1 shot. (b) Visualization of target image (top row) and generated image (second row). (c) Failure Cases.
Qualitative Results A
Qualitative Results B

Citation

@inproceedings{li2025domainrag, title={Domain-RAG: Retrieval-Guided Compositional Image Generation for Cross-Domain Few-Shot Object Detection}, author={Li, Yu and Qiu, Xingyu and Fu, Yuqian and Chen, Jie and Qian, Tianwen and Zheng, Xu and Paudel, Danda Pani and Fu, Yanwei and Huang, Xuanjing and Van Gool, Luc and Jiang, Yu-Gang}, booktitle={Advances in Neural Information Processing Systems (NeurIPS)}, year={2025} }
@inproceedings{fu2024cross, title={Cross-domain few-shot object detection via enhanced open-set object detector}, author={Fu, Yuqian and Wang, Yu and Pan, Yixuan and Huai, Lian and Qiu, Xingyu and Shangguan, Zeyu and Liu, Tong and Fu, Yanwei and Van Gool, Luc and Jiang, Xingqun}, booktitle={European Conference on Computer Vision}, pages={247--264}, year={2024}, organization={Springer} }