1 CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor CVPR 2024 30회
2 Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero-Shot Medical Image Segmentation CVPR 2024 (Workshop) 14회
3 LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning CVPR 2024 (Workshop) 19회
4 Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation CVPR 2024 15회
5 Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation CVPR 2024 12회
6 MaskPrompt: Open-Vocabulary Affordance Segmentation with Object Shape Mask Prompts AAAI 2025 0~2회
7 Improved Visual Grounding through Self-consistent Explanations CVPR 2024 15회
8 Prompt Learning via Meta-Regularization CVPR 2024 15회
9 Domain-Controlled Prompt Learning AAAI 2024 18회
10 Domain Prompt Learning with Quaternion Networks CVPR 2024 13회
11 Multi-Prompts Learning with Cross-Modal Alignment for Attribute-Based Person Re-Identification AAAI 2024 20회
12 Learning to Prompt with Text Only Supervision for Vision-Language Models AAAI 2025 13회
13 WAFFLE: Multimodal Floorplan Understanding in the Wild CVPR 2025
14 Omnicount: Multi-label Object Counting with Semantic-Geometric Priors AAAI 2025 2회
15 Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images AAAI 2025
16 Personamagic: Stage-regulated High-fidelity Face Customization with Tandem Equilibrium AAAI 2025 2회
17 HiClassGen: High-Resolution Image Augmentation with Class and Shape Controllable Diffusion Models ICCV 2025
[HTML] Cpt: Colorful prompt tuning for pre-trained vision-language models
Y Yao, A Zhang, Z Zhang, Z Liu, TS Chua, M Sun - AI Open, 2024 - Elsevier
Cape: Corrective actions from precondition errors using large language models
SS Raman, V Cohen, I Idrees, E Rosen… - … on Robotics and …, 2024 - ieeexplore.ieee.org
Extracting knowledge and reasoning from large language models (LLMs) offers a path to
designing intelligent robots. Common approaches that leverage LLMs for planning are unable …
저장 인용 29회 인용 관련 학술자료 전체 7개의 버전
Woodpecker: Hallucination correction for multimodal large language models
S Yin, C Fu, S Zhao, T Xu, H Wang, D Sui… - Science China …, 2024 - Springer
… In: Proceedings of AAAI, 2024 9 Li Y, Du Y, Zhou K, et al. Evaluating object hallucination in
large vision-… PointCLIP V2: prompting CLIP and GPT for powerful 3D open-world learning. In: …
저장 인용 210회 인용 관련 학술자료 전체 3개의 버전
Vision-language models in remote sensing: Current progress and future trends
X Li, C Wen, Y Hu, Z Yuan… - IEEE Geoscience and …, 2024 - ieeexplore.ieee.org
… [107] proposed a novel training paradigm called data-efficient CLIP, which improves the
efficiency of learning generic visual features. This approach incorporated both intramodality self-…
저장 인용 94회 인용 관련 학술자료 전체 8개의 버전
A survey on multimodal large language models for autonomous driving
C Cui, Y Ma, X Cao, W Ye, Y Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
… [145] and ControlNet [204] utilized CLIP and UNet-based diffusion model to generate images
… In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), …
저장 인용 359회 인용 관련 학술자료 전체 7개의 버전
A survey on open-vocabulary detection and segmentation: Past, present, and future
C Zhu, L Chen - IEEE Transactions on Pattern Analysis and …, 2024 - ieeexplore.ieee.org
… VLMs (eg, CLIP) trained via contrastive learning yield superior zero-shot recognition ability
… CLIP visual-semantic space more effectively. A more detailed framework is given in Fig. 4. …
저장 인용 33회 인용 관련 학술자료 전체 8개의 버전
Referring camouflaged object detection
X Zhang, B Yin, Z Lin, Q Hou, DP Fan… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
… Then, we feed these 64 text descriptions into the pre-trained CLIP [102] that adopts ResNet-…
/journals, including IEEE TPAMI, CVPR, ICCV, NeurIPS, etc. His research interests include …
저장 인용 26회 인용 관련 학술자료 전체 5개의 버전
Cross-modal conditioned reconstruction for language-guided medical image segmentation
X Huang, H Li, M Cao, L Chen… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
… the frozen CLIP [42] text encoder to encode text. The reasons for selecting CLIP and its pre-…
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3517–3525, 2022. …
저장 인용 15회 인용 관련 학술자료 전체 3개의 버전
Transformer-based visual segmentation: A survey
X Li, H Ding, H Yuan, W Zhang, J Pang… - IEEE transactions on …, 2024 - ieeexplore.ieee.org
… visual CLIP encoder for visual distillation. To handle this, Forzen-VLM [279] adopts the
frozen visual clip model and combines the scores of both learned visual embedding and CLIP …
저장 인용 160회 인용 관련 학술자료 전체 12개의 버전
Improved visual grounding through self-consistent explanations
R He, P Cascante-Bonilla, Z Yang… - Proceedings of the …, 2024 - openaccess.thecvf.com
… This CVPR … CLIP [41] model as pseudo-labels during training. We posit that our contribution
is orthogonal and our approach would likely also benefit from similar supervision, since CLIP …
저장 인용 15회 인용 관련 학술자료 전체 8개의 버전
Ophclip: Hierarchical retrieval-augmented learning for ophthalmic surgical video-language pretraining
M Hu, K Yuan, Y Shen, F Tang, X Xu, L Zhou… - arXiv preprint arXiv …, 2024 - arxiv.org
… For these videos, we collect their titles and clip metadata to construct a knowledge base, …
The comparison to the OpenAI CLIP and CLIP pretrained on our dataset. We report Accuracy / …
저장 인용 13회 인용 관련 학술자료 전체 2개의 버전
iedit: Localised text-guided image editing with weak supervision
R Bodur, E Gundogdu, B Bhattarai… - Proceedings of the …, 2024 - openaccess.thecvf.com
… This CVPR Workshop paper is the Open Access version, provided by the Computer Vision
… , we introduce a global CLIP loss [30] between the edit prompt and the generated image x1: …
저장 인용 15회 인용 관련 학술자료 전체 11개의 버전
Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding
X Zuo, P Samangouei, Y Zhou, Y Di, M Li - International Journal of …, 2025 - Springer
… By carefully analyzing the properties in both CLIP and DINO embeddings, we design an …
of CLIP features by extracting and aggregating them at multiple resolutions for a hybrid CLIP …
저장 인용 40회 인용 관련 학술자료 전체 4개의 버전
Learning object state changes in videos: An open-world perspective
Z Xue, K Ashutosh, K Grauman - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
… This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version; the final published version …
저장 인용 22회 인용 관련 학술자료 전체 7개의 버전
Vision-language models for vision tasks: A survey
J Zhang, J Huang, S Jin, S Lu - IEEE Transactions on Pattern …, 2024 - ieeexplore.ieee.org
… The publications grow exponentially since the pioneer study CLIP [10] in 2021. … Visual
recognition related VLM studies have made great progresses since the development of CLIP [10]. …
저장 인용 605회 인용 관련 학술자료 전체 11개의 버전
Lidarclip or: How i learned to talk to point clouds
G Hess, A Tonderski, C Petersson… - Proceedings of the …, 2024 - openaccess.thecvf.com
… CLIP embedding space. Using image-lidar pairs, we supervise a point cloud encoder
with the image CLIP … We show the effectiveness of LidarCLIP by demonstrating that lidar-based …
저장 인용 21회 인용 관련 학술자료 전체 8개의 버전
Exploring vision-language models for imbalanced learning
Y Wang, Z Yu, J Wang, Q Heng, H Chen, W Ye… - International Journal of …, 2024 - Springer
… , in Table 5, we further provide some results using ViT of Laion-CLIP to show the
generalizability of our decoder approach combining VLMs and imbalanced learning methods. Our …
저장 인용 39회 인용 관련 학술자료 전체 6개의 버전
Domain prompt learning with quaternion networks
Q Cao, Z Xu, Y Chen, C Ma… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
… enhances the adaptability of CLIP for visual recognition tasks. … This CVPR paper is the Open
Access version, provided by … We use the pre-trained ViT-B/16 CLIP model for prompt tuning…
저장 인용 13회 인용 관련 학술자료 전체 7개의 버전
Prompt learning via meta-regularization
J Park, J Ko, HJ Kim - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
… This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
… CLIP [52] provides a wellaligned image-text joint embedding space. The pretrained CLIP …
저장 인용 15회 인용 관련 학술자료 전체 7개의 버전
Affordancellm: Grounding affordance from vision language models
S Qian, W Chen, M Bai, X Zhou… - Proceedings of the …, 2024 - openaccess.thecvf.com
… This CVPR Workshop paper is the Open Access version, … The standard LLaVA uses
CLIP image encoder and a linear … In practice, we find that the CLIP image encoder has low …
저장 인용 38회 인용 관련 학술자료 전체 9개의 버전
Transferring vision-language models for visual recognition: A classifier perspective
W Wu, Z Sun, Y Song, J Wang, W Ouyang - International Journal of …, 2024 - Springer
… We observe that the embeddings from the textual encoder of CLIP significantly improve the
… This result can be explained by the fact that both DistillBERT and CLIP are pre-trained with …
저장 인용 22회 인용 관련 학술자료 전체 5개의 버전
Llm-seg: Bridging image segmentation and large language model reasoning
J Wang, L Ke - Proceedings of the IEEE/CVF Conference …, 2024 - openaccess.thecvf.com
… This CVPR Workshop paper is the Open Access version, … However, fine-tuning the CLIP
parameters may harm the zero-shot … model upon a frozen CLIP model such as FC-CLIP [34] and …
저장 인용 19회 인용 관련 학술자료 전체 5개의 버전
[HTML] Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography
J Liu, Y Zhang, K Wang, MC Yavuz, X Chen… - Medical image …, 2024 - Elsevier
… 1, we present the framework of CLIP-Driven Universal Model for multi-organ segmentation
and tumor detection with an integrated dataset, which consists of two main branches: the …
저장 인용 18회 인용 관련 학술자료 전체 9개의 버전
Rar: Retrieving and ranking augmented mllms for visual recognition
Z Liu, Z Sun, Y Zang, W Li, P Zhang, X Dong… - arXiv preprint arXiv …, 2024 - arxiv.org
… CLIP and MLLM. Our RAR can seamlessly integrate into MLLMs to improve the few-shot/zero-shot
abilities on classification (upper right) and detection (bottom) datasets. CLIP’… that CLIP …
저장 인용 14회 인용 관련 학술자료 전체 3개의 버전
Clip as rnn: Segment countless visual concepts without training endeavor
S Sun, R Li, P Torr, X Gu, S Li - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
… This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version; the final published version …
저장 인용 30회 인용 관련 학술자료 전체 7개의 버전
Test-time adaptation with salip: A cascade of sam and clip for zero-shot medical image segmentation
S Aleem, F Wang, M Maniparambil… - Proceedings of the …, 2024 - openaccess.thecvf.com
… tasks across diverse domains while CLIP is renowned for its zero-shot … of integrating SAM
and CLIP into a unified framework for … within the image followed by CLIP to retrieve the mask …
저장 인용 14회 인용 관련 학술자료 전체 8개의 버전
Vlcounter: Text-aware visual representation for zero-shot object counting
S Kang, WJ Moon, E Kim, JP Heo - Proceedings of the AAAI Conference …, 2024 - ojs.aaai.org
… joint embedding space of CLIP, has provided a clear motivation for utilizing CLIP as a robust
… capability of CLIP to achieve precise and efficient object counting in an end-to-end manner. …
저장 인용 26회 인용 관련 학술자료 전체 5개의 버전
Foundation Models Defining a New Era in Vision: a Survey and Outlook
M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
… The state-of-the-art performance of CLIP hinges on the large-scale image-text … CLIP
model. Utilizing largescale LAION datasets [46], [47], Open-CLIP [72] train and reproduce CLIP …
저장 인용 181회 인용 관련 학술자료 전체 4개의 버전
Visual and textual prior guided mask assemble for few-shot segmentation and beyond
S Chen, F Meng, R Zhang, H Qiu, H Li… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
… Due to CLIP’s advantages of aligning visual and textual information, the integration of CLIP
… However, even with the CLIP model, the existing CLIP-based FSS methods are still subject to …
저장 인용 12회 인용 관련 학술자료 전체 2개의 버전
Learning to prompt with text only supervision for vision-language models
MU Khattak, MF Naeem, M Naseer… - … of the AAAI Conference …, 2025 - ojs.aaai.org
… 2022b) is the pioneering prompt learning method for CLIP which learns text prompts to
fine-tune CLIP. CoCoOp (Zhou et al. 2022a) improves CoOp’s generalization by conditioning text …
저장 인용 13회 인용 관련 학술자료 전체 2개의 버전
[PDF] thecvf.com
Rethinking prior information generation with clip for few-shot segmentation
J Wang, B Zhang, J Pang, H Chen… - Proceedings of the …, 2024 - openaccess.thecvf.com
… This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version; the final published version …
저장 인용 15회 인용 관련 학술자료 전체 6개의 버전
Multi-prompts learning with cross-modal alignment for attribute-based person re-identification
Y Zhai, Y Zeng, Z Huang, Z Qin, X Jin… - Proceedings of the AAAI …, 2024 - ojs.aaai.org
… For visual representation and prompt representation, we adopt the image encoder and
the text encoder from CLIP as the backbone for feature extractor respectively. They are all …
저장 인용 20회 인용 관련 학술자료 전체 6개의 버전
Exploring regional clues in CLIP for zero-shot semantic segmentation
Y Zhang, MH Guo, M Wang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
… This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.
Except for this watermark, it is identical to the accepted version; the final published version …
저장 인용 12회 인용 관련 학술자료 전체 5개의 버전
Domain-controlled prompt learning
Q Cao, Z Xu, Y Chen, C Ma, X Yang - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Large pre-trained vision-language models, such as CLIP, have shown remarkable generalization
capabilities across various tasks when appropriate text prompts are provided. However…
저장 인용 18회 인용 관련 학술자료 전체 6개의 버전
-------------------------------------------------------
[PDF] ieee.org
Find it @ Kwangwoon
SGANet: A Siamese Geometry-Aware Network for Remote Sensing Change Detection
J Chen, S Dong, X Meng - IEEE Journal of Selected Topics in …, 2025 - ieeexplore.ieee.org
… In the visual-language domain, the CLIP model [30] has demonstrated the power of multimodal
pre… The CLIP model can closely integrate visual information with language descriptions, …
저장 인용 관련 학술자료
[PDF] ieee.org
Find it @ Kwangwoon
A comprehensive survey of foundation models in medicine
W Khan, S Leem, KB See, JK Wong… - IEEE Reviews in …, 2025 - ieeexplore.ieee.org
… promising performance compared to CLIP and other methods [58], … , which outperformed the
original CLIP in the biomedical … CLIP-based framework that uses text embeddings from CLIP …
저장 인용 14회 인용 관련 학술자료 전체 3개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Detecting Wildfires on UAVs with Real-time Segmentation Trained by Larger Teacher Models
J Pesonen, T Hakala, V Karjalainen… - 2025 IEEE/CVF …, 2025 - ieeexplore.ieee.org
Early detection of wildfires is essential to prevent large-scale fires resulting in extensive
environmental, structural, and societal damage. Uncrewed aerial vehicles (UAVs) can cover …
저장 인용 1회 인용 관련 학술자료 전체 2개의 버전
[HTML] nih.gov
[HTML] Efficient Multi-Task Training with Adaptive Feature Alignment for Universal Image Segmentation
Y Qu, J Kim - Sensors (Basel, Switzerland), 2025 - pmc.ncbi.nlhttp://m.nih.gov
Universal image segmentation aims to handle all segmentation tasks within a single model
architecture and ideally requires only one training phase. To achieve task-conditioned joint …
저장 인용 관련 학술자료 전체 6개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Few-shot Structure-Informed Machinery Part Segmentation with Foundation Models and Graph Neural Networks
M Schwingshackl, FF Oberweger… - 2025 IEEE/CVF Winter …, 2025 - ieeexplore.ieee.org
This paper proposes a novel approach to few-shot semantic segmentation for machinery with
multiple parts that exhibit spatial and hierarchical relationships. Our method integrates the …
저장 인용 1회 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments
M Elnoor, K Weerakoon, G Seneviratne, J Liang… - arXiv preprint arXiv …, 2025 - arxiv.org
We introduce Vision-Language Attention Distillation (Vi-LAD), a novel approach for distilling
socially compliant navigation knowledge from a large Vision-Language Model (VLM) into a …
저장 인용 관련 학술자료
[PDF] dcu.ie
An exploration of domain generalisation through vision benchmarking, masking, and pruning
H Riaz - 2025 - doras.dcu.ie
There are many computer vision applications including object segmentation, classification,
object detection, and reconstruction for which Machine Learning (ML) shows state-of-the-art …
저장 인용 관련 학술자료
[PDF] wiley.com
Full View
Bootstrapping vision–language transformer for monocular 3D visual grounding
Q Lei, S Sun, X Song, H Song, M Feng… - IET Image …, 2025 - Wiley Online Library
In the task of 3D visual grounding using monocular RGB images, it is a challenging problem
to perceive visual features and accurately predict the localization of 3D objects based on …
저장 인용 관련 학술자료
[PDF] frontiersin.org
Find it @ Kwangwoon
Application of remote sensing technology in studying the interaction between culture and environment in the Third Pole Region
Z Lu - Frontiers in Environmental Science, 2025 - frontiersin.org
Introduction The interplay between cutture and environment in the Third Pole Region holds
profound implications for the region's socio-ecological resilience and long-term sustainability…
저장 인용 관련 학술자료
[HTML] CSP-DCPE: Category-Specific Prompt with Deep Contextual Prompt Enhancement for Vision–Language Models
C Wu, Y Wu, Q Xu, X Zi - Electronics, 2025 - mdpi.com
… , CLIP successfully integrates textual and visual information into a unified embedding space.
Consequently, the CLIP … Subsequently, this paper initially presents an overview of the CLIP …
저장 인용 관련 학술자료
[PDF] arxiv.org
Fine-Grained Image-Text Correspondence with Cost Aggregation for Open-Vocabulary Part Segmentation
J Choi, S Lee, M Lee, S Lee, H Shim - arXiv preprint arXiv:2501.09688, 2025 - arxiv.org
… 57] such as CLIP, allowing both visual and text features to be aligned in a unified semantic …
In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’…
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
InterMamba: A Visual-Prompted Interactive Framework for Dense Object Detection and Annotation
S Liu, Z Yang, Q Li, Q Wang - IEEE Transactions on Geoscience …, 2025 - ieeexplore.ieee.org
… Foundational work like CLIP [15] introduced a powerful contrastive learning framework, …
Although initially limited to image-level tasks, CLIP inspired a surge of research exploring its …
저장 인용 관련 학술자료
[PDF] ieee.org
Find it @ Kwangwoon
HiClassGen: High-Resolution Image Augmentation with Class and Shape Controllable Diffusion Models
AM Tayeb, JM Lee, DS Kim - 2025 International Conference on …, 2025 - ieeexplore.ieee.org
… Moreover, the latest advancements in generative models, such as CLIP and Stable
Diffusion, have revolutionized data augmentation by enabling the creation of entirely synthetic …
저장 인용 관련 학술자료
[PDF] ieee.org
Find it @ Kwangwoon
Referring camouflaged object detection
X Zhang, B Yin, Z Lin, Q Hou, DP Fan… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
… Then, we feed these 64 text descriptions into the pre-trained CLIP [102] that adopts ResNet-…
/journals, including IEEE TPAMI, CVPR, ICCV, NeurIPS, etc. His research interests include …
저장 인용 26회 인용 관련 학술자료 전체 5개의 버전
[HTML] proquest.com
Find it @ Kwangwoon
When Remote Sensing Meets Foundation Model: A Survey and Beyond.
C Huo, K Chen, S Zhang, Z Wang, H Yan… - remote …, 2025 - search.ebscohost.com
… CLIP-Driven CRIS [50] extended CLIP for the referring image segmentation task by introducing
a visual–language decoder and a text-to-pixel contrastive loss. GLIP [33] reformulated …
저장 인용 관련 학술자료 전체 4개의 버전
[PDF] arxiv.org
Context-Aware Semantic Segmentation: Enhancing Pixel-Level Understanding with Large Language Models for Advanced Vision Applications
B Rahman - arXiv preprint arXiv:2503.19276, 2025 - arxiv.org
Semantic segmentation has made significant strides in pixel-level image understanding, yet
it remains limited in capturing contextual and semantic relationships between objects. …
저장 인용 관련 학술자료
[PDF] arxiv.org
Computer vision for primate behavior analysis in the wild
R Vogg, T Lüddecke, J Henrich, S Dey, M Nuske… - Nature …, 2025 - nature.com
… For instance, in the vision-language model CLIP 112 , the samples are pairs consisting of
images and corresponding captions. A pretrained CLIP model can be used for transfer …
저장 인용 5회 인용 관련 학술자료 전체 4개의 버전
[PDF] cell.com
Find it @ Kwangwoon
Ethical-Lens: Curbing malicious usages of open-source text-to-image models
Y Cai, S Yin, Y Wei, C Xu, W Mao, F Juefei-Xu, S Chen… - Patterns, 2025 - cell.com
… pre-trained CLIP model through linear probing, a technique that involves training a linear
classifier on the outputs of the CLIP image encoder while keeping the original CLIP parameters …
저장 인용 3회 인용 관련 학술자료 전체 5개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering
Y Zhao, Y Hao, S Gao, Y Wang… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
… We extend our previous studies presented at ICLR 2024 in the following aspects: • Setup:
We … It is important to highlight that ClipSeg employs the CLIP model [33] and is further trained …
저장 인용 관련 학술자료 전체 4개의 버전
Continual Learning for Multiple Modalities
H Jin, E Kim - arXiv preprint arXiv:2503.08064, 2025 - arxiv.org
… learning methods, we adopted ViT-B-16 CLIP [28] as the backbone model for the compared
methods. We used the vision and text encoders of CLIP for the non-text and text modalities, …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
Fmgs: Foundation model embedded 3d gaussian splatting for holistic 3d scene understanding
X Zuo, P Samangouei, Y Zhou, Y Di, M Li - International Journal of …, 2025 - Springer
… By carefully analyzing the properties in both CLIP and DINO embeddings, we design an …
of CLIP features by extracting and aggregating them at multiple resolutions for a hybrid CLIP …
저장 인용 40회 인용 관련 학술자료 전체 4개의 버전
[PDF] arxiv.org
Robust Fusion Controller: Degradation-aware Image Fusion with Fine-grained Language Instructions
H Zhang, Y Zha, Q Zhuang, Z Shao, J Ma - arXiv preprint arXiv …, 2025 - arxiv.org
… CLIP … CLIP, α ∈ RB×512 indicates the obtained functional condition, and B denotes the
batch size. In contrast, ζs is more related to spatial localization, which cannot be handled by CLIP …
저장 인용 관련 학술자료
[PDF] ieee.org
Find it @ Kwangwoon
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter
J Wang, X Li, J Zhang, Q Xu, Q Zhou… - … on Image Processing, 2025 - ieeexplore.ieee.org
… and all the candidate classes into CLIP model and select the classes with cosine similarity
larger than 0.97. The union of the selected classes by BLIP and CLIP models is used as our …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation
H Basak, Z Yin - arXiv preprint arXiv:2504.06389, 2025 - arxiv.org
… Whereas SemiVL targets semi-supervised semantic segmentation with a focus on label
efficiency—employing a languageguided decoder that leverages frozen CLIP predictions and …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
Dynamic Scene Understanding from Vision-Language Representations
S Pruss, M Alper, H Averbuch-Elor - arXiv preprint arXiv:2501.11653, 2025 - arxiv.org
… Our tests span V&L models of different sizes, including our CLIP [38], BLIP [23], and BLIP-2
[… the that superior performance of large V&L models (CLIP-L, and particularly BLIP-2) …
저장 인용 관련 학술자료
[PDF] ieee.org
Find it @ Kwangwoon
I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting
N Fanelli, G Vessio, G Castellano - 2025 IEEE/CVF Winter …, 2025 - ieeexplore.ieee.org
… the CLIP Image Quality Assessment score (CLIP-IQA) [36] and measured prompt alignment
using CLIP … the image background before inputting it into CLIP to improve local image-text …
저장 인용 관련 학술자료 전체 3개의 버전
[PDF] nature.com
Find it @ Kwangwoon
Mapping facade materials utilizing zero-shot segmentation for applications in urban microclimate research
N Tarkhan, M Klimenka, K Fang, F Duarte, C Ratti… - Scientific Reports, 2025 - nature.com
… At the stage of image fragment classification, we explore two approaches: (1) Utilizing
CLIP to classify an image patch (2) Using CLIPSeg to calculate the class triggering the most …
저장 인용 관련 학술자료 전체 3개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Personalizing Vision-Language Models With Hybrid Prompts for Zero-Shot Anomaly Detection
Y Cao, X Xu, Y Cheng, C Sun, Z Du… - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
… However, since prompts related to “normality” and “abnormality” are rarely present in the
pretrained data [43] of CLIP, the pretrained CLIP may struggle to effectively distinguish between …
저장 인용 3회 인용 관련 학술자료
[PDF] ssrn.com
Exp-Vqa: Fine-Grained Facial Expression Analysis Via Visual Question Answering
Y Yuan, J Zeng, S Shan - Available at SSRN 5102484 - papers.ssrn.com
This paper presents a novel task, facial expression visual question answering (FEVQA), for
fine-grained facial expression analysis across multiple scales. FEVQA interprets facial …
저장 인용 관련 학술자료
Adapter with Textual Knowledge Graph for Zero-Shot Sketch-Based Image Retrieval
J Zhang, J Tang - IEEE Access, 2025 - ieeexplore.ieee.org
… while freezing CLIP parameters and retain the rich knowledge contained in the CLIP model
at … knowledge, we use the text encoder of CLIP to extract the category semantic information of …
저장 인용 관련 학술자료
Find it @ Kwangwoon
BridgeCLIP: Automatic Bridge Inspection by Utilizing Vision-Language Model
P Liao, G Nakano - International Conference on Pattern Recognition, 2025 - Springer
… In this paper, we aim to investigate the capabilities of CLIP for a task that highly requires
professional knowledge, such as bridge inspection. Typically, reading a manual proves helpful …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
D Jagpal, X Chen, VP Namboodiri - arXiv preprint arXiv:2504.06861, 2025 - arxiv.org
… Based on these, we obtain a CLIP-based attention mask that controls the timing of
switching the prompts for each grid cell. Earlier switching results in higher variance, while later …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Active learning for vision-language models
B Safaei, VM Patel - 2025 IEEE/CVF Winter Conference on …, 2025 - ieeexplore.ieee.org
… Following that, we provide a brief overview of the training process for the CLIP model, followed
by an explanation of the prompt tuning approach we employed for fine-tuning the …
저장 인용 2회 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Human-Guided Zero-Shot Surface Defect Semantic Segmentation
Y Jin, Y Zhang, D Shan, Z Wu - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
… We use a frozen CLIP text encoder to retain the rich knowledge learned by the CLIP model
from … Knowledge from the pretrained CLIP is used to generate prior guidance for the model, …
저장 인용 관련 학술자료
[PDF] arxiv.org
ATM-Net: Anatomy-Aware Text-Guided Multi-Modal Fusion for Fine-Grained Lumbar Spine Segmentation
S Lian, D Pan, J Cai, GY Chen, Z Zhong, Z Luo… - arXiv preprint arXiv …, 2025 - arxiv.org
… Cris: Clip-driven referring image segmentation. In CVPR, 2022. 3 [49] James N Weinstein,
Jon D Lurie, Tor D Tosteson, Brett Hanscom, Anna NA Tosteson, Emily A Blood, Nancy JO …
저장 인용 관련 학술자료
[PDF] hait.od.ua
Improved segmentation model to identify object instances based on textual prompts
СВ Машталір, АР Ковтуненко - Вісник сучасних інформаційних …, 2025 - hait.od.ua
… Like the original model, our model consists of a CLIP encoder (ViT-B/16), which was
adapted for 352x352 resolutions, a prompt encoder, and two decoders for each of the heads – …
저장 인용 관련 학술자료
[PDF] ieee.org
Find it @ Kwangwoon
3VL: Using Trees to Improve Vision-Language Models' Interpretability
N Yellinek, L Karlinsky, R Giryes - IEEE Transactions on Image …, 2025 - ieeexplore.ieee.org
… to CLIP and CLIP+LoRA … NeurIPS. He organized workshops and tutorials on various aspects
of deep learning both internationally and locally including in ICML, CVPR, ECCV and ICCV …
저장 인용 관련 학술자료 전체 4개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Open-Vocabulary High-Resolution Remote Sensing Image Semantic Segmentation
Q Cao, Y Chen, C Ma, X Yang - IEEE Transactions on …, 2025 - ieeexplore.ieee.org
… We extend the applicability of vision-language models like CLIP to remote sensing OVS
by introducing newly designed mechanisms that bridge the domain gap, enhancing the …
저장 인용 관련 학술자료
[PDF] ieee.org
Find it @ Kwangwoon
WAFFLE: Multimodal Floorplan Understanding in the Wild
K Ganon, M Alper, R Mikulinsky… - 2025 IEEE/CVF …, 2025 - ieeexplore.ieee.org
… We use CLIP [28] image embeddings to filter for images … with CLIP text prompt embeddings,
following the use of CLIP for … conference on computer vision (ECCV), pages 201–217, 2018. …
저장 인용 관련 학술자료 전체 2개의 버전
Open-Vocabulary Saliency-Guided Progressive Refinement Network for Unsupervised Video Object Segmentation
Z Han, S Hu, H Song, K Zhang - ICASSP 2025-2025 IEEE …, 2025 - ieeexplore.ieee.org
… Then, we encode It by CLIP to obtain the saliency classes, which are fed into CLIPSeg to
generate the OVS attention map A1 t . Further, we apply a sigmoid function [24] to normalize A1 …
저장 인용 관련 학술자료
[PDF] arxiv.org
Know" No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
J Park, J Lee, J Song, S Yu, D Jung, S Yoon - arXiv preprint arXiv …, 2025 - arxiv.org
… CLIP architectures validate the effectiveness of our data generation pipelines in enhancing
CLIP’s … [37], the dataset popularly used for training public CLIP models such as OpenCLIP [3]. …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
Y Bendou, A Ouasfi, V Gripon, A Boukhayma - arXiv preprint arXiv …, 2025 - arxiv.org
… In this training-based version, instead of using the zero-shot CLIP as the base learner to
regularize our method in the RKHS, we iteratively optimize for a regularizer and the obtained …
저장 인용 2회 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
CLIP-TNseg: A Multi-Modal Hybrid Framework for Thyroid Nodule Segmentation in Ultrasound Images
X Sun, B Wei, Y Jiang, L Mao… - IEEE Signal Processing …, 2025 - ieeexplore.ieee.org
… CLIP visual encoder to provide multi-scale features for segmentation. While the input image
X ∈ RW ×H×3 is passed through the CLIP … To incorporate textual supervision, the CLIP Text …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] aaai.org
Find it @ Kwangwoon
Omnicount: Multi-label object counting with semantic-geometric priors
A Mondal, S Nag, X Zhu, A Dutta - Proceedings of the AAAI Conference …, 2025 - ojs.aaai.org
Object counting is pivotal for understanding the composition of scenes. Previously, this task
was dominated by class-specific methods, which have gradually evolved into more …
저장 인용 2회 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Search3D: Hierarchical Open-Vocabulary 3D Segmentation
A Takmaz, A Delitzas, RW Sumner… - IEEE Robotics and …, 2025 - ieeexplore.ieee.org
… of class-agnostic 3D object instance masks and then compute a feature representation per
object, represented in the joint vision-language embedding space of models such as CLIP [14…
저장 인용 6회 인용 관련 학술자료 전체 4개의 버전
[PDF] arxiv.org
Freestyle Sketch-in-the-Loop Image Segmentation
S Koley, VR Gajjala, A Sain, PN Chowdhury… - arXiv preprint arXiv …, 2025 - arxiv.org
… global feature for both as in CLIP training [68]) required for dense pixel-level segmentation
task [57]. We however exclude the textual encoder and employ CLIP’s visual encoder to …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
T Zhang, X Li, Z Huang, Y Li, W Lei, X Deng… - arXiv preprint arXiv …, 2025 - arxiv.org
… On representative work, LLaVA [40], uses the CLIP to encode images into visual tokens and
… VQA tasks, while EVE [14] designs a CLIP supervision to enhance visual token learning. Our …
저장 인용 관련 학술자료
[PDF] arxiv.org
LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps
Y Wang, R Memmesheimer, S Behnke - arXiv preprint arXiv:2503.12230, 2025 - arxiv.org
… Following CLIP, we initialized the temperature with a value of 0.07 and use 100 as an
upper threshold to clip the temperature value during the training stage, preventing the scaling …
저장 인용 관련 학술자료
[PDF] arxiv.org
Generate, Transduct, Adapt: Iterative Transduction with VLMs
O Saha, L Lawrence, G Van Horn, S Maji - arXiv preprint arXiv:2501.06031, 2025 - arxiv.org
… with CLIP encoders, we demonstrate that GTACLIP, yields … encoders, over CLIP and
transductive CLIP respectively in the … In Proceedings of the AAAI conference on artificial …
저장 인용 관련 학술자료 전체 3개의 버전
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
CS Lin, CY Wang, YCF Wang… - 2025 IEEE/CVF Winter …, 2025 - ieeexplore.ieee.org
… in the CLIP latent space. In this paper, we aim to fully exploit the CLIP latent space to benefit
… -associated semantic knowledge discovered from the CLIP latent space, as shown in Fig. 1 (…
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Colorization Quality Assessment with CLIP
S Shimizu, H Ishikawa - 2025 IEEE International Conference on …, 2025 - ieeexplore.ieee.org
… Then we feed the CLIP model a colorized image … CLIP-IQA [15] is a method that integrates
CLIP into an image quality assessment model. It aims to leverage the prior knowledge of CLIP …
저장 인용 관련 학술자료
[PDF] arxiv.org
Disentangling CLIP Features for Enhanced Localized Understanding
S Rawlekar, Y Cai, Y Wang, MH Yang… - arXiv preprint arXiv …, 2025 - arxiv.org
… • To address this challenge, we propose Unmix-CLIP, a framework that adapts CLIP features
for fine-… • We show that Unmix-CLIP outperforms SOTA multilabel recognition methods in …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
Zero-Shot Industrial Anomaly Segmentation with Image-Aware Prompt Generation
SY Park, H Lee, M Choi, S Han, JR Lee, S Lim… - arXiv preprint arXiv …, 2025 - arxiv.org
… adaptability and efficiency and are further divided into CLIP-based and SAM-based models,
each with distinct strengths and limitations. CLIP-based methods, including WinCLIP [7], …
저장 인용 관련 학술자료
[PDF] aaai.org
Find it @ Kwangwoon
Neural Assembler: Learning to Generate Fine-Grained Robotic Assembly Instructions from Multi-View Images
H Yan, Y Mu - Proceedings of the AAAI Conference on Artificial …, 2025 - ojs.aaai.org
Image-guided object assembly represents a burgeoning research topic in computer vision.
This paper introduces a novel task: translating multi-view images of a structural 3D model (for …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
CRESO: CLIP-Based Referring Expression Segmentation for Object using Text Prompt
S Park, Z Piao, YH Gu - 2025 International Conference on …, 2025 - ieeexplore.ieee.org
… : clip and llm synergy for multimodal question summarization in healthcare," in Proceedings
of the AAAI … Proceedings of the AAAI conference on artificial intelligence, 2018, vol. 32, no. 1. …
저장 인용 관련 학술자료
[PDF] arxiv.org
Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation
X Chen, W Zhu, P Qiu, H Wang, H Li, H Wu… - arXiv preprint arXiv …, 2025 - arxiv.org
… During training, this framework involves a CLIP with learnable prompts, ie, the adapted
model and a frozen pre-trained CLIP, while during inference, we only need the adapted CLIP. …
저장 인용 관련 학술자료
Threadshift: Contextual-aware garment replacement using CLIP, segmentation and stable diffusion
P Ghadekar, O Joshi, J Barhate, B Kundu… - AIP Conference …, 2025 - pubs.aip.org
… advanced segmentation and leveraging CLIP's language comprehension capabilities. This
… The system's comprehensive pipeline, incorporating CLIP and CLIPSeg models, facilitates …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] aaai.org
Find it @ Kwangwoon
Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model
K Ahn, S Han, S Park, J Kim, S Park… - Proceedings of the AAAI …, 2025 - ojs.aaai.org
… We clip the confidence difference maps to be greater than zero to focus on the changes in
destruction rather than construction. Then, we transform the confidence difference mapsˆMV …
저장 인용 1회 인용 관련 학술자료 전체 2개의 버전
[PDF] springer.com
Alignclip: navigating the misalignments for robust vision-language generalization
Z Han, G Luo, H Sun, Y Li, B Han, M Gong, K Zhang… - Machine Learning, 2025 - Springer
… Problem setup We start with a pre-trained CLIP model and adapt it using a downstream
training … Our primary objective is to fine-tune the CLIP model such that it performs robustly on …
저장 인용 관련 학술자료
Learning to prompt with text only supervision for vision-language models
MU Khattak, MF Naeem, M Naseer… - … of the AAAI Conference …, 2025 - ojs.aaai.org
… 2022b) is the pioneering prompt learning method for CLIP which learns text prompts to
fine-tune CLIP. CoCoOp (Zhou et al. 2022a) improves CoOp’s generalization by conditioning text …
저장 인용 13회 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
M Mistretta, A Baldrati, L Agnolucci, M Bertini… - arXiv preprint arXiv …, 2025 - arxiv.org
… Pre-trained multi-modal Vision-Language Models like CLIP are widely used offthe-shelf for
a … We argue that this is inherently due to the CLIP-style intermodal contrastive loss that does …
저장 인용 1회 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
CLIPGaze: Zero-Shot Goal-Directed Scanpath Prediction Using CLIP
Y Lai, R Quan, D Liang, J Qin - ICASSP 2025-2025 IEEE …, 2025 - ieeexplore.ieee.org
… CLIP models, we evaluate the performance of using ViT-B/32 and ViT-B/16 based CLIP …
Additionally, to validate the importance of visualsemantic alignment, we replace CLIP’s target …
저장 인용 관련 학술자료
[PDF] openreview.net
VLSM-Ensemble: Ensembling CLIP-based Vision-Language Models for Enhanced Medical Image Segmentation
J Dietlmeier, OG Adegboro, VVV Ganepola… - Medical Imaging with …, 2025 - openreview.net
… CLIP-based image and text encoders and a combined image-text decoder. Both CLIP …
During fine-tuning, the CLIP encoders remain frozen and only decoder weights are updated…
저장 인용 관련 학술자료
[PDF] aaai.org
Find it @ Kwangwoon
MaskPrompt: Open-Vocabulary Affordance Segmentation with Object Shape Mask Prompts
D Chen, D Kong, J Li, B Yin - Proceedings of the AAAI Conference on …, 2025 - ojs.aaai.org
… The Alpha-CLIP is a variant of the CLIP that allows you to get information about wherever
you … We use CLIP’s textencoder to convert these syndicated text captions into the embedded …
저장 인용 관련 학술자료
[PDF] aaai.org
Find it @ Kwangwoon
SoundBrush: Sound as a Brush for Visual Scene Editing
K Sung-Bin, K Jun-Seong, J Ko, Y Kim… - Proceedings of the AAAI …, 2025 - ojs.aaai.org
… After generating the samples, we adopt several CLIP-based metrics, including directional
similarity and feature similarity between two images in CLIP space, to ensure the quality of the …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Foundation Models Defining a New Era in Vision: a Survey and Outlook
M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
… The state-of-the-art performance of CLIP hinges on the large-scale image-text … CLIP
model. Utilizing largescale LAION datasets [46], [47], Open-CLIP [72] train and reproduce CLIP …
저장 인용 181회 인용 관련 학술자료 전체 4개의 버전
[PDF] aaai.org
Find it @ Kwangwoon
Personamagic: Stage-regulated high-fidelity face customization with tandem equilibrium
X Li, J Zhan, S He, Y Xu, J Dong, H Zhang… - Proceedings of the AAAI …, 2025 - ojs.aaai.org
… Notably, we introduce the CLIP image embedding of the training image into the network
to generate the residual embedding. This is because features extracted from images are …
저장 인용 2회 인용 관련 학술자료 전체 3개의 버전
[PDF] ieee.org
Find it @ Kwangwoon
Feature Design for Bridging SAM and CLIP Toward Referring Image Segmentation
K Ito - 2025 IEEE/CVF Winter Conference on Applications of …, 2025 - ieeexplore.ieee.org
… In the field of computer vision, CLIP and Segment anything model (SAM) have gained sig…
In this paper, we propose a model that integrates CLIP and SAM to enhance RIS. Since SAM …
저장 인용 관련 학술자료 전체 2개의 버전
[PDF] arxiv.org
TSAL: Few-shot Text Segmentation Based on Attribute Learning
C Li, C Liu, Y Fan, X Jin, X Hou, X Qian - arXiv preprint arXiv:2504.11164, 2025 - arxiv.org
… We propose TSAL, which leverages CLIP’s prior knowledge to learn text attributes for
segmentation. To fully utilize the semantic and texture information in the image, a visual-guided …
저장 인용 관련 학술자료 전체 2개의 버전