publications | Na Min An

2025

Aligned but Stereotypical? The Hidden Influence of System Prompts on Social Bias in LVLM-Based Text-to-Image Models

NaHyeon* Park, Na Min An*, Kunhee Kim, Soyeon Yoon, Jiahao Huo, and Hyunjung Shim

arXiv preprint arXiv:2512.04981, 2025

* indicates equal contribution.

Abs arXiv

Large vision-language model (LVLM) based text-to-image (T2I) systems have become the dominant paradigm in image generation, yet whether they amplify social biases remains insufficiently understood. In this paper, we show that LVLM-based models produce markedly more socially biased images than non-LVLM-based models. We introduce a 1,024 prompt benchmark spanning four levels of linguistic complexity and evaluate demographic bias across multiple attributes in a systematic manner. Our analysis identifies system prompts, the predefined instructions guiding LVLMs, as a primary driver of biased behavior. Through decoded intermediate representations, token-probability diagnostics, and embedding-association analyses, we reveal how system prompts encode demographic priors that propagate into image synthesis. To this end, we propose FairPro, a training-free meta-prompting framework that enables LVLMs to self-audit and construct fairness-aware system prompts at test time. Experiments on two LVLM-based T2I models, SANA and Qwen-Image, show that FairPro substantially reduces demographic bias while preserving text-image alignment. We believe our findings provide deeper insight into the central role of system prompts in bias propagation and offer a practical, deployable approach for building more socially responsible T2I systems.
World in a Frame: Understanding Culture Mixing as a New Challenge for Vision-Language Models

Eunsu Kim*, Junyeong* Park, Na Min An*, Junseong Kim, Hitesh Laxmichand Patel, Jiho Jin, Julia Kruk, Amit Agarwal, Srikant Panda, Fenal Ashokbhai Ilasariya, Hyunjung Shim, and Alice Oh

arXiv preprint arXiv:2511.22787, 2025

* indicates equal contribution.

Abs arXiv

In a globalized world, cultural elements from diverse origins frequently appear together within a single visual scene. We refer to these as culture mixing scenarios, yet how Large Vision-Language Models (LVLMs) perceive them remains underexplored. We investigate culture mixing as a critical challenge for LVLMs and examine how current models behave when cultural items from multiple regions appear together. To systematically analyze these behaviors, we construct CultureMix, a food Visual Question Answering (VQA) benchmark with 23k diffusion-generated, human-verified culture mixing images across four subtasks: (1) food-only, (2) food+food, (3) food+background, and (4) food+food+background. Evaluating 10 LVLMs, we find consistent failures to preserve individual cultural identities in mixed settings. Models show strong background reliance, with accuracy dropping 14% when cultural backgrounds are added to food-only baselines, and they produce inconsistent predictions for identical foods across different contexts. To address these limitations, we explore three robustness strategies. We find supervised fine-tuning using a diverse culture mixing dataset substantially improve model consistency and reduce background sensitivity. We call for increased attention to culture mixing scenarios as a critical step toward developing LVLMs capable of operating reliably in culturally diverse real-world environments.
Multi-Objective Task-Aware Predictor for Image-Text Alignment

Eunki Kim*, Na Min An*, James Thorne, and Hyunjung Shim

arXiv preprint arXiv:2510.00766, 2025

* indicates equal contribution.

Abs arXiv

Evaluating image-text alignment while reflecting human preferences across multiple aspects is a significant issue for the development of reliable vision-language applications. It becomes especially crucial in real-world scenarios where multiple valid descriptions exist depending on contexts or user needs. However, research progress is hindered by the lack of comprehensive benchmarks and existing evaluation predictors lacking at least one of these key properties: (1) Alignment with human judgments, (2) Long-sequence processing, (3) Inference efficiency, and (4) Applicability to multi-objective scoring. To address these challenges, we propose a plug-and-play architecture to build a robust predictor, MULTI-TAP (Multi-Objective Task-Aware Predictor), capable of both multi and single-objective scoring. MULTI-TAP can produce a single overall score, utilizing a reward head built on top of a large vision-language model (LVLMs). We show that MULTI-TAP is robust in terms of application to different LVLM architectures, achieving significantly higher performance than existing metrics and even on par with the GPT-4o-based predictor, G-VEval, with a smaller size (7-8B). By training a lightweight ridge regression layer on the frozen hidden states of a pre-trained LVLM, MULTI-TAP can produce fine-grained scores for multiple human-interpretable objectives. MULTI-TAP performs better than VisionREWARD, a high-performing multi-objective reward model, in both performance and efficiency on multi-objective benchmarks and our newly released text-image-to-text dataset, EYE4ALL. Our new dataset, consisting of chosen/rejected human preferences (EYE4ALLPref) and human-annotated fine-grained scores across seven dimensions (EYE4ALLMulti), can serve as a foundation for developing more accessible AI systems by capturing the underlying preferences of users, including blind and low-vision (BLV) individuals.
CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP

Na Min An, Inha Kang, Minhyun Lee, and Hyunjung Shim

arXiv preprint arXiv:2509.23098, 2025

Abs arXiv

Spatial grounding is crucial for referring image segmentation (RIS), where the goal of the task is to localize an object described by language. Current foundational vision-language models (VLMs), such as CLIP, excel at aligning images and text but struggle with understanding spatial relationships. Within the language stream, most existing methods often focus on the primary noun phrase when extracting local text features, undermining contextual tokens. Within the vision stream, CLIP generates similar features for images with different spatial layouts, resulting in limited sensitivity to spatial structure. To address these limitations, we propose \textscCoPatch, a zero-shot RIS framework that leverages internal model components to enhance spatial representations in both text and image modalities. For language, CoPatch constructs hybrid text features by incorporating context tokens carrying spatial cues. For vision, it extracts patch-level image features using our novel path discovered from intermediate layers, where spatial structure is better preserved. These enhanced features are fused into a clustered image-text similarity map, CoMap, enabling precise mask selection. As a result, CoPatch significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2–7 mIoU) without requiring any additional training. Our findings underscore the importance of recovering and leveraging the untapped spatial knowledge inherently embedded in VLMs, thereby paving the way for opportunities in zero-shot RIS.
How Blind and Low-Vision Individuals Prefer Large Vision-Language Model-Generated Scene Descriptions

Na Min An*, Eunki Kim*, Wan Ju Kang, Sangryul Kim, James Thorne, and Hyunjung Shim

arXiv preprint arXiv:2502.14883, 2025

* indicates equal contribution.

Abs arXiv

For individuals with blindness or low vision (BLV), navigating complex environments can pose serious risks. Large Vision-Language Models (LVLMs) show promise for generating scene descriptions, but their effectiveness for BLV users remains underexplored. To address this gap, we conducted a user study with eight BLV participants to systematically evaluate preferences for six types of LVLM descriptions. While they helped to reduce fear and improve actionability, user ratings showed wide variation in sufficiency and conciseness. Furthermore, GPT-4o–despite its strong potential to refine descriptions–was not consistently preferred by participants. We use the insights obtained from the user study to build training data for building our new automatic evaluation metric that can capture BLV preferences effectively. Our findings underscore the urgent need for BLV-centered evaluation metrics and human-in-the-loop feedback to advance LVLM description quality for accessibility.
VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization

Sania Waheed, Na Min An, Michael Milford, Sarvapali D. Ramchurn, and Shoaib Ehsan

ACRA, 2025

Abs arXiv

Geo-localization from a single image at planet scale (essentially an advanced or extreme version of the kidnapped robot problem) is a fundamental and challenging task in applications such as navigation, autonomous driving and disaster response due to the vast diversity of locations, environmental conditions, and scene variations. Traditional retrieval-based methods for geo-localization struggle with scalability and perceptual aliasing, while classification-based approaches lack generalization and require extensive training data. Recent advances in vision-language models (VLMs) offer a promising alternative by leveraging contextual understanding and reasoning. However, while VLMs achieve high accuracy, they are often prone to hallucinations and lack interpretability, making them unreliable as standalone solutions. In this work, we propose a novel hybrid geo-localization framework that combines the strengths of VLMs with retrieval-based visual place recognition (VPR) methods. Our approach first leverages a VLM to generate a prior, effectively guiding and constraining the retrieval search space. We then employ a retrieval step, followed by a re-ranking mechanism that selects the most geographically plausible matches based on feature similarity and proximity to the initially estimated coordinates. We evaluate our approach on multiple geo-localization benchmarks and show that it consistently outperforms prior state-of-the-art methods, particularly at street (up to 4.51%) and city level (up to 13.52%). Our results demonstrate that VLM-generated geographic priors in combination with VPR lead to scalable, robust, and accurate geo-localization systems.
Image Embedding Sampling Method for Diverse Captioning

Sania Waheed*, and Na Min An*

EMNLP Main, 2025

* indicates equal contribution.

Abs arXiv PDF

Image Captioning for state-of-the-art VLMs has significantly improved over time; however, this comes at the cost of increased computational complexity, making them less accessible for resource-constrained applications such as mobile devices and assistive technologies. Alternatively, smaller VLMs prioritize high-level scene descriptions, overlooking finer details that contribute to a richer understanding of an image. In this paper, we introduce a training-free framework that enhances caption diversity and informativeness by explicitly attending to distinct image regions using a comparably small VLM, BLIP, as the backbone. Our approach leverages structured segmentation to produce hierarchical representations that capture both global and localized semantics. Without requiring additional model training, we demonstrate that our method allows smaller VLMs to achieve performance comparable to larger models in terms of image-caption alignment, semantic integrity, and diversity. We evaluate our framework on MSCOCO, Flickr30k, and Nocaps test datasets, achieving a Div-2 score of 0.735, 0.750, and 0.748 for each dataset respectively, while maintaining strong image-caption relevancy and semantic integrity with the human-annotated captions.
I0T: Embedding Standardization Method Towards Zero Modality Gap

Na Min An*, Eunki Kim*, James Thorne, and Hyunjung Shim

In ACL (Outstanding; Top 1%, 26 out of 3,000 accepted papers), 2025

* indicates equal contribution.

Abs arXiv PDF

Contrastive Language-Image Pretraining (CLIP) enables zero-shot inference in downstream tasks such as image-text retrieval and classification. However, recent works extending CLIP suffer from the issue of *modality gap*, which arises when the image and text embeddings are projected to disparate manifolds, deviating from the intended objective of image-text contrastive learning. We discover that this phenomenon is linked to the modality-specific characteristic that each image or text encoder independently possesses. Herein, we propose two methods to address the modality gap: (1) a post-hoc embedding standardization method, I0T_post that reduces the modality gap approximately to zero and (2) a trainable method, I0T_async, to alleviate the modality gap problem by adding two normalization layers for each encoder. Our I0T framework can significantly reduce the modality gap while preserving the original embedding representations of trained models with their locked parameters. In practice, I0T_post can serve as an alternative explainable automatic evaluation metric of widely used CLIPScore (CLIP-S). The code is available in https://github.com/xfactlab/I0T.
Diffusion Models Through a Global Lens: Are They Culturally Inclusive?

Zahra Bayramli*, Ayhan Suleymanzade*, Na Min An, Huzama Ahmad, Eunsu Kim, Junyeong Park, James Thorne, and Alice Oh

In ACL (Oral; Top 8%, 243 out of 3,000 accepted papers), 2025

* indicates equal contribution.

Abs arXiv PDF

Text-to-image diffusion models have recently enabled the creation of visually compelling, detailed images from textual prompts. However, their ability to accurately represent various cultural nuances remains an open question. In our work, we introduce CultDiff benchmark, evaluating state-of-the-art diffusion models whether they can generate culturally specific images spanning ten countries. We show that these models often fail to generate cultural artifacts in architecture, clothing, and food, especially for underrepresented country regions, by conducting a fine-grained analysis of different similarity aspects, revealing significant disparities in cultural relevance, description fidelity, and realism compared to real-world reference images. With the collected human evaluations, we develop a neural-based image-image similarity metric, namely, CultDiff-S, to predict human judgment on real and generated images with cultural artifacts. Our work highlights the need for more inclusive generative AI systems and equitable dataset representation over a wide range of cultures.
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions

Wan Ju Kang, Eunki Kim, Na Min An, Sangryul Kim, Haemin Choi, Ki Hoon Kwak, and James Thorne

In ACL Main, 2025

Abs arXiv PDF

Often, the needs and visual abilities differ between the annotator group and the end user group. Generating detailed diagram descriptions for blind and low-vision (BLV) users is one such challenging domain. Sighted annotators could describe visuals with ease, but existing studies have shown that direct generations by them are costly, bias-prone, and somewhat lacking by BLV standards. In this study, we ask sighted individuals to assess – rather than produce – diagram descriptions generated by vision-language models (VLM) that have been guided with latent supervision via a multi-pass inference. The sighted assessments prove effective and useful to professional educators who are themselves BLV and teach visually impaired learners. We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes and demonstrate their fine-tuning potential in various downstream tasks.
WHEN TOM EATS KIMCHI: Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts

Jun Seong Kim*, Kyaw Ye Thu*, Javad Ismayilzada, Junyeong Park, Eunsu Kim, Huzama Ahmad, Na Min An, James Thorne, and Alice Oh

In NAACL C3NLP Workshop (Outstanding Paper), 2025

* indicates equal contribution.

Abs arXiv PDF

In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs.For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it.However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MIXCUBE, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures
Machine Learning Techniques for Simulating Human Psychophysical Testing of Low-Resolution Phosphene Face Images in Artificial Vision

Na Min An, Hyeonhee Roh, Sein Kim, Jae Hun Kim, and Maesoon Im

Advanced Science (Impact Factor 14.3, JCR Ranking 6.5%), 2025

Abs PDF

To evaluate the quality of artificial visual percepts generated by emerging methodologies, researchers often rely on labor-intensive and tedious human psychophysical experiments. These experiments necessitate repeated iterations upon any major/minor modifications in the hardware/software configurations. Here, the capacity of standard machine learning (ML) models is investigated to accurately replicate quaternary match-to-sample tasks using low-resolution facial images represented by arrays of phosphenes as input stimuli. Initially, the performance of the ML models trained to approximate innate human facial recognition abilities across a dataset comprising 3600 phosphene images of human faces is analyzed. Subsequently, due to the time constraints and the potential for subject fatigue, the psychophysical test is limited to presenting only 720 low-resolution phosphene images to 36 human subjects. Notably, the superior model adeptly mirrors the behavioral trend of human subjects, offering precise predictions for 8 out of 9 phosphene quality levels on the overlapping test queries. Subsequently, human recognition performances for untested phosphene images are predicted, streamlining the process and minimizing the need for additional psychophysical tests. The findings underscore the transformative potential of ML in reshaping the research paradigm of visual prosthetics, facilitating the expedited advancement of prostheses.

2024

Stable Language Model Pre-training by Reducing Embedding Variability

Woojin Chung, Jiwoo Hong, Na Min An, James Thorne, and Se-Young Yun

In EMNLP Main, 2024

Abs arXiv PDF

Stable pre-training is essential for achieving better-performing language models. However, tracking pre-training stability by calculating gradient variance at every step is impractical due to the significant computational costs. We explore Token Embedding Variability (TEV) as a simple and efficient proxy for assessing pre-training stability in language models with pre-layer normalization, given that shallower layers are more prone to gradient explosion (section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an architecture to alleviate such instability by limiting the exponential growth of output embedding variance, thereby preventing the gradient explosion (section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased stability and lower perplexity, particularly in deeper models.
Capturing the Relationship Between Sentence Triplets for LLM and Human-Generated Texts to Enhance Sentence Embeddings

Na Min An, Sania Waheed, and James Thorne

In EACL Findings, 2024

Abs PDF

Deriving meaningful sentence embeddings is crucial in capturing the semantic relationship between texts. Recent advances in building sentence embedding models have centered on replacing traditional human-generated text datasets with those generated by LLMs. However, the properties of these widely used LLM-generated texts remain largely unexplored. Here, we evaluate the quality of the LLM-generated texts from four perspectives (Positive Text Repetition, Length Difference Penalty, Positive Score Compactness, and Negative Text Implausibility) and find that there exists an inherent difference between human and LLM-generated datasets. To further enhance sentence embeddings using both human and LLM-generated datasets, we propose a novel loss function that incorporates Positive-Negative sample Augmentation (PNA) within the contrastive learning objective. Our results demonstrate that PNA effectively mitigates the sentence anisotropy problem in Wikipedia corpus (-7% compared to CLHAIF) and simultaneously improves the Spearman’s correlation in standard Semantic Textual Similarity (STS) tasks (+1.47% compared to CLHAIF).

2023

Can Large Language Models Capture Dissenting Human Voices?

Noah Lee*, Na Min An*, and James Thorne

In EMNLP Main, 2023

* indicates equal contribution.

Abs arXiv PDF

Large language models (LLMs) have shown impressive achievements in solving a broad range of tasks. Augmented by instruction fine-tuning, LLMs have also been shown to generalize in zero-shot settings as well. However, whether LLMs closely align with the human disagreement distribution has not been well-studied, especially within the scope of natural language inference (NLI). In this paper, we evaluate the performance and alignment of LLM distribution with humans using two different techniques to estimate the multinomial distribution: Monte Carlo Estimation (MCE) and Log Probability Estimation (LPE). As a result, we show LLMs exhibit limited ability in solving NLI tasks and simultaneously fail to capture human disagreement distribution. The inference and human alignment performances plunge even further on data samples with high human disagreement levels, raising concerns about their natural language understanding (NLU) ability and their representativeness to a larger human population.
Reinforcement Learning Framework to Simulate Short-Term Learning Effects of Human Psychophysical Experiments Assessing the Quality of Artificial Vision

Na Min An, Hyeonhee Roh, Sein Kim, Jae Hun Kim, and Maesoon Im

In IJCNN (Oral), 2023

Abs PDF

The quality of the artificial vision produced by visual prostheses has traditionally been evaluated by human psychophysical tests using images expressed with an array of phosphenes. However, such experiments involving human subjects are considerably time-consuming and labor-intensive. One potential solution may be implementing an efficient approach to assist or replace psychophysical experiments showing a short-term learning effect in human subjects. The present work developed a reinforcement learning (RL)-based feedback framework which built artificial agents to emulate the behavioral changes in the learning process of human subjects over the experimental trials. In our framework, we first trained an agent which can gradually learn to identify 720 faces presented in low-resolution phosphene images with feedback rewards received from the agreement in the perception of nine training human subjects. Then, in the automating stage of the framework, we created nine RL agents. By testing those agents, we found the RL agents mimicked the learning effects of nine test human subjects better than nine instances of a supervised learning (SL) model. Given the similar outcomes with human tests and the time efficiency of RL, our framework may expedite the development of visual prosthetic systems by at least partially replacing laborious human psychophysical experiments.
Artificial Vision Parameter Learning and Automating Method for Improving Visual Prosthetic Systems

Maesoon Im, Hyeonhee Roh, Na Min An, and Jae Hun Kim

2023

US Patent App. 18/075,555

PDF

2021

Machine Learning Approaches as An Alternative to Human Psychophysical Tests of Prosthetic Vision (abstract)

Na Min An, Hyeonhee Roh, Soomin Jung, Eun Ju Kim, and Maesoon Im

EMBC Abstract, 2021

Abs PDF

Microelectronic retinal prostheses can evoke artificial visual percepts to blind individuals. Among several factors, expression of original images using limited resources (i.e., small number of pixels and low gray scales) is critical for improved recognition of artificial vision. To assess effectiveness of new image processing techniques, prosthetic researchers have performed psychophysical tests with normally-sighted subjects. Here, we evaluated facial recognition task performances of two machine learning (ML) models for phosphene images in various conditions, representing several electrical stimulation cases. Both ML models showed correct response ratios with trends expected from psychophysical studies with human subjects. Our results suggest ML approaches would be useful for evaluating newly-developed artificial vision image processing techniques. In particular, the use of ML approaches is expected to substantially reduce time and cost needed for psychophysical testing of phosphene images.