Recent News
Selected Publications
|
When and why vision-language models behave like bags-of-words, and what to do about it?
Mert Yuksekgonul,
Federico Bianchi,
Pratyusha Kalluri,
Dan Jurafsky,
James Zou
Oral @ ICLR 2023 (Top 5% of all accepted papers)
[
Paper ,
Code
]
▶ Show Description
Recent work [and many tweet threads] suggests that Vision-Language models(VLMs) such as CLIP do not fare well with compositional understanding. Here, we first propose a large-scale benchmark, ARO(Attribution, Relation and Order) to evaluate fine-grained relational, attributive, and order understanding. Why did the BoW-like behavior is not reflected in the retrieval evaluations (e.g. COCO/Flickr30k), where the datasets contain rich compositional structure? We propose interesting experiments to show that models do not need to do well with compositions to perform well on these tasks. Similarly, this can explain why contrastive-pretrained models may be exploiting this shortcut, and we should be careful about it. Following this intuition, we propose composition-aware negative mining. Check out our work !
|
|
Beyond Confidence: Reliable Models Should Also Quantify Atypicality
Mert Yuksekgonul,
Linjun Zhang,
James Zou,
Carlos Guestrin
NeurIPS 2023, Contributed Talk @ ICLR 2023 Workshop on Trustworthy Machine Learning
[Code Soon ]
▶ Show Description
While most machine learning models can provide confidence in their predictions, confidence is insufficient to understand and use the model's uncertainty reliably. In this work, we investigate the relationship between how atypical~(or rare) a sample is and the reliability of a model's confidence for this sample. Read the paper for interesting connections between atypicality and uncertainty!
|
|
Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models
Mert Yuksekgonul,
Varun Chandrasekaran,
Erik Jones,
Suriya Gunasekar,
Ranjita Naik,
Hamid Palangi ,
Ece Kamar ,
Besmira Nushi
Preprint
[
Paper,
Code Soon ]
▶ Show Description
We investigate the internal behavior of LLMs associated with their generation of factually incorrect text. We propose modeling factual queries as constraint satisfaction problems and use this framework to investigate how the model interacts internally with factual constraints. We find a strong positive relationship between the model's attention to constraint tokens and the factual accuracy of generations.
|
|
Post-hoc Concept Bottleneck Models
Mert Yuksekgonul,
Maggie Wang,
James Zou
Spotlight @ ICLR2023 (Top 25% of all accepted papers)
[
Paper,
Code ]
▶ Show Description
Concept Bottleneck Models are very cool! But it's hard to find large training datasets with concept annotations, or match the performance of unrestricted neural nets with limited bottlenecks. In this work, we address practical limitations of CBMs by introducing Post-hoc Concept Bottleneck models (PCBMs). Further, we have a user study where humans can improve PCBMs via concept-level feedback.
|
|
Leveraging medical Twitter to build a visual–language foundation model for pathology AI
Zhi Huang*,
Federico Bianchi*,
Mert Yuksekgonul,
Thomas Montine,
James Zou
Nature Medicine
[
Preprint,
Demo ]
▶ Show Description
We collect data from Twitter (yes, Twitter) + LAION and release the largest text-image pathology dataset: OpenPath. We also release PLIP, a CLIP-variant for pathology which gives exciting performance in 0-shot, transfer learning, and retrieval for Path.
|
|
Meaningfully debugging model mistakes using conceptual counterfactual explanations
Abubakar Abid*,
Mert Yuksekgonul*,
James Zou
ICML 2022
[
Paper ,
Code
]
▶ Show Description
We use human understandable concepts and counterfactual explanations (we call Conceptual Counterfactual Explanations) to debug model mistakes and reveal biases of a model.
|
|
KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval
Marah Abdin,
Suriya Gunasekar,
Varun Chandrasekaran,
Jerry Li,
Mert Yuksekgonul,
Rahee Ghosh Peshawaria ,
Ranjita Naik,
Besmira Nushi
Preprint
[Paper ,
Dataset]
We view prompts to LLMs as constraint satisfaction problems, and evaluate their abilities of retrieving information subject to constraints.
|
|
Diversity of Thought Improves Reasoning Abilities of Large Language Models
Ranjita Naik,
Varun Chandrasekaran,
Mert Yuksekgonul,
Hamid Palangi ,
Besmira Nushi
Preprint
[Paper]
We demonstrate how to simply promote the diversity of thought and improve the reasoning abilities of language models.
|
|
Discover and Cure: Concept-aware Mitigation of Spurious Correlation
Shirley Wu,
Mert Yuksekgonul,
Linjun Zhang,
James Zou
ICML 2023
[Paper ,
Code]
We give a metric to quantify and monitor how spurious individual concepts are, and propose concept-aware mixup to mitigate spurious correlations.
|
|
SkinCon: A skin disease dataset densely annotated by domain experts for fine-grained model debugging and analysis
Roxana Daneshjou*,
Mert Yuksekgonul*,
Zhuo Ran Cai,
Roberto Novoa,
James Zou
NeurIPS 2022 Datasets and Benchmarks Track
[Paper ,
Dataset]
In dermatology, skin disease is described using an established clinical lexicon that allow clinicians to describe physical exam findings to one another. We release a densely annotated skin lesion dataset, SKINCON, for concept-level analysis and debugging of machine learning models for derm.
|
|
GPT detectors are biased against non-native English writers
Weixin Liang*,
Mert Yuksekgonul*,
Yining Mao*,
Eric Wu*,
James Zou
Patterns
[Paper ,
Code]
We examine the models that predicts whether a text is human or AI-generated tend to be biased against non-native writers.
|
|
Holistic Evaluation of Language Models (HELM)
with 50+ Collaborators at Stanford HAI / CRFM
TMLR
[Paper ,
Website,
Code
]
Language models are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models, and holistically evaluate them.
|
|
Pretraining boosts out-of-domain robustness for pose estimation
Alexander Mathis,
Thomas Biasi,
Steffen Schneider,
Mert Yuksekgonul,
Byron Rogers,
Matthias Bethge,
Mackenzie Mathis
WACV 2021
[
Paper ]
Better ImageNet-performing architectures perform better on both within- and out-of-domain data if they are pretrained on ImageNet. Better ImageNet models generalize better across animal species. We introduce a cool dataset called Horse-C, a new benchmark for common corruptions for pose estimation, and confirm that pretraining increases performance in this domain shift context as well.
|
|
Learning prototypes for multiple instance learning
Mert Yuksekgonul,
Ozgur Emre Sivrikaya,
Mustafa Baydogan
NeurIPS2019 Sets&Parts Workshop, Turkish Journal of Electrical Engineering & Computer Sciences 2021
[
Paper ]
End-to-end learning of prototypes for multiple instance learning.
|
|