Letitia Parcalabescu

Letiția Pârcălăbescu

PhD, AI Researcher at Aleph Alpha Research

Deep Learning, LLMs, Vision and Language Models, Explainable AI, Interpretability

Letitia Parcalabescu has an academic background in Physics and Computer Science, and holds a PhD in Computational Linguistics. Her doctoral research focused on benchmarking and interpreting the internal processes and explanations of multimodal AI models. Currently, she is an AI researcher at Aleph Alpha Research, working on training interpretable reasoning models by design, as well as curating and synthesizing data for large-scale pre-training.

She created the "AI Coffee Break with Letitia" YouTube channel where she breaks down complex AI concepts. Topics range from newest research results in natural language processing, computer vision, to the broader societal impact of AI.

Highlights

YouTube Channel

AI Coffee Break with Letitia

Lighthearted bite-sized ML videos for your AI Coffee Break! 📺 Mostly videos about the latest technical advancements in AI, such as large language models (LLMs), text-to-image models and everything cool in natural language processing, computer vision, etc.!

Reviewing

ACL 2025 Area Chair

ICLR, Monthly *ACL Rolling Review (ARR), EACL, EMNLP, NAACL, ACL, CVPR, ACMMM, EurNLP

Workshops: MULA, RepL4NLP, LIMO

ACL2021 (Outstanding Reviewer)

Honors

DAAD Vollstipendium Für Absolventen Deutscher Auslandschulen

Full scholarship to study the subject of my choice in Germany after graduating high school in Romania

Young Researcher at the Heidelberg Laureate Forum 2022

Recipient of the ABBE GRANT promoted by the Carl-Zeiss Stiftung

Nominated for the GI Dissertationspreis 2024, Dagstuhl

Nominated by Heidelberg University for my PhD thesis

Publications

Visit my Google Scholar page for a complete list. Selection:

Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?

Parcalabescu, L. and Frank, A., 2025., The Twelfth International Conference on Learning Representations (ICLR)

Vision-and-language models (VLMs) – think GPT-4o – can answer questions about images and explain their reasoning. But how much do these explanations and answers actually rely on the image, and how much is just based on text?

In this project, I take a closer look at how VLMs combine vision and language when reasoning about visual input. Do they rely more on images when generating explanations than when producing answers? Are their explanations internally consistent? And how well do today’s top-performing VLMs handle the VALSE benchmark I previously introduced?

To answer these questions, we apply established tools for measuring explanation faithfulness — including my previously introduced own methods — as well as my metric for quantifying modality usage. I compare model behavior across both post-hoc and chain-of-thought (CoT) explanation settings. The findings are striking: text dominates across the board, but the contribution of the image increases when models are asked to explain themselves — especially in CoT setups.

Finally, I present an up-to-date evaluation of modern vision-language decoders on the VALSE benchmark — which we designed a few years ago to test earlier-generation VLMs. The results? Despite recent advances, today’s models still struggle with grounded, image-based reasoning.

On Measuring Faithfulness or Self-Consistency of Natural Language Explanations

Parcalabescu, L. and Frank, A., 2024., Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)

Are LLMs just making explanations up?

Large language models can explain their answers — either after the fact (post-hoc) or step by step (Chain-of-Thought). But here’s the catch: sounding reasonable doesn’t mean being truthful. Many so-called “faithfulness tests” don’t actually peek into the model’s reasoning — they just check if the answer and explanation match on the surface.

In this project, we set the record straight: 🧠 We show that most “faithfulness” tests are really just self-consistency checks. 🧪 We introduce the Comparative Consistency Bank: the first large-scale comparison of existing tests across 11 open LLMs and 5 tasks.🔍 We propose CC-SHAP, a fine-grained metric that dives deeper: it compares how inputs influence both the answer and the explanation, shedding light on the model’s actual reasoning.

Faithfulness starts from the model internals — and CC-SHAP brings us closer to that.

ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models

Kesen, I., Pedrotti, A., Dogan, M., Cafagna, M., Acikgoz, E.C., Parcalabescu, L., Calixto, I., Frank, A., Gatt, A., Erdem, A. and Erdem, E., 2023. The Twelfth International Conference on Learning Representations (ICLR)

Video-language models (VidLMs) are everywhere — but how well do they really understand what they see and hear over the entire video?

Enter ViLMA (Video Language Model Assessment): a task-agnostic benchmark designed to probe the fine-grained reasoning skills of VidLMs beyond surface-level performance. Unlike typical task-based evaluations, ViLMA focuses temporal understanding and visual grounding using carefully crafted counterfactuals and controlled setups.

ViLMA also includes proficiency tests to measure core abilities that VidLMs should have before solving more complex reasoning tasks.

The findings? Today’s VidLMs don’t perform any better than models trained on static images — even after accounting for basic proficiency.

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

Parcalabescu, L. and Frank, A., ACL 2023, In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4032-4059, Toronto, Canada. Association for Computational Linguistics.

Multimodal models are supposed to combine information from both vision and language — but often, they cheat. When a unimodal model performs just as well as a multimodal one, it’s a red flag: unimodal collapse.

But relying on accuracy alone doesn’t tell the whole story. What if a model uses the right modality — but still gets the answer wrong?

That’s where MM-SHAP comes in. It’s a performance-agnostic metric based on Shapley values that quantifies how much a model relies on each modality — regardless of whether its prediction is right or wrong.

We use MM-SHAP to: 📊 Compare models by their average degree of multimodality, and 🧪 measure how individual modalities contribute across different tasks and datasets.

Applied to six vision-language models (including LXMERT, CLIP, and ALBEF variants) on four tasks, MM-SHAP reveals: unimodal collapse isn’t just common — it happens in different ways and directions.

💡 MM-SHAP helps you diagnose what’s going wrong — and more importantly can help build truly multimodal models.

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena

Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I. and Gatt, A., ACL 2022 In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280, Dublin, Ireland. Association for Computational Linguistics.

Pretrained vision-and-language models (VLMs) may shine on standard tasks — but do they really understand the connection between images and language?

VALSE (Vision And Language Structured Evaluation) is a benchmark designed to find out. Rather than testing models on downstream tasks, VALSE zooms in on their ability to ground specific linguistic phenomena in the visual modality — things like spatial relations, counting, and negation.

VALSE includes six targeted tests, each crafted to reveal whether a model is truly connecting vision and language, or just guessing from shortcuts. We ensure high-quality examples through controlled construction methods and valid foils.

We tested five popular VLMs — and the results are sobering: most models struggle with core visio-linguistic reasoning.

With VALSE, we offer a finer lens for evaluating V&L models and tracking real progress — not just accuracy on tasks, but actual grounding and understanding.

Scientific Talks

Science Communication Talks

Talks for broader audiences:

Teaching and Supervision

Teaching

Own courses organized independently at Heidelberg University, including lectures, exercises, exam / practical project

  • Methods for Learning without Annotated Data

    Master Level Course (in English), every Summer Term from 2020 to 2024 with very good reviews

  • Designing Experiments For Machine Learning

    Bachelor Level Course (in German), every Winter Term from 2021 to 2024 with very good reviews

  • Deep Learning Course for Biologists

    at the HBIGS graduate school Heidelberg every term since 2023

  • Programming Exam

    Summer Term 2020, Winter Term 20/21

  • Resource course

    Bachelor Level Course (in German) Summer Term 2020, Winter Term 20/21

  • Integrating Vision and Language: Achievements and Challenges in Multimodal Machine Learning

    Master Level Seminar, Winter Term 19/20

Supervision

Co-supervision of theses with Prof. Anette Frank:

  • Master theses: Phillip Wiesenbach, Julia Suter
  • Bachelor thesis: Lillita Muradjan