Highlights
YouTube Channel
Lighthearted bite-sized ML videos for your AI Coffee Break! 📺 Mostly videos about the latest technical advancements in AI, such as large language models (LLMs), text-to-image models and everything cool in natural language processing, computer vision, etc.!
Reviewing
ACL 2025 Area Chair
ICLR, Monthly *ACL Rolling Review (ARR), EACL, EMNLP, NAACL, ACL, CVPR, ACMMM, EurNLP
Workshops: MULA, RepL4NLP, LIMO
ACL2021 (Outstanding Reviewer)
Honors
DAAD Vollstipendium Für Absolventen Deutscher Auslandschulen
Full scholarship to study the subject of my choice in Germany after graduating high school in Romania
Young Researcher at the Heidelberg Laureate Forum 2022
Recipient of the ABBE GRANT promoted by the Carl-Zeiss Stiftung
Nominated for the GI Dissertationspreis 2024, Dagstuhl
Nominated by Heidelberg University for my PhD thesis
Publications
Visit my Google Scholar page for a complete list. Selection:
Do Vision & Language Decoders use Images and Text equally? How Self-consistent are their Explanations?
Parcalabescu, L. and Frank, A., 2025., The Twelfth International Conference on Learning Representations (ICLR)
Vision-and-language models (VLMs) – think GPT-4o – can answer questions about images and explain their reasoning. But how much do these explanations and answers actually rely on the image, and how much is just based on text?
In this project, I take a closer look at how VLMs combine vision and language when reasoning about visual input. Do they rely more on images when generating explanations than when producing answers? Are their explanations internally consistent? And how well do today’s top-performing VLMs handle the VALSE benchmark I previously introduced?
To answer these questions, we apply established tools for measuring explanation faithfulness — including my previously introduced own methods — as well as my metric for quantifying modality usage. I compare model behavior across both post-hoc and chain-of-thought (CoT) explanation settings. The findings are striking: text dominates across the board, but the contribution of the image increases when models are asked to explain themselves — especially in CoT setups.
Finally, I present an up-to-date evaluation of modern vision-language decoders on the VALSE benchmark — which we designed a few years ago to test earlier-generation VLMs. The results? Despite recent advances, today’s models still struggle with grounded, image-based reasoning.
On Measuring Faithfulness or Self-Consistency of Natural Language Explanations
Parcalabescu, L. and Frank, A., 2024., Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
Are LLMs just making explanations up?
Large language models can explain their answers — either after the fact (post-hoc) or step by step (Chain-of-Thought). But here’s the catch: sounding reasonable doesn’t mean being truthful. Many so-called “faithfulness tests” don’t actually peek into the model’s reasoning — they just check if the answer and explanation match on the surface.
In this project, we set the record straight: 🧠 We show that most “faithfulness” tests are really just self-consistency checks. 🧪 We introduce the Comparative Consistency Bank: the first large-scale comparison of existing tests across 11 open LLMs and 5 tasks.🔍 We propose CC-SHAP, a fine-grained metric that dives deeper: it compares how inputs influence both the answer and the explanation, shedding light on the model’s actual reasoning.
Faithfulness starts from the model internals — and CC-SHAP brings us closer to that.
ViLMA: A Zero-Shot Benchmark for Linguistic and Temporal Grounding in Video-Language Models
Kesen, I., Pedrotti, A., Dogan, M., Cafagna, M., Acikgoz, E.C., Parcalabescu, L., Calixto, I., Frank, A., Gatt, A., Erdem, A. and Erdem, E., 2023. The Twelfth International Conference on Learning Representations (ICLR)
Video-language models (VidLMs) are everywhere — but how well do they really understand what they see and hear over the entire video?
Enter ViLMA (Video Language Model Assessment): a task-agnostic benchmark designed to probe the fine-grained reasoning skills of VidLMs beyond surface-level performance. Unlike typical task-based evaluations, ViLMA focuses temporal understanding and visual grounding using carefully crafted counterfactuals and controlled setups.
ViLMA also includes proficiency tests to measure core abilities that VidLMs should have before solving more complex reasoning tasks.
The findings? Today’s VidLMs don’t perform any better than models trained on static images — even after accounting for basic proficiency.
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Parcalabescu, L. and Frank, A., ACL 2023, In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4032-4059, Toronto, Canada. Association for Computational Linguistics.
Multimodal models are supposed to combine information from both vision and language — but often, they cheat. When a unimodal model performs just as well as a multimodal one, it’s a red flag: unimodal collapse.
But relying on accuracy alone doesn’t tell the whole story. What if a model uses the right modality — but still gets the answer wrong?
That’s where MM-SHAP comes in. It’s a performance-agnostic metric based on Shapley values that quantifies how much a model relies on each modality — regardless of whether its prediction is right or wrong.
We use MM-SHAP to: 📊 Compare models by their average degree of multimodality, and 🧪 measure how individual modalities contribute across different tasks and datasets.
Applied to six vision-language models (including LXMERT, CLIP, and ALBEF variants) on four tasks, MM-SHAP reveals: unimodal collapse isn’t just common — it happens in different ways and directions.
💡 MM-SHAP helps you diagnose what’s going wrong — and more importantly can help build truly multimodal models.
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Parcalabescu, L., Cafagna, M., Muradjan, L., Frank, A., Calixto, I. and Gatt, A., ACL 2022 In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8253–8280, Dublin, Ireland. Association for Computational Linguistics.
Pretrained vision-and-language models (VLMs) may shine on standard tasks — but do they really understand the connection between images and language?
VALSE (Vision And Language Structured Evaluation) is a benchmark designed to find out. Rather than testing models on downstream tasks, VALSE zooms in on their ability to ground specific linguistic phenomena in the visual modality — things like spatial relations, counting, and negation.
VALSE includes six targeted tests, each crafted to reveal whether a model is truly connecting vision and language, or just guessing from shortcuts. We ensure high-quality examples through controlled construction methods and valid foils.
We tested five popular VLMs — and the results are sobering: most models struggle with core visio-linguistic reasoning.
With VALSE, we offer a finer lens for evaluating V&L models and tracking real progress — not just accuracy on tasks, but actual grounding and understanding.
Scientific Talks
- Keynote at the Heidelberg Postdoc Symposium hosted by the dkfz (2025)
- Keynote at the National Conference on AI Transformations: Language, Technology, and Society, Utrecht (2025)
- Invited talk at the University of Sheffield NLP Group (2024)
- Invited talk at Datafest Yerevan Conference (2024)
- Invited talk at Aleph Alpha - Heidelberg (2024)
- Invited talk at cogsys-group, CLASP, Gothenburg, Sweden (2024)
- Podcast Interview Deep Learning with Letitia Parcalabescu - Weaviate Podcast #96! (2024)
- Invited to talk about on work at heidelberg.ai at the DKFZ (2023)
- Invited Talk at the LIMO 2023 Workshop About Vision and Language (VL) models (2023)
- Invited talk about own work at the ICDM Workshop Foundation Models in Vision and Language (2022)
- Invited talk about own work “Multimodal Learning: Integrating Vision and Language” at StuTS 2020
Science Communication Talks
Talks for broader audiences:
- Invited on the AlumNode Panel Discussion "Social Media for Scientists – Opportunities and Challenges" (2025)
- Participated as a science communicator in the first Romanian-language Native Scientists Workshop Heidelberg (2025)
- Panelist at the VOICES festival in Zagreb, Croatia (2025)
- Talk and discussion "Artificial Intelligence: Which skills do I need?" at E-engAGEd organized by EAVI Media Literacy for Citizenship (2025)
- Panelist at the VOICES festival in Florence, Italy (2024)
- Invited to talk (in German) about AI for a general audience at ARD MixTalk (2023)
- Podcast (in German) about AI for the Handelsblatt (2023)
- Invited to talk in a panel about Digital Tools & AI in Research at To be honest Conference (2023)
- Panelist on the “Popularization in ML Research” panel at the ML in PL conference (2022)
- “AI for good” at the EAVI Conversations (2021)
- Guest on the Transformative Ideas Podcast
- Guest on the MLST YouTube channel and podcast
- “Why Multi-Modality is the Future of Machine Learning” at the ML Engineered Podcast
Teaching and Supervision
Teaching
Own courses organized independently at Heidelberg University, including lectures, exercises, exam / practical project
-
Methods for Learning without Annotated Data
Master Level Course (in English), every Summer Term from 2020 to 2024 with very good reviews
-
Designing Experiments For Machine Learning
Bachelor Level Course (in German), every Winter Term from 2021 to 2024 with very good reviews
-
Deep Learning Course for Biologists
at the HBIGS graduate school Heidelberg every term since 2023
-
Programming Exam
Summer Term 2020, Winter Term 20/21
-
Resource course
Bachelor Level Course (in German) Summer Term 2020, Winter Term 20/21
-
Integrating Vision and Language: Achievements and Challenges in Multimodal Machine Learning
Master Level Seminar, Winter Term 19/20
Supervision
Co-supervision of theses with Prof. Anette Frank:
- Master theses: Phillip Wiesenbach, Julia Suter
- Bachelor thesis: Lillita Muradjan