DS-2025-07: Zhang, Zhi (2025) Advancing Vision and Language Models through Commonsense Knowledge, Efficient Adaptation and Transparency. Doctoral thesis, Universiteit van Amsterdam.
![]() |
Text
DS-2025-07.text.pdf - Accepted Version Download (20MB) |
![]() |
Text (Samenvatting)
DS-2025-07.samenvatting.txt - Other Download (4kB) |
Abstract
The area of multimodal learning has seen substantial advances in recent years, both in terms of performance and the variety of tasks tackled. However, many important challenges remain, including integrating external factual knowledge in multimodal models, enabling their fast and efficient adaptation to a new task and reducing negative interference in joint learning. Furthermore, despite the wide applicability of these models, we still lack an understanding of their internal mechanisms, in terms of how different modalities interact in multi-modal systems. This dissertation investigates these challenges from the perspectives of two different modalities: language and vision, as well as their combination.
First, we develop a method to enhance multimodal reasoning by integrating commonsense knowledge into the models. Specifically, we incorporate the commonsense knowledge about objects in the image into model representations of these objects. We evaluate this technique in the task of referring expression comprehension, where the aim is to find the position of the object in the image from a language description and an image. By incorporating external knowledge, the models can infer object functions and contextual relationships, enabling them to reason about complex scenes rather than merely perceiving visual and spatial cues, improving the applicability of models to real-world scenarios requiring commonsense knowledge.
Second, we propose a novel parameter-efficient fine-tuning (PEFT) method for an efficient adaptation of pre-trained models to new tasks. Instead of updating the entire model, our approach fine-tunes only a small, relevant subset of parameters, effectively modulating neuron activations to adapt to new tasks. This selective tuning reduces computational demands and potentially reduces gradient conflicts between tasks and preserves useful knowledge acquired during pretraining. We demonstrate that our method improves performance across various vision and language tasks. On the other hand, to enable pretrained models to efficiently handle multiple tasks at the same time, we propose a novel sparse training approach. This method facilitates the sharing of relevant information across tasks during the learning process while simultaneously mitigating gradient conflicts that commonly arise in multi-task learning. Empirical evaluations demonstrate that our approach significantly improves performance in dense vision prediction tasks, demonstrating robustness and wide applicability.
Lastly, we investigate the inner working mechanism of multimodal large language models (MLLMs) when performing multimodal tasks, especially how linguistic and visual information interact within these models.
Specifically, given an image-question pair as input, we investigate where in the model and how the visual and linguistic information are combined to generate the final prediction. Extensive experiments have shown that there are two distinct stages in the process of integration of the two modalities. In the lower layers, the model first transfers the more general visual features of the whole image into the representations of (linguistic) question tokens. In the middle layers, it once again transfers visual information about specific objects relevant to the question to the respective token positions of the question. Finally, in the higher layers, the resulting multimodal representation is propagated to the last position of the input sequence for the final prediction.
Overall, our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs, thereby facilitating future research into multimodal information localization and editing.
In summary, this dissertation contributes to advancing vision and language models by addressing fundamental challenges across these domains. The proposed solutions and insights provide a foundation for developing more knowledge-aware, efficient, and interpretable multi-modal AI systems, applicable to diverse tasks in real-world settings.
Item Type: | Thesis (Doctoral) |
---|---|
Report Nr: | DS-2025-07 |
Series Name: | ILLC Dissertation (DS) Series |
Year: | 2025 |
Subjects: | Computation Language |
Depositing User: | Dr Marco Vervoort |
Date Deposited: | 10 Aug 2025 15:00 |
Last Modified: | 26 Sep 2025 00:10 |
URI: | https://eprints.illc.uva.nl/id/eprint/2376 |
Actions (login required)
![]() |
View Item |