MoL-2023-33: Beyond Perplexity: Examining Temporal Generalization of Large Language Models via Definition Generation

MoL-2023-33: Luden, Iris (2023) Beyond Perplexity: Examining Temporal Generalization of Large Language Models via Definition Generation. [Report]

[thumbnail of MoL-2023-33.text.pdf] Text
MoL-2023-33.text.pdf - Published Version

Download (1MB)


The emergence of large language models (LLMs) has significantly improved performance across various Natural Language Processing (NLP) tasks. However, the field of NLP predominantly follows a static language modeling paradigm, resulting in performance deterioration of LLMs over time. This indicates a lack of temporal generalization, i.e., the ability to adjust their capabilities to data beyond their training period. In real-life NLP applications, models are often pre-trained on data from one time period and then deployed for tasks which inherently involve temporally shifted data. So far, performance deterioration of LLMs is primarily attributed to the factual changes over time, leading to attempts of updating a LLMs factual knowledge to avoid performance deterioration. However, not only the facts of the world, but also the language we use to describe it constantly changes. Recent studies have indicated a relationship between performance deterioration and semantic change. Performance deterioration is typically measured using perplexity scores and relative performance on downstream tasks. But such dry comparisons of perplexity and accuracy do not explain the effects of temporally shifted data on LLMs in practice. Given the potential societal impact of NLP applications, it is crucial gain insight into how the performance deterioration, particularly caused by semantic change, is reflected in the output of LLMs. This thesis investigates how semantic change in temporally shifted data impacts the performance of a LLM on the downstream task of contextualized word definition generation. This approach offers a dual perspective: quantitative measurement of performance deterioration, as well as human-interpretable output through the generated definitions. First, I construct two diachronic corpora of Twitter and Reddit data, such that one overlaps in time with the pre-training period, and the other is temporally shifted. Next, I use a lexical semantic change system to collect a set of semantically changed target words, a set of stable words, and a set of emerging new words. Third, I evaluate the performance of the definition generation model in both time periods, and analyze whether semantic change impacts performance. Fourth, I compare the results with cross entropy and perplexity scores for the same inputs. The results indicate that (i) the model’s performance deteriorates more for semantically changing words compared to semantically stable words, (ii) the model exhibits significantly lower performance and potential bias for emerging new words, and (iii) the performance does not correlate with loss or (pseudo)-perplexity scores.

Item Type: Report
Report Nr: MoL-2023-33
Series Name: Master of Logic Thesis (MoL) Series
Year: 2023
Subjects: Computation
Depositing User: Dr Marco Vervoort
Date Deposited: 25 Jan 2024 22:27
Last Modified: 25 Jan 2024 22:27

Actions (login required)

View Item View Item