26 February 2025
10:00 Master's Defense IC3 Auditorium
Topic
Story classification and textual coherence: An approach with inclusion of rhetorical and syntactic structure in language models
Student
Luiz Fellipe Machi Pereira
Advisor / Teacher
Sandra Eliza Fontes de Avila - Co-supervisors: Nadia Felix Felipe da Silva and Helena de Almeida Maia
Brief summary
The emergence of more sophisticated language models, such as GPT-3, BERT and their derivatives, have revolutionized the interactions between computer systems and humans. Over time, systems with larger models, better responses, and user-friendly interfaces, such as ChatGPT and Copilot, made them even more popular. These models are widely used in applications ranging from virtual assistants to automated content generation, offering fluid and contextualized responses. However, a persistent challenge lies in the ability to ensure that the generated texts are not only grammatically correct but also semantically coherent. Textual incoherence—such as internal contradictions, breaks in thematic progression, or flaws in logical structure—can compromise the usefulness and reliability of these systems, especially in critical scenarios such as customer service, education, or information dissemination.
Identifying inconsistencies in generated texts before making them available to users is a complex problem. The superficial fluency of language models often masks structural deficiencies, creating the illusion of quality in narratives that, in reality, lack logic or cohesion. This limitation becomes even more relevant when we consider applications that demand narrative precision, such as the generation of texts with journalistic themes, scripts or educational materials. Furthermore, the scarcity of annotated databases with information on textual coherence makes it difficult to train and evaluate automated systems for this task. Manually annotating texts for their coherence requires linguistic expertise and time, since coherence involves multiple layers, such as the organization of arguments, definition of theme and world context, aspects that are not trivially quantifiable.
Given this scenario, this study proposes a methodology to classify coherent stories using language models and compare its performance to that of a model in which syntactic and rhetorical information is integrated. The central approach is based on the incorporation of special symbols derived from knowledge arising from linguistic theories. To validate the proposal, we built a corpus of stories, called H.IAAC CommonStories, automatically annotated with rhetorical relations and syntactic categories, with coherent narratives and incoherent versions of them. This corpus was used to train and evaluate an adapted language model, whose robustness was boosted by extending the model's knowledge.
In addition to the evaluation in the developed corpus, we performed zero-shot tests in a Brazilian disinformation database (FakeTrue.BR), aiming to explore the hypothesis that textual coherence can serve as an indirect indicator for detecting disinformation in offline scenarios.
Examination Board
Headlines:
Sandra Eliza Fontes de Avila | IC / UNICAMP |
Fabíola Souza Fernandes Pereira | FACOM / UFU |
Marcos Medeiros Raimundo | IC / UNICAMP |
Substitutes:
Leandro Aparecido Villas | IC / UNICAMP |
Tiago Timponi Torrent | SPEAK/UFJF |