Analyzing Embedding Space Trajectories to Augment Document Link Prediction

Year
2025
Author(s)
Anibely Torres Polanco - Oak Ridge National Laboratory
Abstract
Knowledge extraction from documents is a key challenge both for analysts and researchers who need to navigate and explore significant amounts of information. Document graphs, or documents with pieces of text that link to other documents, can act as a knowledgebase similar to Wikipedia, a form of adding structure, relevance, and context to originally disconnected pieces of information. In this talk we explore constructing document graphs via link prediction, validating against a Wikipedia dump as our baseline corpus. Encoder-only transformer models give us a tool for analyzing text within an embedding space, which captures the subtle and highly-context dependent semantics of documents. Furthermore, when a document is split up into smaller chunks, the individual chunk embeddings can define a trajectory through embedding space by sequentially connecting these embedding data points. We pose different trajectory characterization metrics, which define and characterize common patterns of document trajectories. We then seek to reconstruct the document links by building proximity graphs around individual text chunks/queries and analyzing nearby documents for relevance based on their trajectories. Finally, we discuss potential nuclear-domain relevant use-cases for this form of document graph construction