Applying Natural Language Processing Techniques to International Safeguards

Year
2023
Author(s)
Scott Stewart - Oak Ridge National Laboratory
Carlos Soto - Brookhaven National Laboratory Upton, NY, USA
Alejandro Michel Zuniga - Pacific Northwest National Laboratory Richland, WA, USA
Nathan Martindale - Oak Ridge National Laboratory
Abstract

The International Atomic Energy Agency (IAEA) faces a significant, and growing, challenge in collecting and analyzing safeguards-relevant data. As the IAEA progressively continues to safeguard more facilities and material in the future, this data challenge will only grow. The IAEA collects and analyzes safeguards-relevant information primarily through data streams from open-source collection (text-based), in-field instrumentation (signals-based), surveillance (image or video based), and satellite imagery (imagery based). This process is currently mostly manual; however, intelligent automation will likely be required to process the growing volumes of safeguards-relevant information in the near future. This special session provided a demonstration of five natural language processing techniques that are relevant to text processing workflows in open-source collection. These techniques are currently in development at Brookhaven National Laboratory, Oak Ridge National Laboratory, and Pacific Northwest National Laboratory and include text classification with NukeLM, the Transformer eXpalinability and eXploration tool, author disambiguation with S2AND, machine learning–based table extraction built as part of the Evaluated Nuclear Structure Data File (ENSDF) effort, and the Interactive Corpus Analysis Tool. NukeLM is a BERT-style transformer model that has been pretrained on 1.5 million abstracts from the US Department of Energy’s Office of Science and Technical Information database to provide more relevant document classification for the nuclear domain. The Transformer eXplainability and eXploration tool was created to provide users a tool to better understand the performance of language models being used for sequence classification tasks. The S2AND algorithm was developed by Allen AI and is particularly useful for disambiguating authors in a collection of publications. The ENSDF machine learning–based table extraction approach has been developed to automatically extract information from tables in non-machine-readable documents. Finally, the Interactive Corpus Analysis Tool was developed as a method to allow someone who is not an expert in machine learning to build a text processing workflow based on their subject matter expertise while still leveraging the field of machine learning.