Topic Modeling with Natural Language Processing for Identification of Nuclear Proliferation-Relevant Scientific and Technical Publications

Year
2020
Author(s)
Daniel M. Dunlavy - Sandia National Laboratories, Albuquerque
Jonathan Bisila - Sandia National Laboratories, Albuquerque
Zoe Gastelum - Sandia National Laboratories
Craig Ulmer - Sandia National Laboratories, Albuquerque
Abstract

Scientific and technical publications can provide relevant information regarding the technical capabilities of a state, the location of nuclear materials and related research activities within that state, and international partnerships and collaborations. Nuclear proliferation analysts monitor scientific and technical publications using complex word searches defined by fuel cycle experts as part of their collection and analysis of all potentially relevant information. These search strings have been refined over time by fuel cycle experts and other analysts but represent a top-down approach that is inherently defined by the requirement of term presence. In contrast, we are developing a bottom-up approach in which we develop topic models from a small number of expert refereed source documents to search similar topic space, with the hope that we can use this method to identify publications that are relevant to the proliferation detection problems space without necessarily conforming to the expert-derived rule base. We are comparing our results of various topic modeling and clustering techniques to a traditional analyst search strings to determine how well our methods work to find seed documents. We also present how our methods provide added benefit over traditional search by organizing the retrieved documents into topic-oriented clusters. Finally, we present distributions of author institutions to facilitate a broader perspective of the content of interest for analysts.