A Large Safeguards-informed Hybrid Imagery Dataset For Computer Vision Research And Development

Zoe Gastelum - Sandia National Laboratories
Timothy Shead - Sandia National Laboratories
Ahmad Rushdi - Sandia National Laboratories
File Attachment
a193.pdf725.56 KB
The scarcity of large, labelled data sets is a barrier to the development of computer vision models in many domains. In international nuclear safeguards, images of relevant equipment and technologies may be rare due to commercial or proprietary concerns, limited historical examples of proliferation-relevant technology, and sensitivity concerns for otherwise relevant examples. Labelling even this limited data is expensive, requires subject matter expertise, and is prone to human error and disagreement. In previous work, we demonstrated that synthetic two-dimensional images rendered from 3D computer-aided design (CAD) models can be used to train deep learning models when real-world data is limited. This paper describes our current work to develop a large labelled dataset containing 1 million real-world, synthetic, and adversarial images - including distractor examples - of 30B and 48 containers used to store and transport uranium hexafluoride (UF6). We describe how we will validate the synthetic images using multiple deep learning algorithm types and models, using explainability measures to identify biases and re-render images to counter those biases. The resulting dataset will support a range of computer vision research topics, some of which are proposed here.