Improving Computer System Integrity Resilience With Agent-managed Decoupled Components

Year
2021
Author(s)
William Horsthemke - Argonne National Laboratory
Nathaniel Evans - Argonne National Laboratory
Dan Harkness - Argonne National Laboratory
File Attachment
a508.pdf287.73 KB
Abstract
We depend upon continuous operation of trustworthy, resilient computer systems, but ensuring their integrity and availability poses substantial challenges. These challenges come from ever-increasing cyber attacks by dedicated, sophisticated adversaries as well as unintentional faults caused by the complexity of the software and hardware operating these systems. This paper proposes reducing the complexity of traditional multi-function computer systems by separating the functions and deploying those functions on simpler, dedicated-function components. Simple dedicated-function components are easier to verify and replace. To ensure high-availability, this paper proposes to deploy multiple instances of each component in redundant arrays, and use autonomous agents to continuously verify their integrity and availability and replace them as necessary. The motivation for decoupling complex, multi-function systems into dedicated-function components appeared in two fields: nuclear verification and electricity grid protection. The nuclear verification community wants trustworthy measurement systems than can be confidently verified to ensure that they correctly perform their functions and only the required functions. The complexity of systems increases the opportunities for adversaries to tamper with the system and decreases the ability of inspectors to verify the integrity of the system. To decrease overall system complexity, researchers are developing modular systems and deploying each function on a dedicated component and interconnecting them with function-specific communication. These dedicated-function components permit inspectors to independently test and verify each functional component and its interfaces, before inspecting and verifying the combined system. This builds mutual trust in the verification process. The manufacturer of protective relays used to protect the electricity grid proposes a similar approach. They attribute many of the faults of their relays to the complexity of multi-function, integrated systems, where a fault in a support function cascades into a failure of their control system. They observe that the software required to perform support functions becomes more complex and requires more frequent updates than the software perform control and protective functions. They propose separating their system into dedicated function components so that faults in one function do not cause faults in other functions.