Software program is a technique of speaking human intent to a machine. When builders write software program code, they’re offering exact directions to the machine in a language the machine is designed to know and reply to. For complicated duties, these directions can develop into prolonged and tough to verify for correctness and safety. Synthetic intelligence (AI) affords the choice risk of interacting with machines in methods which can be native to people: plain language descriptions of objectives, spoken phrases, and even gestures or references to bodily objects seen to each the human and the machine. As a result of it’s so a lot simpler to explain complicated objectives to an AI system than it’s to develop hundreds of thousands of strains of software program code, it isn’t stunning that many individuals see the likelihood that AI programs may eat better and better parts of the software program world. Nonetheless, better reliance on AI programs may expose mission homeowners to novel dangers, necessitating new approaches to check and analysis.
SEI researchers and others within the software program group have spent many years finding out the conduct of software program programs and their builders. This analysis has superior software program improvement and testing practices, growing our confidence in complicated software program programs that carry out vital features for society. In distinction, there was far much less alternative to check and perceive the potential failure modes and vulnerabilities of AI programs, and significantly these AI programs that make use of giant language fashions (LLMs) to match or exceed human efficiency at tough duties.
On this weblog publish, we introduce System Theoretic Course of Evaluation (STPA), a hazard evaluation method uniquely appropriate for coping with the complexity of AI programs. From stopping outages at Google to enhancing security in aviation and automotive industries, STPA has confirmed to be a flexible and highly effective methodology for analyzing complicated sociotechnical programs. In our work, we’ve got additionally discovered that making use of STPA clarifies the security and safety goals of AI programs. Primarily based on our experiences making use of it, we describe 4 particular ways in which STPA has reliably supplied insights to boost the security and safety of AI programs.
The Rationale for System Theoretic Course of Evaluation (STPA)
If we have been to deal with a system with AI elements like every other system, frequent follow would name for following a scientific evaluation course of to determine hazards. Hazards are circumstances inside a system that might result in mishaps in its operation leading to dying, harm, or harm to gear. System Theoretic Course of Evaluation (STPA) is a latest innovation in hazard evaluation that stands out as a promising strategy for AI programs. The four-step STPA workflow leads the analyst to determine unsafe interactions between the elements of complicated programs, as illustrated by the essential security-related instance in Determine 1. Within the instance, an LLM agent has entry to a sandbox pc and a search engine, that are instruments that the LLM can make use of to raised deal with consumer wants. The LLM can use the search engine to retrieve info related to a consumer’s request, and it may write and execute scripts on the sandbox pc to run calculations or generate information plots. Nonetheless, giving the LLM the flexibility to autonomously search and execute scripts on the host system doubtlessly exposes the system proprietor to safety dangers, as in this instance from the Github weblog. STPA affords a structured technique to outline these dangers after which determine, and finally forestall, the unsafe system interactions that give rise to them.

Determine 1. STPA Steps and LLM Agent with Instruments Instance
Traditionally, hazard evaluation methods have centered on figuring out and stopping unsafe circumstances that come up as a result of part failures, comparable to a cracked seal or a valve caught within the open place. Most of these hazards typically name for better redundancy, upkeep, or inspection to scale back the likelihood of failure. A failure-based accident framework shouldn’t be match for AI (or software program, for that matter), as a result of AI hazards will not be the results of the AI part failing in the identical approach as a seal or a valve may fail. AI hazards come up when fully-functioning applications faithfully observe flawed directions. Including redundancy of such elements would do nothing to scale back the likelihood of failure.
STPA posits that, along with part failures, complicated programs enter hazardous states due to unsafe interactions amongst imperfectly managed elements. This basis is a greater match for programs which have software program elements, together with elements that depend on AI. As an alternative of pointing to redundancy as an answer, STPA emphasizes constraining the system interactions to forestall the software program and AI elements from taking sure usually allowable actions at occasions when the actions would result in a hazardous state. Analysis at MIT evaluating STPA and conventional hazard-analysis strategies, reported that, “In all of those evaluations, STPA discovered all of the causal situations discovered by the extra conventional analyses, nevertheless it additionally recognized many extra, typically software-related and non-failure, situations that the normal strategies didn’t discover.” Previous SEI analysis has additionally utilized STPA to research the security and safety of software program programs. Just lately, we’ve got additionally used this method to research AI programs. Every time we apply STPA to AI programs—even ones in widespread use—we uncover new system behaviors that might result in hazards.
Introduction to System Theoretic Course of Evaluation (STPA)
STPA begins by figuring out the set of harms, or losses, that system builders should forestall. In Determine 1 above, system builders should forestall a lack of privateness for his or her prospects, which might consequence within the prospects changing into victims of legal exercise. A secure and safe system is one that can’t trigger prospects to lose management over their private info.
Subsequent, STPA considers hazards—system-level states or circumstances that might trigger losses. The instance system in Determine 1 might trigger a lack of buyer privateness if any of its part interactions trigger it to develop into unable to guard the purchasers’ personal info from unauthorized customers. The harm-inducing states present a goal for builders. If the system design at all times maintains its capacity to guard prospects’ info, then the system can’t trigger a lack of buyer privateness.
At this level, system idea turns into extra distinguished. STPA considers the relationships between the elements as management loops, which compose the management construction. A management loop specifies the objectives of every part and the instructions it may challenge to different elements of the system to realize these objectives. It additionally considers the suggestions out there to the part, enabling it to know when to challenge completely different instructions. In Determine 1, the consumer enters queries to the LLM and evaluations its responses. Primarily based on the consumer queries, the LLM decides whether or not to seek for info and whether or not to execute scripts on the sandbox pc, every of which produces outcomes that the LLM can use to raised deal with the consumer’s wants.
This management construction is a robust lens for viewing security and safety. Designers can use management loops to determine unsafe management actions—mixtures of management actions and circumstances that might create one of many hazardous states. For instance, if the LLM executes a script that allows entry to non-public info and transmits it exterior of the session, this might end in it being unable to guard delicate info.
Lastly, given these doubtlessly unsafe instructions, STPA prompts designers to ask, what are the situations wherein the part would challenge such a command? For instance, what mixture of consumer inputs and different circumstances could lead on the LLM to execute instructions that it mustn’t? These situations type the idea of security fixes that constrain the instructions to function inside a secure envelope for the system.
STPA situations will also be utilized to system safety. In the identical approach {that a} security evaluation develops situations the place a controller within the system may challenge unsafe management actions by itself, a safety evaluation considers how an adversary might exploit these flaws. What if the adversary deliberately tips the LLM into executing an unsafe script by requesting that the LLM check it earlier than responding?
In sum, security situations level to new necessities that forestall the system from inflicting hazards, and safety situations level to new necessities that forestall adversaries from bringing hazards upon the system. If these necessities forestall unsafe management actions from inflicting the hazards, the system is secure/safe from the losses.
4 Methods STPA Produces Actionable Insights in AI Programs
We mentioned above how STPA might contribute to raised system security and safety. On this part we describe how STPA reliably produces insights when our crew performs hazard analyses of AI programs.
1. STPA produces a transparent definition of security and safety for a system. The NIST AI Danger Administration Framework identifies 14 AI-specific dangers, whereas the NIST Generative Synthetic Intelligence Profile outlines 12 further classes which can be distinctive to or amplified by generative AI. For instance, generative AI programs could confabulate, reinforce dangerous biases, or produce abusive content material. These behaviors are extensively thought-about undesirable, and mitigating them stays an energetic focus of educational and business analysis.
Nonetheless, from a system-safety perspective, AI danger taxonomies might be each overly broad and incomplete. Not all dangers apply to each use case. Moreover, new dangers could emerge from interactions between the AI and different system elements (e.g., a consumer may submit an out-of-scope request, or a retrieval agent may depend on outdated info from an exterior database).
STPA affords a extra direct strategy to assessing security in programs, together with these incorporating AI elements. It begins by figuring out potential losses—outlined because the lack of one thing valued by system stakeholders, comparable to human life, property, environmental integrity, mission success, or organizational popularity. Within the case of an LLM built-in with a code interpreter on a company’s inside infrastructure, potential losses might embody harm to property, wasted time, or mission failure if the interpreter executes code with results past its sandbox. Moreover, it might result in reputational hurt or publicity of delicate info if the code compromises system integrity.
These losses are context particular and depend upon how the system is used. This definition aligns carefully with requirements such because the MIL-STD-882E, which defines security as freedom from circumstances that may trigger dying, harm, occupational sickness, harm to or lack of gear or property, or harm to the atmosphere. The definition additionally aligns with the foundational ideas of system safety engineering.
Losses—and subsequently security and safety—are decided by the system’s objective and context of use. By shifting focus from mitigating basic AI dangers to stopping particular losses, STPA affords a clearer and extra actionable definition of system security and safety.
2. STPA steers the design towards making certain security and safety. Accidents may end up from part failures—cases the place a part now not operates as meant, comparable to a disk crash in an info system. Accidents also can come up from errors—instances the place a part operates as designed however nonetheless produces incorrect or sudden conduct, comparable to a pc imaginative and prescient mannequin returning the incorrect object label. Not like failures, errors will not be resolved via reliability or redundancy however via adjustments in system design.
A duty desk is an STPA artifact that lists the controllers that make up a system, together with the duties, management actions, course of fashions, and inputs and suggestions related to every. Desk 1 defines these phrases and provides examples utilizing an LLM built-in with instruments, together with a code interpreter operating on a company’s inside infrastructure.

Desk 1. Notional Duty Desk for LLM Agent with Instruments Instance
Accidents in AI programs can—and have—occurred as a result of design errors in specifying every of the weather in Desk 1. The field beneath incorporates examples of every. In all these examples, not one of the system elements failed—every behaved precisely as designed. But the programs have been nonetheless unsafe as a result of their designs have been flawed.
The duty desk supplies a possibility to judge whether or not the duties of every controller are acceptable. Returning to the instance of the LLM agent, Desk 1 leads the analyst to contemplate whether or not the management actions, course of mannequin, and suggestions for the LLM controller allow it to satisfy its duties. The primary duty of by no means producing code that exposes the system to compromise is unsupportable. To satisfy this duty, the LLM’s course of mannequin would wish a excessive stage of consciousness of when generated code shouldn’t be safe, in order that it might accurately decide when not to supply the execute script command due to a safety danger. An LLM’s precise course of mannequin is restricted to probabilistically finishing token sequences. Although LLMs are skilled to disregard some requests for insecure code, these steps cut back, however don’t remove, the danger that the LLM will produce and execute a dangerous script. Thus, the second duty represents a extra modest and acceptable objective for the LLM controller, whereas different system design selections, comparable to safety constraints for the sandbox pc, are crucial to completely forestall the hazard.

Determine 2: Examples of accidents in AI programs which have occurred as a result of design errors in specifying every of the weather outlined in Desk 1.
By shifting the main focus from particular person elements to the system, STPA supplies a framework for figuring out and addressing design flaws. We have now discovered that obtrusive omissions are sometimes revealed by even the easy step of designating which part is liable for every side of security after which evaluating whether or not the part has the knowledge inputs and out there actions it wants to perform its duties.
3. STPA helps builders contemplate holistic mitigation of dangers. Generative AI fashions can contribute to lots of of several types of hurt, from serving to malware coders to selling violence. To fight these potential harms, AI alignment analysis seeks to develop higher mannequin guardrails—both instantly educating fashions to refuse dangerous requests or including different elements to display inputs and outputs.
Persevering with the instance from Determine 1/Desk 1, system designers ought to embody alignment tuning of their LLM in order that it refuses requests to generate scripts that resemble recognized patterns of cyberattack. Nonetheless, it may not be potential to create an AI system that’s concurrently able to fixing probably the most tough issues and incapable of producing dangerous content material. Alignment tuning can contribute to stopping the hazard, nevertheless it can’t accomplish the duty by itself. In these instances, STPA steers builders to leverage all of the system’s elements to forestall the hazards, beneath the idea that the conduct of the AI part can’t be absolutely assured.
Think about the potential mitigations for a safety danger, such because the one from the situation in Determine 1. STPA helps builders contemplate a wider vary of choices by revealing methods to adapt the system management construction to scale back or, ideally, remove hazards. Desk 2 incorporates some instance mitigations grouped in response to the DoD’s system security design order of priority classes. The classes are ordered from simplest to least efficient. Whereas the LLM-centric security strategy would deal with aligning the LLM to forestall it from producing dangerous instructions, STPA suggests a group of choices for stopping the hazard even when the LLM does try to run a dangerous script. The order of priority first factors to structure decisions that remove the problematic conduct as the best mitigations. Desk 2 describes methods to harden the sandbox to forestall the personal info from escaping, comparable to using and implementing rules of least privilege. Shifting down via the order of priority classes, builders might contemplate lowering the danger by limiting the instruments out there throughout the sandbox, screening inputs with a guardrail part, and monitoring exercise on the sandbox pc to alert safety personnel to potential assaults. Even signage and procedures, comparable to directions within the LLM system immediate or consumer warnings, might contribute to a holistic mitigation of this danger. Nonetheless, the order of priority presupposes that these mitigations are prone to be the least efficient, pushing builders to not rely solely on human intervention to forestall the hazard.
Class |
Instance for LLM Agent with Instruments |
|
State of affairs |
An attacker leaves an adversarial immediate on a generally searched web site that will get pulled into the search outcomes.
The LLM agent provides all search outcomes to the system context, follows the adversarial immediate,
and makes use of the sandbox to transmit the consumer’s delicate info to an internet site managed by the attacker.
|
|
1. Remove hazard via design choice |
Harden sandbox to mitigate towards exterior communication. Steps embody using and implementing rules
of least privilege for LLM brokers and the infrastructure supporting/surrounding them when provisioning and configuring
the sandboxed atmosphere and allocating sources (CPU, reminiscence, storage, networking and so on.)
|
|
2. Cut back danger via design alteration |
- Restrict LLM entry throughout the sandbox, for instance, to Python interpreters operating in digital environments with a restricted set of packages. Encrypt information at relaxation and management it utilizing appropriately configured permissions for learn, write, and execute actions leveraging rules of least privilege.
- Community entry must be segmented, if not remoted, and unused ports must be closed to restrict lateral motion and/or exterior sources that may be leveraged by the LLM.
- Limit all community visitors apart from explicitly allowed supply and locations addresses (and ports) for inbound and outbound visitors.
- Keep away from using open-ended extensions and make use of extensions with granular performance.
- Implement strict sandboxing to restrict mannequin publicity to unverified information sources. Use anomaly detection methods to filter out adversarial information.
- Throughout inference, combine Retrieval-Augmented Era (RAG) and grounding methods to scale back dangers of hallucinations (OWASP LLM04: 2025).
|
|
3. Incorporate engineered options or units |
Incorporate host, container, community, and information guardrails by leveraging stateful firewalls, IDS/IPS, host-based monitoring,
data-loss prevention software program, and user-access controls that restrict the LLM utilizing guidelines and heuristics.
|
|
4. Present warning units |
Routinely notify safety, interrupt periods, or execute preconfigured guidelines in response to unauthorized or sudden useful resource utilization/actions. These might embody:
- Flagging packages or strategies within the Python script that try OS, reminiscence, or community manipulation
- Makes an attempt at privilege escalation
- Makes an attempt at community modification
- Makes an attempt at information entry or manipulation
- Makes an attempt at information exfiltration through visitors group deviation (D3FEND D3-NTCD), per host download-upload ratio evaluation (D3FEND D3-PHDURA), and community visitors filtering (D3FEND D3-NTF)
|
|
5. Incorporate signage, procedures, coaching, and protecting gear |
- Add warnings to keep away from unauthorized behaviors to the LLM’s system immediate.
- Require consumer approval for high-impact actions (OWASP LLM06:2025).
|
|
Desk 2: Design Order of Priority and Instance Mitigations
Due to their flexibility and functionality, controlling the conduct of AI programs in all potential instances stays an open downside. Decided customers can typically discover tips to bypass subtle guardrails regardless of one of the best efforts of system designers. Additional, guardrails which can be too strict may restrict the mannequin’s performance. STPA permits analysts to assume exterior of the AI elements and contemplate holistic methods to mitigate potential hazards.
4. STPA factors to the assessments which can be crucial to verify security. For conventional software program, system testers create assessments based mostly on the context and inputs the programs will face and the anticipated outputs. They run every check as soon as, resulting in a go/fail end result relying on whether or not the system produced the right conduct. The scope for testing is helpfully restricted by the duality between system improvement and assurance (i.e., Design the system to do issues, and make sure that it does them.).
Security testing faces a special downside. As an alternative of confirming that the system achieves its objectives, security testing should decide which of all potential system behaviors have to be averted. Figuring out these behaviors for AI elements presents even better challenges due to the huge area of potential inputs. Fashionable LLMs can settle for as much as 10 million tokens representing enter textual content, photographs, and doubtlessly different modes, comparable to audio. Autonomous autos and robotic programs have much more potential sensors (e.g., gentle, detection, and ranging LiDAR), additional increasing the vary of potential inputs.
Along with the impossibly giant area of potential inputs, there’s hardly ever a single anticipated output. The utility of outputs relies upon closely on the system consumer and context. It’s tough to know the place to start testing AI programs like these, and, consequently, there’s an ever-proliferating ecosystem of benchmarks that measure completely different components of their efficiency.
STPA shouldn’t be an entire resolution to those and different challenges inherent in testing AI programs. Nonetheless, simply as STPA enhances security by limiting the scope of potential losses to these explicit to the system, it additionally helps outline the mandatory set of security assessments by limiting the scope to the situations that produce the hazards explicit to the system. The construction of STPA ensures analysts have alternative to evaluate how every command might end in a hazardous system state, leading to a doubtlessly giant, but finite, set of situations. Builders can hand this checklist of situations off to the check crew, who can then choose the suitable check circumstances and information to analyze the situations and decide whether or not mitigations are efficient.
As illustrated in Desk 3 beneath, STPA clarifies particular safety attributes together with correct placement of duty for that safety, holistic danger mitigation, and hyperlink to testing. This yields a extra full strategy to evaluating and enhancing security of the notional use case. A safe system, for instance, will defend buyer privateness based mostly on design selections taken to guard delicate buyer info. This design ensures that each one elements work collectively to forestall a misdirected or rogue LLM from leaking personal info, and it identifies the situations that testers should study to verify that the design will implement security constraints.
Profit
|
Software to Instance
|
|
creates an actionable definition of security/safety
|
A safe system is not going to end in a lack of buyer privateness. To stop this loss, the system should defend delicate buyer info always.
|
|
ensures the appropriate construction to implement security/safety duties
|
Duty for shielding delicate buyer information is broader than the LLM and consists of the sandbox pc.
|
|
mitigates dangers via management construction specification
|
Since even an alignment-tuned LLM may leak info or generate and execute a dangerous script, guarantee different system elements are designed to guard delicate buyer info.
|
|
identifies assessments crucial to verify security
|
Along with testing LLM vulnerability to adversarial prompts, check sandbox controls on privilege escalation, communication exterior sandbox, warnings tied to prohibited instructions, and information encryption within the occasion of unauthorized entry. These assessments ought to embody routine safety scans utilizing up-to-date signatures/plugins related to the system for the host and container/VM. Safety frameworks (e.g., RMF) or guides (e.g., STIG checklists) can help in verifying acceptable controls are in place utilizing scripts and handbook checks.
|
|
Desk 3. Abstract of STPA Advantages on Notional Instance of Buyer Knowledge Administration
Preserving Security within the Face of Growing AI Complexity
The long-standing development in AI—and software program usually—is to repeatedly broaden capabilities to satisfy rising consumer expectations. This usually leads to growing complexity, driving extra superior approaches comparable to multimodal fashions, reasoning fashions, and agentic AI. An unlucky consequence is that assured assurances of security and safety have develop into more and more tough to make.
We have now discovered that making use of STPA supplies readability in defining the security and safety objectives of AI programs, yielding invaluable design insights, progressive danger mitigation methods, and improved improvement of the mandatory assessments to construct assurance. Programs considering proved efficient for addressing the complexity of business programs up to now, and, via STPA, it stays an efficient strategy for managing the complexity of current and future info programs.