MITRE has unveiled the Offensive Cyber Functionality Unified LLM Testing (OCCULT) framework, a groundbreaking methodology designed to guage dangers posed by massive language fashions (LLMs) in autonomous cyberattacks.
Introduced on February 26, 2025, the initiative responds to rising issues that AI techniques may democratize offensive cyber operations (OCO), enabling malicious actors to scale assaults with unprecedented effectivity.
Cybersecurity specialists have lengthy warned that LLMs’ potential to generate code, analyze vulnerabilities, and synthesize technical information may decrease obstacles to executing subtle cyberattacks.
Conventional OCOs require specialised expertise, sources, and coordination, however LLMs threaten to automate these processes—doubtlessly enabling speedy exploitation of networks, knowledge exfiltration, and ransomware deployment.
MITRE’s analysis highlights that newer fashions like DeepSeek-R1 already reveal alarming proficiency, scoring over 90% on offensive cybersecurity information exams.
Contained in the OCCULT Framework
OCCULT introduces a standardized method to evaluate LLMs throughout three dimensions:
- OCO Functionality Areas: Checks align with real-world ways from frameworks like MITRE ATT&CK®, overlaying credential theft, lateral motion, and privilege escalation.
- Use Instances: Evaluations measure if an LLM acts as a information assistant, collaborates with instruments (co-orchestration), or operates autonomously.
- Reasoning Energy: Eventualities take a look at planning, environmental notion, and flexibility—key indicators of an AI’s potential to navigate dynamic networks.
The framework’s rigor lies in its avoidance of simplistic benchmarks.
As an alternative, OCCULT emphasizes multi-step, sensible simulations the place LLMs should reveal strategic pondering, equivalent to pivoting via firewalls or evading detection.


Key Evaluations and Findings
MITRE’s preliminary exams in opposition to main LLMs revealed vital insights:
- TACTL Benchmark: DeepSeek-R1 aced a 183-quency evaluation of offensive ways, reaching 91.8% accuracy, whereas Meta’s Llama 3.1 and GPT-4o trailed carefully. The benchmark consists of dynamic variables to stop memorization, forcing fashions to use conceptual information.
- BloodHound Equivalency: Fashions analyzed artificial Energetic Listing knowledge to determine assault paths. Whereas Mixtral 8x22B achieved 60% accuracy in easy duties, efficiency dropped in complicated situations, exposing gaps in contextual reasoning1.
- CyberLayer Simulations: In a simulated enterprise community, Llama 3.1 70B excelled at lateral motion utilizing living-off-the-land strategies, finishing targets in 8 steps—far outpacing random brokers (130 steps).
Cybersecurity professionals have praised OCCULT for bridging a vital hole. “Present benchmarks usually miss the mark by testing slim expertise,” stated Marissa Dotter, OCCULT co-author.
“Our framework contextualizes dangers by mirroring how attackers use AI.” The method has drawn comparisons to MITRE’s ATT&CK framework, which revolutionized menace modeling by cataloging actual adversary behaviors.
Nevertheless, some specialists warning in opposition to overestimating LLMs. Preliminary exams present fashions wrestle with superior duties like zero-day exploitation or operationalizing novel vulnerabilities.
“AI isn’t changing hackers but, however it’s a power multiplier,” famous moral hacker Alex Stamos. “OCCULT helps us pinpoint the place defenses should evolve.”
MITRE plans to open-source OCCULT’s take a look at instances, together with TACTL and BloodHound evaluations, to foster collaboration.
The workforce additionally introduced a 2025 growth of the CyberLayer simulator, including cloud and IoT assault situations.
Crucially, MITRE urges neighborhood participation to broaden OCCULT’s protection. “No single workforce can replicate each assault vector,” stated lead investigator Michael Kouremetis.
“We want collective experience to construct benchmarks for AI-driven social engineering, provide chain assaults, and extra.”
As AI turns into a double-edged sword in cybersecurity, frameworks like OCCULT present important instruments to anticipate and mitigate dangers.
By rigorously evaluating LLMs in opposition to real-world assault patterns, MITRE goals to arm defenders with actionable insights—making certain AI’s transformative potential isn’t overshadowed by its perils.
Acquire Risk Intelligence on the Newest Malware and Phishing Assaults with ANY.RUN TI Lookup -> Strive free of charge