Go back to All Blog posts

Can Autonomous LLM Agents Exploit One Day Vulnerabilities?

Fara Hain CMO

March 3, 2025

In this article

Introduction – Explaining Arxiv 2404.08144 research on the topic of LLM agents and one day exploits

When generative AI first emerged, the cybersecurity community primarily focused on two promising benefits:

AI for remediation: Using AI to remediate security vulnerabilities before they could be exploited
AI for investigation: Leveraging AI to analyze security incidents more efficiently, making security teams more productive

However, a concerning “third angle” has now been demonstrated: AI as an attacker – powerful AI systems in the hands of malicious actors, autonomously exploiting vulnerabilities with minimal human guidance.

Large Language Models (LLMs) such as GPT-4 have dramatically advanced in capabilities over the past few years, achieving near or even superhuman performance on a range of tasks. But what happens when these powerful models are used for malicious purposes, such as exploiting cybersecurity vulnerabilities? A recent preprint titled “LLM Agents can Autonomously Exploit One-day Vulnerabilities” by Richard Fang, Rohan Bindu, Akul Gupta, and Daniel Kang (2024) explores this very possibility by demonstrating how GPT-4, when combined with a simple tool-using “agent” framework, can autonomously hack real-world systems with critical or high-severity known vulnerabilities.

What Are One-Day Vulnerabilities?

“One-day vulnerabilities” are security flaws in software that are already disclosed publicly (often assigned a CVE number) but remain unpatched in certain deployments. This period between disclosure and patching can provide a window of opportunity for attackers.

“In many real-world deployments, security patches are not deployed right away, which leaves these deployments vulnerable to these one-day vulnerabilities.” (Fang et al., 2024, p. 3)

The authors emphasize that while open-source vulnerability scanners can discover or exploit some known issues, they frequently fail to handle more complex scenarios—especially those disclosed recently or requiring multi-step exploitation paths.

The LLM Agent Exploit Research Paper in Brief

Benchmark of 15 Real-World Vulnerabilities

Fang et al. collected a dataset of 15 one-day vulnerabilities from the Common Vulnerabilities and Exposures (CVE) database and a highly cited academic paper. These vulnerabilities are not trivial “capture-the-flag” (CTF) or toy examples; they include:

Website vulnerabilities (e.g., WordPress SQL injection)
Container management software (e.g., runc)
Vulnerable Python packages
A concurrency attack known as ACIDRain (Warszawski & Bailis, 2017)

Many of these CVEs are classified as high or critical severity.

GPT-4 Outperforms Other Models and Tools

The authors tested ten different models and two open-source vulnerability scanners (ZAP and Metasploit) on their dataset of 15 vulnerabilities. The only model capable of autonomously exploiting any vulnerabilities was GPT-4.

“GPT-4 is capable of exploiting 87% of these vulnerabilities compared to 0% for every other model we test…and open-source vulnerability scanners.” (Fang et al., 2024, p. 1)

Interestingly, GPT-3.5, a widely used predecessor to GPT-4, failed to exploit even one of these vulnerabilities, just like the open-source models.

Importance of Providing the CVE Description

While GPT-4 achieved an 87% success rate when given CVE details, its success rate dropped drastically (to only 7%) without the CVE descriptions. This indicates that locating the vulnerability is much more challenging than exploiting it once the nature of the flaw is known.

“Our findings show that GPT-4’s success rate drops to 7% without the CVE description, suggesting that our agent is much more capable of exploiting vulnerabilities than finding vulnerabilities.” (Fang et al., 2024, p. 1)

Cost Analysis

The researchers also examined the economic implications of using GPT-4 for hacking versus hiring a human expert. They found that GPT-4-based attacks can already be 2.8 times cheaper than human-based exploitation. Though these are rough estimates, it demonstrates how automated approaches could make attacks more scalable.

“Using an LLM agent is already 2.8× cheaper than human labor…we expect costs to drop for GPT-4, as costs have dropped by GPT-3.5 by over 3× in a span of a year.” (Fang et al., 2024, p. 7)

How the LLM Agent for One-Days Works

The authors used a simple ReAct agent framework (LangChain) to allow GPT-4 to:

Read the vulnerability description (where provided).

Use tools, such as:

A web browser (to retrieve and parse HTML).
A command-line interface.
A code interpreter.
Interact with the target system step-by-step.

“Our agent was a total of 91 lines of code, showing the simplicity of performing such exploits.” (Fang et al., 2024, p. 4)

Even though the base GPT-4 model has a knowledge cutoff date of November 6th, 2023, it still managed to exploit new CVEs disclosed after that date when given the necessary CVE descriptions.

Key Takeaways & Implications

Emergent Capabilities

GPT-4’s success on real-world exploits could be viewed as an emergent capability. Unlike simpler or purely knowledge-based tasks, vulnerability exploitation requires multi-step reasoning, code generation, and tool usage.

Need for Defensive Measures

With LLMs automating complex exploitation tasks, there is a growing urgency for improved defensive strategies. Traditional scanners like ZAP or Metasploit were no match for GPT-4’s flexible, tool-using approach. Organizations must ensure patches are deployed quickly and that security defenses keep up with modern AI capabilities.

“Our findings highlight the need for the wider cybersecurity community and LLM providers to think carefully about how to integrate LLM agents in defensive measures.” (Fang et al., 2024, p. 9)

Ethical Concerns and Responsible Disclosure

The authors emphasize that their work is conducted in controlled environments and with the aim of better understanding and preventing malicious uses. They have disclosed their findings to OpenAI and do not publicly release their exact prompts to avoid facilitating black-hat exploitation.

“Like many technologies, these results can be used in a black-hat manner, which is both immoral and illegal…we took precautions to ensure that we only used sandboxed environments to prevent harm.” (Fang et al., 2024, p. 9)

Future Directions

Planning & Subagents: GPT-4 performed well even without a separate planning module, but the authors suggest that advanced agent frameworks could improve success rates further.

Tool Integration: Incorporating more specialized tools for tasks like advanced network reconnaissance may bolster an LLM’s hacking capabilities.

Model Alignment: As LLMs grow more powerful, alignment strategies to curb malicious usage become increasingly critical.

Conclusion

This study is an eye-opening look at how cutting-edge AI can be used to exploit real-world vulnerabilities autonomously, highlighting both the remarkable capabilities of GPT-4 and the significant security risks that come with it. While it remains an open question how (or if) other models will close the performance gap, one fact is clear: automated hacking is no longer limited to static, script-based scanners. Instead, an LLM with a set of simple tools can reason, plan, and exploit vulnerabilities—signaling that organizations and researchers alike must adapt swiftly to this new threat landscape.

“Our findings raise questions around the widespread deployment of highly capable LLM agents.” (Fang et al., 2024, p. 1)

References

Below are the key references from the article and the article itself:

Fang, R., Bindu, R., Gupta, A., & Kang, D. (2024). LLMAgents can autonomously exploit one-day vulnerabilities. arXiv:2404.08144v2 [cs.CR]. Retrieved from https://arxiv.org/abs/2404.08144

Bennetts, S. (2013). ZAP (Zed Attack Proxy). In OWASP.

Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems.

Engebretson, P. (2013). The basics of hacking and penetration testing: ethical hacking and penetration testing made easy. Syngress.

Halfond, W. G., Viegas, J., & Orso, A. (2006). A classification of SQL-injection attacks and countermeasures. In Proceedings of the IEEE International Symposium on Secure Software Engineering.

Kang, D., Fang, R., Gupta, A., et al. (2023). SpamGPT: Large language models can generate massive amounts of targeted phishing emails. arXiv preprint arXiv:2301.01234.

Kennedy, H., O’Gorman, J., Kearns, J., & Aharoni, M. (2011). Metasploit: the penetration tester’s guide. No Starch Press.

Schaeffer, R., Mirman, M., Madaan, A., et al. (2024). Theory-based predictions of emergent abilities in large language models. In International Conference on Learning Representations.

Warszawski, T., & Bailis, P. (2017). ACIDRain: Concurrency attacks on database-backed web applications. In SIGMOD/PODS ’17: Proceedings of the 2017 ACM International Conference on Management of Data.

(Additional citations are provided in the original manuscript by Fang et al. (2024).)