Anthropic recently scanned open source software and found 1.6K vulnerabilities. As of May 22nd, only 97 of these have been patched. Now, discovery is the easy part... verification, triage and patching are the new bottleneck. The author of this post wanted to measure the bottleneck directly by trying to fix the vulnerability with twenty real-world CVEs, five models, and three prompts.
They tried this with a few different sources of information. First, they gave it the exploit report, but no code reference. Next, they gave exact location information but only a hint about what was wrong. Finally, they had a prompt that provided the exact issue and the affected code. All three of these are real-world scenarios, so they wanted to know what would work in practice.
The sandbox contained the vulnerable project's source code with a limited toolset. They decided NOT to allow bash because of the likelihood of cheating on the benchmark, even though this likely hindered the more advanced models. Each run gets at most 20 turns. After this, a file is added to the sandbox to test whether the bug was completely fixed.
When running the tests, only 10/20 actually fix the vulnerabilities correctly. In one case, 11/20 was fixed by gpt-5.4 nano where gpt-5.5 only fixed 8/20 in that case. Gpt 5.5 solves 12/20 with the most favorable conditions, which still isn't great. Although gpt5.5 costs 12x the gpt5.4-mini, it performs about the same. More tokens don't mean more solves here.
There were several failure modes... First, the wrong-search drift. In this case, it would search for something that didn't exist infinitely. Once it got sidetracked, it couldn't find itself. Second, was the budget for the fix being exhausted while in the process of fixing it.
More interestingly, it would also make edits to the proper part of code but not fix the entire issue. In a way, it was satisfied with the fix, but it was insufficient. Finally, it would sometimes fix the wrong part of the vulnerability, leaving the actual exploitation bug alive.
False confidence creates its own attack surface. Two of these failure modes were token limits, while the other two were code that had been given the thumbs-up but was completely wrong. A good article and a truthful look into the problems with LLM's fixing bugs.