Monad created an automated bugfinder using LLMs. This is the story of what they built and why.
Why? It's getting good at finding bugs; specifically, better than simple static analyzers. It's not going to replace everything, but it will augment it. The more bugs you can kill, the better. They decided to build one rather than buy one for two reasons. First, they wanted a baseline to compare against others. Second, they wanted to address specific gaps in their application, such as cross-language boundaries and their unusual architectural challenges.
Two main pieces of prior art shaped their system. First, Mythos Preview claimed they had a very simple harness. So, they decided to give minimal prompting but with major agentic freedom to inspect, instrument, and test. Second,
TxRay provides a structured way to identify vulnerabilities and their causes. They used the Mythos approach for
finding bugs and the TxRay shape for being
skeptical and
proving the issues.
The system has two main services: bug finder and bug triager. The finder will rank files using static analysis to prioritize which files to examine. After this, it chooses a target and runs inside a container with source, build tools, and instrumentation availability. With a potential issue, the bug hunter will try to prove that the issue exists with crashes, invariant violations, or several other types of test cases. This creates a unit test and then a full E2E POC. Only the full E2E POC issues are promoted to the next step of triaging.
Once at the triage step, it would validate that the bug was legitimate by reading the report. This phase was made to resist a confident report, which is where most low-quality triaging fails. By enabling the prompt to push back, and demand evidence, it moved the needle very well. If the verifier was happy, then they run the full POC to ensure that the bug is valid. At the end, they have a final adjudicator who reviews the POC, hypotheses and much more. This determines the final state of the report.
A few things they learned from it... have very strict validation gates. For consensus-level bugs, a full POC within a small cluster is useless, for instance. Second, separate agents will redo much of the work that the previous one did. By reducing the specialized agents and trusting the previous steps, this got better. Third, lead generation should be separate from the triaging; this is because vulnerabilities can be chained together.
To improve it in the future, they need an evaluation framework to properly test between versions. Going forward, they will put a lot of work into the harness to find more bugs. They also want to keep iterating, and running, similar to that of a traditional fuzzer, to find more bugs.
For every reported finding, there are 10 confirmed issues. For every confirmed issue, there are 10 bug leads. This reduces the human cost by 10x, and it may grow in the future. Each confirmed issue costs about $100 in API credits. From audit reports to bug bounty, this is a small price to pay! Overall, a great post on the realities of creating a hackbot.