The author of this post had watched the development of AI tools for many months and decided to see if they could find bugs in LLMs. Early on, they realized that the difference would be in the scaffolding more than anything else. They are #1 in Korea HackerOne reputation at the time of this post.
The most impactful bug they found was a privilege escalation in Grafana. It can be configured per dashboard, such as having admin access on one dashboard but not another. However, the permissions API did not have these controls. With admin control over one dashboard, you can modify the permissions of other dashboards. This was rated with a CVSS score of 8.1.
They tried and discarded several versions of their workflow. Eventually, they ended up with three stages: target preparation, vulnerability hypothesis, and candidate validation. Processing the target was all about dividing the codebase into reasonably sized chunks. Too big and it misses things. Too small and the tool calls explode. This was just used to create inputs that were a suitable size for the next stage to analyze.
Each code chunk was sent to the LLM to extract potential candidates. The goal was to have false positives here and have them get filtered out later. The role was more about finding all possible attack points rather than finding a real vulnerability. They used GLM 4.7 instead of GLM 5.1 because of cost. In practice, generating vulnerability candidates did not require much deep reasoning. Quantity over quality at this stage.
After identifying candidate issues, it was time to validate whether the bugs were real. This was about removing false positives. A piece of code may look vulnerable in isolation, but there are many ways an attack could have been blocked. This had two phases: one to remove obvious false positives with a lower-powered model, and another to run on things more likely to be bugs on a higher-powered model.
They ran into some limitations. First, it's still really expensive to run this tooling when the goal is to make money. It's not guaranteed that bounties will be paid out. Their system worked originally because GLM models were so cheap at the time. The next issue was around the gray zone of vulnerabilities. Vulnerabilities can be identified based on policy rather than code alone. They give an example of this in their report too.
LLM's usually analyze a single repository. In practice, they consist of multiple connected components, with security controls provided at different locations. This creates a number of false positives when things aren't actually vulnerable. An interesting article and another way that LLM's can help us find bugs.