Given degrees of freedom on what is acceptable, Claude will do a terrible job. For instance, when it wrote a C compiler, it passed the C test suite. However, it still had 34 horrible bugs. So, how do we improve the code generation? We force it to make the proper choices with execution oracles. The simplest one is a test case, but we can do better than that. Claude wrote a C compiler, but made really bad optimization passes. JustHTML was tested well, but had architectural issues. With some targeted rewrites, it improved significantly.
What are execution oracles? They are anything that can objectively rate the quality of code. Test suites, fuzzers, static analyzers, linters, formal verifiers... all good things. Performance oracles such as speed and memory are also useful. Finally, minimizing lines of code is also a good metric. These metrics need to be robust. Otherwise, the LLM will game the system.
There are some degrees of freedom that we can't control. Software architecture, modularity, and maintainability are important yet difficult to have objective ratings for. GUI polish, security, and duplicated/over-complexity are some other ones.
In practice, there are some other things to consider. First, some requirements are hard, while others are soft. Test cases failing is hard but a small performance loss may be fine. Next, LLMs will not perform actions unless they are specified exactly; if a tool isn't in the proper spot, it just gives up. It can be lazy and skip steps too. The best way to handle these complexities is to just watch what it's doing and intervene when necessary.
This post is a good look at how to make LLMs more deterministic in their workflows and reach the points we want. Great write-up!