Current state of virtual developers
CTO's deep dive to understand where software engineering is heading
This piece summarises the current state of “virtual developer” tooling. Based on experimenting with building it myself.
The idea
The idea of virtual developer tooling is simple:
Use LLMs to generate code based on a human prompt.
Specifically, there are two settings that are quite different to solve for:
generating a standalone codebase from scratch
fixing a bug / inserting a new feature in an existing codebase
Early success
Since GPT4 came out it was quite clear that it was capable of generating code for entire applications, for example, the game Snake, from just one prompt.
Caveats include:
It mostly works well on prompts with many publicly available implementations
For most applications, the code will not run correctly on the first attempt
This is quite impressive. To be clear: applications that humans write, in one go, will very seldom run on the first attempt either.
The naive approach: chat dialog
The naive approach to address the caveats is to simply point out the problems in the generated code, and ask the model to fix it, in the ChatGPT UI or equivalent. And then let a human “take over” once there are only minor things to fix, by pasting the code into the editor.
Pros:
It’s very flexible. Easy to experiment with prompting strategies etc.
Cons:
A human needs to wait for the slow GPT4 to generate code, it can not “run in the background”
Once code is copied from the chat dialog, the “AI assistance” is gone from that point
Youtube creator Marko explains quite clearly how he was able to greatly accelerate his workflow, for a new project, with this approach.
Innovations to make virtual developers reliable
There are many tricks to improve LLM performance generally, some of these are particularly applicable for code generation. I’ll highlight the most impressive and useful ones below
Self reflection and Chain of thought prompts
In short, chain of thought prompting corresponds to asking the model to “think step by step”. This improves its performance on most tasks.
How it is can be applied in virtual developer tooling is a rich topic. A distillation of the main innovation:
Ask the model to write specification for the code that needs to be written to accomplish the goal
Ask the model to, one step at a time, write/edit each file necessary to accomplish the goal, based on the specification
There has been published work, Reflexion, taking chain of thought a few steps further: asking the model to also “critique itself” before committing to an answer, and cutting the error rate from 20% (normal GPT4) to 9% on coding tasks. Anyone can play with their code here.
Self-healing code
Another big innovation for virtual developer tooling is to let the LLM – just like a human engineer would – attempt running the generated code, read the errors (stack traces) and “fix” the code itself.
An example can be seen here by @bio_bootloader.
Self-healing code is, also, a rich topic. Lots of innovation can be made, for example, endless variation in how Chain of Thought prompting can be used to first write tests, and then self-heal the code based on the results of any failing test.
Retrieval augmentation
The idea of retrieval augmentation is to do a search to find relevant data (for example documentation, or “good code examples“, relating to the code that should be written) from a database, and then append that to the LLM prompt for generating the code.
Retrieval augmentation is very powerful, and a simple library to play with it is llama index.
Open-source projects
I list the most influential projects I’ve played with myself, which are around 1k Github stars each:
Dev GPT – Focused on making python microservices from a prompt. Utilises code healing, Chain of thought, and more
English Compiler – Takes a “technical specification”, chunks it up, and generates code for each chunk
AutoPR – Automatically generates pull requests based on issues in Github. Not as advanced “prompt innovation wise”, but focused on integrating with Github.
Smol Developer – Just found this before posting. Focused on generating entire codebases. A bit overengineered but simple to understand and run.
Recent addition
GPT Engineer – Generates arbitrary codebases. Focused on being quick to get started, experiment with, and ultimately generate the code for itself (“bootstrap”)
Since I couldn’t find any project providing sufficient interactivity, I actually wrote the last one, GPT Engineer, myself over the last few weekends. It has allowed me to experiment and has the flexibility needed for rapid experimentation.
Present day limitations
The limiting factor for all mentioned projects is: reliability.
A human would generally either not understand what to do an say so, or create something correct. While as the LLM will always confidently generate something, even though it might miss an important part in the specification.
Furthermore, if doing a generation run was instant, it wouldn’t be as big of a factor, but spending a minute or so waiting for something which might not work at all, is not a great value proposition.
From better models, towards clever heuristics?
A meta-observation on this topic is this:
OpenAI was able to change the game of AI by releasing access to better foundation models.
What we are witnessing now is that there are many smart tricks that can be used to make products and tools built on the foundation models more reliable in the real world. Perhaps, as much of the AI innovation coming years are heuristics and normal software engineering, as model training? Remains to be seen.
The ongoing Cambrian explosion
Things are moving fast. I expect there to already be projects that I did not know about when writing this, that are now fast-growing.
One example is Microsoft/Github’s work on a “copilot for merge requests”, with a preview video of how it will be able to suggest pull requests based on a short prompt. This, or one of the many startups currently claiming to be working on it (sweep.dev, magic.dev), might succeed in making it a successful product by solving what all the initiatives thus far are lacking: making it reliable.
Finally, if you read this far:
To keep us all up to date with the fast development, please share in the comments or tweet at @antonosika if you know some project I missed.