The author starts out with an excellent observation:
Lately, I've been playing around with LLMs to write code. I find
that they're great at generating small self-contained snippets.
Unfortunately, anything more than that requires a human...
I have been working on this problem quite a bit lately. I put together a writeup describing the solution that's been working well for me:
The problem I am trying to solve is that it’s difficult to use GPT-4 to modify or extend a large, complex pre-existing codebase. To modify such code, GPT needs to understand the dependencies and APIs which interconnect its subsystems. Somehow we need to provide this “code context” to GPT when we ask it to accomplish a coding task. Specifically, we need to:
1. Help GPT understand the overall codebase, so that it can decifer the meaning of code with complex dependencies and generate new code that respects and utilizes existing abstractions.
2. Convey all of this “code context” to GPT in an efficient manner that fits within the 8k-token context window.
To address these issues, I send GPT a concise map of the whole codebase. The map includes all declared variables and functions with call signatures. This "repo map" is built automatically using ctags and enables GPT to better comprehend, navigate and edit code in larger repos.
The writeup linked above goes into more detail, and provides some examples of the actual map that I send to GPT as well as examples of how well it can work.
I've been hearing executives and "tech leaders" recently saying that 80% of the new code is now written by chatgpt, and that it will "10x" a developer, but that sure mismatches with my experience. I suspect there will be a lot of managers with much higher expectations than is reasonable, which won't be good.
It's actually 1.2x productivity. Writing code does not take most of the day. If GPT can be just as good at debugging as it is at writing code, maybe the speedup would increase a bit. The ultimate AI speed == human reading + thinking speed, not AI generation speed.
I’ve found it’s very useful for understanding snippets of code, like a 200-line function. But systems are more than “lots of 200-line functions” - there’s a lot of context hidden in flags, conditional blocks, data, Git histories.
Maybe one day, we’ll be able to run it over all the Git histories, Jira tickets, Confluence documentation, Slack conversations, emails, meeting transcripts, presentations, etc. Until then, the humans will need to stitch it all together as best they can.
I can't think of a reason you couldn't specifically train an AI on your own large code base. After all, current LLMs are trained on effectively the entire internet.
Unless your documentation is 20~100x the size of your codebase and written in a conversational tone, the LLM won't be able to be asked any questions about it using English.
If your only aim is to use it like Copilot, sure, it's useful.
You might be able fine tune a model on pull requests if they have really high quality descriptions, high quality commit messages, and the code is well documented and organized.
I haven't yet seen anything that can scan an entire codebase and build, say, a data lineage to understand how a value in the UI was calculated. I'm sure it's coming, though.
but not all human thinking is worthwhile. I had it do a simple chrome extension to let me reply inline on HN, and it coughed out a manifest.json that worked first try. I didn't have to poke around the Internet to find a reference and then debug that via stack overflows. Easily saved me half an hour and gave me more mental bandwidth for the futzing with the DOM that I did need to do. (to your point tho, I didn't try feeding the html to it to see if it could do that part for me.)
so it's somewhere between 1.2x and 10x for me, depending on what I'm doing that day. Maybe 3x on a good day?
I don't mean to pick on you specifically, but this kind of approach doesn't fit the way I like to work.
For example, just because the manifest.json worked doesn't mean it is correct - is it free of issues (security or otherwise)?
I would argue that every system in production today seemed to "just work" when it was built and initially tested, and yet how many serious issues are in the wild today (security or otherwise)?
I prefer to take a little more time solving problems, gaining an understanding of WHY things are done certain ways, and some insight into potential problems that may arise in the future even if something seems to work at first glance.
Now I get that you are just talking about a small chrome extension that maybe you are only using for yourself... but scaling that up to anything beyond that seems like a ticking time bomb to me.
I feel like you would get more benefit out of GPT. you could ask it if it finds any vulnerabilities, common mistakes, other inconstancies. please provide comments on what each line does. what are some other common ways to write this line of code, etc.
what are some ways to handle this XYZ problem. I see you might have missed sql injection attacks. would that apply here?
Same goes for code you find on the internet.
I got this out put for this line of code what do you think the problem is.
Even OpenAI says clearly: You should not, by any means, ask the AI any questions you don't know the answer already!
> more benefit out of GPT. you could ask it if it finds any vulnerabilities, common mistakes, other inconstancies. please provide comments on what each line does. what are some other common ways to write this line of code, etc.
And than it spits out some completely made up bullshit…
How would you know if you don't understand what you're actually doing?
Every time I've tried chatgpt I've been shocked at the mistakes. It isn't a good tool to use if you care about correctness, and I care about correctness.
It may be able to regurgitate code for simple tasks, but that's all I've seen it get right.
Most managers don't care for people like you. Companies sell their product. Another successful fiscal year. Irregardless of the absolute shit code base, wasteful architecture and gaping security vulnerabilities.
Maybe, but I've always had lucrative jobs and my work has always been appreciated. Maybe you just have to find the right employer. I think longer-term, employers that value high quality work will have the upper-hand.
And to be honest, I don't care for managers like that, so the feeling is mutual.
I think it is worthwhile considering the multiple of the alternatives.
That is, SERPs providing relevant discussion and exacting or largely turnkey solutions.
On “easy” tasks in technical niches I’m not familiar with, I would take gpt over DDG + SO the majority of the time.
I’ve had situations where I’m wanting a tutorial or walk through on something and mixing various sub stack and independent blog posts.
There is no consistency in quality or even correctness from those sources, while you must also deal with format and styling variation.
I block adtech within reason, but the problem of greyhat or SEO-focused filler also isn’t really a thing in gpt. You just get the fat of the land.
The biggest problem w gpt 4 is the cutoff date. The LLM needs to be updated regularly, the way Google initially seemed to crawl all the things and make them available to queries as they appeared.
Data recency and the ability to process excess tokens is going to show OP’s test as but a toy example of what the systems can do.
In the absence of constant updates of a massive model and all that entails, I foresee Companies temporarily dominating attention by providing pretrained LORA-like add-ons to LLMs at key events.
For example, Apple could release a new model trained on all of the updated Swift libraries coming to market at WWDC shortly after the keynote.
It can contain all the docs, example code and allow devs to really, really experiment and have warm and fuzzies on the newest stuff.
It could even include the details on product announcements and largely handle questions of compatibility.
If the companies hosted the topic focused chatgpt-like bots, they could also own all the unexpected questions, and both clarify and retrain on what the most enthusiastic users want to know.
This is going a bit of another direction, but I think all of this is very exciting and will hasten the delivery of software for brand new SDKs.
> The biggest problem w gpt 4 is the cutoff date. The LLM needs to be updated regularly, the way Google initially seemed to crawl all the things and make them available to queries as they appeared.
Have you tried out Phind? It's essentially GPT-4 + internet access. It hasn't been perfect, but it's been a very useful tool for me.
Also chances are great that the AI just spit out some code from that extension… (Of course without attribution. Which would make it a copyright volition.)
Definitely this. I spend about 10-15% of my time writing code so a 20% increase really doesn't save me a lot of time. Also AI generated code requires more reading of code, which is harder and more cognitively expensive than writing code.
Depends on the code. If AI can quickly write even mediocre quality automated tests, that's a tremendous speedup, and, if I'm being totally honest, morale boost.
Throwing away and rewriting tests is work (== negative value).
Testing implementation details is always contra productive. Have you watched the video? (I'm not recommending videos often, as I think writings have more value per time-unit, but this talk is a kind of classic on that topic.)
Sorry, I can't accept that a one hour video is the cost to participate in this conversation. I don't disagree that there is a cost associated with doing this, but the cost is so much smaller than it used to be that it can be economical now.
Implementation details are in the eye of the beholder IMO. I'm open to reasons why that's not the case here.
I use it pretty regularly for things like “write me a typed retry decorator in python, and write tests for it” or “parallelize this for loop using a ThreadPoolExecutor”
I work with people who have used it almost daily since it came out. None of them have 10x’d anything. Even their open source ChatGPT related projects are stalled and not going anywhere. Not to say it can’t help but I’ve not yet encountered this mythical 10x boost.
As others have observed, maybe most have stopped using it.
Maybe Ray Kurzweil is kind of on the money about computer brain interfaces, that’s when the fun really starts.
No worry. As soon as 80%+ of those managers will be replaced by ChatGPT such nonsense claims will stop; as ChatGPT knows things better than most managers. /s
I did not know what Ctags were, here is the explanation from exuberant Ctags:
>Ctags generates an index (or tag) file of language objects found in source files that allows these items to be quickly and easily located by a text editor or other utility. A tag signifies a language object for which an index entry is available (or, alternatively, the index entry created for that object).
I'm a big fan of ctags. My old Emacs setup utilized Ctags when all else failed. So if I wanted to find a reference for something it would use LSP, then ctags if LSP returned nothing.
I've been thinking that there needs to be a LangChain or similar in-prompt Tool that allows the model to automatically query a Language Server Protocol server. Maybe your tool does everything that could be done that way anyhow. aider looks interesting, I will try it out.
Absolutely, LSP is strictly more capable than ctags for this purpose. I mention it in the future work section of the writeup I linked above.
The main reason I started with ctags is because the `universal ctags` tool supports a ton of languages right out of the box. Each LSP server implementation tends to only support a single language. So it would be more work for users to find, install and stand up the LSP server(s) they need for their particular projects.
I plan to try some experiments with LSP in the future.
I keep reading that you can upload data to one of the GPT interfaces, I'm not sure if it's a ChatGPT plugin, or one of the OpenAI API, but I wonder if you can include some source code that you can then prompt based on...
I've been kind of tricking GPT, admittedly I'm currently limited to gpt-3.5-turbo, because with the API you can submit whole conversations and ask for the next reply from the assistant. I don't know how much of that whole conversation it re-reads, but I'll do things like:
- Give it a system prompt with guidelines.
- Give it my user prompt, which contains a description of what I want it to do.
- Give it an "assistant response" which I've faked which is like: "Can you tell me more about the database schema?"
- Give it my answer, which is the database schema.
- Give it another faked assistant response where it is asking me about the API endpoints I need.
- Give it my endpoints...
At this point, I then ask it to generate a response.
I've found that if I try to combine all the above information into a single prompt, I often run into token limits.
How much time did it take to prepare all of that for ChatGPT? Won’t you have to redo all of that work every time you ask for more help since code bases are not static? Would it take less time and effort to just write the code on your own?
Ya, it would be tedious to do all of this manually. I guess it wasn't clear, but all of this is part of my open source GPT coding tool called "aider".
aider is a command-line chat tool that allows you to write and edit code with GPT-4.
As you are chatting, aider automatically manages all of this context and provides it to GPT as part of the chat. It all happens transparently behind the scenes.
The onus is on you to prove that they do understand.
One may not simply claim that a machine ‘understands’ because it appears to. Now we’re actually in the much-misunderstood Occam’s razor: that the theory which introduces the fewest new assumptions is the most likely correct.
The assumption that a machine ‘understands’ — is capable of such a thing — is new, and requires extreme justification.
Ya that’s not how proof works. People can do the same with humans. Not understanding things happens to be not a sign for the lack of capacity of understanding in humans.
> Convey all of this “code context” to GPT in an efficient manner that fits within the 8k-token context window.
> To address these issues, I send GPT a concise map of the whole codebase. The map includes all declared variables and functions with call signatures.
Even if you could encode a function signature in only one token (which isn't be possible, of course) this would only work for small single developer programs, and not large projects.
Mid-sized frameworks / libs have often teens of millions of lines of code, and hundred thousands of function signatures. With 8k tokens to spend you go nowhere even close to a large code base…
Ya, vector search is certainly the most common hammer to reach for when you're trying to figure out what subset of your full dataset to share with an LLM. And you're right, it's probably a piece of the puzzle for coding with GPT against a large codebase.
But I think code has such a semantically useful structure that we should probably try and exploit that as much as possible before falling back to "just search for stuff that seems similar".
Check out the "future work" section near the end of the writeup I linked above. I have a few possible improvements to this basic "repo map" concept that I'm experimenting with now.
No I agree the distilled map is most useful in context. However I wonder if providing a vector store of the total code base amplifies the effect. You could also pull in vector stores of all dependencies as well. Regardless amazing work and looking forward to seeing your future work as outlined.
LLMs can put their embeddings into a vector database, extending their short-term memory to encompass an entire codebase, every support ticket, PR, etc...
I haven't seen any turnkey solutions yet good enough to use for development, but it's coming with months, not years.
https://aider.chat/docs/ctags.html
The problem I am trying to solve is that it’s difficult to use GPT-4 to modify or extend a large, complex pre-existing codebase. To modify such code, GPT needs to understand the dependencies and APIs which interconnect its subsystems. Somehow we need to provide this “code context” to GPT when we ask it to accomplish a coding task. Specifically, we need to:
1. Help GPT understand the overall codebase, so that it can decifer the meaning of code with complex dependencies and generate new code that respects and utilizes existing abstractions.
2. Convey all of this “code context” to GPT in an efficient manner that fits within the 8k-token context window.
To address these issues, I send GPT a concise map of the whole codebase. The map includes all declared variables and functions with call signatures. This "repo map" is built automatically using ctags and enables GPT to better comprehend, navigate and edit code in larger repos.
The writeup linked above goes into more detail, and provides some examples of the actual map that I send to GPT as well as examples of how well it can work.