Implementing a Basic LLM Agent in Python

The following are some notes I took on the process of implementing a simple LLM powered agent from scratch in Python. The core of my implementation was taken from this post by Simon Willison. My goal was to better understand the capabilities and limitations of an agent in its most basic form.

  • When you look at how simple the code / prompt is, its really amazing that this works at all.
  • You really just tell the model that it has some tools. Tell it to think about which it would need to answer a question. Then set up some code to trigger the execution of that tool based on the model's thoughts, and repeat.
  • Yet the difference between the most simple and most advanced implementation isn't all that big. If you look at this alleged leaked ChatGPT system prompt it is telling the model about the tools it has at hand in exactly the same way. 
  • The implementation could be much more robust. Some things that would be helpful:
    • Parsing for inputs / outputs to / from a tool to ensure consistent formatting
    • Error handling
    • Better retry logic. If hits max right now, fails.
  • It seems obvious that if you had a set of tools you planned to stick with, you could fine tune the LLM and make it much better at reasoning to select the proper tool at query time
  • The model matters a lot
    • If you use GPT-3.5 vs GPT-4 you will get completely different results. You can see how their reasoning about what info they need diverges when asked the same question.
    • I’ve experienced the same thing when building more complex agents. GPT-4 would get to the right answer a much higher percent of the time.
  • Maybe better models make agents just work
    • Following the above its reasonable to ask if agents as designed now might just work once something like GPT-5 comes along. Seems likely. When you inspect the ways in which they get things wrong, usually its a reasoning error that would be obvious to a human. Indicates that a better model might be all you need.
  • Interesting to compare what types of questions GPT-4 API gets wrong that even a simple agent architecture can get right.
    • Example: How old is the US and what’s its age raised to the 4th power? GPT-4 API gets the age right from its knowledge of the world and proceeds to completely make up what that number raised to the 4th power is. Whereas the agent looks up the date the country was founded, calculates the item from then till now in days, and then raises that to the 4th power.