top of page
Search

What I learned building an agentic ant colony.

  • Writer: Cynthia Unwin
    Cynthia Unwin
  • 6 days ago
  • 7 min read

Over the past week or so I have been working on coding an agent ant colony that restores service to a running application. The agents don't do complex RCA, log tickets, interact with engineers etc.. They just keep the application up and running. It was definitely fun, and I learned some interesting and useful things.


What I built:

  1. A simple python based web ordering application with a front end supported by two micro-services each with their own database. Pretty simple CRUD stuff.

  2. A simple monitoring display that reports on micro-service health showing things like total calls, error rate, latency, CPU usage, Memory usage and database health.

  3. A fault injection system that allows me to impose error rates, latency, memory leakage, database failures, and pod crashes.

  4. A simple load generator that pushes traffic to the application.

  5. A colony of five ants that are responsible for maintaining the availability of the application despite my constant fault injection.

  6. Agent tracing using Phoenix


The ants themselves are simple and all extend a base with the same directives:


class BaseAnt(ABC):
    """
    Base class for all ant agents.
    
    Each ant:
    1. Monitors a specific metric/threshold
    2. Detects when threshold is breached
    3. Formulates a hypothesis about the fix
    4. Takes action using tools
    5. Validates the hypothesis against actual results
    6. Logs everything to shared state
    """

I used gpt4 as the model used by the agents solely for the reason that I am cheap and I have access to a sandbox where it is available. I will run a set of tests next week using different models (aka Sonnet) and and see what that does to the outcomes.


The agents themselves are built using langchain. When they get a bit fancier they may become langGraph but the whole point here was simple so simple we have.


I manually set the thresholds that the ants would react to so their is no nuance here. Smart ants would be more adaptable to the environment than the on/off we have, however, smart ants also wouldn't be deployed using a monitoring system that I built on my laptop during a meeting about something else, so I figured fair is fair.


I wrote all the code using IBM Bob. Like every line. I have used other coding tools like Claude Code but I genuinely prefer Bob. This isn't the Blue Kool-Aid talking either. Bob is slick. It does mean that some parts of the code (for better or worse) don't resemble how I would have built them myself, but the overall architecture and design principals remain true to my original intention. On occasion that required some extended discussion with Bob but I'm generally happy with where we got to in the end. I will post the code after I complete the next generation.



Architectural Principles:

  • Choreography not orchestration. Agents have skills and respond to events but there is no central control that tells ants what to do.

  • The only form of inter-ant communication in the system is access to shared memory.

  • Actions taken by ants in the environment can only be taken through the use of pre-tested tools (my proxy for Ansible scripts etc) or a list of allowed commands. Ants do not create new ant behavior on the fly, they can make novel use of existing ant instincts but they aren't learning new ant tricks. (I'm considering giving them the ability to request new ant skills if they find they need something they aren't allow to do as a form of controlled ant evolution.)

  • Ants monitor their specific thresholds, they are not making complex decisions on what thresholds are relevant in each scenario. They see a breach, they read the shared memory, they do their thing.

  • Each Ant is responsible for recording what they think the result of any action they take will be and then evaluating whether that is actually what happened as part of every action cycle.


The Outcome:


While my test system is reasonably simple, the ant colony is perfectly capable of keeping it available as long as the ants have the tool needed to do what they need to do. For example, if I artificially exceed the space available for the database, the current iteration of ants can't help us because they can't expand the database. However, when they do have the tools required, they can resolve both simple and cascading problems. The ants have no problem restarting pods with memory leaks, removing artificial network latency etc., but they also can manage dealing with response latency in a service that is caused by an external database that is not available. While I wouldn't let these guys loose on a production system yet, they actually exceeded my expectations for a first try and consistently prove themselves competent and consistent within their defined scope.


What I learned:


Agents (at least build on gpt4) aren't good at thinking about the passage of time unless they are directed to consider it. My first ants easily identified that a database had been restored but they struggled with what "recently" means. The latency ant would report that the observed latency could be caused by the database not responding, and it could see that the database ant had restored the database 1 second ago, but since that wasn't recently, it should look for another cause. This was probably the most persistent problem I faced as I really didn't want to specify how long to wait after another ant had acted as that really is something that requires judgement. In the end I had to give the ant a rule that said that it should hold off on taking an action if the trend for the metric they were watching was improving. If I didn't, the latency and error rate ants would obsessively try different actions for the 2 - 3 minutes it took for these average rates to reduce to normal after a database failure. I had originally tried having the ant look at current values to see if they were normal but that required a more complex understanding of normal than my hacked together monitoring system was capable of providing. I will do more work on average metrics in the next version as I fear there are still flaws with my existing implementation.


Because the agents are choreographed they will do things at the same time and conflict with each other. For example the error rate ant will try to clear the error rate on a pod while the memory ant is restarting that same pod. The end state is a working system but the error ant itself will report failures (and pollute the shared memory with that failure) making it difficult to track which ants are being effective and which aren't. I haven't decided if the problem is division of ant responsibility, a shared memory challenge, or a problem with how I am measuring success. This type of race condition issue is common in event based systems and I suspect the answer to how we should deal with it can be informed by what we have learned building any sort of choreographed architecture. For now I have decided not to try to stop this from happening, but to make the ants smart enough to understand that it can happen and they need to be aware that there are other ants without getting hung up on coordinating with them. We will see how this goes.


The current version of the ants don't learn from past mistakes. As mentioned above there is no ant evolution going on here. The colony will put an ant to sleep if it fails to get expected results consistently, but ants don't learn anything from their failures. For example, if I disable the override for the latency injection feature, the latency ant will continue to reset the latency setting indefinitely with no result without thinking "maybe I should try something else". It has a reset hammer and it is going to use it. I want the ant to see the repeated failures and think "something else must be wrong". To get the behavior wanted I had to actually tell it to stop and think about another fix if it has seen multiple failures. While I don't think this will cause me problems down the road, I worry that I'm somewhat orchestrating the ants actions in my instructions and that defeats the whole purpose of the experiment. Perhaps I'm over-thinking this one.  Interestingly, as noted above, the ants seem better at recognizing that they need to wait for another ant to do something than they are at seeing that trying their own failed action over and over again wouldn't fix a problem.


Finally, because the goal of the system is only to restore service (not to correct underlying problems) the ants will get into loops of maniacally restarting pods etc. Much like automation that clears out a log directory when it gets full will merrily delete your tomcat logs every 12 minutes for 3 months without ever raising a flag that says "hey, why is this tomcat log directory filling up every 12 minutes", the current iteration of ants isn't smart enough to say "I've fixed a memory leak in this pod every 7 minutes all day, there must be something that needs an actual fix here". To me this is the next step in their growth. To do this they will need better situational awareness than I have provided; more advanced memory; and probably the capacity to hand off tasks to specialist ants that deal with more complex investigation (an L2 investig-ant?). I want to push the simple model as far as it can go before I complicate it with multiple layers, but there is a point where we will lose the simplicity if we try to make the individual ants more sophisticated. Perhaps I should think of it as passing problems on to a different ant specialization - the restoration of service ant can pass a problem on to a code fix ant? We will need to see how it comes together.


All in all the experiment has been an adventure and I look forward to sharing what I learn next.




 
 
 

Comments


Post: Blog2 Post
  • Facebook
  • Twitter
  • LinkedIn

©2020 by alwaysOn. Proudly created with Wix.com

bottom of page