Agents: How do we know they work?

Cynthia Unwin
Jan 15
7 min read

Agentic platforms are everywhere and we are pushing forward to use more and more AI driven software. As Site Reliability Engineers we need to really think about what it means to run diverse agent platforms at scale. We need to think about what needs to be in place to make them manageable. How do we know right now that our agents are working? What do we need to see in the logs to troubleshoot when they don't? What data needs to be gathered at time of failure? What differences do there need to be in our deployment pipelines? What is an obeservable agent? What does an agent sanity test look like? What new playbooks do we need to have to be prepared to support agents? What does agentic operational readiness look like? These are the questions we need to dig into over the next weeks.

So let's start at the beginning and talk about how we know our agents are working. This starts far before production:

what do agent unit tests look like?
how do we functionally test an agent or agentic software?
what does non-functional testing look like?
what should be included in a sanity test and how often do I run these?
what does an agent health check look like.

Perversely, one of the issues I see currently is that teams building agents are not always rigorous about unit testing the deterministic code in their agent systems. Tools for example need unit tests, tools need health checks, tools are just deterministic code. We can't expect our non-deterministic code to work if our basics don't.

Moving past that we have a fundamental problem with non-deterministic testing in that it can be exceedingly hard to either define or evaluate what working looks like from a functional perspective. How good is good enough? We can talk about precision and recall and there are testing methods that we will discuss below but it can be difficult to assess whether an agent that provides a summary pulls the right key points, or if a search agent selects the top results based on the right arguably ambiguous criteria. From the Site Reliability Engineer's perspective however, it is up to the designer to define what functionally accurate is, and ours to notice if the software drifts away from meeting this definition.

Pulling this discussion back from more abstract topics we can look at how agents fail in real life. Unsurprisingly a certain amount of research has been done on this topic. What we see in real life is a set of common failure patterns that occur independently of the task the agent is performing. [paper]

Incorrect tool use: The agent used the wrong tool, or used the tool wrong.
Premature action without grounding: The agent had access to the correct information but didn't go look it up and instead made up something plausible.
Over-helpfulness under uncertainty: Instead of admitting it didn't know the answer the agent made up something plausible instead.
Context failure: Based on the information available the agent made a determination but the context or situational awareness was incomplete.
Context Pollution: While the agent could have identified the extraneous data in it's context it used irrelevant data in making it's determination, skewing the usefulness or accuracy of its results.
Fragile Execution under cognitive load: Failure to recover from error, coherence collapse, generation loops, etc.
Precision and Recall: The response provided is incorrect, or incomplete.

Agents will fall prey to these issues with varying degrees of deleterious effects. If one of the above failures causes a functional issue in a human in the loop process then it makes the agent less useful, but if an AIOps agent suffers from one of these issues the results could be operationally catastrophic. We can design and test to avoid these issues during build but we cannot expect that even a carefully designed and tested agent that performs well during design or test will be sure to avoid these problems once moved to a production environment. Additionally different failure types require different solutions. For example, while we can test tools, and design agents to use them effectively and handle errors when they do not; context failures or context pollution can infect a previously capable agent based on unknown external influences with little warning.

If we move away from functional issues, we see we also need to think about completely different non-functional problems. In addition to our regular non-functional focuses like response time, resource usage, logging and error handling, etc we need to consider:

response style and whether responses are of an appropriate length;
token efficiency;
is the context window being used effectively;
chain of thought traceability (it's not just good enough to trace what happened, we also need to know why an agent made a decision);
response consistency;
effectiveness over time (especially in operational processes, failures can be part of complex chains that manifest over time).
etc.

Not surprisingly many teams are very focused on functional testing but non-functional testing can lag behind in the race to show value. As agents scale this will become more and more critical.

Finally, agents are a class of tool but they do many different types of things, using many different patterns. Test requirements for orchestrated processes differ from tests designed for event driven systems; HITL vs autonomous processes have different test requirements; retrieval systems are different from analysis systems. Each of these will need to be carefully thought through, however, if we start with a standard we can expand and grow. Here are the practical recommendations I have based on experience to date:

Basics:

Tools and other deterministic code need standard unit tests run by developers and in pipelines.

Static code analysis and security scanning needs to be run and gated. Flagging a problem is a start but errors need to be fixed before code moves to production. It always surprises me how many Enterprise processes include code scanning but do not mandate action be taken based on the scans.

Functional Testing:

Functional testing of non-deterministic code is hard but there are strategies to address this. No single strategy will provide a complete solution.

Deterministic scenarios: Tests where only one answer and format is acceptable. For example, processes that load data from different places and use a tool to provide a numeric answer that can be known in advance, or processes that write SQL to return a specific database field based on known input.
Golden Questions: More complex questions that require a conversational answer or require information processing. Teams identify representative questions and provide a correct answer and the agent's answer is compared to the standard. This can test for format, precision, recall, etc. For example, chat bots that retrieve information from RAG stores; RCA agents that provide suggestions to engineers; etc
LLM as a judge: A secondary process uses an LLM to judge the quality of an answer based on pre-defined criteria. (This is good for ensuring that real use of a tool is meeting the expected results).
Hypothesis Testing: Useful for agents that take action in an environment that changes the state of that environment, they hypothesis test involves having the agent state what the anticipated outcome of the proposed action is, which can then be compared to the actual outcome.
Correct Action Testing: Useful for agents that take action in an environment where the next best action is not orchestrated. An agent can know that deleting the data on a drive will resolve the disk space problem, but it needs to also know that is not an appropriate action. This can also include ensuring that the agent is using logical escalation. For example, rebooting a server may fix a problem but the lowest impact action with the same result may be restarting a service. These tests can use llm as a judge or be based on rule systems (or a combination of both).
Outcome based testing: Useful in systems where agents are performing different tasks in combination with the aim of maintaining the state of a complex system, these tests assess whether the agents are effective as a group. For example, does agent team maintain 99.999 availability for the k8s cluster?

Non-functional testing:

Most non-functional testing involves establishing targets and ensuring the system falls within them. These tests are fundamentally deterministic and answer questions like:

Was my average token usage below a threshold
How many times did my agent team lose coherence and fail to produce a result within the agreed on max turns.
How long does it take to get a response from the LLM.
Are the cache statistics within threshold (or am I asking the LLM for everything).
Are the agents making multiple tool calls when one would suffice, what is the error rate for tool calls.

While the test is dependent on the requirement it is strongly suggested that these tests be performed using deterministic code and thresholds as you would perform non-functional testing on any software platform.

Frequency:

Functional and Non-Functional testing for agent systems cannot be an activity performed before release only. Because of the dynamic nature of agent systems, this testing needs to be completed on an ongoing basis. This means that this testing needs to be:

Fully automated.
Properly provisioned for both from a token and ongoing compute stand-point.

The appropriate frequency for retesting or sampling behavior on production systems will be different depending on system complexity and criticality, however, as a baseline the recommendation is to test or sample question answering, analytics, and non-action taking systems at least once a week and closed loop automation (AIOps) daily. While full regression testing may not be warranted some form of sanity testing that will detect drift is critical.

Health Checking:

What does it mean for an agent to be available and ready? I suspect this discussion will evolve but at this time:

Available, I can reach an agent and get an alive response (software only, this tests if the agent is reachable and accepting requests)
Ready, I can send a test question to an agent and receive a response generated from the associated llm. (This tests that the agent is providing responses that invoke the llm (not just the "I can't reach my llm" error path)
Tools are available: Can I deterministically invoke the tools in use and get a response. (This ensures tool availability, MCP availability, etc)
Agent can invoke tools. If I send a test question to an agent can it select and use each tool to provide a response.
Hypothesis Threshold: Agent responds with the number of actions it has taken and the numbers of actions that have produced the expected outcomes.

Conclusion:

Establishing a full testing protocol for agent systems will take work and it will continue to evolve, however, as we move as an industry towards the pervasive use on non-deterministic software, this is a discussion that will need to be central. As SRE we need to understand what operational readiness for agent system really means and the first step is being able to understand what working looks like.

Agents: How do we know they work?

Recent Posts

Comments