[Emerging Standards] A2A Protocol: New Era of Agent Collaboration or Just APIs in a Trench Coat?
When announcing new things we get a bit too excited. This is an attempt to bring us back down to earth. It doesn't mean that the idea of a protocol for agent communication and collaboration is bad.
On April 9th, in a blog post of appropriate levels of fanfare several senior Google figures announced the Agent2Agent protocol that would usher “a new era of Agent Interoperability”.
I am not exactly sure what was the old era of Agent Interoperability, but rest assured a new era is on its way. Excellent.
So let’s dive in. What exactly does the A2A protocol give us:
“The A2A protocol will allow AI agents to communicate with each other, securely exchange information, and coordinate actions on top of various enterprise platforms or applications.” - Google Blog Post
Ok. That sounds interesting. Why do we need it?
“To maximize the benefits from agentic AI, it is critical for these agents to be able to collaborate in a dynamic, multi-agent ecosystem across siloed data systems and applications. Enabling agents to interoperate with each other, even if they were built by different vendors or in a different framework, will increase autonomy and multiply productivity gains, while lowering long-term costs.” - Google Blog Post
Got it. So Google (and several other providers) are envisioning a future where (LLM-powered) applications will be reaching across silos and asking other (LLM-powered) applications to perform tasks for them. An application to application interface if you will - an Application Programming Interface even or API for short. Fascinating.
Anybody else trying this? Here is something:
the essential infrastructure to integrate AI agents with your entire IT ecosystem, allowing them to autonomously interact with APIs and drive efficient, high-quality customer interactions at scale. This ensures that AI agents can operate effectively within a secure and governed environment, enhancing overall business agility and performance.
This is Mulesoft Anypoint - from Salesforce. Salesforce also also supports the A2A protocol. That’s a bit confusing. So which one do we use, when? Products like Mulesoft have been around for ages. Why exactly do we need a specific Agent 2 Agent protocol? Seems we’ve been grappling with integration issues for a while here.
I know! A2A is when we want to get two Agents to interoperate. We’ve figured out that an API or data integration could not possibly solve the problem. The methods that we have decades of experience on simply will not cut it. We need to apply agentic capabilities and we need LLM-powered decision making. Got it.
Then I am sure that A2A will be solving tough, agent specific issues. Not the age-old API and data integration issues. This is about a new agentic era! Luckily we even know what the issues are. After all we’ve been doing multi-agent systems research for several decades. We know what are the challenges around discovery, trustworthiness, reliability and so on. So let’s not rush to conclusions about A2A. Let’s dive into the protocol and figure out how it addresses the really tough issues. I am pumped!
A2A design principles
Let’s have a look at the design principles for A2A.
Embrace Agentic Capabilities. A2A focuses on enabling agents to collaborate in their natural, unstructured modalities, even when they don’t share memory, tools and context. We are enabling true multi-agent scenarios without limiting an agent to a “tool.”
Ok. This is good. You’d hardly want an Agent 2 Agent protocol to not embrace agent capabilities. It would really, truly be just an API then I guess. I am not sure what a “natural, unstructured” modality is but sounds great. Ok. What next?
Build on existing standards: The protocol is built on top of existing, popular standards including HTTP, SSE, JSON-RPC, which means it’s easier to integrate with existing IT stacks businesses already use daily.
Cool, cool, cool. Imagine if we had to put in place entirely new standards - like a REPLACEMENT FOR HTTP. No, this is wise, very wise.
Secure by default: A2A is designed to support enterprise-grade authentication and authorization, with parity to OpenAPI’s authentication schemes at launch.
Authentication and authorisation. Perfect. Good thing this didn’t slip by. I am sure this will pretty much handle any and all security issues with having two LLM-powered applications hallucinate their way through complex tasks. Fantabulous.
Support for long-running tasks: We designed A2A to be flexible and support scenarios where it excels at completing everything from quick tasks to deep research that may take hours and or even days when humans are in the loop. Throughout this process, A2A can provide real-time feedback, notifications, and state updates to its users.
Interesting. Yup - sometimes applicationsagents do take a long time. We need a way to keep in touch and update other applicationsagents on what is happening. Looking forward to diving into the detail of this one. I wonder if I could subscribe to some sort of event and then be notified. That would be revolutionary.
Modality agnostic: The agentic world isn’t limited to just text, which is why we’ve designed A2A to support various modalities, including audio and video streaming.
Nice. This clears things up about that natural, unstructured modality mentioned in the first principle. I thought that might mean different modalities but actually this modality agnostic thing means that. So that must mean something else. Super clear.
So we want to support agents, securely, using existing standards as they work on tasks and notify each other. These are all good requirements, there are some light design principles somewhere there but it reads mostly as specific decisions rather than guidelines on how to make decisions. Anyway, let’s not get caught up in definitions. How about a bit more detail on how all this works.
How A2A actually works
The Github repo and this video provide some good insight into the detail.
Discovery of Agents and Skills
The first bit is discovery. My agent will identify a need for a skill or a task that it (presumably) cannot perform itself and will look into a “tool registry” for agents that might be able to perform this task. Agents describe their capabilities through AgentCards. These cards describe both practical information to facilitate communication and information exchange and, crucially, the skills, of the agent. The things it is able to do.
One interesting question here is how do we unambiguously describe a skill without assuming some prior knowledge or assumption. How do we know that the advertised skill is exactly what we need. When we hire people we put out a job ad, we look at a CV but then we run an interview to make sure the candidate it a good fit. With agents if they say they can "run HR activities such as candidate sourcing” are we just going to assume that that will do everything we need or is there an implicit assumption that this will be good enough. If we have implicit interoperability assumptions in place where does that leave the dynamic, autonomous agentic future?
Also, it’s not clear how my agent decides that it cannot do the task itself but I assume that this is similar to choosing a tool, only in this case the tool is a whole other agent and the protocol is A2A rather than standard tool calling.
How agents are dynamically discovered is also not tackled. For the time being I assume that if this was used in production setting it would simply be a case of letting all the agents involved know of all the agents they might use.
Tasks
The chosen agent will then be assigned a task. I am genuinely disappointed that the protocol designers went with the idea of a task rather than a goal. The whole thing about agents is goal-directed behavior. This was the one chance to make it clearly conceptually different. A goal is a description of a desirable state of affairs, a task is something a bit lower level. Oh well. This is where we are. We assign a task to an agent and a task can be in one of various states. Much like a job. I wonder if there are decades of tools helping us to manage long-running jobs in the IT world?
Collaboration
The other agent will then review the task and may have some questions. There is a specific ‘input-required’ state for this. That’s it. There is not really much to say about the collaboration mechanism other that it is very rudimentary right now. It is geared to some quite specific tasks that seem like they are making some strong implicit assumptions. It’s not as if there is decades of research in speech act theory and agent communication languages based on speech-act theory from the 1970s.
Conclusions
It might not feel like it given this post but I do think work on agent-to-agent protocols is very useful. However, this feels rushed. It feels like a proof-of-concept of some technical capabilities without the deep thought that needs to go into an actual protocol. It feels like Google rushing to occupy some space because they have to win everywhere and Anthropic got a head start with MCP.
The thing that gets me is that what will inevitably happen is that some people will get excited, some thing will be implemented and will not work and then the assumption will be that agent-to-agent collaboration is not useful rather than the thinking behind this particular protocol was too rushed.
This protocol does not solve any of the hard conceptual problems. It provides an interesting point of reference to think about them. An organisation as big as Google, however, should be driving for setting standards in a more organised way. We have the W3C the IETF, there are bodies that can support standards in a much more structured way. It is clear that international organisations are not popular these days but that doesn’t mean they are not useful. It seems that we just prefer to barge straight forward, push out some half-baked idea, generate the hype that comes with it (a new era!) and then deal with the consequences later.