in

Ajwad Ali on Building Agentic AI Inside Insurance

In this conversation, we spoke with Ajwad Ali, Head of AI Engineering at AXA UK, about how agentic AI systems are architected, governed, and deployed inside large insurance organizations.

1.You’ve spoken about AXA’s shift from conversational AI toward more agentic systems. In insurance, workflows like claims can span weeks and involve many state transitions. How do you architect long-lived agent memory without relying on fragile LLM context windows? Do you treat state as a first-class deterministic system outside the model, or is retrieval still doing most of the work?

Ajwad Ali: Managing state and context engineering is a key challenge within our Agentic systems at AXA. The approach we’ve taken within AXA is to develop our own framework for Agentic systems that standardises how agents communicate with each other. This was done before developing any agents, to ensure they are built consistently and share a common understanding of context. The framework we’ve developed is based on an event-driven architecture which enables context engineering as well as stability for the Agentic platform.

For state management, it very much depends on the use case. Our AI solutions at AXA serve multiple business areas such as health, commercial, and retail. Some scenarios, such as policy document lookup, can be handled predominantly using retrieval, whereas claims workflows require more complex state management that augments deterministic logic with LLM-based decisioning.

2.Insurance rules are binary, but large language models are probabilistic. When an agent’s reasoning suggests flexibility, but a hard underwriting or compliance rule forbids it, how is that conflict resolved in your architecture? Where exactly does deterministic code override model inference?

Ajwad Ali: This is an area where we get a lot of challenge and questions around what value does AI bring in binary outcomes. The answer here is to have a good orchestrator within AI systems. The way we’ve developed Agentic systems is to be multi-agent with specialised domain and task agents whose output go to an orchestrator. The orchestrator is the top layer and is a hybrid agent that is combined with LLM driven decisioning but also hard coded rules as above. It’s at the orchestrator layer we can build in rules and logic around overriding output by models.

Another aspect to this is we are very much implementing a human-in-the-loop approach at AXA where outputs are verified and reviewed by humans which gives us a little bit more freedom with experimenting how much decisioning is appropriate for the Orchestrators.

3.As agent autonomy increases, failure attribution becomes harder. If a multi-agent system produces an incorrect claims or underwriting outcome, what level of observability do you require to trace whether the failure came from document extraction, reasoning, orchestration, or guardrail logic? Is this forensic traceability embedded into your MLOps standards today?

Ajwad Ali: Observability and auditability aren’t my favourite topics, but they consistently come up. We’re working closely with our dedicated AI Governance team to ensure that systems are built with this in mind. There has to be a risk acceptance with Generative AI models and yes, failure will happen and in Insurance, which is a heavily regulated industry, there must be an audit trail for any decision makers or failures. We are ensuring right from the off that Agentic systems are outputting logs at every relevant step but that’s the easy bit. The tricky bit is making those logs readable and consumable for wider business stakeholders so we can quickly make sense of erroneous outputs, that’s something we are still trying to work through. It’s something I look forward to seeing examples of in the AI Awards, to see what solutions have been tried and tested for monitoring AI at scale.

4.AXA operates in a deeply brownfield environment. How do your AI agents interact with legacy core systems such as mainframe-based policy or claims platforms without inheriting their latency and batch constraints? What does the anti-corruption layer between cloud-native AI and legacy systems look like in practice?

Ajwad Ali: The truth is they don’t! We have an internal function called the AI Hub which is responsible for validating and delivering AI use cases. In some cases, we have managed to make sense of some older systems, but some legacy systems are far too archaic and complex to effectively integrate with AI, part of our prioritisation process is looking at feasibility.

All of this has resulted in a strategy and vision at AXA UK to be AI-native. We want our core systems to be rebuilt and modernised with AI in mind going right down to the data fabric. A lot of the focus for 2026 will look at foundational work to enable AI at scale going forward. We’re investing heavily in this space and won’t just look at clunky solutions putting AI on top of legacy systems where we don’t foresee optimal results. We are actively working on modernising alongside our AI innovation work and using AI to accelerate our modernisation through reverse engineering+ and understanding how and where we need to improve systems and processes.

5.Secure GPT scaled rapidly across a large employee base. At that level of concurrency, cost and latency become architectural problems. How do you decide which tasks warrant large foundation models versus smaller or specialized models, and is that routing logic static or adaptive at runtime?

Ajwad Ali: Yes, SecureGPT is great! It’s given us so much value at AXA and the team continue to add new features to ensure AI at scale for colleagues.

My personal view is that model selection has become an underutilised practice and by default too much traffic is going to large foundation models. The large models are great for general purpose and work well in insurance use cases where external data is enriching internal. If looking at a multi-agent architecture, the first point of contact for the user would usually be best placed leveraging a large model, such as the orchestrator I mentioned earlier but where we may choose to implement a smaller model is for some of the task agents which do small specific specialised tasks. For example, we use a small model for identifying PII that’s trained on fields we classify as PII within AXA.

Smaller models are great for cost as quite often they are self-hosted as they can be open source, which also means they tend to be safer to implement as you can see exactly what’s under the hood unlike a lot of the consumption based LLM’s such as GPT5 which are essentially a black box.

As for routing logic, we are experimenting at having adaptive resource management but are yet to combine both model states, so it is static for now.

6.You advocate for human-centred AI, but excessive automation can lead to automation bias. In high-stakes workflows like claims or underwriting, do you intentionally design friction into the user experience to force human judgment? From an engineering perspective, how do you prevent AI from turning reviewers into rubber stamps?

Ajwad Ali: Our thinking here is still primitive and I’m sure it will evolve as we mature in our relationship between AI & humans. We don’t just simply put a review request in place, we add context to the human to give the decision point a bit of a story and reasoning. The context usually involves key highlights from the workflow up to the review and any suggested next steps. The suggestions are important as they force the user to reflect on the decision and potential next steps before reviewing.

I’m interested to see how interfaces and design change as we all get more accustomed to AI systems. Currently AI is being built on top of existing systems and squeezed into interfaces that weren’t necessarily built for AI interactions. I’m not sure what the answer is here for the long term, but I suspect it opens the door for a lot of innovation. It’s a factor that I think is often overlooked and there’s a lot of opportunity to build better user experiences in areas such as accessibility, it’s another area I hope to see some progress in the AI awards. I’ll use my favourite term again ‘AI native’ but interfaces and the human interaction are a huge piece of that puzzle.

7.Many emerging insurance risks have no historical data, including deepfake fraud and synthetic identity attacks. Is AXA engineering synthetic data pipelines to train systems for these low-frequency, high-impact scenarios, and how do you validate that agents trained on synthetic distributions behave safely in the real world?

Ajwad Ali: AXA is using synthetic data where necessary to augment and simulate real world scenarios. Validation is various traditional data science techniques to get data on how we expect it to behave in the real world such as k-fold and Monte Carlo testing but ultimately there’s only so much validation that can be done using synthetic data. It’s more critical to have secure and robust guardrails in place and to deploy these agents in highly monitored small traffic use cases initially to collate enough data to build confidence.

8.You’ve emphasised portability and reuse, yet Secure GPT is tightly integrated with Azure’s proprietary safety and governance tooling. How do you balance the speed gains of deep cloud integration against the risk of vendor lock-in if alternative or open models outperform in specific insurance tasks?

Ajwad Ali: Abstraction. Our architecture teams put a lot of thought and effort into sustainable technology. We de-couple our AI solutions as much as possible from technical dependencies on 3rd party technology. We have also developed our own model gateway layer which acts as a connector for all UK AI use cases to models and simplifies the process of selecting and switching the models that power the solutions. The model gateway has integrations with a wide range of models and is configurable to enable more. By doing this we are also able to simplify the observability and ensure it is consistent even when we swap out models and components as it’s a plug and play implementation.

9.Document intelligence remains central to insurance operations. Are you moving toward end-to-end multimodal models that reason directly over images and documents, or do you still prefer modular DocAI pipelines? In sensitive domains like medical claims, how do you engineer against hallucinations in multimodal systems?

Ajwad Ali: I think we are shifting to multimodal and world models. It’s probably not one for 2026 but in our ambitions for the future as these models mature and the technology becomes more capable. I personally think world models will have a big impact on the insurance industry. As these models grow and have better understanding of the physical world, they may have much more input in processes around injuries, claims and more. It’s something I’m keeping tabs on!

A strong output validation layer built into guardrails is what’s needed to manage hallucinations and that’s got to be real time but equally as important is having a longer window for monitoring services which assess model behaviour such as drift or general changes over a sustained period to try and predict hallucinations. The reactive approach in real time can be a costly method that adds latency but can be relaxed and minimised over time as confidence grows with AI systems but the cold analytics of model monitoring and looking for trends over time and proactively avoiding hallucinations is something we put equal importance on.

10.Secure GPT was deployed quickly, but trust develops more slowly than software. What engineering metrics do you use to measure internal trust in AI systems? Do override or rejection signals feedback automatically into model retraining or policy adjustment?

Ajwad Ali: The most important metric for us in measuring trust is returning users, which impressively is quite high for SecureGPT, and work is being done to get the number higher. A lot of our metrics now are based around new users and returning users and both are important. New users give us an indication of what % of our organisation that has access to AI is willing to try and use it in their role and returning users gives a broad view of % of users who see value in using AI. We’ve had over 50% of colleagues use SecureGPT and over 30% are return users so have already seen good returns but are working towards even better.

We also encourage users to give qualitative and quantitative feedback which gives us a more direct answer to understanding trust.

Editor’s note: Ajwad Ali serves on the judging panel for the AI Awards 2026.

Consio Raises $3.3M for E-commerce Voice