in

Daniel Keinrath on Voice AI as Infrastructure

Voice AI is moving beyond call handling into core business operations. In this interview, Fonio founder Daniel Keinrath explains why latency, orchestration, and system integration have pushed voice from interface to infrastructure, and why reliability matters more than model novelty as adoption accelerates.

1. 2025 is increasingly described as the inflection point for enterprise voice agents. From your perspective, what structural changes made voice AI viable as core infrastructure for SMEs only in the past one to two years, rather than earlier?

Daniel Keinrath: Two things changed in the last one to two years. First, speech-to-text and text-to-speech crossed a quality threshold where conversations stopped sounding robotic and started feeling natural. Second, latency dropped massively because models got faster and orchestration got smarter.
Before that, voice AI technically functioned, but it wasn’t reliable enough to use with real customers. Once it could respond in under a second, understand messy, real-world language (including pauses, inflections, and nuances) and integrate seamlessly with business systems, it became truly valuable for SMEs.

2. Traditional phone systems failed SMEs long before AI arrived. What specifically broke in the post-pandemic period that made human-only phone handling economically unsustainable for small businesses?

Daniel Keinrath: Of course, costs went up after the pandemic, hiring became more difficult and that definitely accelerated things. But honestly, I believe companies would have moved toward solutions like ours much earlier, if the technology had been mature enough.

Customer service and call centre work has always been one of the hardest areas to staff. These roles are tough, not particularly attractive long term, and extremely hard to scale. Finding and retaining people has been a challenge for years. Once reliable voice AI solutions became available, it finally offered a realistic way to deal with an issue that businesses had been struggling with for a very long time.

3. Fonio positions its AI not as a voice bot, but as an “intelligent employee” that can read from and write to business systems. At what point does voice AI stop being an interface and become part of the operational stack itself?

Daniel Keinrath: Voice AI stops being just an interface once it actually starts behaving, rather than just delivering pre-set answers.

The moment it can autonomously check availability, book appointments, update a CRM, or trigger workflows on its own, it stops being a tacky phone bot. At that point, it behaves like an employee that’s directly plugged into your systems.

The phone call is really just the entry point. The real thing happens in the background, when the AI takes work off people’s plates and actually moves processes forward without someone having to double-check or clean things up afterward.

4. Many voice AI startups rely heavily on U.S. APIs and telecom layers. You chose to build a Europe-native orchestration layer instead. What technical or regulatory constraints made this unavoidable in the DACH market?

Daniel Keinrath: It’s less a regulatory issue and more a technical one. A lot of voice AI startups basically white-label US services that run on servers in the States. That means a German phone call gets transcribed, sent to the US, processed there, translated back, and then returned to Germany. This creates long latency, which kills conversation flow and customer experience.
By hosting everything in Europe and building our own orchestration layer, we get latency down to an almost human level. That massively improves understanding of the German language and results in much higher call quality overall.

5. German is often cited as a hard language for speech systems, especially in domains that require exact spelling such as healthcare or automotive services. What did you learn about language-specific failure modes that general-purpose voice models consistently miss?

Daniel Keinrath: German breaks a lot of general-purpose voice models because it’s very nuanced. Things like spelling out email addresses, license plates, names, or compound words are extremely hard if the system isn’t built for it.
What made the difference for us was building our own orchestration layer and fine-tuning specifically for German. We also had to teach the system real speech patterns: how people actually talk, hesitate, correct themselves, or switch context mid-sentence. That’s where generic models usually fail.

6. GDPR compliance is often framed as a legal requirement. In practice, how did data sovereignty become a go-to-market strategy for Fonio rather than a box to check? Which SME segments simply cannot adopt voice AI without it?

Daniel Keinrath: GDPR is non-negotiable in Europe. Of course, you can use it as a strategy and an added value, but the product still has to be good. What we noticed early on is that certain industries simply won’t adopt voice AI unless data stays in Europe and is handled properly.
Medical practices, legal offices, and other businesses dealing with sensitive personal data don’t even consider solutions that route calls or recordings through non-European infrastructure. So GDPR is a prerequisite, really.

7. In voice-driven workflows, errors are not cosmetic. How do you design guardrails so that an AI agent can operate autonomously while still knowing when to stop, escalate, or hand over to a human without breaking user trust?

Daniel Keinrath: From a technical point of view, the AI itself doesn’t really make random mistakes. It only acts based on the prompts and the information it’s given. So when something goes wrong, it’s usually because the input wasn’t clear or the underlying information was incomplete or wrong.

On top of that, we’ve built in general guardrails to prevent misuse. For example, it’s intentionally very hard to use fonio.ai for cold calling. By default, we only support inbound calls. If a company wants to use outbound calling, we first need to understand the use case and verify that it’s legitimate. That way, we reduce the risk of scams and protect both businesses and end customers.

8. The acquisition of fluently was unusually early for a company at your stage. Beyond customer consolidation, what technical or capability gaps did that acquisition allow you to close faster than building internally?

Daniel Keinrath: It was a strategic decision. Fluently was building technology very similar to ours, so instead of duplicating work, we integrated their tech and brought their customers onto our platform. 

9. Fonio reached thousands of customers and high call volumes before raising a significant round. From your view, what mattered more to investors: architectural scalability, vertical-specific performance, or proof that SMEs were willing to trust AI with real customer interactions?

Daniel Keinrath: It was really a mix of all three. With a large number of customers and consistently strong reviews, we showed that the technology scales and can be rolled out very quickly. At the same time, real usage proved that SMEs are actually willing to trust AI with customer interactions.
That’s especially meaningful in the German market, which is usually pretty sceptical toward new technologies. So seeing that level of adoption was both surprising and very validating for us.

10. Expanding from German-speaking markets into France is not just translation. Which parts of your system were language-agnostic, and which assumptions had to be re-examined when entering a new linguistic and cultural environment?

Daniel Keinrath: Our core technology is language-agnostic, but it was originally fine-tuned for German because most voice models already work extremely well in English. We deliberately focused on a non-English language first, because that’s where the real challenges are.
Once you’ve solved things like latency, orchestration, and real-world speech patterns for German, adapting the system to other languages becomes much easier. That said, every market has its own communication style and expectations, so some assumptions around tone, phrasing, and interaction flow had to be re-examined.

11. Large platforms now offer increasingly capable voice APIs. In the long run, where do you believe vertical, region-specific voice systems maintain defensibility as foundation models continue to improve?

Daniel Keinrath: In the enterprise segment, large companies will increasingly buy voice capabilities directly from big platform providers. They have the budgets, teams, and scale to do that and implement them.
For SMEs, it’s a completely different story. They won’t build these systems themselves, and even if the cost comes down over time, it still won’t make sense for them to invest heavily in setup, maintenance, and development. The complexity alone is a barrier.
That’s where specialized providers like us stay relevant. We package very advanced technology into something SMEs understand, are not scared of, can actually use, afford, and benefit from without needing an internal AI team.

12. If voice AI becomes a default layer across phone, messaging, and chat, how do you expect this to reshape labor inside European SMEs? Which human roles gain importance rather than disappear in a world of AI “employees”?

Daniel Keinrath: The founding idea of fonio.ai was born out of the reality that customer service eats up a huge amount of time, especially for SMEs that can’t afford dedicated support teams or call centres. With fonio.ai, we’re trying to give those companies their time back.
That said, as these technologies become more widespread, there will obviously be changes in the workforce. We believe new roles will emerge, like AI agent managers, people who prompt, train, supervise, and continuously improve AI systems like ours.

Editor’s Note

This interview reflects a broader shift in enterprise software. As real-time AI systems become reliable enough to act autonomously, voice is no longer a channel but an operational layer that reads from and writes to business systems.

Siavash Ghorbani on Coordination as the Bottleneck

Kime Raises €2M for Generative Engine Optimization