In this conversation, we spoke with Chuck Lee, Head of Business Division at ESTsoft, about building AI humans for real-world interaction, why orchestration matters more than individual models in conversational systems, and what changes when AI avatars move from controlled demos into physical environments.
Perso Interactive is often described as a real-time AI human rather than an avatar product. What has to go right for the interaction to feel genuinely usable, not just visually convincing?
What matters most is the combination of visual fidelity, advanced technology for delivering appropriate feedback, and well-designed user experience scenarios.
Perso Interactive is not limited to a specific industry. It is a platform designed for real-world interaction across retail, exhibitions, mobility, and customer service environments.
The challenge is not just generating a realistic AI human, but identifying where AI-driven interaction actually solves a meaningful operational problem.
To do that, ESTsoft deploys Perso Interactive in real-world environments such as exhibitions and offline stores, working closely with partners to build scenarios and gather feedback.
Through this process, the system evolves using real-world interaction data rather than controlled demo environments.
A lot of digital human systems still perform best in controlled screen-based settings. What changes technically when the product has to work in public physical environments with noise, movement, and unpredictable behavior?
The biggest difference is that input is no longer controlled.
In screen-based environments, users provide clear inputs. In public spaces, however, noise, multiple users, and movement introduce constant uncertainty.
As a result, it becomes more important to determine when to initiate interaction and who to engage with, rather than simply improving STT accuracy.
To address this, Perso Interactive integrates technologies such as vision, VAD, and distance sensing, enabling natural interaction in real-world environments.
ESTsoft has put unusual emphasis on kiosks, signage, and in-vehicle deployment. What did the product need to solve before those environments became serious commercial targets?
The most critical challenge was enabling interaction that begins without a button.
Traditional systems require user input to start, but an AI human must proactively initiate conversation. This required designing a natural flow from user approach, to wait time, to speech initiation.
Once this was achieved, Perso Interactive transitioned from a functional tool to a meaningful customer experience, and from that point, meaningful business expansion became possible.
Multilingual conversation gets much harder once it moves from translation to live interaction. Where do you see the gap most clearly between language coverage and actual conversational quality?
The key challenges lie in proper nouns, pronouns, and contextual understanding.
While translation technology itself has reached a high level of accuracy, real conversations require maintaining continuity across names, brands, and situational context. This is where quality gaps become most evident.
Ultimately, it is not just about translation, but about preserving meaning within the flow of conversation.
As mentioned earlier, ESTsoft continuously enhances Perso Interactive by connecting its LLM to real-world data, enabling more usable and context-aware interactions.
In addition, voice-based lip-sync generation that can naturally support multiple languages is also critically important.
Perso Interactive sits close to the moment of customer interaction. How do you think about what needs to happen in real time for the experience to work?
It is essential to achieve both system stability and a natural response experience.
Perso Interactive is not powered by a single model, but by a comprehensive AI system. STT, intent analysis, agent-based LLMs, RAG, MCP, TTS, and STF are all connected within a single pipeline.
Therefore, more than individual model performance, orchestration across the entire system is what truly matters.
At the same time, it is impossible to have pre-built data for every possible query. The system must be able to respond naturally in any situation without causing user discomfort.
This kind of refinement comes from deployment experience rather than controlled demos.
The taxi deployment in Japan is one of the more operationally demanding use cases in this category. What did that environment expose about the product that a lab setup or trade show never would?
From a technical standpoint, there were no major issues. Perso Interactive operated reliably in real taxi environments, just as it did in testing.
The unexpected challenge was how easily drivers in their 50s and 60s could understand and use the system.
This led us to focus on end-turn recognition, determining when a conversation has naturally concluded.
Interestingly, we observed that over 40% of drivers in Japan initiated conversations with passengers first.
This experience reinforced our belief that AI humans are not just a technology, but a product capable of creating new forms of everyday communication.
Vision can make an AI human feel more aware, but it can also make behavior less predictable. Where has visual understanding proved most useful in the product?
Vision plays its most important role in timing rather than recognition.
Understanding when a person approaches, leaves, or intends to interact has a significant impact on the overall experience.
In this sense, the role of vision in Perso Interactive is not to make AI smarter, but to make its behavior feel more natural.
Many enterprise AI systems become much harder to manage once they move from one deployment to many. What has mattered most in making Perso Interactive repeatable across different partners, hardware formats, and operating conditions?
The biggest challenge is handling fragmented deployment requirements across different industries and hardware environments.
Different partners often require different interaction flows, hardware setups, operational logic, and deployment conditions.
To manage that complexity, ESTsoft built dedicated teams and internal processes to structure those requirements and translate them into reusable platform features.
The repeatability of Perso Interactive comes less from a single model and more from turning deployment experience into productized system behavior.
When a new partnership opportunity appears, what tells you early that it is built around a real customer need rather than interest in the novelty of AI?
We primarily respond to inbound inquiries.
In sectors such as retail, exhibition, and travel, we are already working with top-tier partners. When we share the problems we aim to solve, the depth of the discussion quickly reveals alignment.
If a partner demonstrates a clear intent to solve problems using Perso Interactive, it indicates a genuine need rather than simple curiosity.
Additionally, since AI solutions are often demonstrated directly to decision-makers, their reactions during demos provide an immediate signal of real demand.
ESTsoft is now active across retail, mobility, telecom, and customer experience. How do you decide where to focus when several categories all look commercially promising?
We prioritize areas where interaction occurs repeatedly and where AI can create real value.
As a result, our core focus has been on retail, exhibition, travel, and living environments, with rapid expansion into education and mobility in recent years.
You joined during a period when ESTsoft was pushing its AI business outward much more aggressively. What has changed most in the way you make decisions now?
Decision-making at ESTsoft is centered around the field, data, and the end user.
Because Perso Interactive is a platform, it can be applied across multiple domains. As a result, we place decision-making emphasis on real-world environments and end users.
This has made domain expertise increasingly important. We rely heavily on field deployment feedback when prioritizing product decisions.
When AI humans become routine rather than remarkable, where do you think they will matter most in everyday life?
The greatest impact will be in areas where humans repeatedly handle interactions.
In retail, this could enable fully autonomous stores operating 24/7. In exhibitions, companies will be able to deliver consistent multilingual sales narratives anywhere in the world, enhanced by tool calling for rich media delivery.
In travel, AI humans can respond to visitors in their native languages, creating seamless global experiences.
The common thread across all these scenarios is clear: AI can provide consistent, always-on, and fully attentive interaction.
Ultimately, AI humans will not be seen as a novel technology, but as infrastructure that performs repetitive human tasks more reliably.

