in

Anton Vice on Why AI Still Depends on Human Data | Interview

AI has made rapid progress, but at the pretraining layer, it still depends on real human data rather than synthetic substitutes. In this interview, Anton Vice, CTO and Co-founder of Grably, explains why access to high-quality human behavior data remains a structural constraint for AI development, how Grably approaches large-scale human data collection, and where synthetic data still falls short.

1. Your previous company, Speechki, focused on using synthetic voices to scale content production. Grably takes almost the opposite approach by prioritizing raw, real human data. Was there a specific moment where you realized synthetic data had reached a ceiling, and that progress toward more capable AI would require returning to messy, unstructured human interaction itself?

Anton Vice: Speechki was focused mainly on inference of synthetic voices to turn books into audiobooks, however, we still used a significant amount of diverse labeled human voice data to train our base model. 

This is when the first bells rang and we faced the problem we are solving now: Human data is the cornerstone of AI pretraining and available sources fragmented. That is why we started Grably – to unlock hard-to-access AI training data sources and make it easy for AI labs to acquire the data, by embedding compliance and governance directly into the workflow, so ML teams can move faster within approved and auditable boundaries.

To your broader question, as we can observe, synthetic data is definitely on the rise, especially with advances in reinforcement learning and self-play. In practice, we see it working best when anchored in real human data, which provides the initial distribution and signal that synthetic generation can then amplify more efficiently. We want to exhaust all human data sources while we wait for a significant jump in synthetic data generation research since it may be cheaper to acquire and the tools for getting it are already in place, they are just not optimized.

2. You describe Grably as building a “human layer of intelligence” spanning physical, physiological, communicative, and cognitive signals. From an engineering standpoint, which of these dimensions has proven the hardest to capture using consumer-grade devices, and what technical compromises were required to make this feasible at scale?

Anton Vice: The biggest limiting factor in data sourcing is perhaps diversity. It is easy to ask someone in the US to use a high quality audio or video recording device, but when we start diversifying, especially to the highly unrepresented areas of the world, we start facing issues with accessibility of actual devices for data acquisition.

Consumer devices vary widely in microphones, cameras, sensors, and connectivity, particularly outside well-resourced regions. To make this feasible at scale, we had to design systems that assume imperfect inputs by default: lower sampling rates, inconsistent frame quality, missing modalities, and significant background noise.

The key technical compromise was shifting complexity out of hardware and into software. Instead of requiring specialized devices, we focus on robust preprocessing, normalization, and metadata capture so that heterogeneous inputs can still be aligned into a usable training distribution. That allows us to prioritize global diversity without enforcing unrealistic device standards.

And it is a crucial step, because an ML model can generalize more, the wider the distribution of training data there is.

3. Large language models have become highly capable at processing text, yet still struggle with non-verbal context such as intent, hesitation, or embodied intuition. What kinds of human signals does Grably collect today that you believe current models fundamentally lack, and why are those signals so difficult to infer synthetically?

Anton Vice: Grably focuses on capturing implicit human signals that don’t reliably surface in text or audio alone – such as hesitation before an action, subtle shifts in engagement over time, confidence drift, and context-dependent decision timing. These signals are often embodied or longitudinal. They emerge from patterns of behavior across situations rather than from what someone explicitly states in a single moment.

Today’s models tend to struggle with these signals because they’re underspecified, sparse, and path-dependent. They’re rarely labeled, highly individual, and tightly coupled to situational context and lived experience. While some aspects may become more tractable as multimodal systems improve, synthetically inferring them at scale often collapses nuance into population averages. In practice, these human signals tend to matter precisely because they deviate from the norm.

In that sense, the gap isn’t just about model scale or architecture. It’s about access to the right kinds of real-world, behavior-level data and the ability to interpret it over time, rather than within a single interaction or modality.

4. Grably introduces the idea of “living datasets” that continuously evolve rather than remaining static. In AI research, reproducibility depends on stable benchmarks. How do you reconcile continuously updating datasets with the need for version control, reproducibility, and fair model comparison?

Anton Vice: Grably is a human-interaction data research lab focused on understanding and measuring real-world human signals. We approach data with the same level of rigor, versioning, and evaluation discipline that leading AI labs apply to model development.

We think of “living datasets” as operational data streams layered on top of a disciplined versioning and evaluation framework, not as a replacement for stable benchmarks.

In practice, every snapshot of a living dataset is versioned, immutable, and fully auditable. Models are always trained, evaluated, and compared against explicitly defined dataset versions, with metadata capturing time range, collection conditions, and known shifts. That allows results to be reproduced exactly, even as the underlying system continues to evolve.

Static benchmarks remain essential for controlled comparison. However, many real-world failures occur because models overfit to frozen snapshots. Living datasets let us study adaptation, drift, and generalization explicitly, while still preserving the guarantees that reproducible research and enterprise deployment require.

5. Video generation models often fail on causality and physical consistency. You’ve emphasized that Grably’s video data includes rich metadata beyond raw pixels. What kinds of causal or contextual annotations matter most for teaching models how the real world actually behaves?

Anton Vice: One interesting thing we’ve seen with video models recently, and something researchers at Google have shown in the “Video models are zero-shot learners and reasoners” paper – is that as video generation models scale, they can develop emergent internal representations of how the world works. For example, aspects of physical dynamics, such as fluid behavior, can be learned implicitly rather than explicitly programmed, which is quite compelling.

At the same time, video generation remains difficult to interpret at the level of internal representations, and much of recent progress has come from scale. Larger datasets and longer training, rather than from explicit causal supervision.

In our hypothesis, adding contextual signals, such as aligned human commentary or situational context, helps models reason at a higher level. These annotations provide causal and contextual grounding that raw pixels alone often miss, allowing models to better understand why events happen, not just what happens, and to explore a broader space of emergent behaviors.

6. Grably relies on token-based incentives to motivate large-scale data contribution. Economic incentives can attract volume, but also gaming and low-quality submissions. What mechanisms does Grably use to ensure that data collected for financial reward remains reliable, human, and useful for serious AI research?

Anton Vice: Grably initially explored token-based incentives as a way to study participation and data quality at scale, but we no longer rely on volume-driven rewards as our primary collection mechanism.

Today, we work with a combination of human contributors and private companies or enterprises. Incentives, where used, are tied to quality-weighted and longitudinal participation rather than one-off tasks, and enterprise data is collected within clearly defined, consented workflows.

Across all sources, data only enters research pipelines after passing versioned validation, consistency checks over time, and task-specific quality filters. In that sense, incentives help align effort, but reliability comes from sustained engagement, clear context, and rigorous evaluation rather than financial reward alone.

7. By operating through Telegram Mini Apps and the TON blockchain, Grably effectively turns millions of users into decentralized data nodes. Do you see this as the early formation of a new labor market for AI, and how do you respond to critiques that this model risks becoming a form of digital exploitation rather than empowerment?

Anton Vice: Telegram has been an incredible platform for user acquisition because you can get access to over 1 billion users in a matter of days. Contributors choose when and how they engage, understand what data is collected, and are compensated within defined, consented contexts. That distinction matters.

We take concerns about exploitation seriously. For us, the guardrails are transparency, the ability to opt out, and continuous feedback from participants. If incentives or collection mechanisms start to feel misaligned, that shows up quickly in user behavior and retention. Our responsibility is to respond to that signal, not to optimize blindly for scale.

AI is reshaping how value is created, and there’s real risk in getting these dynamics wrong. Our goal is to push toward models where participation feels voluntary, understandable, and fair. Rather than extractive – while staying open to course correction as the ecosystem evolves.

8. Grably has publicly referenced access to tens of millions of medical research records. Medical data is among the most regulated categories globally. Can you clarify how these datasets are sourced, how consent is obtained, and how Grably navigates HIPAA and GDPR constraints in a decentralized data collection model?

Anton Vice: When we reference large-scale medical research data, we’re referring to lawfully sourced, secondary research datasets, not raw patient records collected directly from individuals through decentralized consumer channels.

Grably does not collect identifiable personal health information from users. Medical datasets we work with are obtained through licensed providers, research institutions, or enterprise partners, and are de-identified or anonymized in accordance with applicable regulations before they enter any research workflow.

Consent and lawful basis are handled at the source level, under the governance frameworks of the originating institutions. Our role is to operate downstream as a research and data infrastructure layer, with strict controls around access, use, and purpose limitation.

From a compliance standpoint, decentralized participation mechanisms are not used for regulated medical data. Medical datasets are handled separately, within controlled environments that align with HIPAA, GDPR, and regional data protection requirements, including data minimization, auditability, and contractual safeguards.

In short, consumer-scale data collection and regulated medical research data are intentionally separated, both technically and operationally, to ensure compliance and reduce risk.

9. Copyright and data provenance have become central issues as lawsuits challenge how AI models are trained. Beyond payments, do you see blockchain primarily as a legal infrastructure for traceability and auditability, and could Grably’s on-chain records become a compliance layer for enterprise AI teams?

Anton Vice: We don’t see blockchain as a replacement for legal frameworks, but as a foundation for building more transparent and inspectable data governance systems.

One of Grably’s longer-term goals is to develop a clear, standardized framework for data evaluation, provenance, and governance that makes it easier for teams to understand where data comes from, how it can be used, and how it has evolved over time. Today, much of that work is manual, fragmented, and slow, often stretching compliance cycles over months.

On-chain records can help by providing a shared source of truth for data lineage and usage history, which can support audits, licensing reviews, and internal compliance processes. We see this as a way to reduce friction and ambiguity for enterprise AI teams, not to replace legal judgment or regulatory review.

In that sense, blockchain is useful as an enabling layer, helping teams reason about data more efficiently. While formal compliance decisions remain grounded in contracts, policy, and applicable law.

10. High-fidelity human data is also the foundation of deepfakes and other misuse. How does Grably decide who can access the most sensitive datasets, and where do you draw the ethical line between open research and preventing downstream harm?

Anton Vice: Misuse risk is real! And high-fidelity human data raises that bar. We don’t think this is something any single company can fully solve, but we do believe data providers have a responsibility to act thoughtfully.

At Grably, access to sensitive datasets is gated. We work with known counterparties, require clarity on intended use, and avoid distributing high-risk data to entities that lack transparency or appropriate safeguards. That doesn’t mean we can control downstream use perfectly, but it does mean we’re selective about who we work with and under what terms.

We draw the line by separating open research from sensitive data access. Lower-risk datasets can support broader research, while higher-fidelity or more sensitive data is restricted, reviewed, and governed through contractual, technical, and operational controls.

Our role is to reduce obvious misuse risk at the data layer, not to claim authority over how AI is used globally.

11. Some researchers argue that increasingly powerful models will eventually generate synthetic data indistinguishable from real human data, making original human data less valuable over time. Do you believe there is something irreducibly human that synthetic data cannot replicate, and if so, what is it?

Anton Vice: Most synthetic data today is ultimately sampled from distributions learned from human-generated data. Hence synthetic data leads to model collapse when the distribution is narrowed so much that the model goes wackadoo and becomes unusable.

Novel approaches to synthetic data generation show potential in areas such as thinking for reinforcement learning and perhaps driving environments in simulation for autonomous driving models.

Over time, synthetic data will likely play a larger role, and in some domains it may become unavoidable. Our view is simply that we’re not there yet, and that there is still substantial value in collecting high-quality human data before relying more heavily on synthetic substitutes.

12. Looking five years ahead, if Grably’s vision succeeds, how does the structure of the internet change? Do we move from an attention economy to a true data ownership economy, and what role does Grably play in mediating value between humans and increasingly autonomous AI systems?

Anton Vice: Looking ahead, we think the internet gradually shifts from an attention-first model toward one where data is treated as a first-class asset with clearer ownership, consent, and value attribution.

AI systems will continue to require data to improve, but the terms under which that data is accessed and used matter. A healthier structure makes participation explicit: individuals understand how their data is used and are compensated when it creates value.

Grably’s goal is to reduce friction between original data owners and increasingly autonomous systems while keeping data owners legible and represented in that exchange.

That doesn’t eliminate risk, but it creates a more balanced foundation for how value flows between humans and machines as AI becomes more deeply embedded in everyday life.

Editor’s Note

This interview surfaces a pretraining-level structural constraint. No matter how advanced synthetic data becomes, it continues to sample from distributions rooted in real human behavior. Until that dependency fundamentally changes, access to high-quality, diverse, and compliant human data remains a foundational bottleneck for AI progress.

Immo Polewka on Why Manufacturing Planning Breaks | Interview

Sebastian Völkl on Why Engineering Breaks at Requirements | Interview