in

Georgii Vysotskii on Making 4D Gaussian Splatting Streamable

In this interview, Georgii Vysotskii, Co-founder and CEO of Gracia, explains how 4D Gaussian Splatting is moving from studio-bound capture toward web-native streaming. He details the compression, rendering, and distribution breakthroughs required to make volumetric media practical at consumer bandwidth.

You’ve moved from a physics background at Moscow State University to building fintech infrastructure at Netmonet, and later into generative video. When you design Gracia’s compression and rendering systems today, do you think of them primarily as computer graphics problems, or as probabilistic systems that need to remain stable across time?

When we design graphics compression and rendering, we face a couple of key challenges. And the answer is basically both: we think about it as a computer graphics problem and as a probabilistic system that needs to be stable over time.

We need to make sure the reconstruction quality is there, and the temporal stability you’re referring to is probably the most important part. You need to ensure temporal consistency between frames, and we do that by training frames together, rather than using a flipbook approach where you train each frame separately. That’s how we solve temporal consistency.

The second challenge — one we’re very proud to solve, and where we likely have some of the best results — is real-time rendering on a headset. We can render up to half a million splats per frame on a standalone headset, starting from Quest 3, at around 72 FPS, so there’s no motion sickness when you move your head around.

So, it’s both: achieving proper reconstruction quality, and making sure you can render it in real time on a headset.


By moving away from meshes and toward Gaussian representations, Gracia gains realism but loses some traditional affordances like rigging and physics-driven interaction. Do you believe 4D Gaussian Splatting characters will eventually be driven by game engine physics systems, or is this format better understood as a high-fidelity playback medium rather than a fully interactive one?

Eventually, I’m pretty sure Gaussian splatting will be animatable — or at least it will be possible to animate it. The technology is still very early. Right now, we’re solving problems like temporal consistency, and Gaussian splatting is mostly about reconstructing something as close as possible to real life. Sometimes it’s genuinely indistinguishable from real footage of a particular event. That could be a single character, multiple characters, or an entire scene captured together.

But today, it’s not really animated in the traditional sense. For static Gaussian splats, relighting is kind of a solved problem now. For dynamic splats, we’re working in parallel with one of our partners on relighting for 4D Gaussian splats, and we already have very promising early results. For the VFX projects we’re working on, we can provide relighting — which, in a way, is a form of “animating” the splats.

The way Gaussian splatting works is that it basically bakes how light behaves in the scene. So when you change the lighting, in some sense you’re animating the splats. I see that as a first step toward making splats editable and truly animatable in the future.

So, over roughly the next two years, the way we see it is that motion capture and volumetric capture will merge: you’ll capture characters volumetrically, represent and render them with Gaussian splats for photorealism, and then use mocap or rigging to animate them to some degree. It’s definitely something we’re thinking about, but it’s a bit down the line.


You’ve spoken about temporal consistency and infinite frame interpolation in 4DGS. From a technical perspective, is this interpolation purely mathematical and continuous in nature, or does it involve elements of generative inference when sampling moments that were never directly captured?

On temporal consistency and “infinite” frame interpolation — I’d say these are basically byproducts of how our pipeline works.

Temporal consistency was obviously one of our targets for 4DGS. That’s why we train frames together in batches, with overlap between batches. Each frame is aware of its neighbors, and because of that there’s no flickering.

But it’s not generative in any way. It’s pure math.

Imagine you capture at 30 FPS. You have two frames, and when you train frames together so each frame is aware of its neighbors, you can interpolate how the splats move from one frame to the next. That’s essentially how temporal consistency works, and it’s also related to how our compression works.

To be clear, we don’t interpolate frames in the original footage. But once the model is trained, you can render it at any frame rate. So even if it was captured at 30 FPS, you can still play it back at 90 FPS. Two out of three frames “in between” will effectively be interpolated at render time.

So these aren’t generated frames in the sense of generative inference — they’re frames you get by interpolation when you render. And we don’t really render just keyframes one after another; we render the whole sequence together — these splats and their trajectories over time. So overall: it’s a byproduct of the pipeline, not generative, and it’s pure math.


Some researchers see Gaussian Splatting as a transitional step rather than a final representation. In your view, is 4DGS an end-state for volumetric media, or a bridge toward an even more expressive future format?

I don’t know — the honest answer is I don’t know if it’s going to be the end state for volumetric media.

To me, it could end up being a final destination in terms of the underlying technology, but it’s definitely not a final destination in terms of creative capabilities — coming back to animation, relighting, and things like that. There’s still a lot to do and a lot of challenges to solve if you want Gaussian splatting to truly be the “final format” for volumetric media.

But at least today, we don’t really have many other options. From what we know right now, Gaussian splatting could become the final destination and the final file format — especially if more people adopt it, which seems to be happening. At the same time, there’s still a lot to be done around standardization: agreeing on file format conventions, how you store the original capture, and how you train it.

If the industry converges on those conventions, progress may stabilize around Gaussian splatting as the main representation, and then everyone will focus more on solving the creative challenges.


Even with significant compression gains, volumetric video remains extremely data-heavy. In practical terms, do you see Gracia’s near-term future as primarily download-first experiences, or are you actively pursuing progressive or streaming-based transmission of Gaussian data?

You’re absolutely right: file size has always been the biggest restriction for splats, and one of the biggest concerns for our clients — along with the fact that you typically needed a native application for rendering. But now I think we’ve basically solved both.

We’re very close to releasing WebXR support together with streaming. Let me elaborate.

When we started, it was about 100 GB per minute of volumetric video if you follow a flipbook approach — where you store, train, and store each frame separately. That’s obviously insane and completely unusable for any consumer-facing use case.

A year ago, we were at around 10 GB per minute with compression already implemented. That was workable for some use cases — mostly professional ones like VFX, where large assets are normal and storage isn’t as much of a constraint — but still not really consumer-facing.

Recently, we achieved about 1 GB per minute on the same footage. So that’s roughly 100× compression compared to the original naive approach of storing frames separately.

Our target — already achieved internally — is around 75 Mbps, and now we’re implementing that on the product side. In a few weeks, we’ll release a streamable format. Practically, that means: you press play, and instead of downloading the entire piece of content upfront, you get progressive streaming

— you don’t need to download the whole sequence before it starts rendering.

That’s a major milestone for us, because it makes the content much easier to try. And 75 Mbps is already within consumer-level bandwidth, so we can say we’ve addressed this major restriction.

The other restriction was needing a native application. We’ve managed to integrate our rendering pipeline with WebXR — and specifically WebGPU — and we’re releasing these two things together because it makes sense as a package.

So now you won’t need to download an app. You’ll just go to a website — whether from a headset, a mobile device, or a laptop — see the content, press a button, and it plays. No app download, and no waiting for the full download either — you press play and it renders.

We’re very proud of it and super excited to release it and see what people say. In that sense, I think this is solved now.

And what’s especially important for us: yes, you can reduce file size by just reducing the number of splats — basically lowering the quality bar. But we’ve managed to get to streaming and WebXR without major compromises in quality, keeping the resolution and fidelity high while still reaching consumer-level file size and web-based distribution.


High-quality volumetric capture today still requires large multi-camera rigs. Do you see this as a temporary engineering constraint that will shrink over time, or as a fundamental physical cost of reconstructing reality at this level of fidelity?

Yes, at the moment, we rely on ground-truth information from cameras. That’s why the most straightforward way to get higher reconstruction quality is simply to add more cameras, which is obviously a dead end from a content production perspective.

Right now, the sweet spot is roughly 50 cameras for full 360-degree capture — an “outside-in” rig where cameras are distributed evenly around the subject. You can get similar results with fewer cameras if you reposition more of them toward the front to capture more detail there and compromise the back a bit. In that case, you can get down to around 35 cameras, for example.

But obviously, to make this scalable — and to enable semi-professional creators — you need to get to something like 4 to 8 cameras. And we’re actively working on that. I can’t disclose the exact approach, but what I can say is that we will definitely use diffusion models, because they’ve finally reached the point where they can be geometrically consistent and efficient enough to render images quickly.

So by leveraging diffusion models and the recent progress there, we plan to reduce the number of cameras required to four to eight. We’ve already started working on it, and we’ll keep you posted.

So no, it’s not going to be forever that ~50 cameras is the minimum to get really high-quality results without artifacts or major compromises. I can definitely see that in one to two years, it will be more like four to eight cameras, which will unlock semi-professionals to start producing this kind of content themselves.


Gracia already runs close to the performance limits of current standalone headsets. To what extent is your roadmap implicitly betting on continued advances in mobile and XR chip performance, and what happens if that curve slows down?

We don’t rely on any future improvements in XR chip performance. The idea is the following: Quest 3 already renders our content at native resolution, without any quality drop compared to PC VR.

Of course there’s still a bottleneck — you can’t render millions of splats. But half a million splats is usually more than enough to render multiple characters simultaneously, composited accurately with a 3D environment.

We’ve worked on many consumer-facing projects, and it’s almost never been the case that the splat budget was the main limitation. So from our perspective, headset performance is basically solved — it’s there. It’s not something we feel super constrained by right now.

So I wouldn’t say our roadmap is betting on chips getting better, because we don’t see that as the bottleneck at the moment.


As Gracia builds its own SDK and pipeline, platform risk becomes unavoidable. If operating systems from companies like Apple or Meta natively support their own volumetric or Gaussian-based formats, how does an independent infrastructure layer remain relevant?

We understand the risk, of course. Gaussian splatting has become a very hot topic, and a lot of big tech companies — not only XR players like Meta and Apple, but also companies like Netflix, Disney, and Paramount — are building their own Gaussian splatting pipelines.

The idea is this: to build a pipeline that actually delivers high-quality results, you need to do three things at the same time.

First is capture. You need to understand capturing deeply — how the content should be captured, what matters, what doesn’t. This is offline expertise. You need to roll up your sleeves and go to studios, work with volumetric stages, or build your own rig to do the capture properly.

Second is processing — basically training the models. This is the hardcore R&D part: solving problems like temporal consistency, file size and compression, efficiency, parallel training, and so on. Third is rendering: how you render on-device, how you make it efficient, and how you build integrations so your proprietary format is supported in video editing toolkits and in platforms like Unity or Unreal.

To get great results, you have to tie all three together and make sure every stage is aware of the previous one. That’s hard for big corporations, because you need three very different teams — with very different expertise — to collaborate tightly and deliver efficiently. That’s often a problem for large companies, and it’s usually something startups are better at.

For us, from day zero — and given our background — we’ve seen our ability to work across both offline and online as part of our competitive moat. It’s actually quite rare to find a team that combines operational, offline capture expertise with online R&D and engineering. And Gaussian splatting itself is still very niche, so the engineering and research profiles you need are extremely specialized. There simply aren’t that many people who can solve these challenges, and we’re proud to have them on board.

So yes, we understand the risk. But we think the way to win here is to move first and move fast — so that others end up adopting your approach. Speed is our major competitive advantage.


Investors have said they have been waiting a decade for this kind of content. In your view, what actually unlocks mass adoption of volumetric media: a hardware inflection point, a networking breakthrough, or the emergence of a widely accepted file and distribution standard?

On mass adoption, I think there are three challenges — three milestones you need to hit — and I’d say one of them is already solved.

The first is file size and compatibility with WebXR: making it super easy for consumers to launch and try content. I think we’ve solved that.

The other two things we believe will enable mass adoption of splats — of volumetric video in the form of splats — are:

  1. The number of cameras required. The fewer cameras you need, the cheaper content production becomes. So that’s clearly the cost side of the equation.
  2. Mass adoption of headsets. And it’s becoming clear that it’s not really about VR headsets — it’s about AR glasses. Over time, VR headsets will likely merge with AI glasses into a wearable device where people can see your eyes, and you can wear it outside — on the street, to the office — so essentially AR glasses.

It feels like it’s a few years away. But I do see the trend toward wearable AR glasses, and I think this is the most important point. Once the hardware shifts from something you only use at home — something you can’t really use in public — to something like Ray-Ban-style glasses that you can wear walking down the street without anyone pointing at you, that’s when you get true mass adoption of the wearables.

So those are the three things: distribution in the sense of “how easy it is to try” (which I think is solved), the cost of production (driven by the number of cameras), and the devices that can render it.

And here I’m mainly talking about consumer-facing use cases. A large part of what we’re working on is also B2B for professionals — VFX, sports, and similar industries. In those cases, device adoption isn’t really the issue; it’s much more about capture and reducing the cost of capture. That’s essentially what we’re focused on right now.


As 4D Gaussian representations approach photographic realism, volumetric deepfakes become a real concern. Does Gracia see a role in authenticity verification or provenance systems to distinguish captured reality from edited or synthetic volumetric content?

4DGS is already highly photorealistic, but on this concern — it’s not really about generating content.

As I mentioned before, even animation isn’t solved yet. Not solved — honestly, not even close. It will take a few years.

And when it comes to “generating” 4DGS: without actually capturing someone, you can’t generate them volumetrically in 4DGS today. It’s also not realistic right now to just prompt a diffusion model with something like, “Generate me 60 camera views of a public figure in a volumetric setup,” and then train a 4DGS from that. It’s simply not geometrically consistent enough.

So I don’t think the concern about faking things — specifically for 4DGS — is really a factor at the moment. It might become relevant later down the road, but it’s not something on the table right now, because the technology is still early and it just doesn’t allow you to fake things in that way.


If text has compilers and video is beginning to have compilers, is Gracia ultimately trying to become a compiler for reality itself, and do you imagine future human memory being stored as flat recordings or as spaces people can re-enter and experience again?

I’m not sure I fully get the first part of the question, but on the second part — yes, I can definitely imagine memories being preserved and kept in a volumetric way.

One of the great applications of this tech is capturing your parents or your children — preserving family moments volumetrically. Right now you can really only do this in a studio, but soon enough you’ll be able to put a few iPhones together at home, capture your grandparents, and have that moment preserved volumetrically — so you can go back to it later and relive that particular moment in your life. That’s fascinating.

I’m not sure we’re talking about preserving all of reality in 4DGS in a volumetric way, but preserving specific moments you’d want to return to — definitely. Even today, that’s already possible in a controlled setup, and I think the technology will be widely used for it.

People are always looking for a 10× improvement over what was possible before. Today you can take photos, and maybe with AI you can animate them a bit or add some depth. Even with something like Apple Vision Pro, you can make a spatial photo — but it’s still not the same as capturing a moment volumetrically and being able to walk around it and see it from all angles.

That’s where you get the real 10× improvement in experience: volumetric capture and motion splatting. So yes — partially. I don’t think everything will be captured and preserved volumetrically, but the most important moments in life definitely will be.

Editor’s Note

This interview signals a shift in volumetric media: realism is no longer the bottleneck. Distribution is. As compression, WebXR rendering, and streaming converge, 4DGS begins to function as infrastructure rather than experiment.

SurrealDB Raises 23M for AI Native Database