Ground Truth: How We Create Geometrically Accurate Reconstructions

Ground Truth: How We Create Geometrically Accurate Reconstructions

Date:5/27/2026

Categories:

The core problem: raw imagery is not geometry

Whether imagery comes from a drone, a phone, or a 360° camera, the starting point is a collection of overlapping frames of video. While photos often carry location information in their metadata, they contain no depth information. A photo of a rooftop can tell you what it looks like; it can't tell you how high or long it is.

Converting raw frames to 3D geometry requires solving two problems: figuring out where the camera was positioned when it captured each frame, and using those positions to reconstruct the physical structure of the scene.

Both problems are harder than they look at scale. At city scale, with tens or hundreds of thousands of frames from multiple flights, the computational and alignment challenges mount up quickly.

Working out where each camera was

The first stage is Structure-from-Motion (SfM): identifying matching visual features across overlapping images and using those correspondences to recover the precise position and orientation of the camera at the moment each frame was captured.

Effectively, it’s working backwards from "what did the camera see" to "where must the camera have been standing." If this step goes poorly, the resulting reconstruction can be distorted or unusable, regardless of how good the subsequent steps are.

SfM is well-established, but breaks down with challenging scans. Surfaces with little visual texture, like plain concrete, glass, or reflective metal, give the algorithm sparse features to match across frames. Variable lighting, motion blur, large distances between capture positions (where two frames of the same feature were taken so far apart that the algorithm struggles to match them), and gaps in coverage create the same problem. These are particularly important for customers working with drone surveys, archival footage, or subsea video.

Niantic Spatial has an extensive history of training on, inferring from, and reconstructing from noisy data: non-expert capture, uneven coverage, and variable conditions. Building on it has produced a system that can handle challenging input data and large outdoor spaces. Published work underpinning this includes ACE (CVPR 2023), ACEZero and Scene Coordinate Reconstruction (ECCV 2024), MicKey (CVPR 2024), and ACE-G (ICCV 2025).

Turning camera positions into geometry

Once the position and orientation of every camera is known, the pipeline builds a 3D reconstruction. The output format is point clouds, polygonal meshes, and 3D Gaussian splats (3DGS).

3DGS represents a scene as a large collection of small, semi-transparent, Gaussian (blob-like) shapes, each with a defined position, size, orientation, and color -- essentially a large 3D point cloud where each point has volume and color rather than just position. Together, millions of these 3D Gaussians form a representation that can be rendered photorealistically from any viewpoint.

The standard approach to building these representations optimises purely for visual quality: it produces images that look right, but doesn't enforce that the underlying geometry is physically accurate. In practice, this means surfaces can bleed into each other, low-texture areas develop holes, and the representation becomes unstable at object edges.

These artefacts are often invisible in rendered images but cause problems the moment someone tries to take a measurement, extract a mesh for simulation, or use the reconstruction as training data for a robot.

Niantic Spatial's pipeline applies depth and surface constraints derived from its depth estimation research, including SimpleRecon (ECCV 2022), DoubleTake (ECCV 2024), and MVSAnywhere (CVPR 2025). The result is a reconstruction where the underlying geometry is physically consistent, not just visually plausible.

Reconstructions are also georeferenced: each point in the model is assigned real-world coordinates, tied to the same coordinate system used by mapping and geospatial infrastructure globally. This means you can drop it into a map, compare captures of the same location taken months apart, or hand the asset to another system that needs to know where the capture was taken.

Compression and streaming at scale

A city-scale Gaussian splat reconstruction in its raw form can run to hundreds of gigabytes. That is unworkable for storage and delivery, and it makes the asset difficult to share across teams or integrate into existing workflows.

Niantic Spatial developed and open-sourced SPZ, a compression format for Gaussian splat data that reduces file size by approximately 90% compared to uncompressed PLY at negligible visual quality loss. SPZ is supported across Cesium 3D Tiles, glTF, ESRI ArcGIS, and SuperSplat without specialist decoders. The recently released SPZ 4 brings parallel compression streams, 3-5x faster encoding, and roughly half the load time for large scenes. Full details here.

This matters especially when splats get larger – today we announced a partnership with Spexi Geospatial that puts this pipeline to work at city scale, combining Spexi's aerial data network of 10,000-plus drone pilots with Niantic Spatial's reconstruction technology.

What the output is used for

The value of our reconstructions comes from their geometric accuracy and real-world coordinates embedded in the output, not just the visual fidelity.

Current use cases include infrastructure inspection, where the measurement tools support remote assessment of assets without site visits; insurance risk assessment, where high-resolution geometry captures structural detail that satellite imagery might miss; energy site analysis and asset management, where multi-temporal captures can support change detection over time; and robotics, where geometry-accurate reconstructions of real environments provide training data that can reduce the sim-to-real gap in autonomous navigation.

For physical AI, the ability to produce geometrically accurate, real-world-grounded models at city scale is what makes reconstructions useful as training data. Synthetic simulation environments are easier to generate but don't capture the complexity of real physical spaces.

The pipeline also produces Visual Positioning System (VPS) maps as an output, supporting downstream field-worker and robot localisation in the same environments where reconstruction has been run.

Reconstruct

Localize

Understand

Capture

Reconstruct

Localize

Understand

Capture