Image6 min readMarch 23, 2026

How Depth Maps Turn Flat Photos into 3D

A photograph is a flat, two-dimensional grid of pixels. Yet, when you look at a photo of a mountain range or a city street, your brain instantly understands which objects are close enough to touch and which are miles away. This process of recovering the third dimension from a 2D image is called depth estimation.

What is a depth map?

A depth map is a specialized image where each pixel represents the distance from the camera to the object at that point. Unlike a standard photo that stores color (Red, Green, Blue), a depth map typically stores a single value per pixel, often visualized as a grayscale image.

Bright pixels (White) represent objects closest to the camera.
Dark pixels (Black) represent objects furthest away.
Grays represent everything in between.

Depth vs. Disparity: In technical contexts, you might hear the term "disparity." While depth is the actual distance in meters, disparity is the apparent shift of an object between two viewpoints. They are inversely related.

How a single camera "sees" depth

Humans with two eyes use stereopsis to triangulate distance. But how does a single camera perceive depth? This is known as monocular depth estimation, and it relies on several visual cues:

Occlusion - If one object blocks another, the blocker is closer.
Relative Size - Smaller objects are perceived as further away.
Linear Perspective - Parallel lines appear to converge in the distance.
Texture Gradients - Fine details blur into uniform texture far away.
Atmospheric Perspective - Distant objects appear paler and bluer.

How AI models learn depth

Modern AI models, like Depth Anything, are trained on millions of images where the "ground truth" depth is known from LiDAR or stereo setups.

The Encoder-Decoder Architecture

The model converts a color image into a depth map through a specialized pipeline:

Encoder - A neural network breaks the image into abstract features.
Decoder - Projects features back to resolution, estimating distance.

The magic of modern AI is its ability to understand context. It knows that a person standing on a sidewalk is likely closer than the building behind them.