Code8 min readMarch 30, 2026

How Text Diff Algorithms Work

Every time you run git diff, review a pull request, or compare two versions of a document, an algorithm is solving one of computer science's classic problems: given two sequences of text, what is the minimum set of changes that transforms one into the other? The answer is more subtle than it sounds -- and the algorithms behind it shaped the tools developers use every day.

The longest common subsequence problem

At the heart of every diff tool is the Longest Common Subsequence (LCS) problem. Given two sequences, find the longest subsequence present in both. Unlike a substring, a subsequence doesn't need to be contiguous -- it just needs to appear in the same relative order.

Sequence A: A B C D F
Sequence B: A C B C F

LCS:        A B C F  (length 4)

The characters A, B, C, F appear in both
sequences in the same relative order.

Once you know the LCS, everything not in it represents a change. Characters in A but not in the LCS were deleted. Characters in B but not in the LCS were inserted. This is why diff output shows lines prefixed with - (removed) and + (added).

The classic DP solution

The textbook approach uses dynamic programming. Build an (m+1) x (n+1) table where m and n are the lengths of the two sequences. Each cell dp[i][j] stores the length of the LCS of the first i characters of A and the first j characters of B. The time complexity is O(mn), and so is the space.

For two files with 10,000 lines each, that's a 100-million-cell table. It works, but it's slow. Real-world diff tools need something faster.

Myers' diff algorithm

In 1986, Eugene Myers published "An O(ND) Difference Algorithm and Its Variations" -- the algorithm that powers diff, Git, and most modern diff tools. Instead of filling an entire m x n grid, Myers' algorithm works in terms of the edit distance D (the number of insertions plus deletions).

The edit graph

Myers reframes diffing as a shortest-path problem on an edit graph. Imagine a grid where moving right is a deletion, moving down is an insertion, and moving diagonally (when characters match) is free. The goal is to reach the bottom-right corner with the fewest non-diagonal moves.

The key insight: Myers' algorithm is O(ND) where D is the edit distance. When two files are mostly similar (small D), the algorithm is nearly linear. Only when files are completely different does it degrade to O(N^2). Since most diffs in practice involve small changes to large files, this is exactly the right trade-off.

Why this matters in practice: Comparing two 50,000-line files that differ in 10 lines takes roughly 50,000 x 10 = 500,000 operations with Myers, versus 2.5 billion with the naive DP approach. That's a 5,000x speedup for a typical code review.

Patience diff: Git's secret weapon

Myers' algorithm finds a shortest edit script, but not necessarily the most readable one. When two functions are reordered in a file, Myers might match closing braces from one function with opening braces of another, producing a confusing diff.

Patience diff (invented by Bram Cohen of BitTorrent fame) takes a different approach:

Find lines that appear exactly once in both files -- these are "anchor" lines (function signatures, class declarations, unique comments).
Compute the LCS of only these unique anchor lines to establish the overall structure.
Recursively diff the sections between anchors using a standard algorithm.

The result is structurally cleaner. Git supports it via git diff --patience, and many developers set it as their default because the output aligns with how humans think about code changes: function-by-function, not line-by-line.

Diff output formats

Unified diff

The most common format, used by git diff and patch files. Lines starting with - are removed, + are added, and a space prefix means unchanged context. The header shows line ranges:

--- a/hello.txt
+++ b/hello.txt
@@ -1,4 +1,4 @@
 Hello
-World
+Universe
 This is
 a test

The @@ -1,4 +1,4 @@ means: starting at line 1 of the old file (showing 4 lines) and line 1 of the new file (showing 4 lines). This compact format is what makes git apply and patch workflows possible.

Side-by-side diff

Shows the old and new versions in parallel columns. Easier for humans to read when reviewing large changes, but uses twice the horizontal space. Most code review tools (GitHub, GitLab) offer both views because neither is universally better -- unified is more compact, side-by-side is more visual.

Computational complexity: the hard truth

Computing the exact minimum edit distance is fundamentally expensive. The key complexity classes:

Classic DP -- O(mn) time and space. Works for small files.
Myers -- O(ND) time, O(N) space. Fast when D is small.
Histogram diff -- Used by git diff --histogram. An optimisation of patience diff that handles repeated lines better.
Line-level vs character-level -- Most tools diff by lines first (fast), then optionally diff within changed lines by character (for highlighting).

In 2023, the Bram Cohen and Myers approaches remain dominant. No fundamentally faster general algorithm has been found -- the LCS problem has a proven lower bound of O(N^2 / log N) under certain computational models.

Diff algorithms are a beautiful example of theory meeting practice. The textbook solution is too slow, the fast solution only works because real-world changes are small, and the "best" algorithm depends on whether you optimise for speed or readability.