TC
text6 min read

Text Manipulation: Cases, Encoding, and Character Sets

Why does JavaScript use camelCase while Python uses snake_case? Why does toLowerCase() break Turkish text? And what does a Roman orator from 45 BC have to do with placeholder text? The answers reveal surprising depth in something we take for granted: manipulating text.


Case conventions and why they matter

Naming conventions aren't cosmetic. They signal which language or framework you're working in, distinguish variables from classes, and make code scannable. Using the wrong convention in a codebase is like using British spelling in an American newspaper — technically correct, but conspicuously out of place.

ConventionExampleUsed in
camelCasegetUserNameJavaScript, Java, TypeScript
PascalCaseGetUserNameC#, React components
snake_caseget_user_namePython, Ruby, Rust
kebab-caseget-user-nameCSS, HTML attributes, URLs
SCREAMING_SNAKEMAX_RETRY_COUNTConstants (all languages)
dot.caseget.user.nameJava properties, configs

The names are descriptive: camelCase has humps in the middle, snake_case crawls along the ground, and kebab-case is skewered by hyphens. These aren't arbitrary — each language picked a convention based on its syntax constraints. Python can't use hyphens in identifiers (they'd be subtraction), so it uses underscores. CSS properties can't use underscores in the traditional syntax, so they use hyphens.


ASCII: the first 128 characters

ASCII (American Standard Code for Information Interchange) was published in 1963. It assigns a number to 128 characters: the English alphabet (upper and lowercase), digits 0–9, punctuation, and 33 control characters like tab and newline. That's it. No accents, no Chinese characters, no emoji.

Character → Code point
'A'       → 65
'a'       → 97
'0'       → 48
' '       → 32
'\n'      → 10

Upper → lower: add 32  (65 + 32 = 97)
Lower → upper: subtract 32

ASCII worked fine for American English. But the world has thousands of writing systems. ASCII couldn't even handle French accents, let alone Japanese or Arabic script.

Unicode and UTF-8: text for the whole world

Unicode is a catalog of every character ever used in human writing — plus emoji, musical notation, mathematical symbols, and ancient scripts. Each character gets a unique code point. As of 2024, Unicode defines over 154,000 characters.

UTF-8 is the encoding that turns those code points into actual bytes in a file. It's backwards-compatible with ASCII (every ASCII file is valid UTF-8) and uses 1–4 bytes per character:

'A'   → 1 byte   (U+0041)  — same as ASCII
'é'   → 2 bytes  (U+00E9)
'中'  → 3 bytes  (U+4E2D)
'🎉'  → 4 bytes  (U+1F389)
UTF-8 dominates the web. Over 98% of all web pages use UTF-8 encoding. It won because it's compact for English text (1 byte per character, same as ASCII) while still supporting every language on Earth.

Why toLowerCase() is harder than it looks

In English, converting case is simple: shift each character by 32 in ASCII. But many languages have special rules. The most famous is the Turkish İ problem.

In Turkish, the uppercase of i is İ (I with a dot), not I. And the lowercase of I is ı (dotless i), not i. A naive toLowerCase() implementation breaks Turkish text:

// English locale
"FILE".toLowerCase()  → "file"  ✓

// Turkish locale
"FILE".toLocaleLowerCase('tr')  → "fıle"  (dotless ı!)

// The bug:
"TITLE".toLowerCase() in Turkish should be "tıtle"
but most code produces "title" — wrong!

Other case-conversion surprises

  • German ß (sharp s) uppercases to SS — one character becomes two
  • Greek Σ has two lowercase forms: σ (mid-word) and ς (end of word)
  • Some Unicode characters change length when case-converted, breaking assumptions about string length

Lorem Ipsum: placeholder text from ancient Rome

Lorem ipsum dolor sit amet... — every designer and developer has seen this text. It looks like Latin, and it almost is. The text is a scrambled excerpt from De Finibus Bonorum et Malorum (On the Ends of Good and Evil) by the Roman philosopher Cicero, written in 45 BC.

A typesetter in the 1500s scrambled the original text to create a block of content that looked like natural language without being readable. The goal was to let designers focus on visual layout without being distracted by meaningful content. The same principle applies today: Lorem Ipsum fills a page so you can evaluate typography, spacing, and structure.

The key phrase Lorem ipsum itself comes from a broken sentence. The original reads “dolorem ipsum” (“pain itself”). The typesetter sliced it mid-word, creating “Lorem ipsum” — which is not actual Latin.

Why not use real text? Because real content draws attention to itself. Stakeholders start editing copy instead of reviewing layout. Nonsense text keeps the focus on design. That's why Lorem Ipsum has survived for over 500 years of print and digital publishing.
Text seems simple until you zoom in. Case conversion, character encoding, and even placeholder text carry centuries of history and surprising technical depth.

Try it yourself

Put what you learned into practice with our Case Converter.