Why a single emoji breaks string length
Text is not a single thing — it's four layers (bytes, code units, code points, graphemes), and almost every encoding bug is confusion between two of them. This post walks through how computers actually represent text, why a JavaScript emoji has length 2, when to use UTF-8 vs UTF-16, what base64 actually solves, how to fix mojibake, and a field guide for debugging encoding problems.
A form on a billing page shipped with maxLength="100" and a helpful placeholder that said "Up to 100 characters." A QA engineer in Tokyo tested it with a message containing emoji and reported the form cut off at 50 characters. The same form on the same message on Safari accepted all 100. Same HTML. Same JavaScript. Different results.
The bug was a 128-bit Unicode character trying to fit into a 16-bit code unit. The thumbs-up emoji 👍 takes 2 "characters" in JavaScript's string length because JavaScript internally counts UTF-16 code units, and emoji in the supplementary plane require two of them. Chrome's maxLength implementation counted code units; Safari counted code points. Safari was wrong technically — the spec says code units — but users in Safari got the expected experience. In Chrome, a user typing emoji hit the limit twice as fast.
That's the kind of bug that happens when you think "text" is one thing. It isn't. Text is at least four layers: bytes on disk, code units in memory, code points in Unicode, and glyphs on screen. Every encoding topic — ASCII, UTF-8, UTF-16, binary representations, hex, base64, URL encoding, HTML entities — is about the boundaries between those layers and the bugs that happen when you cross them wrong.
This post walks through all of it. Not because you need to know Unicode history to ship, but because the moment an international user pastes Cyrillic into your app, or a Japanese customer name comes out of the database as 新浪新闻 instead of 新浪新闻, you need to know where to look. It threads together how computers represent text as bytes, why UTF-8 won, what surrogate pairs are, when base64 is the right choice, the classic mojibake bug and how to fix it, and a field guide for debugging encoding problems when you see them.
Text is not a thing
Strip away all the abstractions and there are four distinct layers whenever a character goes from input to display:
1. Bytes. The raw 8-bit values stored on disk, in memory, or on the wire. 72 is a byte. 0x48 is the same byte. 01001000 is the same byte. None of those say "H" yet.
2. Code units. The smallest unit of an encoding. UTF-8 code units are 1 byte. UTF-16 code units are 2 bytes. UTF-32 code units are 4 bytes. Characters that fit in one code unit don't need anything special. Characters that don't — like most emoji in UTF-16 — need multiple code units.
3. Code points. The abstract Unicode number assigned to a character. U+0041 is 'A'. U+1F44D is 👍. Code points are independent of how they're stored.
4. Glyphs. What you see on screen. The font on your device converts code points into visual shapes. A code point without a font is a square box.
Most text bugs happen because two parts of a system disagree on which layer they're operating in. The form-field story at the top is a code-units-vs-code-points disagreement. Mojibake ("é" instead of "é") is a bytes-vs-code-units disagreement. MySQL's original utf8 accepting 3-byte characters but not 4-byte emoji is the same thing. The debugging skill is knowing which layer is which.
Our text to binary, text to hex, and Unicode converter tools make these layers visible side by side — paste a string and see the bytes, the code units, the code points, and the glyph all at once.
ASCII and the 128-character world
Before Unicode, there was ASCII. 7 bits per character. 128 possible values. Defined in the 1960s by a committee of American telecommunications and computing companies for telegraph-era needs. The full table covers:
- Uppercase letters A-Z (codes 65-90)
- Lowercase letters a-z (codes 97-122)
- Digits 0-9 (codes 48-57)
- Common punctuation (codes 32-47, 58-64, 91-96, 123-126)
- Control characters for early terminals (codes 0-31 and 127 — things like tab, newline, backspace, bell)
The range fits in 7 bits because 2^7 = 128. A byte is 8 bits though, so every ASCII character has a spare high bit that was historically used for parity (error detection on noisy phone lines) or later repurposed by various "extended ASCII" encodings that defined characters for codes 128-255.
ASCII's lasting influence is that its 128 characters sit at the same numeric values in every modern encoding. A capital A is code point 65 in ASCII, Unicode, UTF-8, UTF-16, and UTF-32. That compatibility is why you can stop a random byte stream, interpret it as ASCII, and often get English text out. It's also why UTF-8 was designed to preserve ASCII compatibility — an ASCII-only file is byte-identical in UTF-8.
ASCII in binary and hex
"Hello" is five ASCII characters. Same string, four different representations depending on what layer you're looking at:
As characters: H e l l o
As decimal: 72 101 108 108 111
As hex: 48 65 6C 6C 6F
As binary: 01001000 01100101 01101100 01101100 01101111
Every one of those lines is the same 5 bytes. Hex is the most useful human-readable form because each byte is exactly 2 hex characters (since 2^8 = 256 = 16^2). Binary is useful for teaching and debugging but no one writes 01001000 01100101 01101100 01101100 01101111 in production code.
Our text to binary converter produces the binary form, text to hex produces the hex form, and both work in reverse — paste 48 65 6C 6C 6F into hex to text to get "Hello" back. The per-character breakdown in those tools shows character, binary, hex, and decimal simultaneously, which is the fastest way to build intuition for what's happening.
One catch: those tools use JavaScript's charCodeAt() which returns UTF-16 code units. For pure ASCII text that's identical to the byte value, but for non-ASCII characters you'll see UTF-16 code unit values, not UTF-8 bytes. The difference matters when you hit the Unicode section below.
Unicode is not an encoding
This is the single most confused concept in the whole topic.
Unicode is a catalog. It assigns a unique number — a code point — to every character in every writing system. Latin A is U+0041. Greek alpha is U+03B1. CJK ideograph 猫 is U+732B. The thumbs-up emoji is U+1F44D. As of Unicode 16.0 (released September 2024), there are roughly 154,998 code points assigned. The code space goes up to U+10FFFF, giving room for 1.1 million code points total.
Unicode is not how text is stored. A code point is an abstract number. To put text on a disk or send it over a network, you need an encoding — a rule that converts code points into bytes. UTF-8, UTF-16, and UTF-32 are three such encodings. They all represent the same Unicode characters, just with different tradeoffs.
This distinction matters because "my file is in Unicode" is not a meaningful statement. A file contains bytes. The bytes might encode Unicode code points using UTF-8, UTF-16, UTF-32, or something else. Each produces different bytes for the same text.
Our Unicode converter shows the same text simultaneously as code points (U+0048 U+0065 U+006C U+006C U+006F), UTF-8 bytes, UTF-16 code units, and various escape formats. Paste any emoji or non-Latin character to see why encoding choice matters.
UTF-8 as the smart encoding
Of the three Unicode encodings, UTF-8 is the clear winner for almost every use case. It's the default on every modern system except for Windows/Java/JavaScript internal memory (which use UTF-16 for historical reasons).
UTF-8 is variable-width. A single code point uses 1 to 4 bytes depending on the code point's value:
| Code point range | Bytes | Example |
|---|---|---|
| U+0000 – U+007F | 1 | A → 0x41 |
| U+0080 – U+07FF | 2 | é → 0xC3 0xA9 |
| U+0800 – U+FFFF | 3 | € → 0xE2 0x82 0xAC |
| U+10000 – U+10FFFF | 4 | 👍 → 0xF0 0x9F 0x91 0x8D |
The design is clever. For code points in the ASCII range, UTF-8 emits exactly one byte with the same value. That means a file containing only English text is byte-identical in ASCII and UTF-8. Legacy software that doesn't understand UTF-8 can still process ASCII-only UTF-8 files correctly. The rest of Unicode costs more bytes, but since most text on the web is mostly ASCII (even non-English text has plenty of spaces, digits, and punctuation in ASCII), UTF-8's average overhead is small.
The bit patterns are also self-synchronizing. A multi-byte UTF-8 sequence always has a specific leading byte (starts with 11) followed by continuation bytes (start with 10). You can drop into the middle of a UTF-8 stream, find the next leading byte, and resume decoding. UTF-16 doesn't have this property — miss a byte and you misinterpret everything that follows.
Specifically, the bit layout:
Single byte: 0xxxxxxx (ASCII range)
Two bytes: 110xxxxx 10xxxxxx (covers U+0080–U+07FF)
Three bytes: 1110xxxx 10xxxxxx 10xxxxxx (covers U+0800–U+FFFF)
Four bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (covers U+10000+)
Continuation bytes always start with 10, so any byte with that prefix cannot be a first byte. First bytes have a count of leading 1s that indicates how many bytes are in the sequence. The redundancy is why UTF-8 is robust.
RFC 3629 specifies UTF-8 formally. Google, WHATWG, W3C, Unicode Consortium, and every modern language default to UTF-8 for I/O. If you're picking an encoding for a new file format or protocol in 2026, pick UTF-8.
UTF-16 and the surrogate pair problem
UTF-16 uses 2 bytes (one 16-bit code unit) for characters in the Basic Multilingual Plane (BMP) — code points U+0000 through U+FFFF. That covers most common characters including Latin, Cyrillic, Greek, most CJK, Hebrew, Arabic, and historical scripts.
Code points above U+FFFF (supplementary planes — emoji, historical scripts, some mathematical symbols, additional CJK) don't fit in 16 bits. UTF-16 represents them with a surrogate pair: two 16-bit code units that together encode one code point.
The mechanism: a range of code points (U+D800 to U+DFFF, 2048 values total) is permanently reserved in Unicode as "surrogates." Real characters never live there. When UTF-16 needs to encode U+10000 or higher, it picks one "high surrogate" (U+D800-U+DBFF) and one "low surrogate" (U+DC00-U+DFFF) whose values, when combined, encode the supplementary plane code point.
Example: 👍 is code point U+1F44D. Too high for a single UTF-16 code unit. Encoded as a surrogate pair:
Code point: U+1F44D (decimal 128077)
Subtract 0x10000: 0xF44D (decimal 62541)
Split into high 10 bits and low 10 bits: 0x3D (61) and 0x04D (77)
Add 0xD800 to high: 0xD83D (high surrogate)
Add 0xDC00 to low: 0xDC4D (low surrogate)
UTF-16 encoding: 0xD83D 0xDC4D (two 16-bit code units)
Now the form field bug from the opening anecdote makes sense:
"👍".length // 2 (JavaScript counts UTF-16 code units)
JavaScript's string.length returns the number of UTF-16 code units, not the number of characters a human would count. For ASCII and BMP characters, that matches intuition. For supplementary plane characters — emoji, ancient scripts, some math symbols — it doubles.
The fix is counting code points, not code units. Several patterns work:
// Spread operator iterates code points
[..."👍"].length // 1
// Array.from does the same
Array.from("👍").length // 1
// Or explicitly iterate with for...of
let count = 0;
for (const char of "👍") count++;
// count === 1
But even that isn't enough for every case. Some characters are built from multiple code points combined — a woman construction worker emoji might be woman + skin tone modifier + construction worker sign + zero-width joiners, totaling 5-7 code points for one visible glyph. For true "character" counting that matches what a human sees, you need the Intl.Segmenter API:
const segmenter = new Intl.Segmenter();
[...segmenter.segment("👷🏽♀️")].length // 1 (correctly)
Intl.Segmenter segments text into graphemes — visible user-perceived characters. It's the right tool for user-facing character counts on real international input.
Why a single emoji breaks string length
Returning to the opening anecdote with the full mechanism.
The form had maxLength="100". In the HTML spec, maxLength refers to the length of the form's value as a string. JavaScript strings count UTF-16 code units. A user typing "Hello" sees length 5 (5 ASCII characters, 5 code units). A user typing 👍👍👍 sees length 6 (3 emoji, 2 code units each).
Chrome followed the spec strictly: length = code units. The form cut off after 50 emoji.
Safari implemented maxLength using code points, not code units. A user typing 👍👍👍 saw length 3. The form accepted 100 emoji.
Both browsers are "correct" from different perspectives. Chrome matches the HTML spec text. Safari matches user intuition. The spec is slowly being updated to clarify — some specs now explicitly use code points rather than UTF-16 length because the latter has been such a source of bugs. But existing browsers render the existing bug.
The lesson: if your form needs to count "characters" for a user-facing limit, don't rely on maxLength. Write the length check in JavaScript using [...str].length or Intl.Segmenter, and make the limit based on graphemes rather than code units.
Mojibake — when bytes get decoded wrong
Mojibake (文字化け, Japanese for "character change") is the general term for text that looks garbled because bytes encoded in one system were interpreted using another. The specific name depends on the pair, but the mechanism is the same.
The classic case is UTF-8 interpreted as Latin-1 (aka ISO-8859-1, aka Windows-1252). The word "café" in UTF-8:
c: 0x63
a: 0x61
f: 0x66
é: 0xC3 0xA9 (two bytes in UTF-8)
If you read these bytes as Latin-1 (which maps every byte 0-255 to one character), you get:
c: 0x63 → c
a: 0x61 → a
f: 0x66 → f
Ã: 0xC3 → Ã
©: 0xA9 → ©
So "café" becomes "café". The tell is that Latin-accented and European characters show up as pairs of weird symbols, usually starting with Ã.
Double-encoding is the same thing twice. The text "café" is written to disk as UTF-8 bytes. A script reads those bytes, thinks they're Latin-1, "fixes" them by UTF-8 encoding the Latin-1 characters, and writes "é" back to disk as UTF-8 of "Ã" (0xC3 0x83) and UTF-8 of "©" (0xC2 0xA9). Now "café" has become "cafÃ\u0083©" in the file. Three rounds in and you have unrecoverable damage without knowing the full history.
The fix depends on how bad the damage is:
Single-encoded mojibake (one wrong decode) is reversible. Python example:
# You see: "café" but expected "café"
broken = "café"
bytes_ = broken.encode("latin-1")
fixed = bytes_.decode("utf-8")
# fixed == "café"
Double-encoded mojibake requires applying the fix twice. Anything beyond double-encoding is usually data loss — the information about which character originally produced each byte is lost if the same confusion happened repeatedly.
The MySQL utf8 trap deserves its own mention. Prior to MySQL 5.5, the utf8 character set supported only 3 bytes per character, which meant it couldn't store emoji (which need 4 bytes in UTF-8). Storing a 4-byte character into a utf8 column silently truncated it, corrupting the data. MySQL introduced utf8mb4 (UTF-8, 4 bytes max) as the proper UTF-8 variant. Every new MySQL database should use utf8mb4, not utf8. If you're debugging an old MySQL database where emoji disappear on insert, that's exactly this bug.
Base64 — when you can't send bytes
Sometimes you need to send binary data through a channel that only accepts text. Email was text-only in the 1970s SMTP era, so email clients couldn't send attachments directly — they needed a way to represent an image or PDF as text characters. HTTP headers are text-only, so authentication tokens can't embed raw bytes. JSON is text-only, so binary data in an API response needs to be encoded somehow.
Base64 solves this. It's a binary-to-text encoding that maps every 3 bytes of input to 4 ASCII characters of output. The characters are a fixed alphabet: A-Z, a-z, 0-9, +, /, with = used as padding when the input length isn't a multiple of 3. Specified in RFC 4648.
Every base64 character represents 6 bits of input (since 2^6 = 64). 3 bytes = 24 bits = 4 base64 characters. The efficiency ratio is 3/4 = 75%, meaning base64 output is about 33% larger than the input (4 output bytes for every 3 input bytes).
Example:
Input bytes: H i !
72 105 33
01001000 01101001 00100001
Split into 6-bit groups: 010010 000110 100100 100001
Decimal: 18 6 36 33
Base64 alphabet: S G k h
Output: "SGkh"
Base64 shows up everywhere:
- Data URIs in HTML/CSS —
data:image/png;base64,iVBORw0KGgo...embeds a PNG directly in the markup - HTTP Basic Authentication — the
Authorization: Basic YWxpY2U6c2VjcmV0header is base64 ofalice:secret - JWT payloads — each JWT segment is base64url-encoded JSON (see our JWT decoder guide for the full story)
- Email MIME attachments — every email attachment in every inbox is base64-encoded during transit
- Webhook signatures — HMAC signatures are often base64-encoded in headers
Base64URL is a variant that substitutes - for + and _ for / (those characters have meaning in URLs) and optionally drops padding. It's what JWTs use. Hit our base64 encoder/decoder for a live tool that handles both variants.
Base64 is not encryption. It's trivially reversible. Anyone who has the base64 string can decode it. Use it for transport, not secrecy.
Hex, when base64 feels like overkill
Hex (base-16) is the other common binary-to-text encoding. Same goal as base64, worse efficiency, better readability.
Every byte is 2 hex characters. No padding issues. Efficiency is 50% (2 hex chars per byte, each hex char is 4 bits = 8 bits per byte displayed). That's worse than base64's 75% but hex has advantages:
- Fixed byte boundaries. Every byte is exactly 2 chars. You can index into hex strings with simple byte math.
- Readable. Each byte is identifiable at a glance.
- Universal support. Every language has hex encode/decode in the standard library.
Hex shows up in:
- Cryptographic hashes — SHA-256 output is 64 hex characters (32 bytes)
- Color codes —
#FF5733is three bytes expressing RGB values - MAC addresses —
00:1B:44:11:3A:B7is six hex bytes - Memory dumps and debugger output — hex is easier to read than decimal in memory contexts
- Unicode code point notation —
U+1F44Dis always hex
Use hex when readability matters more than compactness. Use base64 when the goal is minimum size.
Text to hex converts text into its byte sequence. Hex to text reverses it. Our hash algorithm guide goes deeper on hex's role in cryptographic hash output.
URL encoding, HTML entities — the other escape layers
Two more encoding schemes you'll run into constantly. Neither is "binary to text" exactly — they're "text with problem characters to text without problem characters."
Percent encoding (URL encoding) replaces characters that have meaning in URLs with %XX where XX is the hex byte value. Spaces become %20 (or + in query strings). Special characters like ?, &, =, # get encoded when they appear in URL segments where they're not meant as delimiters. UTF-8 bytes above 127 are always encoded. The character € (three bytes in UTF-8: E2 82 AC) becomes %E2%82%AC in a URL.
Our URL encoder/decoder does this round-trip.
HTML entities escape characters that have meaning in HTML. < becomes < so it doesn't start a tag. & becomes & so it doesn't start an entity. " becomes " in attribute values. There's also a numeric form: é for é or 👍 for 👍 (entities can use decimal or hex code point references).
Our HTML entity encoder/decoder handles both forms.
The key insight: percent encoding, HTML entities, and base64 all solve the same abstract problem — "I need to put these bytes in a context where some bytes are dangerous" — but each one works on a different context. Percent encoding for URLs. HTML entities for HTML. Base64 for anywhere that needs ASCII-safe transport of arbitrary bytes.
Bytes vs characters in code
The string.length === bytes confusion shows up in almost every language. Quick reference table:
| Language | length/len returns | Byte length how |
|---|---|---|
| JavaScript | UTF-16 code units | new TextEncoder().encode(s).length |
| Python | code points (Unicode) | s.encode("utf-8").__len__() |
| Go | bytes of underlying array | utf8.RuneCountInString(s) for code points |
| Java | UTF-16 code units | s.getBytes(StandardCharsets.UTF_8).length |
| Rust | bytes | s.chars().count() for code points |
| C (strlen) | bytes until null | no built-in code point counting |
Three places this matters in production code:
1. Database column sizes. VARCHAR(100) in MySQL utf8mb4 is 100 characters (code points), but on disk it can occupy up to 400 bytes (4 bytes per character). Make sure your schema limits account for this. PostgreSQL varchar(n) is characters. SQLite TEXT has no explicit limit.
2. HTTP Content-Length. Always bytes, never characters. Computing Content-Length from str.length in JavaScript is wrong for non-ASCII content.
3. Log size limits. Log aggregators often limit per-message size in bytes. A 1000-character log message in English might fit, but the same in Chinese is 3000 bytes. Design log schemas in terms of bytes, not characters, for multilingual safety.
Our byte converter is handy when you need to convert between representations of data size (bytes, KB, MB, KiB, MiB — binary vs decimal prefixes are their own can of worms).
Debugging encoding bugs — a field guide
When text looks wrong, follow this workflow.
1. Look at the raw bytes. Every other debugging step is guessing without this. On Linux/macOS:
xxd file.txt | head
# or
od -c file.txt | head
In Python:
with open("file.txt", "rb") as f:
print(f.read()[:100])
In the browser, use our hex to text tool backward — paste the broken text and see what bytes produced it.
2. Identify the encoding. Look for a BOM (byte order mark):
- UTF-8 BOM:
EF BB BF - UTF-16LE BOM:
FF FE - UTF-16BE BOM:
FE FF - UTF-32 LE/BE BOMs also exist but are rare
No BOM, guess by heuristics. file -i on Linux shows the detected encoding:
file -i file.txt
# text/plain; charset=utf-8
For programmatic detection, use chardet (Python) or jschardet (JavaScript). Both use statistical models to guess the encoding. They're not perfect but right 80% of the time.
3. Reinterpret the bytes. Once you know the wrong encoding and can guess the right one, round-trip:
broken = "é"
real = broken.encode("latin-1").decode("utf-8")
# real == "é"
4. Fix the source. The broken text is a symptom. The source is wherever the bytes get misinterpreted. Common sources:
- Database connection uses wrong charset
- HTTP response missing
Content-Type: text/html; charset=utf-8header - File opened in Python without
encoding="utf-8"argument - JavaScript reading a text file with a BOM and getting an extra character at position 0
5. Stop the bleeding. Set encoding explicitly everywhere. For web:
response["Content-Type"] = "text/html; charset=utf-8"
For Python file I/O:
open("file.txt", encoding="utf-8")
For MySQL:
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
For HTML:
<meta charset="utf-8">
UTF-8 everywhere is the simple rule. UTF-16 in memory where languages require it (JavaScript, Java). UTF-8 for every I/O boundary, every storage system, every network protocol. Any deviation needs a specific reason.
FAQ
What's the difference between ASCII and Unicode?
ASCII is a 128-character set covering English letters, digits, punctuation, and control characters. Unicode is a universal catalog of all characters in all writing systems, currently about 155,000 code points. ASCII is a small subset of Unicode — the first 128 Unicode code points are identical to ASCII.
What's the difference between Unicode and UTF-8?
Unicode is a catalog of characters (code points). UTF-8 is an encoding that converts code points to bytes for storage and transmission. Other encodings of Unicode exist (UTF-16, UTF-32) but UTF-8 is dominant on the web.
Why does a single emoji have length 2 in JavaScript?
JavaScript strings are sequences of UTF-16 code units. Most characters fit in one 16-bit code unit. Emoji and other supplementary plane characters need two code units (a surrogate pair). "👍".length counts code units, so you see 2. Use [..."👍"].length or Array.from("👍").length to count code points (1).
What's a surrogate pair?
A pair of 16-bit UTF-16 code units that together encode a single Unicode code point from outside the Basic Multilingual Plane. Required for code points above U+FFFF. Each surrogate has a specific value range (high surrogates U+D800-U+DBFF, low surrogates U+DC00-U+DFFF) that identifies it as part of a pair.
What's the difference between UTF-8 and UTF-16?
Both encode Unicode but differently. UTF-8 uses 1-4 bytes per character (1 byte for ASCII, more for others). UTF-16 uses 2 or 4 bytes (2 for BMP, 4 for supplementary). UTF-8 is ASCII-compatible; UTF-16 is not. UTF-8 is the default on web, Linux, macOS. UTF-16 is internal to Windows, Java, JavaScript.
Is ASCII still used?
ASCII is still meaningful as the subset of Unicode that covers codes 0-127. A file containing only ASCII characters is byte-identical in ASCII and UTF-8. Standards still reference "ASCII-safe" characters to mean the 128-character subset.
What is mojibake?
Text that looks wrong because bytes encoded in one character set were decoded using a different one. Classic case: UTF-8 bytes interpreted as Latin-1, where "é" becomes "é". Usually fixable by identifying the wrong decode and reversing it.
How do I fix mojibake in Python?
Re-encode and re-decode with the correct encoding pair. For UTF-8-as-Latin-1: broken.encode("latin-1").decode("utf-8"). Double-encoded mojibake requires applying the fix twice.
What's the difference between utf8 and utf8mb4 in MySQL?
Before MySQL 5.5, utf8 supported only 3-byte UTF-8 characters, which excluded emoji and some CJK characters. utf8mb4 is real 4-byte UTF-8. Always use utf8mb4 for new MySQL databases.
What is base64 used for?
Encoding binary data as ASCII-safe text so it can be transmitted through text-only channels. Common uses: email attachments (MIME), HTTP Basic Auth headers, data URIs in HTML/CSS, JWT payloads, API responses with binary data.
Is base64 encryption?
No. Base64 is trivially reversible and requires no key. It's a transport encoding, not a security mechanism. Anyone with a base64 string can decode it.
When should I use hex vs base64?
Base64 when size matters (33% overhead). Hex when readability matters (100% overhead but easy to read byte-by-byte). Hex for cryptographic hashes, color codes, memory dumps. Base64 for attachments, JWTs, embedded assets.
What's the difference between URL encoding and base64?
URL encoding (percent encoding) replaces only characters that have meaning in URLs with their %XX form — most characters pass through unchanged. Base64 encodes arbitrary bytes as a fixed alphabet regardless of whether any given byte is "safe" in the destination context. They solve different problems.
How do I count visible characters in a string?
Use Intl.Segmenter in JavaScript: [...new Intl.Segmenter().segment(str)].length. For languages without segmenter APIs, use a grapheme cluster library. Counting code points with Array.from(str).length works for most cases but misses combining characters and emoji modifier sequences.
What's a grapheme cluster?
One or more code points that combine to form one visible character. Examples: an emoji with a skin tone modifier (base emoji + modifier = 1 grapheme, 2 code points). A complex emoji like 👨👩👧👦 (family) is one grapheme but 7 code points.
Can I just use UTF-8 everywhere?
Yes for I/O and storage. JavaScript, Java, Windows APIs will still use UTF-16 internally, but that's invisible to most application code. UTF-8 at every boundary (file I/O, database, network) is the right default.
What's the BOM?
Byte Order Mark. A special character (U+FEFF) at the start of a file that indicates the encoding's byte order. UTF-8 BOM is EF BB BF, UTF-16LE is FF FE. Not required by UTF-8 but sometimes written by Windows tools. Can cause bugs when software reads the BOM as actual content (a mysterious first character). When opening UTF-8 files, use encoding="utf-8-sig" in Python to strip a BOM if present.
The takeaway
Text is four layers, and every encoding bug is confusion between two of them:
- Bytes are what's on disk and over the wire.
- Code units are the atoms of a specific encoding (UTF-8 has 1-byte code units, UTF-16 has 2-byte code units).
- Code points are Unicode's abstract character numbers.
- Graphemes are what a human reads as a character.
The four most common bugs:
- Counting code units as characters.
"👍".length === 2in JavaScript. UseIntl.Segmenterfor user-facing counts. - Mojibake. UTF-8 bytes decoded as Latin-1. Fix by re-encoding and re-decoding with the correct pair.
- MySQL
utf8silently dropping emoji. Always useutf8mb4instead. - Byte-vs-character confusion in sizes. HTTP Content-Length is bytes. Database VARCHAR is characters. They're not the same for non-ASCII content.
For hands-on debugging, our text to binary, text to hex, and Unicode converter tools show the same text at every layer simultaneously. The base64 encoder/decoder handles the standard and URL-safe variants. The HTML entity encoder and URL encoder round-trip their respective escape formats. The character map browses Unicode with code points visible. All run client-side so text never leaves your browser.
For the deeper stories these topics connect to, our JWT decoder guide covers base64url in JWT payloads, our hash algorithm guide covers hex as the standard output for SHA-256 and friends, and our UUID v4 vs v7 guide covers hex representation of UUIDs.
If you remember nothing else: UTF-8 everywhere for I/O. Intl.Segmenter for user-facing character counts. And when something looks wrong, look at the raw bytes before anything else.