UTF-8 collations in SQL Server: half the storage, one nasty trap

Since SQL Server 2019 you can give a VARCHAR column a UTF-8 collation (..._UTF8) and it will store the full Unicode repertoire — encoded as UTF-8 instead of UTF-16. The pitch is seductive: ASCII costs 1 byte per character instead of NVARCHAR’s 2, so you halve your strings, your indexes and your backups “for free,” with no application change. The reality is more interesting: it’s a Western-text optimisation with a sharp edge that can quietly turn your seeks into scans.

I benchmarked it on SQL Server 2025 — one 100-character string column, 200,000 rows, three storage types (NVARCHAR, VARCHAR UTF-8, legacy VARCHAR CP1252) across three data distributions: pure English (ascii), Lithuanian with its diacritics (mixed, ~⅓ of characters multibyte), and CJK ideographs (cjk). Timing methodology is the same as the rest of the series. The harness is in the repo. Let’s start with the good news, then the trap.

Storage: it depends entirely on your alphabet

UTF-8 is variable width — 1 byte for ASCII, 2 for Latin/Cyrillic/Greek, 3 for CJK, 4 for emoji and other supplementary characters. UTF-16 (NVARCHAR) is a flat 2 bytes for almost everything, 4 for supplementary. So the win is real exactly until your data stops being mostly-ASCII. Measured footprint (data and index) as a percentage of the NVARCHAR equivalent:

Data	Bytes/char (UTF-8 vs NV)	Data pages	NC index	Total on disk
English (`ascii`)	1.00 vs 2.00	53.6%	52.3%	52.9%
Lithuanian (`mixed`)	1.27 vs 2.00	70.5%	67.6%	68.8%
CJK (`cjk`)	3.00 vs 2.00	148.0%	148.3%	148.1%

Three things to read off that table:

English halves. 97 MB → 52 MB, indexes included.
European text still wins comfortably. Lithuanian is about as diacritic-heavy as Western European languages get, and UTF-8 still lands near 69%. German, French, Spanish, Polish — all win.
CJK loses. Three bytes per character beats UTF-16’s two, so a Chinese/Japanese/ Korean column is ~1.5× bigger under UTF-8. If your data is predominantly CJK, NVARCHAR is the smaller type.

The nonclustered index tracks the data exactly — because the index key is the string. UTF-8 doesn’t just shrink your table; it shrinks (or grows) every index on that column in the same proportion.

The trap: an NVARCHAR parameter turns your seek into a scan

Here’s the one that will bite you in production. You migrate a column to UTF-8 VARCHAR, you have a perfectly good nonclustered index on it, and your app does:

-- the parameter is NVARCHAR (the default for most drivers / N'...' literals)
SELECT COUNT(*) FROM dbo.t_cjk_utf8 WHERE sval = @needle;   -- @needle NVARCHAR(400)

NVARCHAR has higher data-type precedence than VARCHAR, so SQL Server doesn’t convert your parameter down — it converts the column up, wrapping every row in CONVERT_IMPLICIT(nvarchar, sval). A converted column is not sargable, so the index seek collapses into a full scan. From the benchmark’s STATISTICS IO, same query, same data, only the parameter type changed:

Predicate on `t_cjk_utf8`	Plan	Logical reads	Time
`VARCHAR` needle (native)	Index Seek	7	0.12 ms
`NVARCHAR` needle	Index Scan	1066	435 ms

That’s ~3500× slower, on a table that fits in memory. On a real CJK table that doesn’t, it’s the difference between a query and an outage. The NVARCHAR table, for comparison, has no penalty at all — its native type is NVARCHAR, so the parameter matches.

The nuance that surprised me: for single-byte (ASCII) data the same NVARCHAR parameter still seeks — 3 logical reads, not 1066. When every character is one byte, SQL Server can fold the conversion into a range predicate. So the catastrophe is specific to genuinely multibyte data. That’s a trap, not a feature: your ASCII-heavy test environment will look fine and your CJK (or heavily-accented) production data will fall off a cliff. Don’t rely on it — match your parameter type to the column.

Writes and reads: roughly track bytes, with caveats

Write throughput follows the byte count, plus a small UTF-8 encode/validate tax on the way in. Inserts of ASCII came in around 86% of NVARCHAR’s time (fewer bytes, less log); CJK around 189% (more bytes, plus encoding). The middle ground sat inside the noise of a shared 4-core box — if you need precise write numbers, measure on a quiet host.

Reads had the least intuitive result. A full-scan aggregate over a cached table was basically a tie for ASCII and Lithuanian (99–102% of NVARCHAR). When the data is already in the buffer pool, scanning it is bound by row count and CPU, not by bytes on disk — so UTF-8’s footprint advantage shows up as cache density (you fit ~2× the ASCII rows per GB of RAM, which matters enormously at scale) rather than as a faster individual query. The headline storage win is a memory and IO win, realised under pressure, not a per-query speedup on warm data.

Three sizing facts you have to internalise

VARCHAR(n) is n bytes; NVARCHAR(n) is n characters. This is the one that breaks migrations. Assign 10 CJK characters to each:

Target	Characters kept	Bytes
`NVARCHAR(10)`	10	20
`VARCHAR(10)` UTF-8	3	9

VARCHAR(10) UTF-8 holds whatever fits in 10 bytes — three CJK characters, then it truncates. If you migrate NVARCHAR(n) → VARCHAR(n) UTF-8 by copying the number n, you silently chop every multibyte value. Size UTF-8 columns in bytes (often n × 2–3 the character count you actually need).

The nonclustered index key limit is 1700 bytes — bytes, not characters. A NVARCHAR(900) key is 1800 bytes and SQL Server warns you at create time; the same VARCHAR(900) UTF-8 key is at most 900 bytes and creates clean. UTF-8 lets you index a character-wider column — but remember the worst case is 4 bytes/char, so a VARCHAR(425) UTF-8 key can still blow the limit on supplementary characters.

LEN counts characters, DATALENGTH counts bytes — and now they differ. Under NVARCHAR DATALENGTH was always 2 × LEN. Under UTF-8 it’s 1–4× depending on content (measured: 1.00 for ASCII, 1.27 for Lithuanian, 3.00 for CJK). Any code that derived byte sizes from LEN, or assumed the 2× relationship, needs revisiting.

Verdict

_UTF8 collations are a genuine win for the common case — mostly-ASCII or European-language string data — and the win is large: roughly half the storage and half the index size, which becomes half the buffer-pool pressure and half the backup. For CJK-dominant data, keep NVARCHAR; UTF-8 makes it bigger.

Before you flip a column to UTF-8:

Profile the alphabet first. AVG(DATALENGTH(col)) / AVG(LEN(col)) on real data tells you whether you’re in the 1.0–1.3 (win) or ~3.0 (lose) regime.
Audit your predicates. Make sure parameters and literals comparing to the column are VARCHAR, not NVARCHAR/N'...', or you’ll convert the column and lose the seek — silently, and worst on exactly your multibyte data.
Re-size in bytes, not characters, and re-check index key widths against 1700 B.
Don’t expect a warm-cache query speedup — the payoff is footprint: density, IO and memory. Which, at scale, is usually the payoff you actually wanted.

Numbers from SQL Server 2025 (Developer), Latin1_General_100_CI_AS[_SC[_UTF8]] collations, 200k rows × 100 chars, on a shared host — trust the ratios over the absolute milliseconds. Reproduce or poke holes: the harness builds the data, runs the matrix and prints the receipts.