UTF-8 collations in SQL Server: half the storage, one nasty trap
A VARCHAR column with a _UTF8 collation stores all of Unicode in UTF-8 — and roughly halves an ASCII column versus NVARCHAR. Benchmarked on SQL Server 2025 across English, Lithuanian and CJK data: where it wins, where it makes things 1.5× bigger, and the implicit-conversion trap that turns an index seek into a 1000-read scan.
Since SQL Server 2019 you can give a VARCHAR column a UTF-8 collation
(..._UTF8) and it will store the full Unicode repertoire — encoded as UTF-8
instead of UTF-16. The pitch is seductive: ASCII costs 1 byte per character instead
of NVARCHAR’s 2, so you halve your strings, your indexes and your backups “for free,”
with no application change. The reality is more interesting: it’s a Western-text
optimisation with a sharp edge that can quietly turn your seeks into scans.
I benchmarked it on SQL Server 2025 — one 100-character string column, 200,000 rows,
three storage types (NVARCHAR, VARCHAR UTF-8, legacy VARCHAR CP1252) across
three data distributions: pure English (ascii), Lithuanian with its diacritics
(mixed, ~⅓ of characters multibyte), and CJK ideographs (cjk). Timing methodology
is the same as the rest of the series. The
harness is in the repo. Let’s start with the good
news, then the trap.
Storage: it depends entirely on your alphabet
UTF-8 is variable width — 1 byte for ASCII, 2 for Latin/Cyrillic/Greek, 3 for CJK, 4 for emoji and other supplementary characters. UTF-16 (NVARCHAR) is a flat 2 bytes for almost everything, 4 for supplementary. So the win is real exactly until your data stops being mostly-ASCII. Measured footprint (data and index) as a percentage of the NVARCHAR equivalent:
| Data | Bytes/char (UTF-8 vs NV) | Data pages | NC index | Total on disk |
|---|---|---|---|---|
English (ascii) | 1.00 vs 2.00 | 53.6% | 52.3% | 52.9% |
Lithuanian (mixed) | 1.27 vs 2.00 | 70.5% | 67.6% | 68.8% |
CJK (cjk) | 3.00 vs 2.00 | 148.0% | 148.3% | 148.1% |
Three things to read off that table:
- English halves. 97 MB → 52 MB, indexes included.
- European text still wins comfortably. Lithuanian is about as diacritic-heavy as Western European languages get, and UTF-8 still lands near 69%. German, French, Spanish, Polish — all win.
- CJK loses. Three bytes per character beats UTF-16’s two, so a Chinese/Japanese/ Korean column is ~1.5× bigger under UTF-8. If your data is predominantly CJK, NVARCHAR is the smaller type.
The nonclustered index tracks the data exactly — because the index key is the string. UTF-8 doesn’t just shrink your table; it shrinks (or grows) every index on that column in the same proportion.
The trap: an NVARCHAR parameter turns your seek into a scan
Here’s the one that will bite you in production. You migrate a column to UTF-8
VARCHAR, you have a perfectly good nonclustered index on it, and your app does:
-- the parameter is NVARCHAR (the default for most drivers / N'...' literals)
SELECT COUNT(*) FROM dbo.t_cjk_utf8 WHERE sval = @needle; -- @needle NVARCHAR(400)
NVARCHAR has higher data-type precedence than VARCHAR, so SQL Server doesn’t
convert your parameter down — it converts the column up, wrapping every row in
CONVERT_IMPLICIT(nvarchar, sval). A converted column is not sargable, so the index
seek collapses into a full scan. From the benchmark’s STATISTICS IO, same query, same
data, only the parameter type changed:
Predicate on t_cjk_utf8 | Plan | Logical reads | Time |
|---|---|---|---|
VARCHAR needle (native) | Index Seek | 7 | 0.12 ms |
NVARCHAR needle | Index Scan | 1066 | 435 ms |
That’s ~3500× slower, on a table that fits in memory. On a real CJK table that doesn’t, it’s the difference between a query and an outage. The NVARCHAR table, for comparison, has no penalty at all — its native type is NVARCHAR, so the parameter matches.
The nuance that surprised me: for single-byte (ASCII) data the same NVARCHAR parameter still seeks — 3 logical reads, not 1066. When every character is one byte, SQL Server can fold the conversion into a range predicate. So the catastrophe is specific to genuinely multibyte data. That’s a trap, not a feature: your ASCII-heavy test environment will look fine and your CJK (or heavily-accented) production data will fall off a cliff. Don’t rely on it — match your parameter type to the column.
Writes and reads: roughly track bytes, with caveats
Write throughput follows the byte count, plus a small UTF-8 encode/validate tax on the way in. Inserts of ASCII came in around 86% of NVARCHAR’s time (fewer bytes, less log); CJK around 189% (more bytes, plus encoding). The middle ground sat inside the noise of a shared 4-core box — if you need precise write numbers, measure on a quiet host.
Reads had the least intuitive result. A full-scan aggregate over a cached table was basically a tie for ASCII and Lithuanian (99–102% of NVARCHAR). When the data is already in the buffer pool, scanning it is bound by row count and CPU, not by bytes on disk — so UTF-8’s footprint advantage shows up as cache density (you fit ~2× the ASCII rows per GB of RAM, which matters enormously at scale) rather than as a faster individual query. The headline storage win is a memory and IO win, realised under pressure, not a per-query speedup on warm data.
Three sizing facts you have to internalise
VARCHAR(n) is n bytes; NVARCHAR(n) is n characters. This is the one that
breaks migrations. Assign 10 CJK characters to each:
| Target | Characters kept | Bytes |
|---|---|---|
NVARCHAR(10) | 10 | 20 |
VARCHAR(10) UTF-8 | 3 | 9 |
VARCHAR(10) UTF-8 holds whatever fits in 10 bytes — three CJK characters, then it
truncates. If you migrate NVARCHAR(n) → VARCHAR(n) UTF-8 by copying the number n,
you silently chop every multibyte value. Size UTF-8 columns in bytes (often n ×
2–3 the character count you actually need).
The nonclustered index key limit is 1700 bytes — bytes, not characters. A
NVARCHAR(900) key is 1800 bytes and SQL Server warns you at create time; the same
VARCHAR(900) UTF-8 key is at most 900 bytes and creates clean. UTF-8 lets you index a
character-wider column — but remember the worst case is 4 bytes/char, so a
VARCHAR(425) UTF-8 key can still blow the limit on supplementary characters.
LEN counts characters, DATALENGTH counts bytes — and now they differ. Under
NVARCHAR DATALENGTH was always 2 × LEN. Under UTF-8 it’s 1–4× depending on content
(measured: 1.00 for ASCII, 1.27 for Lithuanian, 3.00 for CJK). Any code that derived
byte sizes from LEN, or assumed the 2× relationship, needs revisiting.
Verdict
_UTF8 collations are a genuine win for the common case — mostly-ASCII or
European-language string data — and the win is large: roughly half the storage and
half the index size, which becomes half the buffer-pool pressure and half the backup.
For CJK-dominant data, keep NVARCHAR; UTF-8 makes it bigger.
Before you flip a column to UTF-8:
- Profile the alphabet first.
AVG(DATALENGTH(col)) / AVG(LEN(col))on real data tells you whether you’re in the 1.0–1.3 (win) or ~3.0 (lose) regime. - Audit your predicates. Make sure parameters and literals comparing to the column
are
VARCHAR, notNVARCHAR/N'...', or you’ll convert the column and lose the seek — silently, and worst on exactly your multibyte data. - Re-size in bytes, not characters, and re-check index key widths against 1700 B.
- Don’t expect a warm-cache query speedup — the payoff is footprint: density, IO and memory. Which, at scale, is usually the payoff you actually wanted.
Numbers from SQL Server 2025 (Developer), Latin1_General_100_CI_AS[_SC[_UTF8]]
collations, 200k rows × 100 chars, on a shared host — trust the ratios over the
absolute milliseconds. Reproduce or poke holes: the harness builds the data, runs the
matrix and prints the receipts.
Need help with something like this?
I take on focused SQL Server engagements — performance, migrations, and CI/CD.