Skip to content
dbm.lt
Open menu
Back to blog
performance t-sql benchmarks methodology

How to benchmark something that takes a microsecond — in pure T-SQL

The methodology behind the variable-assignment benchmark: duration-based timing loops, forced recompiles, interleaved passes, min-of-trials, and a no-op baseline — with the actual harness code.

In the previous post I compared seven ways of declaring and assigning T-SQL variables. This post is about the harder problem hiding underneath: how do you trustworthily measure an operation that costs less than a microsecond, using a clock that ticks once a millisecond, on a machine you don’t fully control?

Get this wrong and the benchmark measures your harness, your plan cache, or your neighbours’ containers — and the conclusions are fiction. Here’s the full methodology, with the actual code. Everything runs in pure T-SQL against a disposable SQL Server container; no external profilers.

Problem 1: the operation is too small to time

A single DECLARE @x INT = 1 costs well under a microsecond. SYSUTCDATETIME() on this host only resolves to about a millisecond. Time one execution and you’ll measure zero.

The classic fix is a loop: run the batch N times, divide. But a fixed N has a trap — pick N too small for a fast method and the whole loop finishes inside a single clock tick, and you still measure a bogus zero. Pick N large enough for the fast methods and the slow configs (forced recompiles are ~100× dearer) take forever.

So the harness is duration-based instead: run until a wall-time budget elapses, count the iterations, divide.

WHILE DATEDIFF(MILLISECOND, @t0, @t1) < @target_ms
BEGIN
    SET @j = 0;
    WHILE @j < @CHECK_EVERY      -- don't check the clock every op
    BEGIN
        EXEC sys.sp_executesql @exec_sql;
        SET @j += 1;
    END
    SET @count += @CHECK_EVERY;
    SET @t1 = SYSUTCDATETIME();
END
-- ns/iter = elapsed_microseconds * 1000 / @count

Two details earn their keep here:

  • The clock is checked once per @CHECK_EVERY operations, not per operation — otherwise SYSUTCDATETIME() calls would dominate the very thing being measured.
  • @CHECK_EVERY adapts to the mode: 2,048 for cached ops, 64 for recompile ops, because recompiles are so much more expensive that a 2,048-op inner loop would blow way past the time budget before the first clock check.

Every measurement now spans hundreds of clock ticks regardless of how cheap or expensive the operation is. The full run used a 400 ms budget per configuration per pass.

Problem 2: generating 374 test batches without typos

Seven methods × variable counts from 1 to 64 × a dozen types × five string lengths — hand-writing those batches isn’t happening. A table-valued one-liner with GENERATE_SERIES + STRING_AGG builds each batch from its config:

DECLARE @n TABLE (i INT PRIMARY KEY);
INSERT INTO @n (i)
    SELECT value FROM GENERATE_SERIES(0, @var_count - 1);

-- e.g. the "SET per variable" shape:
DECLARE @decl_comb NVARCHAR(MAX) = 'DECLARE ' + (
    SELECT STRING_AGG('@v' + CAST(i AS VARCHAR(10)) + ' ' + @type_decl, ', ')
           WITHIN GROUP (ORDER BY i) FROM @n) + ';';

DECLARE @set_block NVARCHAR(MAX) = (
    SELECT STRING_AGG('SET @v' + CAST(i AS VARCHAR(10)) + ' = ' + @val + ';', ' ')
           WITHIN GROUP (ORDER BY i) FROM @n);

RETURN @decl_comb + ' ' + @set_block;

For set_assign, 2 variables, int, that returns exactly:

DECLARE @v0 int, @v1 int; SET @v0 = 12345; SET @v1 = 12345;

and for select_combined_assign:

DECLARE @v0 int, @v1 int; SELECT @v0 = 12345, @v1 = 12345;

The batch builder runs once per configuration, never inside the timed loop — string concatenation cost must not leak into the measurement. One subtlety: the methods that assign on the DECLARE can’t be separated from their declaration, so to stay apples-to-apples, every method’s timed batch goes from nothing to all-variables-assigned — the SET/SELECT methods include their DECLARE in the timed batch too.

Problem 3: cached and recompiled plans answer different questions

Executing a batch whose plan is cached measures steady-state execution — what a hot stored procedure pays per call. Executing a batch that must be parsed and compiled first measures the ad-hoc / recompile path. These differ by an order of magnitude, so a benchmark that doesn’t separate them measures a meaningless blend.

Cached mode is free: keep the batch text constant and sp_executesql reuses the plan after the first call. Recompile mode forces a cache miss by appending a globally unique comment each iteration:

-- per-call salt: no iteration can ever hit a plan cached by an earlier pass
DECLARE @salt NVARCHAR(16) =
    LEFT(REPLACE(CAST(NEWID() AS NVARCHAR(36)), '-', ''), 12);

SET @exec_sql = @sql + N' /*' + @salt + CAST(@count + @j AS NVARCHAR(20)) + N'*/';
EXEC sys.sp_executesql @exec_sql;

The comment changes the batch text, the text hash misses the cache, SQL Server compiles from scratch — which is exactly the cost we want on the clock.

The plan-cache trap that ruined the first run

Every one of those unique batches leaves a plan in the cache. Run a few hundred thousand of them and the cache bloats, so it must be cleared between configs. My first version used DBCC FREEPROCCACHE — and it quietly wrecked the benchmark: it flushes the entire instance’s cache, so every subsequent cached-mode config started cold and paid a compile that was billed to execution. The numbers looked plausible and were wrong.

The fix is the database-scoped clear, which only evicts this database’s plans:

IF @run_mode = 'recompile'
    ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE;

If you take one tactical trick from this post, take that one — it applies any time you use plan-cache clearing in tests on a shared instance.

Problem 4: the machine is lying to you

This ran on a shared 4-core VM hosting ~25 other containers. CPU bursts from neighbours land randomly on top of the timings. Three defences, in increasing order of importance:

Warm-up. 100,000 untimed executions before any measurement, so the first real configs don’t pay a cold-instance tax (cold start dominated the very first matrix run before this was added).

Interleaved, shuffled passes. The naive layout — run config A’s samples back-to-back, then config B’s — is fragile: one 10-second CPU burst from a neighbour corrupts all of one config’s samples at once. Instead, the trial loop is on the outside, and the config order is reshuffled every pass:

WHILE @trial <= @trials
BEGIN
    DECLARE cfg CURSOR LOCAL FAST_FORWARD FOR
        SELECT method, var_count, type_decl, value_len, run_mode
        FROM #configs ORDER BY CHECKSUM(NEWID());   -- reshuffle each pass
    ...
END

Each config’s 8 samples are now spread across the whole multi-hour run in random order. A burst of noise hits a random subset of configs once, instead of one config eight times.

MIN across trials, not average. For a micro-benchmark on a contended host, noise is strictly additive — the OS can only ever make an iteration slower, never faster. So the fastest observed pass is the best estimate of the true, uncontended cost, and the average is just “true cost plus however busy the box happened to be.” Min-of-trials plus interleaving means each config very likely caught at least one quiet window. (Median, average, and stdev are recorded alongside — if the stdev is huge, the min is to be trusted less, and differences smaller than the trial-to-trial spread are noise, not signal.)

Problem 5: the harness itself costs something

The timed loop isn’t free: each iteration pays the WHILE bookkeeping and an sp_executesql dispatch (~1 µs combined). That overhead is constant across methods, so it cancels in method-to-method comparisons — but it pollutes absolute numbers.

So one more “method” runs through the identical pipeline: noop, the smallest batch that can exist (DECLARE @z BIT;). Its measured time is the harness overhead, plus the measurement floor of the box. Every number in the results is reported net of the noop baseline — what the method itself costs, with the scaffolding subtracted.

The noop floor also tells you what you can’t measure: on this box it was ~42 µs of loop + dispatch + ambient contention, which is exactly why the data-type and string-length sweeps (whose real effects are smaller than that) came back inconclusive, and the variable-count and recompile sweeps (whose effects are 10–50× the floor) came back clean. Knowing the noise floor isn’t a footnote — it’s what separates “we measured no effect” from “we can’t measure an effect this small here.”

The checklist

If you’re building a T-SQL micro-benchmark of your own:

  1. Duration-based loops, not fixed N — your clock resolves ~1 ms; never let a measurement span fewer than hundreds of ticks.
  2. Check the clock every K iterations, with K scaled to the operation’s cost.
  3. Generate batches, never hand-write themGENERATE_SERIES + STRING_AGG, built outside the timed loop.
  4. Measure cached and recompile separately; unique-comment markers force misses.
  5. Clear the plan cache database-scoped (ALTER DATABASE SCOPED CONFIGURATION CLEAR PROCEDURE_CACHE), never DBCC FREEPROCCACHE on anything shared.
  6. Warm up untimed, interleave and shuffle the passes, take the min of trials.
  7. Run a no-op through the same pipeline and report everything net of it — and let the no-op’s floor tell you which effects are below your ability to measure.
  8. Publish the spread, not just the point estimate: a difference smaller than the trial-to-trial stdev is a coin flip wearing a lab coat.

The whole harness is ~650 lines of T-SQL plus a shell script that spins up a disposable SQL Server 2025 container, runs the matrix, and exports CSVs — ./run.sh full reproduces the entire dataset from scratch. If you’d like the scripts, or you want a claim about your workload measured this carefully, get in touch.


Need help with something like this?

I take on focused SQL Server engagements — performance, migrations, and CI/CD.

Book a 30-min consult