3.3 Dilithium / ML-DSA Implementation
From the specification to working code — the key algorithmic components of FIPS 204.
ML-DSA (FIPS 204), formerly CRYSTALS-Dilithium, is the primary post-quantum digital signature standard. This article covers the implementation-level details: the arithmetic substrate, the key operations, the hint compression scheme, and the side-channel considerations that matter when building a production implementation.
The Arithmetic Ring
Dilithium operates in $R_q = \mathbb{Z}_q[x]/(x^{256}+1)$ with $q = 8{,}380{,}417 = 2^{23} - 2^{13} + 1$. This specific prime was chosen because:
Why q = 8,380,417
$q \equiv 1 \pmod{2 \cdot 256}$, making a 256-point NTT possible. The primitive 512th root of unity exists modulo $q$. The factored form $2^{23} - 2^{13} + 1$ enables efficient Montgomery reduction using just shifts and additions.
Coefficient ranges
Dilithium uses centered representation: coefficients in $[-(q-1)/2, (q-1)/2]$ rather than $[0, q-1]$. This simplifies norm computations and makes the "small" condition for secret/error polynomials equivalent to small absolute value.
Secret key distribution
Secret vectors $s_1 \in R_q^\ell, s_2 \in R_q^k$ have coefficients sampled uniformly from $\{-\eta, \dots, \eta\}$ where $\eta \in \{2, 4\}$ depending on security level. This gives $2\eta+1$ possible values per coefficient.
NTT Implementation
All polynomial multiplications in Dilithium use the NTT. The primitive 256th root of unity modulo $q$ is $\zeta = 1{,}753$ (a generator satisfying $\zeta^{256} \equiv -1 \pmod{q}$). The NTT algorithm used is a negacyclic NTT — the ring $R_q$ factors as a product of 256 degree-1 rings via the Chinese Remainder Theorem, enabling coefficient-wise multiplication in the NTT domain.
NTT Structure in Dilithium
The forward NTT maps $f(x) \in R_q$ to its evaluations at $\zeta^{2i+1}$ for $i = 0, \dots, 255$:
$\hat{f}_i = f(\zeta^{2i+1}) = \sum_{j=0}^{255} f_j \cdot \zeta^{j(2i+1)} \pmod{q}$
In-place Cooley-Tukey butterfly: 8 stages of 128 butterflies each = 1,024 butterfly operations total. Each butterfly is: $(a, b) \leftarrow (a + \zeta^k b, a - \zeta^k b)$ modulo $q$.
With AVX2: compute 8 butterflies in parallel using 256-bit SIMD registers with Montgomery multiplication for the modular reduction. Full NTT on one polynomial: ≈1μs on modern x86.
Expanding the Matrix A
The public matrix $A \in R_q^{k \times \ell}$ is generated deterministically from a seed $\rho$ using SHAKE-128. It is never stored — always regenerated from the seed when needed.
ExpandA Algorithm
For each entry $A[i][j]$ (row $i$, column $j$):
1. Feed $\rho \| i \| j$ (seed concatenated with indices) to SHAKE-128.
2. Stream output bytes and use rejection sampling to extract coefficients in $[0, q-1]$: discard any 3-byte group whose value $\geq q$; accept otherwise.
3. Each coefficient uses 3 bytes; expected ~1.25× overhead from rejection sampling.
The matrix is generated in NTT domain directly — no need to apply NTT afterward. Total cost for ML-DSA-65 ($6 \times 5 = 30$ ring elements): roughly 30 SHAKE-128 squeezes of 256 coefficients each.
The HighBits / LowBits Decomposition
Dilithium uses a clever bit-decomposition trick to compress signatures. Instead of sending the full commitment $w = Ay \pmod{q}$, the signer sends only the high bits of $w$, saving significant bandwidth. A small "hint" vector corrects any rounding errors during verification.
Definition 33 — Bit Decomposition
For coefficient $r \in \mathbb{Z}_q$ and parameter $\gamma_2$:
$r = r_1 \cdot 2\gamma_2 + r_0$
where $r_0 = r \bmod^+ 2\gamma_2$ (centered modulo) and $r_1 = (r - r_0) / 2\gamma_2$.
$\mathrm{HighBits}(r) = r_1$, $\mathrm{LowBits}(r) = r_0$.
In the signature, only $r_1$ (high bits) is transmitted. The hint bits $h$ in the signature encode corrections to the high bits that occur during the verification computation due to the term $-cs_2$. This keeps the signature compact while allowing exact verification.
Sampling Algorithms
Dilithium needs to sample three types of small polynomials during key generation and signing:
Secret key coefficients (±η)
Use SHAKE-256 in "uniform" mode, rejection-sampling coefficients in $\{0, \dots, 2\eta\}$ and centering. For $\eta = 2$: each byte provides two coefficients via nibbles (high 4 bits, low 4 bits), rejecting values $> 14$. Rejection rate ~7% — efficient.
Mask vector y (±γ₁)
Coefficients uniform in $[-\gamma_1 + 1, \gamma_1]$. For $\gamma_1 = 2^{17}$: each coefficient uses 18 bits; pack 8 coefficients into 18 bytes. For $\gamma_1 = 2^{19}$: each uses 20 bits; pack 8 coefficients into 20 bytes.
Challenge polynomial c
A sparse polynomial with exactly $\tau \in \{39, 49, 60\}$ nonzero coefficients of value $\pm 1$ out of 256. Generated by SHAKE-256 of the commitment hash: use Fisher-Yates to place $\tau$ nonzero positions, then sign each based on a bit.
Constant-Time Implementation
Dilithium is designed for straightforward constant-time implementation. The main areas of concern:
Side-Channel Considerations
Rejection sampling: The abort-and-restart loop in signing is not secret-dependent — the test is whether $\|z\|_\infty \geq \gamma_1 - \beta$, which can be computed in constant time. The number of restarts leaks only statistical information (not the secret), and is bounded in expectation.
Coefficient arithmetic: All reductions modulo $q$ use Barrett or Montgomery reduction — no secret-dependent branches.
Hint computation: The MakeHint and UseHint operations compare coefficients to a threshold — implemented with branchless conditional selection (e.g., using arithmetic right shift tricks).
Key expansion: Generate $A$ from seed — SHAKE-128 output is data-independent. No secret key touches A expansion.
Performance Benchmarks
ML-DSA-65 (AVX2, Skylake)
Key generation: ~50,000 cycles (~17 μs)
Signing: ~110,000 cycles (~37 μs) average
Verification: ~45,000 cycles (~15 μs)
For comparison: ECDSA-256 signing ≈70,000 cycles. Dilithium is comparable in speed with much stronger post-quantum security.
Microcontroller (Cortex-M4)
Key generation: ~2.5M cycles
Signing: ~5.8M cycles
Verification: ~2.2M cycles
At 168 MHz, this is ~15 ms for signing — fast enough for IoT authentication use cases. NTT can be optimized with Cortex-M4's DSP multiply-accumulate instructions.
Memory footprint
Public key: 1,952 bytes (ML-DSA-65)
Secret key: 4,032 bytes
Signature: 3,293 bytes
Stack usage during signing: ~8 KB for intermediate values. Acceptable for constrained embedded systems that typically have ≥16 KB RAM.
Deterministic vs Randomized Signing
ML-DSA supports both randomized signing (fresh random mask $y$ each time) and deterministic signing (derive $y$ from the secret key and message using a PRF). The NIST standard uses randomized signing by default. Deterministic signing is provided as an option for environments lacking a good entropy source — but it requires careful implementation to avoid fault injection attacks that can recover the secret key if the same mask is used twice.
References
Implementation resources
FIPS 204 specification — the authoritative ML-DSA standard document from NIST. doi.org ↗
Reference implementation and test vectors: github.com/pq-crystals/dilithium ↗
Micciancio & Regev survey for the mathematical foundations: PDF ↗