This page uses content from Wikipedia and is licensed under CC BYSA.
The Salsa quarterround function. Four parallel copies make a round.


General  

Designers  Daniel J. Bernstein 
First published  2007 (designed 2005)^{[1]} 
Related to  Rumba20, ChaCha 
Certification  eSTREAM portfolio 
Cipher detail  
Key sizes  256 bits 
State size  512 bits 
Structure  ARX 
Rounds  20 
Speed  3.91 cpb on an Intel Core 2 Duo^{[2]} 
Best public cryptanalysis  
2008 cryptanalysis breaks 8 out of 20 rounds to recover the 256bit secret key in 2^{251} operations, using 2^{31} keystream pairs.^{[3]} 
Salsa20 and the closely related ChaCha are stream ciphers developed by Daniel J. Bernstein. Because the two ciphers are very similar, this article will some times discuss them together. Salsa20 is the original cipher, and it was submitted to eSTREAM by Daniel J. Bernstein. ChaCha is a modification of Salsa20 published by Bernstein in 2008. It uses a new round function that increases diffusion and increases performance on some architectures.^{[4]}
Both ciphers are built on a pseudorandom function based on addrotatexor (ARX) operations — 32bit addition, bitwise addition (XOR) and rotation operations. The core function maps a 256bit key, a 64bit nonce, and a 64bit counter to a 512bit block of the key stream (a Salsa version with a 128bit key also exists). This gives Salsa20 and ChaCha the unusual advantage that the user can efficiently seek to any position in the key stream in constant time. Salsa20 offers speeds of around 4–14 cycles per byte in software on modern x86 processors,^{[5]} and reasonable hardware performance. It is not patented, and Bernstein has written several public domain implementations optimized for common architectures.^{[6]}
Internally, the cipher uses bitwise addition ⊕ (exclusive OR), 32bit addition mod 2^{32} ⊞, and constantdistance rotation operations (<<<) on an internal state of sixteen 32bit words. Using only addrotatexor operations avoids the possibility of timing attacks in software implementations. The internal state is made of sixteen 32bit words arranged as a 4x4 matrix.
0  1  2  3 
4  5  6  7 
8  9  10  11 
12  13  14  15 
The initial state is made up of 8 words of key, 2 words of stream position, 2 words of nonce (essentially additional stream position bits), and 4 fixed words:
Cons  Key  Key  Key 
Key  Cons  Nonce  Nonce 
Pos  Pos  Cons  Key 
Key  Key  Key  Cons 
The constant words spell "expand 32byte k" in ASCII (i.e. the 4 words are "expa", "nd 3", "2by", and "te k"). This is an example of a nothing up my sleeve number. The core operation in Salsa20 is the quarterround QR(a,b,c,d)
that takes a fourword input and produces a fourword output:
b ⊕= (a ⊞ d) <<< 7; c ⊕= (b ⊞ a) <<< 9; d ⊕= (c ⊞ b) <<< 13; a ⊕= (d ⊞ c) <<< 18;
Oddnumbered rounds apply QR(a,b,c,d)
to each of the four columns in the 4x4 matrix, and evennumbered rounds apply it to each of the four rows. Two consecutive rounds (columnround and rowround) together are called a doubleround:
// Odd round. QR( 0, 4, 8, 12) // column 1 QR( 5, 9, 13, 1) // column 2 QR(10, 14, 2, 6) // column 3 QR(15, 3, 7, 11) // column 4 // Even round. QR( 0, 1, 2, 3) // row 1 QR( 5, 6, 7, 4) // row 2 QR(10, 11, 8, 9) // row 3 QR(15, 12, 13, 14) // row 4
An implementation in C/C++ appears below.
#define ROT(a,b) (((a) << (b))  ((a) >> (32  (b)))) #define QR(a, b, c, d) \ b ^= ROT(a + d, 7); \ c ^= ROT(b + a, 9); \ d ^= ROT(c + b,13); \ a ^= ROT(d + c,18); void salsa20_block(uint32 out[16],uint32 in[16]) { int i; uint32 x[16]; for (i = 0;i < 16;++i) x[i] = in[i]; // 10 loops x 2 rounds/loop = 20 rounds. for (int i = 0; i < 10; i++) { // Odd round. QR(x[ 0], x[ 4], x[ 8], x[12]) // column 1 QR(x[ 5], x[ 9], x[13], x[ 1]) // column 2 QR(x[10], x[14], x[ 2], x[ 6]) // column 3 QR(x[15], x[ 3], x[ 7], x[11]) // column 4 // Even round. QR(x[ 0], x[ 1], x[ 2], x[ 3]) // row 1 QR(x[ 5], x[ 6], x[ 7], x[ 4]) // row 2 QR(x[10], x[11], x[ 8], x[ 9]) // row 3 QR(x[15], x[12], x[13], x[14]) // row 4 } for (i = 0;i < 16;++i) out[i] = x[i] + in[i]; }
In the last line, the mixed array is added, word by word, to the original array to obtain its 64byte key stream block. This is important because the mixing rounds on their own are invertible. In other words, applying the reverse operations would produce the original 4x4 matrix, including the key. Adding the mixed array to the original resolves that problem.
Salsa20 performs 20 rounds of mixing on its input.^{[1]} However, reduced round variants Salsa20/8 and Salsa20/12 using 8 and 12 rounds respectively have also been introduced. These variants were introduced to complement the original Salsa20, not to replace it, and perform even better^{[note 1]} in the eSTREAM benchmarks than Salsa20, though with a correspondingly lower security margin.
In 2008 Bernstein proposed a variant of Salsa20 with 192bit nonces called XSalsa20.^{[7]}^{[8]} XSalsa20 is provable secure if Salsa20 is secure, but is more suitable for applications where longer nonces are desired.
Salsa20 has been selected as a Phase 3 design for Profile 1 (software) by the eSTREAM project, receiving the highest weighted voting score of any Profile 1 algorithm at the end of Phase 2.^{[9]} Salsa20 had previously been selected as Phase 2 Focus design for Profile 1 (software) and as a Phase 2 design for Profile 2 (hardware) by the eSTREAM project,^{[10]} but was not advanced to Phase 3 for Profile 2 because eSTREAM felt that it was probably not a good candidate for extremely resource constrained hardware environments.^{[11]}
As of 2015^{[update]}, there are no published attacks on Salsa20/12 or the full Salsa20/20; the best attack known^{[3]} breaks 8 of the 12 or 20 rounds.
In 2005, Paul Crowley reported an attack on Salsa20/5 with an estimated time complexity of 2^{165}, and won Bernstein's US$1000 prize for "most interesting Salsa20 cryptanalysis".^{[12]} This attack, and all subsequent attacks are based on truncated differential cryptanalysis. In 2006, Fischer, Meier, Berbain, Biasse, and Robshaw reported an attack on Salsa20/6 with estimated time complexity of 2^{177}, and a relatedkey attack on Salsa20/7 with estimated time complexity of 2^{217}.^{[13]}
In 2007, Tsunoo et al. announced a cryptanalysis of Salsa20 which breaks 8 out of 20 rounds to recover the 256bit secret key in 2^{255} operations, using 2^{11.37} keystream pairs.^{[14]} However, this attack does not seem to be competitive with the brute force attack.
In 2008, Aumasson, Fischer, Khazaei, Meier, and Rechberger reported a cryptanalytic attack against Salsa20/7 with a time complexity of 2^{153}, and they reported the first attack against Salsa20/8 with an estimated time complexity of 2^{251}. This attack makes use of the new concept of probabilistic neutral key bits for probabilistic detection of a truncated differential. The attack can be adapted to break Salsa20/7 with a 128bit key.^{[3]}
In 2012 the attack by Aumasson et al. was improved by Shi et al. against Salsa20/7 (128bit key) to a time complexity of 2^{109} and Salsa20/8 (256bit key) to 2^{250}.^{[15]}
In 2013, Mouha and Preneel published a proof^{[16]} that 15 rounds of Salsa20 was 128bit secure against differential cryptanalysis. (Specifically, it has no differential characteristic with higher probability than 2^{−130}, so differential cryptanalysis would be more difficult than 128bit key exhaustion.)
In 2008, Bernstein published the closely related "ChaCha" family of ciphers, which aim to increase the diffusion per round while achieving the same or slightly better performance.^{[17]} The Aumasson et al. paper also attacks ChaCha, achieving one round fewer: for 256 bits ChaCha6 with complexity 2^{139} and ChaCha7 with complexity 2^{248}. 128 bits ChaCha6 within 2^{107}, but claims that the attack fails to break 128 bits ChaCha7.^{[3]}
Like Salsa20, ChaCha's initial state includes a 128bit constant, a 256bit key, a 64bit counter, and a 64bit nonce, arranged as a 4x4 matrix of 32bit words. But ChaCha rearranges some of the words in the initial state:
Cons  Cons  Cons  Cons 
Key  Key  Key  Key 
Key  Key  Key  Key 
Pos  Pos  Nonce  Nonce 
The constant is the same as Salsa20 ("expand 32byte k"). ChaCha replaces the Salsa20 quarterround QR(a,b,c,d)
with
a ⊞= b; d ⊕= a; d <<<= 16; c ⊞= d; b ⊕= c; b <<<= 12; a ⊞= b; d ⊕= a; d <<<= 8; c ⊞= d; b ⊕= c; b <<<= 7;
Notice that this version updates each word twice, while Salsa20's quarter round updates each word only once. In addition, the ChaCha quarterround diffuses changes more quickly. In average, after changing 1 input bit the Salsa20 quarterround will change 8 output bits while the ChaCha one will change 12.5 output bits. ^{[18]}.
Note also that the ChaCha quarter round has the same number of adds, xors, and bit rotates as the Salsa20 quarterround, but the fact that two of the rotates are multiples of 8 allows for a small optimization on some architectures including x86.^{[19]} Additionally, the input formatting has been rearranged to support an efficient SSE implementation optimization discovered for Salsa20. Rather than alternating rounds down columns and across rows, they are performed down columns and along diagonals.^{[20]} Like Salsa20, ChaCha arranges the sixteen 32bit words in a 4x4 matrix. If we index the matrix elements from 0 to 15
0  1  2  3 
4  5  6  7 
8  9  10  11 
12  13  14  15 
Then a double round in ChaCha is
// Odd round. QR(0, 4, 8, 12) // 1st column QR(1, 5, 9, 13) // 2nd column QR(2, 6, 10, 14) // 3rd column QR(3, 7, 11, 15) // 4th column // Even round. QR(0, 5, 10, 15) // diagonal 1 (main diagonal) QR(1, 6, 11, 12) // diagonal 2 QR(2, 7, 8, 13) // diagonal 3 QR(3, 4, 9, 14) // diagonal 4
ChaCha20 uses 10 iterations of the double round.^{[21]} An implementation in C/C++ appears below.
#define ROT(a,b) (((a) << (b))  ((a) >> (32  (b)))) #define QR(a, b, c, d) \ a += b; d ^= a; d = ROT(d,16); \ c += d; b ^= c; b = ROT(b,12); \ a += b; d ^= a; d = ROT(d, 8); \ c += d; b ^= c; b = ROT(b, 7) void chacha_block(uint32 out[16],uint32 in[16]) { int i; uint32 x[16]; for (i = 0;i < 16;++i) x[i] = in[i]; // 10 loops x 2 rounds/loop = 20 rounds. for (int i = 0; i < 10; i++) { // Odd round. QR(x[0], x[4], x[ 8], x[12]); // column 0 QR(x[1], x[5], x[ 9], x[13]); // column 1 QR(x[2], x[6], x[10], x[14]); // column 2 QR(x[3], x[7], x[11], x[15]); // column 3 // Even round. QR(x[0], x[5], x[10], x[15]); // diagonal 1 (main diagonal) QR(x[1], x[6], x[11], x[12]); // diagonal 2 QR(x[2], x[7], x[ 8], x[13]); // diagonal 3 QR(x[3], x[4], x[ 9], x[14]); // diagonal 4 } for (i = 0;i < 16;++i) out[i] = x[i] + in[i]; }
ChaCha is the basis of the BLAKE hash function, a finalist in the NIST hash function competition, and BLAKE2 successor tuned for even higher speed. It also defines a variant using sixteen 64bit words (1024 bits of state), with correspondingly adjusted rotation constants.
Google has selected ChaCha20 along with Bernstein's Poly1305 message authentication code as a replacement for RC4 in TLS, which is used for Internet security.^{[22]} Google's implementation secures https (TLS/SSL) traffic between the Chrome browser on Android phones and Google's websites.^{[23]}
Shortly after Google's adoption for TLS, both the ChaCha20 and Poly1305 algorithms were also used for a new [email protected] cipher in OpenSSH.^{[24]}^{[25]} Subsequently, this made it possible for OpenSSH to avoid any dependency on OpenSSL, via a compiletime option.^{[26]}
ChaCha20 is also used for the arc4random random number generator in FreeBSD^{[27]}, OpenBSD^{[28]}, and NetBSD^{[29]} operating systems, instead of the broken RC4, and in DragonFly BSD^{[30]} for the CSPRNG subroutine of the kernel.^{[31]}^{[32]} Starting from version 4.8, the Linux kernel uses the ChaCha20 algorithm to generate data for the nonblocking /dev/urandom device.^{[33]}^{[34]}^{[35]}
An implementation reference for ChaCha20 has been published in RFC 7539. The IETF's implementation modified Bernstein's published algorithm by changing 64bit nonce and 64bit block counter to 96bit nonce and 32bit block counter^{[36]}. The name was not changed when the algorithm was modified and it could be the source of confusion for developers. Because of the reduced block counter, maximum message length that can be safely encrypted by the IETF's variant is 2^{32}1 blocks of 64 bytes (about 256 GiB). For usages where this is not enough, such as file or disk encryption, RFC 7539 proposes using the original algorithm with 64bit nonce.
Use of ChaCha20 in IKE and IPsec have been proposed for standardization in RFC 7634. Proposed standardization of its use in TLS is published as RFC 7905.
two of these constants are multiples of 8; this allows for a 1 instruction rotation in Core2 and later Intel CPUs using the pshufb instruction
Replace the RC4 algorithm for generating inkernel secure random numbers with Chacha20
ChaCha based random number generator for OpenBSD.
Legacy arc4random(3) API from OpenBSD reimplemented using the ChaCha20 PRF, with perthread state.
chacha_encrypt_bytes
random: replace nonblocking pool with a Chacha20based CRNG
Changes from regular ChaCha. The nonce: block sequence number split was changed from 64:64 to 96:32