Parity
1. Introduction
Parity of a word is defined as 1
if there are an odd number of 1-bits in the
word, and 0
otherwise.
Examples:
Word | Number of 1-bits | Even or odd | Parity |
---|---|---|---|
1011 | 3 | odd | 1 |
1000010000 | 2 | even | 0 |
11 | 2 | even | 0 |
11111 | 5 | odd | 1 |
2. Problem statement
Design an algorithm for computing the parity of a large number of 64-bit words (Aziz et al., 2018, p. 27).
3. Insights
Consider the simple case of the single bit. Computing the parity here is simple as there is nothing more to do. That is, the parity of a 1-bit is 1, and the parity of a 0-bit is 0.
Now consider the next-simplest case of 2-bit words. There are 4 possible 2-bit
words, 00
, 11
, 01
, and 10
. We can compute the parity by looking at a
2-bit word as 2 separate 1-bit words. So in this case we can split up the 2-bit
word into A and B, the high-order and low-order bits, respectively. Then comes
the trick — we can use the bitwise XOR (exclusive OR) instruction of A XOR B
to compute the parity of the original 2-bit word.
2-bit word | Parity | A | B | A XOR B |
---|---|---|---|---|
00 | 0 | 0 | 0 | 0 |
11 | 0 | 1 | 1 | 0 |
01 | 1 | 0 | 1 | 1 |
10 | 1 | 1 | 0 | 1 |
Seen another way, the XOR instruction can cancel out even numbers of bits (zeros them out) in a larger word. As you can imagine, we can use this property to compute the parity of 4-bit words, 8-bit words, etc.
4. Solutions
4.1. Brute force
This approach just computes the parity of a single word by examining every bit in the word. It also uses bitwise XOR to only store 1 or 0 (instead of actually storing the actual number of bits). Because of the way XOR works, when the number of bits is odd (starting with the very first 1-bit), the parity is set to 1. Then if another 1-bit comes along (even) it is XOR-ed against the previously-calculated parity bit 1 to become 0. Then if another bit (odd) comes along, the 0 is XOR-ed against it to become 1 again, and so forth.
The space complexity is \(O(1)\) because we only store 0 or 1 for the parity bit. The time complexity is \(O(n)\) where \(n\) is the word size.
4.1.1. Clear lowest set bit
Here we use the word &= word - 1
trick to clear the lowest 1-bit in a word
(aka the Least Significant Bit, or LSB). This is a classic bitwise trick. This
is an improvement over word >>= 1
because we no longer have to examine every
single bit. So our time complexity drops to \(O(k)\) where \(k\) is the number of
bits set.
Our worst-case time complexity is still the same as the brute force approach
though, because we'd run the while
loop above 64 times if all 64 bits are set.
4.2. XOR-folding
The key to this solution (Aziz et al., 2018, pp. 29–30) is that we are essentially "folding" the word into itself over and over again until we just get a single bit left. Every "fold" is a XOR instruction, which is used to cancel out even pairs of 1-bits.
Basically, what we want to do is get rid of pairs of 1-bits set in the word. So imagine if there are 57 1-bits set in a 64-bit word. If we can keep subtracting by 2 over and over again (removing pairs of 1-bits), we'll eventually arrive at just "1", which is the parity. Or, if there are 48 bits set, continually subtracting 2 at a time will make us arrive at 0, which again is the parity. The interesting thing to note here is that we can perform multiple "subtract 2" operations in parallel, because of the way XOR-ing works.
Let's use a real example below to see what this actually means with some pseudocode. The first line
word ^= word >> 32
takes the XOR of the top 32 bits and the lower 32 bits using a bit shift. Consider the following 64-bit word as an example, which has 30 1-bits set:
0100100001001110010001111000110010010111000010011110001010111101 (30 1-bits)
Let's stack it on top of itself, shifted down 32 bits. Label the first word A and the shifted-down word as B.
0100100001001110010001111000110010010111000010011110001010111101 = A 0100100001001110010001111000110010010111000010011110001010111101 = B
For visual simplicity, let's "chop off" the bottom 32 bits of B and fill in the left side with 0's to complete the shift. We use the underscore instead of 0 to make it easier for us to track.
0100100001001110010001111000110010010111000010011110001010111101 = A ________________________________01001000010011100100011110001100 = B
Now take the XOR of these 2 words. For the top half, because B has all zeroes (underscores), we get the same bits as in the top half of A. However, this top half does not matter as we will soon see.
For the bottom half, we end up doing the equivalent of 32 1-bit XOR operations, but in parallel. The most important thing here to see is that the 1-bits in B that happen to line up with the 1-bits in A are canceled out. We can see this in C below. The 'x' represents garbage bits that are ignored for our folding operation.
0100100001001110010001111000110010010111000010011110001010111101 = A ________________________________01001000010011100100011110001100 = B xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx11011111010001111010010100110001 = C (A XOR B, or 30 - (2 * 6) = 18 bits)
So in summary, what we've done here is take the top 32 bits and bottom 32 bits
of a word, and used the XOR operation to get rid of matched pairs of 1-bits.
We want to do this because even numbers (pairs) of 1-bits are essentially
ignored for purposes of calculating parity. The A XOR B
operation resulted
in 6 pairs of 1-bits being canceled out, so we now have 18 bits set.
The pseudocode below shows what happens with the rest of the shift and XOR operations. The main thing to keep in mind is that we "fold" the word into itself over and over again to get rid of pairs of 1-bits at each fold. Meanwhile, the region of bits we ignore keeps on growing.
# Shift 16. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx11011111010001111010010100110001 = C ________________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1101111101000111 = D xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0111101001110110 = E (C XOR D, or 18 - (2 * 4) = 10 bits) # Shift 8. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0111101001110110 = E ________xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx01111010 = F xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx00001100 = G (E XOR F, or 10 - (2 * 4) = 2 bits) # Shift 4. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx00001100 = G ____xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0000 = H xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1100 = I (G XOR H, or 2 bits) # Shift 2. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1100 = I __xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx11 = H xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx11 = J (I XOR J, or 2 bits) # Shift 1. xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx11 = J _xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx1 = K xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx0 = L (J XOR K, or 2 - 2 = 0 (parity))
It should be obvious now why we do word & 1
at the end — we really want to
ignore all the garbage "x" bits on the left and only see if L
has the lowest
bit turned on.
The time complexity is reduced to \(O(\log{}n)\) where \(n\) is the word size. This makes sense because whatever the word size, we repeatedly "fold" it in half until we get down to just 1 bit we care about. This is about 20% faster on random input than the previous version, although on sparse inputs the previous one is faster (Aziz et al., 2018, p. 30).
4.3. XOR-folding with in-register table lookup
This is a small improvement over the previous version, noted in (Warren, 2013, p. 97).
The use of the 0x6996
constant is called an "in-register table lookup".
Basically, once we get down to 4 bits, there are only 16 possibilities left,
because 4 bits can only represent \(2^{4} = 16\) unique numbers, 0 to 15. The
0x6996
constant is simply computed by looking at all possible numbers 0 to 15,
computing their parity, and assigning this to a new bit string. We start with
the 15th bit though and go down to 0, because we shift this constant to the
right in 0x6996 >> (word & 0xf)
. The 0x6996
is 0110 1001 1001 0110
in
binary, and you can see how we got these binary bits by looking at the Parity
column in the table below.
Number (Decimal) | Number (binary) | Parity |
---|---|---|
15 | 1111 | 0 |
14 | 1110 | 1 |
13 | 1101 | 1 |
12 | 1100 | 0 |
11 | 1011 | 1 |
10 | 1010 | 0 |
9 | 1001 | 0 |
8 | 1000 | 1 |
7 | 0111 | 1 |
6 | 0110 | 0 |
5 | 0101 | 0 |
4 | 0100 | 1 |
3 | 0011 | 0 |
2 | 0010 | 1 |
1 | 0001 | 1 |
0 | 0000 | 0 |
4.4. XOR-fold by nibbles
This method is also from (Warren, 2013, pp. 97–98). It uses XOR-folding to get the parity of each nibble (4-bit word) with the first 3 lines. Then it uses a multiplication trick to get the sum of bits from each of these parity-of-nibble chunks into the high-order nibble, before finally AND-ing this nibble with 1 to check if it is even or odd.
Let's break it down. The first 2 lines compute the parity of every nibble (4-bit word) in the 64-bit word. Here's an example, again using the same word from the XOR-folding section from above, but with the bits grouped by nibble boundaries:
# word ^= word >> 1 0100 1000 0100 1110 0100 0111 1000 1100 1001 0111 0000 1001 1110 0010 1011 1101 = A _010 0100 0010 0111 0010 0011 1100 0110 0100 1011 1000 0100 1111 0001 0101 1110 = B x110 1100 0110 1001 0110 0100 0100 1010 1101 1100 1000 1101 0001 0011 1110 0011 = C (A XOR B) # word ^= word >> 2 x110 1100 0110 1001 0110 0100 0100 1010 1101 1100 1000 1101 0001 0011 1110 0011 = C __x1 1011 0001 1010 0101 1001 0001 0010 1011 0111 0010 0011 0100 0100 1111 1000 = D xxx1 xxx1 xxx1 xxx1 xxx1 xxx1 xxx1 xxx0 xxx0 xxx1 xxx0 xxx0 xxx1 xxx1 xxx1 xxx1 = E (C XOR D)
The E word has lots of "x" bits in it because we treat each nibble boundary as its own independent word, in a sense. So instead of having 1 long string of "x" garbage bits like in "XOR-folding" above, we instead have 16 groups of garbage bits. But to restate, the "low" bit in each of the 16 nibbles in E calculate the parity of the original nibble from A.
The next line, word &= 0x1111111111111111
, is pretty clear — it zeroes out
the garbage bits in each nibble (a 0x1
in hex is the same as 0b0001
):
# word &= 0x1111111111111111 xxx1 xxx1 xxx1 xxx1 xxx1 xxx1 xxx1 xxx0 xxx0 xxx1 xxx0 xxx0 xxx1 xxx1 xxx1 xxx1 = E 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 0001 = F (same as 0x1111111111111111) 0001 0001 0001 0001 0001 0001 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 = G (E AND F)
It should now be obvious that we simply want to tally up the total number of 1 bits in G. The interesting thing about G is that it has these 16 nibbles, and each nibble is either 0001 or 0000 in binary. What we want to do is just add these 16 nibbles together, like this:
0001 0001 0001 0001 0001 0001 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 (G) 0001 0001 0001 0001 0001 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 ____ (G << 4) 0001 0001 0001 0001 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 ____ ____ (G << 8) 0001 0001 0001 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 ____ ____ ____ (G << 12) 0001 0001 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 ____ ____ ____ ____ (G << 16) 0001 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 ____ ____ ____ ____ ____ (G << 20) 0001 0000 0000 0001 0000 0000 0001 0001 0001 0001 ____ ____ ____ ____ ____ ____ (G << 24) 0000 0000 0001 0000 0000 0001 0001 0001 0001 ____ ____ ____ ____ ____ ____ ____ (G << 28) 0000 0001 0000 0000 0001 0001 0001 0001 ____ ____ ____ ____ ____ ____ ____ ____ (G << 32) 0001 0000 0000 0001 0001 0001 0001 ____ ____ ____ ____ ____ ____ ____ ____ ____ (G << 36) 0000 0000 0001 0001 0001 0001 ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ (G << 40) 0000 0001 0001 0001 0001 ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ (G << 44) 0001 0001 0001 0001 ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ (G << 48) 0001 0001 0001 ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ (G << 52) 0001 0001 ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ (G << 56) 0001 ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ ____ (G << 60) \ `- Tally up this first column with plain addition.
The column we want to tally up above is exactly what we want to do, because
doing a plain addition on these bits will give us the total number of bits in G.
Then we can just AND it with 1 to check if this total is odd to get the parity.
Note that the partial sum of the second column can go up to a maximum of 15
(because the last nibble from G << 60
is always 0), so there is no fear of a
carry from the second column contaminating the first column. Similarly, note
that none of the other columns can be greater than 15, so again there is no fear
of any contamination from any carries.
The naive way to sum up the first column is to literally write the shifts and
additions. However we can do better, because the arithmetic operation of
multiplication does precisely the same thing! This is why we multiply G by the
same 0x1111111111111111
constant. This ends up adding the 1-bits in the first
column together, putting the sum into the high-order hex digit. The sum is
anything from 0 to 16 (each nibble's parity), so we just have to AND it with 1
to check if this sum is even or odd.
The multiplication by the constant 0x1111111111111111
can be thought of as 16
different shifts and additions. Below is an illustration:
0x1111111111111111 * G is the same as ... ------------------ 0x0000000000000001 * G, or G << 0, plus 0x0000000000000010 * G, or G << 4, plus 0x0000000000000100 * G, or G << 8, plus 0x0000000000001000 * G, or G << 12, plus 0x0000000000010000 * G, or G << 16, plus 0x0000000000100000 * G, or G << 20, plus 0x0000000001000000 * G, or G << 24, plus 0x0000000010000000 * G, or G << 28, plus 0x0000000100000000 * G, or G << 32, plus 0x0000001000000000 * G, or G << 36, plus 0x0000010000000000 * G, or G << 40, plus 0x0000100000000000 * G, or G << 44, plus 0x0001000000000000 * G, or G << 48, plus 0x0010000000000000 * G, or G << 52, plus 0x0100000000000000 * G, or G << 56, plus 0x1000000000000000 * G, or G << 60
The above works because multiplying by a power-of-2 is the same as shifting the
number to the left. Just to make sure, let's use a smaller example to illustrate
the point. Consider the 16-bit binary number 0001 0001 0000 0001 = S
. Let's
multiply this by 0x1111
(4 separate multiplications by 0x1
, 0x10
, 0x100
,
and 0x1000
).
# Multiply by 0x1 (same as S << 0) 0001000100000001 = S x 1 = 0x1 ------------------ = 0001000100000001 (same as S) # Multiply by 0x10 (same as S << 4). 0001000100000001 = S x 10000 = 0x10 = 16 ------------------ 0000000000000000 0000000000000000_ 0000000000000000__ 0000000000000000___ 0001000100000001____ (same as S << 4) = 0001000000010000 (lost top 4 bits because we can only hold 16 bits) # Multiply by 0x100 (same as S << 8). 0001000100000001 = S x 100000000 = 0x100 = 256 ------------------ 0000000000000000 0000000000000000_ 0000000000000000__ 0000000000000000___ 0000000000000000____ 0000000000000000_____ 0000000000000000______ 0000000000000000_______ 0001000100000001________ (same as S << 8) = 0000000100000000 (lost top 8 bits because we can only hold 16 bits) # Multiply by 0x1000 (same as S << 12). 0001000100000001 = S x 1000000000000 = 0x1000 = 4096 ------------------ 0000000000000000 0000000000000000_ 0000000000000000__ 0000000000000000___ 0000000000000000____ 0000000000000000_____ 0000000000000000______ 0000000000000000_______ 0000000000000000________ 0000000000000000_________ 0000000000000000__________ 0000000000000000___________ 0001000100000001____________ (same as S << 12) = 0001000000000000 (lost top 12 bits because we can only hold 16 bits)
We can now add these 4 subtotals together. The bottom 12 bits don't matter (we only care about the high-order nibble), but we still do the addition for all numbers for sake of illustration.
0001000100000001 (same as S) 0001000000010000 0000000100000000 + 0001000000000000 ------------------ 0011001000010001
The high-order nibble is 0011 = 3
, so we have 3 bits. If we AND it with 1, we
get 1, which is our parity.
Going back to our 64-bit example, we can see that multiplying by
0x1111111111111111
will similarly end up summing the number of bits in each
nibble into the high-order nibble. Note that if the sum is 16, we'll end up
getting 0000
in the nibble because the 1
will carry over into the 65th bit
index, out of range for our 64-bit word. However it doesn't matter because
AND-ing it with 1 will still get us 0 (parity 0) which is the correct answer.
4.5. 16-bit caching
Because there are \(2^{64}\) possible values, we cannot use a hash table for 64-bit inputs directly. Instead we can use a 16-bit input (\(2^{16} = 65536\) values), and just do 4 16-bit word lookups (because there are 4 16-bit words in a 64-bit word). Then we just take the XOR of these lookups to get the overall parity. Because the keys for the lookups can just be the raw 16-bit words, we can use these keys as indices to a list, instead of using a dictionary.
Note that we don't have to mask the lower 16 bits for calculating a
above,
because shifting down by 48 bits only leaves 16 bits of information (everything
else is cleared to 0). For the others, we have to mask by 0xffff
(16 bits) to
only grab the relevant 16-bit areas.
The time complexity is just \(O(n/L)\), where \(L\) is the width of the cached results and \(n\) is the word size. This assumes that the shift operations take \(O(1)\) time. In our case, \(L\) is 16 and \(n\) is 64, so there are \(\frac{64}{16} = 4\) terms to look up.
4.6. XOR-fold with caching
This is an approach that combines XOR-folding with caching to achieve an even greater speedup.