Intersection of two sorted arrays
1. Problem statement
Compute the intersection of two sorted arrays (Aziz et al., 2018, p. 194). Deduplicate the output.
Input: [1, 2] [3, 4] Output: [] Input: [1, 1, 1, 2] [1, 2, 2, 3] Output: [1, 2]
2. Insights
2.1. Sorted inputs
This problem is begging us to exploit the fact that the inputs are already sorted.
3. Solution
3.1. Hash tables
We could use hash tables. We simply create two set()
objects (essentially hash
tables that only have keys, without associated values), and then get the
intersection between them.
The downside is the space complexity, which is \(O(mn)\) to store the additional
sets for each of the arrays, not to mention the sort()
we have to run because
the set()
function does not preserve order.
Also, using a hash table here is wasteful, because we are not taking advatage of the fact that the inputs are sorted.
We should try to find a solution that tries to avoid prohibitively expensive space complexity.
3.2. Brute force
We can compare all letters with overy other letter, by looping over twice.
This has time complexity \(O(xy)\) where \(x\) and \(y\) are lengths of the xs
and
ys
arrays. It's actually worse than this because we're assuming that the
expressions x in result
and y in result
run in constant time, which is
simply not true. But as the time complexity is already bad, we leave it at that.
3.3. Binary search
Instead of a nested loop, we can remove the inner loop with explicit binary
search using the bisect
module.
The condition (i == 0 or x != xs[i - 1])
avoids using duplicate x
entries,
by ensuring that the current x
is not the same as the previous element in
xs
. The i == 0
merely bypasses this filtering of duplicates because there's
no point in checking for duplicates for the very first element under
consideration.
The time complexity is \(O(m \log_{n})\) where \(m\) is the length of the outer
array we iterate over (xs
). The \(\log_{n}\) comes from the time complexity of
binary search over the searched array, ys
.
We can squeeze more performance out of this by using the shorter of the two
arrays, xs
and ys
, as the array to loop over. This way, we can iterate a
small number of times and then use the logarithmic power of binary search
against the larger array. Otherwise we'd be iterating over a large number of
items while binary searching over a small (tiny) array, where the performance
benefits of binary search won't be as apparent.
3.4. Optimal
In merged linked lists together, we saw that we only needed to traverse through
the two linked lists once because they were already sorted. We can apply the
same principle here and only traverse through the xs
and ys
lists once. The
trick is to have a generic loop and within this loop, have two separate indices
for each array. Then we advance the index for xs
or ys
depending on how we
process the current item pointed to by each of these indices.
Time complexity is \(O(m+n)\), because we spend \(O(1)\) time per element across both inputs.