Find first occurrence from sorted array
1. Problem statement
Given a sorted array, find the first occurrence of an item k in the array
(Aziz et al., 2018, p. 155). If the item does not exist, return None
.
Input: [1, 2, 4, 7, 8], k=7 Output: 3 (index 3) Input: [1], k=2 Output: None
2. Insights
2.1. Duplicate keys
If there are no duplicate keys (best-case scenario), then we can do binary search to get to the first occurrence.
If there are duplicates though (what if all items are equal to k?), then a single binary search iteration will not be enough.
Hint: if the middle element of the input is equal to k, we can eliminate all candidates to the right.
3. Solution
3.1. Brute force
The most obvious solution is to just iterate through the list, checking each item to see if it equals k.
def find_first_occurrence_brute_force(haystack: List[int], needle: int) -> Optional[int]: for i, item in enumerate(haystack): if item == needle: return i return None
3.2. Optimal
We can do a binary search once to figure out if the item exists in the array. Then we can do a linear search to the left in case the one we binary-searched-to is a duplicate, to find the first occurrence.
Obviously, the worst case complexity is quite bad. If there are 1 million items all with the same item k, then binary search will terminate after a single pass at the middle (500,000th) element. Then we'd still need to iterate 500,000 times to the left to find the first occurrence at index 0. As the number of items \(n\) in the array grow, the worst-case time complexity also grows linearly in proportion to \(n\). So our worst-case time complexity is \(O(n)\).
We can instead continually do a binary search until we find the first occurrence of the element. Whereas normal binary search stops when it finds the item, in our modified version we can eliminate everything to the right, because at best they will contain non-first duplicates of k or non-k items.
def find_first_occurrence_optimal(haystack: List[int], needle: int) -> Optional[int]: left = 0 right = len(haystack) - 1 result = None # Search the entire haystack to find the first occurrence. while left <= right: mid = (left + right) // 2 # Reduce the search space by 1/2 on every iteration. If the midpoint is # lower than the needle, move the left bound to just past the middle. # Otherwise (including the case where the midpoint is equal to or # greater than the needle), move the right bound to just below the # middle. This way we can continue to search even if we get a search # hit. if haystack[mid] < needle: left = mid + 1 else: right = mid - 1 # We got a search hit; record the location for now, but don't break the # loop because we want to continue searching until the left/right # pointers cross. if haystack[mid] == needle: result = mid return result
3.3. Optimal (Pythonic)
The Pythonic solution is to use the built-in index()
method for arrays, or the
bisect
module. Both of these solutions delegate the implementation details, so
there's not much work for us to do.
def find_first_occurrence_pythonic(haystack: List[int], needle: int) -> Optional[int]: result = None try: i = haystack.index(needle) except ValueError: pass else: result = i return result def find_first_occurrence_pythonic_bisect(haystack: List[int], needle: int) -> Optional[int]: result = None i = bisect.bisect_left(haystack, needle) if i != len(haystack) and haystack[i] == needle: result = i return result
4. Tests
import bisect from hypothesis import given, strategies as st from typing import List, Optional import unittest brute_force optimal optimal_pythonic class Test(unittest.TestCase): test_cases if __name__ == "__main__": unittest.main(exit=False)
4.1. Basic tests
def test_basic(self): # Empty search space. haystack = [] needle = 1 result = find_first_occurrence_brute_force(haystack, needle) self.assertEqual(result, None) # Simplest search hit. haystack = [1] needle = 1 result = find_first_occurrence_brute_force(haystack, needle) self.assertEqual(result, 0) # Basic examples, as described in the problem statement. haystack = [1, 2, 4, 7, 8] needle = 7 result = find_first_occurrence_brute_force(haystack, needle) self.assertEqual(result, 3) haystack = [1] needle = 2 result = find_first_occurrence_brute_force(haystack, needle) self.assertEqual(result, None) # Pathological case. haystack = [1] * 100 needle = 1 result = find_first_occurrence_brute_force(haystack, needle) self.assertEqual(result, 0)
4.2. Property-based tests
@given(st.lists(st.integers(min_value=1, max_value=50), min_size=1, max_size=50), st.integers(min_value=1, max_value=50)) def test_random(self, haystack: List[int], needle: int): # The input has to be sorted, per the rules of the problem. haystack.sort() result_bf = find_first_occurrence_brute_force(haystack, needle) result_optimal = find_first_occurrence_optimal(haystack, needle) result_pythonic = find_first_occurrence_pythonic(haystack, needle) result_pythonic_bisect = find_first_occurrence_pythonic_bisect(haystack, needle) # Do the solutions agree with each other? self.assertEqual(result_bf, result_optimal) self.assertEqual(result_optimal, result_pythonic) self.assertEqual(result_pythonic, result_pythonic_bisect)