Binary search tree
1. Problem statement
Implement a binary search tree (BST) (Sedgewick & Wayne, 2011, p. 396).
2. Insights
2.1. Sorted binary trees
Binary search trees are basically binary trees with one major difference: instead of manually controlling exactly where nodes should go in the structure of the tree (during insertion), we give up this control in exchange for predictable (and typically very fast) performance.
How do we give up control? We assign a key to every value we want to store in the BST, and let the BST decide where to store the key/value pair. The key's type must be comparable (can be greater or less than other keys). Then whenever we want to add a particular key (and associated value) into the BST, we start at the root node and see if the key is smaller than the key at the root. If so, we go down to the left subtree. If not, we go down the right subtree. If it is equal to the key at the root, we replace it with the new value. Then we do the comparison again until we arrive at a leaf node's child (when no more comparisons can be made to an existing key in the BST).
If you perform an in-order traversal of a BST, you get back all the keys in sorted order.
In a large BST with many children, each time we go down one level into the tree we eliminate about half of the search space. This is because, if the tree is roughly balanced, we'll only need to do \(\log_2N\) comparisons to find the right key in a BST with \(N\) nodes. This is a powerful property, pretty much identical to how binary search works for sorted arrays. Indeed, the BST is already sorted — the way in which we insert new nodes described in the above paragraph ensures that the BST maintains its sorted nature.
2.2. Insertion order determines structure
Let's consider the keys [3, 4, 1, 5, 2]
. The shape of the final BST will
depend on the insertion order.
t = BinarySearchTree() t.insert(3) # Empty tree, so 3 becomes root node. t.insert(4) # 4 is greater than 3, so insert to right child of 3. t.insert(1) # 1 becomes 3's left child. t.insert(5) # 5 becomes 4's right child. t.insert(2) # 2 becomes 1's right child.
Pictorially it will look like this:
If we insert them in order, we only end up using the right child at each level of tree:
t = BinarySearchTree() t.insert(1) # Empty tree, so 1 becomes root node. t.insert(2) # 2 is greater than 1, so insert to right child of 1. t.insert(3) # 3 becomes 2's child (3 is greater than both 1 and 2). t.insert(4) # 4 becomes 3's child (3 is greater than 1, 2, and 3). t.insert(5) # 5 becomes 4's child (5 is greater than 1, 2, 3, and 4).
This means that we have to make sure that the keys we insert into the BST are not in sorted order if we want to take advantage of the \(\log_2N\) lookup speed.
If the input is known to be random, then of course this pathological case can be ignored. If we know that the input can be non-random (but don't want to bother with messing with how the keys are fed into the BST), then self-balancing binary search trees may be more appropriate because they are resistant against sorted input by moving nodes around (if necessary) during each insertion operation, to ensure that we don't end up with a structure resembling a linked list.
3. Solution
We use two classes. One for the nodes (Node
) and another for the overall API
around binary trees (BinarySearchTree
). Splitting things up this way makes the
(recursive) insertion method easy.
The algorithms presented here are taken largely from Algorithms (Sedgewick & Wayne, 2011, p. 396).
3.1. Initialization
3.1.1. Node initialization
The initialization purposely leaves out initialization of the left and right subtrees. This way, we force users to use our API in order to guarantee that we construct the tree correctly.
Note that earlier we said that the key is comparable and is associated with a value. Like a hashmap, our binary tree cannot hold duplicate keys (such that when we store a new value for the same key, we overwrite the value in the existing key).
We set the left and right links to None
. These point to other Node
instances. Later when we add more nodes into this tree, these links will become
populated.
Set the count to just 1. We'll update this when we insert new nodes into this tree.
3.1.2. BST initialization
A new binary tree is empty, so it doesn't require much. The only optional input is what the new root node will look like (if the caller has already constructed such a node).
3.2. Insertion
Insertion requires us to compare the given key with what we already have in the tree. If the key does not exist, we have to add it to the correct spot. If it exists, we overwrite the existing entry.
The "correct spot" is to put the key/value into the left subtree if it's smaller than the current (root) key. It goes into the right subtree if greater. If the key is the same as the current key, replace it. This selection process is done recursively as many times as necessary until we end up replacing an existing node or adding a new one (left or right subtree is empty).
For simplicity, we use integers for the keys.
Note that we have two methods, insert()
and _insert()
. The second (private)
one is recursive and calls itself as many times as necessary to check for the
existence of the given key. The first (public) one simply starts off this
recursive search using the current root node of the tree.
3.2.1. Convenient integer-only insertion
While our default insert()
implementation above accounts for having distinct
keys and values, for our purposes of demonstrating how the various binary tree
algorithms behave we don't really care about what the values look like
(everything is based on the keys). So to that end, we define a insert_int()
method that only allows insertion of integers; it sets the value to a string
representation of the key with a val=
prefix to drive home the point that the
values are distinct from the keys.
3.3. Size
Just like we did for binary trees, getting the size is just a field lookup.
During insertion and deletion we make sure that the count
field is updated.
3.4. Lookup
Lookup is almost identical to insertion — we recursively check for the existence of the given key. And just like for insertion, we have to start off the recursive search with the root of the tree.
3.5. Deletion
Deletion is tricky, for the same reasons that we had for binary trees.
When we want to delete a node, we have to first search for the node we want to delete. Then once we find the node we want to delete, we again have 3 cases:
- the node to delete has 0 children (leaf node),
- the node to delete has 1 child, or
- the node to delete has 2 children.
If the node we want to delete is a leaf node, there's no extra work. If it has a single child, we just make that child the replacement. If it has 2 children, we have to pick a child to be the replacement.
The "successor" s of a key k in a BST is a key with the smallest value that is still greater than k. In other words, if we were to print all the keys in order, s would immediately follow after k.
Similarly, the "predecessor" p is the largest key that is still smaller than k, and would be printed just before k.
For plain binary trees, we decided to choose the leftmost leaf node as the replacement. For BSTs, we choose the node with the smallest key in the right child. This is also known as the "successor", because it's guaranteed to be the next-largest key in the BST after the to-be-deleted node. Due to the way BSTs are already ordered, choosing this child as the replacement preserves the order in the BST (such that after the deletion, the tree remains as a proper BST). This is because the smallest key in the right child is still larger than the key in the left child.
When we are deleting a node with 2 children, it is advised in Sedgewick & Wayne (2011, p. 410), to choose randomly between the successor and prdecessor as the replacement (it's more "balanced" and thus leads to better tree structure over time). For our purposes we just stick to choosing the successor because it is predictable, allowing us to test more easily in unit tests.
3.6. Traversal
The code here is identical to the code for binary trees. See the discussion around traversal there.
3.6.1. BFS
4. Tests
4.1. Initialization
4.2. Insertion
4.3. Lookup
4.4. Deletion
4.4.1. Property based tests
Randomly generate keys to insert, and randomly delete some of them.
If we delete every node, then the tree should have size 0.
4.5. Traversal
For these traversals, we construct the following binary tree (the keys are integers and the values are just the string representations of the keys, and so these redundant values are omitted from the illustration)
4.5.1. Property based tests
But when we do an in-order traversal, we should still get back out the elements in order, regardless of what keys were inserted in whichever order.
5. Export
from __future__ import annotations from typing import Any, Callable, Optional code