Hands-on Tutorials

Where’s Waldo?

Search Algorithms In The Physical World

Vishesh Khemani, Ph.D.
Towards Data Science
9 min readApr 6, 2021

--

Photo by Karlie Mitchell on Unsplash

Imagine you want to read (or re-read) your copy of Infinite Jest. You have thousands of books arrayed across bookcases around your house. How would you find that particular book?

Brute-force

You could walk around the house scanning each and every book on each and every shelf until you find Infinite Jest. This would work but would be quite tiresome and time-consuming. Such a strategy is known as a brute-force algorithm in Computer Science. It establishes that a solution exists and what the worst case is. There’s nowhere to go but up from the brute-force strategy.

This algorithm is clearly cumbersome, but how much so? In the best case, Infinite Jest will happen to be among the first few books that you look through and you find it quickly. In the worst case, you only come across it after having looked at all the other books. As you go through life using the brute-force search repeatedly, you’ll find that on average you have to go through half the books to find a particular book. In Computer Science parlance, the run-time of this algorithm is N, where N is the total number of books. Roughly speaking it means that on average it would take an amount of time proportional to the number of books to find a particular book. If you double the number of books, it will take you twice the time to find a specific one.

How can you do better than a brute-force search? By organizing the books! In Computer Science, an organization of data to solve a problem is known as using a data structure.

Binary Search

Let’s start with a common and familiar organization strategy: arrange the books in increasing alphabetical order of the titles.

Books arranged in alphabetical order (image by author)

Obviously, you’ll be able to find Infinite Jest much faster than having to check each and every book. You know how to find something in an alphabetically arranged list without even thinking about the algorithm you’re using. The key property you implicitly use is that if you check a book and its title is alphabetically greater (lesser) than the book you’re trying to find, then your desired book has to be before (after) that book. In this way, you exploit the ordering of the books to repeatedly narrow the range of books where your target book can be.

What precisely is the algorithm? It’s called binary search. Here’s how it goes. You check a book roughly at the halfway location among the set of arranged books and you find that it’s Moby Dick. Infinite Jest comes before Moby Dick in an alphabetic ordering, so you know that it must be somewhere between the first book and Moby Dick. You’ve just reduced the set of books to search by half. You then check the book at the halfway location in the new range (between the first book and Moby Dick). It turns out to be Frog And Toad Are Friends. So you know that Infinite Jest must be between Frog And Toad Are Friends and Moby Dick. Again you’ve further reduced the range of books by half. Continuing in this way you eventually find Infinite Jest.

Binary search for Infinite Jest (image by author)

How much faster is binary search compared to a brute-force full-scan search? If there are N books to search among, after the first comparison the number of books to search reduces to N/2. After the next comparison the set is of size N/4, then N/8, and so on, until there is only one book left: Infinite Jest. If K is the number of comparisons needed to find a book, then 2^K = N. In other words the run-time of binary search is log N. So if you double the number of books, you need to do just one extra comparison to find a particular one. That is indeed a lot better than the brute-force run-time that scales linearly with the number of books.

A sorted array with binary search seems like a good solution, but it has two significant problems:

  1. It’s difficult to sort a large number of books. There are clever sorting algorithms that can help with this, but there’s no getting around the fact that you’ll have to do a muscle-numbing number of book swaps to get them ordered right.
  2. It’s difficult to add or remove books from a sorted collection. For example, if you bought a new book that belonged at a particular location on a bookshelf based on the title sorting, and if there was no room on that shelf, you’d have to shift many/all the books after that location to make room for it.

Binary Search Tree

Here’s a different way to organize the books to address the problems of sorting and adding/removing books. Leave the books in whatever order they are on the bookshelves. Insert an index card in each book. Each index card will contain a few pieces of information that encode the data structure.

The first piece of information on an index card is the position of the book. The first book’s index card will note its position as 1, the next one as 2, and so on.

Each index card will also contain the position of the right-child and left-child. This structure of pointers defines a binary tree. The first book is the root of the binary tree. The second book is the left or right child of the root, depending on whether its title is lower or higher alphabetically. And so on. The ordering property that the right subtree only contains titles that are alphabetically greater and the left subtree only contains titles that are alphabetically lower makes the binary tree a binary search tree.

To add a new book, start at the first book (the root), follow the left or right child depending on whether the new book’s title is before or after the root’s title, keep repeating the traversal down the tree, and finally when the child to follow does not exist, make the new book the child. Note that for simplicity we’re including any leading article (the, a, an) in the book’s title. This is sketched out with an example in the following figure:

Adding books in a binary search tree (image by author)

To find a book, start at the first book (the root) and follow the appropriate child (depending on the ordering of titles) until you reach the desired book. Here’s an illustration of how to find Infinite Jest in the binary search tree we constructed above:

Finding Infinite Jest in a binary search tree (image by author)

What is the run-time of finding a book using this binary-search-tree algorithm? Since we traverse down the tree picking either the left child or the right child, we’re essentially reducing the search space by half after every step, similar to the binary search algorithm. So, just like the previous result, it takes log N comparisons to find a book.

There are a couple of details we glossed over:

  1. To reduce the search space by half after every comparison, the tree should be balanced i.e. the right and left child should be subtrees of roughly the same height. Generally when titles are inserted into the tree in a random order, the tree should be sufficiently balanced. Failing that, there are algorithmic techniques (e.g. tree rotation) to balance a tree (but that’s out of scope for this article).
  2. To find a child at a particular position, you have to rely on the books being sorted by position and use a binary search to find the child at the indicated position. So, each of the log N hops down the tree itself requires log N comparisons of the position on the index cards to find the book. That leads to a (log N)² run-time.

The binary search tree solution is correct, feasible, and efficient. But it is overkill. It led us down the path of tracking metadata in index cards so that the book titles can be linked together in a sorted order using a search tree data structure. That would be fine if we actually had a need to list the books in sorted order. But sorting just for the sake of searching is not required. Is there a simpler way that is at least as efficient?

Hashing

If we jettison the effort to sort the books (either directly, or indirectly via a search tree), we can indeed come up with a simpler and more efficient solution. It involves a Computer Science technique known as hashing.

What is hashing? You divide up the shelving into hundreds of smaller buckets and label the buckets 0, 1, 2, 3, and so on. You hash each book into one bucket, by computing the bucket number from the title in a repeatable way. For example, use the T9 system on your phone’s keypad to generate a 4-digit number from the last 4 letters of the book’s title, divide the 4-digit number by the number of buckets and take the remainder as the book’s bucket. So, “Infinite Jest” corresponds to the T9 number 5378. If there are 101 buckets, it hashes to bucket 25 (since 5378 = 25 + 101*53).

You organize the books by placing each book in the bucket it hashes to. Given thousands of books roughly equally distributed among the hundreds of buckets (which will happen if your hashing scheme is robust enough), each bucket should contain only tens of books.

Organizing books by hashing (image by author)

How do you find Infinite Jest? You hash the title to its bucket (using exactly the same hashing scheme used to organize the books), go to that bucket, and pick the book out from among the tens of other books there.

Finding Infinite Jest by hashing (image by author)

How fast is finding a book by hashing? Extremely fast. It essentially takes a single step, regardless of the total number of books! All you have to do is compute the hash of the title you’re trying to find, and that narrows your search to a handful of books in a bucket. This is known as a constant time algorithm. If you double the total number of books, you’ll end up roughly doubling the number of books in each bucket. But if there are enough buckets, each should contain only tens of books, so it remains easy to find a particular book in constant time.

Hashing seems like a pretty good solution to finding a book among a large set of books. It’s not perfect though. It has a few relatively minor drawbacks, including the following:

  1. Suppose you grow your book collection and start running out of shelf space. You buy more shelves. But now your hashing scheme has a larger number of buckets and you’ll have to rehash each and every book into its new bucket. What a pain! If changing the shelving is a rare event, perhaps you can live with it. But what if it’s not, or what if you can’t?
  2. What if you wanted your buckets to be of unequal sizes i.e. have more books hash to larger buckets and fewer books to smaller buckets. For example, each shelf could be a bucket and you have different bookcases with different sizes of shelves. The hashing algorithm as presented here doesn’t seem to accommodate different-sized buckets.

Both these issues can be solved. But that requires using more sophisticated hash-based algorithms like consistent hashing or rendezvous hashing. I’ll keep this article grounded in the basics and refrain from discussing these fascinating techniques. If you’re interested in learning about them, leave a comment, and I’ll write a follow-up article.

--

--

Mindful Thinker | Software Engineer (Google, Amazon) | Theoretical Physicist (MIT) | Husband, Dad, Dog Dad