By Jason Orendorff | August 16, 2012
If you ever want to screw over a library, just walk up to any shelf, pick up any book, and put it on another shelf where it doesn’t belong.
Eventually a librarian will stumble across it, see that it’s out of place, and pull it off the shelf. Until then, that book is hopelessly lost. It might as well be on the surface of the moon. Finding a needle in a haystack is easy. When it comes down to it, needles are not hay. Finding one book lost among a million other books? That’s hard.
The only reason such a thing as a library is possible is that it is a gigantic, life-sized, walk-in data structure, tuned for fast lookup.
This post is about searching and sorting, two fundamental aspects of data processing, and what the library has to teach us about them.
Suppose you have a sorted array,
books = [ "001.4 Graphs", "001.56 Symbols", ...millions of strings... "998.9 Antarctica" ]
and you want to find the position of a given value in it.
How do you do that? Well, there’s a method for that:
But it checks every item in the array, one by one. That could take a while. Is there a better way?
Binary search is the right answer.
And this algorithm is proven optimal.
But if the sorted array looks like this
and the item you’re looking for is something like this
then it doesn’t feel optimal.
The orange path is the path you would take doing binary search. The purple path is what you want to do.
What’s going on here? Why is this optimal algorithm so bad in real life?
Is it because humans are dumb and slow and make mistakes, whereas computers are smart and fast and dependable, so what works for a computer won’t necessarily work for a human?
I don’t think so, because if I were programming a robot to go get books off the shelves, I wouldn’t have the robot do a binary search either. A robot doing binary search to find a book would look just as dumb as I would. It really is the wrong algorithm somehow.
Is it because we’re not actually searching a sorted array using comparisons? Well, no, that’s exactly what we’re doing.
So, I thought about this, and the best answer I could come up with is that the cost model is wrong.
Binary search does an optimal number of comparisons. It’s only the best algorithm if comparisons are the most significant cost.
In a library, the books are physically spread out. Most of the time it takes to find a book is not comparisons. It’s walking. And this is exactly why binary search feels so dumb: the algorithm makes no effort to minimize your walking.
There is another factor, which is that librarians do something very clever. They cheat. Or rather, they change the problem.
They put up these signs.
It’s just a little extra information. You can look at these signs and before you even start walking, you know which aisle to go to. So before you even begin, you’re already doing better than binary search.
Well, real-life programs do this exact thing in software, and for exactly the same reason.
Here’s a picture of a B+ tree. Filesystems and databases basically all use this data structure or something very similar. You can see there’s a hierarchy. The actual sorted data is across the bottom; all this other stuff is just a set of signs telling what’s stored where. And all of that is a fraction of the size of the actual data. When you create an index in a database, that tree is what gets created on disk.
To find something, you start at the top, follow the signs, and in a very small number of hops, you can get to the data you’re looking for.
Now, each hop involves looking a chunk of data and determining which way to go next. It’s some code. And if you just ignored the B+ tree and did binary search on the data itself, that would be simpler. It’d be less work, fewer CPU instructions. But it touches more pages of memory. Or disk. More walking around.
It’s easy to forget, but reading data from memory is one of the slowest things a CPU does. A modern CPU can execute hundreds of instructions in the time it takes to load anything from main memory. Going to disk is thousands of times worse.
If there’s a theme here, it is that the library is doing things that are recognizable as algorithms from computer science. But not the obvious textbook algorithms! I guess when an algorithm is going to take up a human being’s time, it tends to get optimized. These algorithms seem to be tuned very well indeed. But perhaps if you looked inside any organization you’d find something similar.
I wanted to share this because thinking about these things gave me just a slightly different perspective on programming. I just gave the library two hours a week this summer, as a volunteer, putting books on the shelves, and it turned out to be unexpectedly rewarding.
And humbling. When I started at the library, there was this moment, embarrassing now in hindsight, when I thought, hey, I know information systems. I could teach the library a thing or two! But of course, predictably, it turned out the library had something to teach me.
I don’t feel too bad about it though. Librarians have been in the IT game for a long, long time.
One more little thing. The Nashville Public Library has a great volunteer program. And they have interesting projects for web developers. They could use your talents.
Or you could just put books on the shelves. Like me. It’s about as exciting as it sounds, but I promise you this: you will read a book that you wouldn’t read otherwise.