Bookworm

Contents:

________________________________________

Search Tips

  • Bookworm does not distinguish between a query "all of these words" and a query "at least one word" - instead, matches containing all of the search terms will be listed first, followed by those containing fewer and fewer search terms, or fragments thereof.
  • There are no escape sequences or similar: Everything you type is searched for in just the way you typed. This makes it easy to search for the unusual character sequences (like delete[]() or \cmd{}) in books on programming languages.
  • Matches are considered better if two adjacent words in the query are also adjacent in the match.
  • Similarly, matches are also considered better if characters at the beginning/end of the query match the beginning/end of the match.
  • In fact, the Bookworm search is not word-oriented at all: If you're too lazy to type operator precedence, just try op prec or similar!
  • Bookworm is semi-case-sensitive - matches with correct case are considered better, but matches with differing case will also be found.
  • If two matches score equal after all the rules above have been considered, the match whose length is closer to that of the query is listed first.
  • Queries which consist of only 1 or 2 characters are not handled too well - if you get too few matches, add another character (not space).
  • The threshold for a match to be listed is set very low. Thus, if the search results seem to bear no resemblance at all to your query, this means there is no real match.

All these properties, which may seem unusual for a search engine, are the result of the "fuzzy search" algorithm used by Bookworm. Unlike with other search engines, the query is not subdivided into words to search for. Instead, it is divided into n-grams, chunks of n characters. For example, when tri-grams (i.e. chunks of length 3) are used, the term "a query" is turned into the tri-grams "a q", " qu", "que", "uer" and "ery". Each line of the text to be searched is then checked for any occurrances of those tri-grams.

If you only counted the number of matching tri-grams in the text to be searched, the results would sometimes be wrong, simply because the match may contain all the tri-grams, but in a different order. In these cases, the quality of the results can be increased by repeating the process with longer n-grams. Bookworm performs checks with 2-, 3-, 4- and 5-grams to measure the similarity between query and match.

The fuzzy search algorithm is based on an article in the 4/1997 issue of c't magazine. Many thanks!

________________________________________

How to Contribute to Bookworm

I need your help!

Currently, Bookworm only offers searches of the index data for two books. I hope to be able to extend this to a collection of all the most popular technical reference books. Unfortunately, preparing the data is quite a lot of work, which is why I need you to help me!

To turn the index of Stroustrup's book into something useful, the following steps were necessary:

  • Scan and OCR the index.
  • Look through the text and correct things. Unfortunately, OCR programs are not usually designed for recognizing this kind of text. :-/
  • Transfer the text into a spreadsheet, with one column for page numbers and two for the text; one plain version of the text, and one HTML version for displaying matches.
  • Create the HTML version of the text, e.g. making keywords appear in a typewriter font.
  • Sort by page number and eliminate duplicate references (e.g. "return by value" and "value, return by"), which are not needed for an automated search. Amazingly, this reduced the Stroustrup index to about a third of its original size!
  • Add the book's Contents section to the data. Stroustrup kindly omitted the lower-level headings like "Exercises", so this meant going through the whole book and typing in the headings manually.

If you would like your book's index to be available via Bookworm, you will have to do at least some of this work for me. As a bare minimum, mail me a text file to richard ätt 2008.atterer.net with tab-separated entries (no Excel please) containing plain text and page numbers in two columns, with all the duplicates still present. The entries and page numbers must be correct. Of course, it would be nicer if you could also provide the HTML version of the index entries, and/or eliminate the duplicates.

By sending me book index data, you agree to place the data in the public domain.

________________________________________

Dedication

Bookworm is dedicated to Bjarne Stroustrup, who has managed to add to his most excellent book "The C++ Programming Language" an index of very poor quality. This search engine would never have been written had the index been better!

Thanks to Matt Kraai for the index data of Kernighan & Ritchie!


To top - Unless noted otherwise, all graphics, programs and text © Copyright 2008 Richard Atterer. All trademarks are the property of their respective owners.