Ben Isherwood

Search: The Inverted Index

Blog Post created by Ben Isherwood Employee on Apr 6, 2017

Traditional Databases

In the enterprise, systems requiring a fast lookup of many results matching a query (especially at large scale) have typically required the use of a database. These database systems provide for data modeling around rows of table data in which there is one or more primary key combination of fields which uniquely identifies each row.

 

Consider the following sets of keyword containing Documents:

Doc1: The quick brown fox jumped over the lazy dog 
Doc2: Quick brown fox leap over lazy dogs in summer
Doc3: The fox quickly jumped over the bridge


A traditional database would typically store the information about each document based on the unique ID of each document:

id       keywords 
----------------------------------------------------------
Doc1   | quick brown fox jump over lazy dog
Doc2   | quick brown fox leap over lazy dog summer
Doc3   | fox quick jump over bridge
----------------------------------------------------------

 

Performance is gained by querying against the primary key directly, or by building efficient "indexes" for traversing these database tables by other criteria. The goal for efficient querying is to limit the database traversal to only the possible subset of rows in the table that could actually satisfy your query.

 

Search engines, on the other hand, are optimized for efficient ranked, keyword based query across billions of documents in milliseconds. Because queries against these indexes cannot be typically predicated ahead of time, an index cannot typically be determined in advance. These search engines make heavy use of the inverted index.

 

Inverted Index

Instead of traditional forward indexes, search engines use an inverted index:

Term      Doc1     Doc2    Doc3 
---------------------------------
quick   |       |   X   |   X   |
brown   |   X   |   X   |       |
dog     |   X   |   X   |       |
fox     |   X   |   X   |   X   |
jump    |   X   |       |   X   |
lazy    |   X   |   X   |       |
leap    |       |   X   |       |
over    |   X   |   X   |   X   |
quick   |   X   |       |       |
summer  |       |   X   |       |
bridge  |       |       |   X   |
---------------------------------

With the inverted index, queries can be resolved by jumping to the searched word (via random access) to deliver query results quickly. The resulting data structure is very similar to a hash map, providing the same performance benefits.


If the user queries for keywords "lazy" and "summer", the inverted index provides the results of "Doc1,Doc2" for the first term and "Doc2" for the second. Because the second Document appears in the results more frequently, it gains a higher relevancy ranking. The query results will then be returned as "Doc2" and then "Doc1".

 

Search Index vs. Database

If your use case allows a person to type words into a search box to initiate any sort of processing, you want a search engine vs. a database.

 

Databases and Search Indexes have complementary strengths and weaknesses. SQL supports very simple wildcard-based text search with some simple normalization like matching upper case to lower case. The problem is that these are full table scans. In search engines, all searchable words are stored in the inverted index, which searches orders of magnitude faster.


From a database perspective, a search index can be thought of as one DB table with very fast lookups and interesting enhancements for text search. This index is relatively expensive in space and creation time. The engine typically wraps this API with a full-featured front end, providing these additions:

  • Schema design and text processing features
  • Clean deployment as a web service for indexing and searching
  • Convenient scalability across multiple servers
  • Learning curve & adoption improvement of ~2 orders of magnitude

 

 

Index Processing

Search engines generally provide a form of document processing on ingest into the search engine index for full content keyword streams. These parsers generally provide 3 forms of index driven document processing:

  • Stopword removal - Stop words (terms that do not improve relevance for document discovery, or that appear frequently) are removed from the data set at both index and query time in order to maximize query matches and improve efficiency and relevancy of the index data
  • Stemming/Lemmatization - Words may be stored in the search engine index in their normalized form to maximize query discovery. For example, "walk" is the base form for word "walking", "walked", and "walker" and can be stored instead of those values. On query time, the same process is executed (e.g. "walker" is converted to "walk") and will match a larger number of similarities in the index. Note that this approach is in direct conflict with "exact phrase" query, so steps need to be taken to support one or both types of query.
  • Synonym Matching - When querying for "used car" you may want the text of a relevant document to hit documents mentioning “passenger vehicle” or “automobile.” If the indexing system associated these alternate terms as synonyms of “car,” the relevant documents would also be attached to the index term “car.”


Documents are ingested into the the index with stopwords removed & synonyms added:

 

1.jpg


At query time, the same text processing is performed on the input rank/relevance is determined accordingly:

 

2.jpg

 

Index Querying

Before populating an inverted index, support for specific types of queries is taken into account by defining index fields of certain types. Each field type can be configured to support a different type of query against that information field.


This process looks like the following:

 

3.gif

 

Here is a demonstration of results for specific query types against the inverted index:

 

Query TypeDescriptionQuery and Results
SimpleKeyword query

t1.jpg

PrefixMatch keywords by prefix

t2.jpg

BooleanMatch terms whether they must (AND), should not (NOT), or may exist (OR)

t3.jpg

PhraseIdentifies a set of tokens at these positions in the index

t4.jpg

 

Conclusion

As the number of documents within a system grows larger and larger, search engine indexes begin to provide benefits over traditional database systems.

 

In a world of Big Data, the ability to discover a set of documents for finite processing becomes increasingly important. Running a discovery query against a set of inverted indexes can help to identify a smaller subset of content for processing, effectively replacing the need to build, optimize, and refine the large, purpose specific indexes traditionally associated with a database.

 

Thanks for reading,

-Ben

Outcomes