In the enterprise, systems requiring a fast lookup of many results matching a query (especially at large scale) have typically required the use of a database. These database systems provide for data modeling around rows of table data in which there is one or more primary key combination of fields which uniquely identifies each row.
Consider the following sets of keyword containing Documents:
Doc1: The quick brown fox jumped over the lazy dog
Doc2: Quick brown fox leap over lazy dogs in summer
Doc3: The fox quickly jumped over the bridge
A traditional database would typically store the information about each document based on the unique ID of each document:
Doc1 | quick brown fox jump over lazy dog
Doc2 | quick brown fox leap over lazy dog summer
Doc3 | fox quick jump over bridge
Performance is gained by querying against the primary key directly, or by building efficient "indexes" for traversing these database tables by other criteria. The goal for efficient querying is to limit the database traversal to only the possible subset of rows in the table that could actually satisfy your query.
Search engines, on the other hand, are optimized for efficient ranked, keyword based query across billions of documents in milliseconds. Because queries against these indexes cannot be typically predicated ahead of time, an index cannot typically be determined in advance. These search engines make heavy use of the inverted index.
Instead of traditional forward indexes, search engines use an inverted index:
Term Doc1 Doc2 Doc3
quick | | X | X |
brown | X | X | |
dog | X | X | |
fox | X | X | X |
jump | X | | X |
lazy | X | X | |
leap | | X | |
over | X | X | X |
quick | X | | |
summer | | X | |
bridge | | | X |
With the inverted index, queries can be resolved by jumping to the searched word (via random access) to deliver query results quickly. The resulting data structure is very similar to a hash map, providing the same performance benefits.
If the user queries for keywords "lazy" and "summer", the inverted index provides the results of "Doc1,Doc2" for the first term and "Doc2" for the second. Because the second Document appears in the results more frequently, it gains a higher relevancy ranking. The query results will then be returned as "Doc2" and then "Doc1".
Search Index vs. Database
If your use case allows a person to type words into a search box to initiate any sort of processing, you want a search engine vs. a database.
Databases and Search Indexes have complementary strengths and weaknesses. SQL supports very simple wildcard-based text search with some simple normalization like matching upper case to lower case. The problem is that these are full table scans. In search engines, all searchable words are stored in the inverted index, which searches orders of magnitude faster.
Feature Search Index Relational Database Text Search Fast and sophisticated Minimal and slow Features Few, targeted to text search Many Deployment Complexity Medium Medium Administration Tools Minimal open source projects Many open source & commercial Monitoring Tools Weak Very Strong Scaling Tools Automated, medium to large scale Large scale Support Availability Weak Strong Schema Flexibility Must in general rebuild Changes immediately visible Indexing Speed Slow Faster and adjustable Query Speed Text search is fast & predictable Very dependent on design & use case Row Addition/Extraction Speed Slow Fast Partial Record Modification No Yes Time to visibility after addition Slow Immediate Access to internal data structures High None Technical knowledge required Java (minimal), web server deployment, IT SQL, DB-specific factors, IT
From a database perspective, a search index can be thought of as one DB table with very fast lookups and interesting enhancements for text search. This index is relatively expensive in space and creation time. The engine typically wraps this API with a full-featured front end, providing these additions:
- Schema design and text processing features
- Clean deployment as a web service for indexing and searching
- Convenient scalability across multiple servers
- Learning curve & adoption improvement of ~2 orders of magnitude
Search engines generally provide a form of document processing on ingest into the search engine index for full content keyword streams. These parsers generally provide 3 forms of index driven document processing:
- Stopword removal - Stop words (terms that do not improve relevance for document discovery, or that appear frequently) are removed from the data set at both index and query time in order to maximize query matches and improve efficiency and relevancy of the index data
- Stemming/Lemmatization - Words may be stored in the search engine index in their normalized form to maximize query discovery. For example, "walk" is the base form for word "walking", "walked", and "walker" and can be stored instead of those values. On query time, the same process is executed (e.g. "walker" is converted to "walk") and will match a larger number of similarities in the index. Note that this approach is in direct conflict with "exact phrase" query, so steps need to be taken to support one or both types of query.
- Synonym Matching - When querying for "used car" you may want the text of a relevant document to hit documents mentioning “passenger vehicle” or “automobile.” If the indexing system associated these alternate terms as synonyms of “car,” the relevant documents would also be attached to the index term “car.”
Documents are ingested into the the index with stopwords removed & synonyms added:
At query time, the same text processing is performed on the input rank/relevance is determined accordingly:
Before populating an inverted index, support for specific types of queries is taken into account by defining index fields of certain types. Each field type can be configured to support a different type of query against that information field.
This process looks like the following:
Here is a demonstration of results for specific query types against the inverted index:
As the number of documents within a system grows larger and larger, search engine indexes begin to provide benefits over traditional database systems.
In a world of Big Data, the ability to discover a set of documents for finite processing becomes increasingly important. Running a discovery query against a set of inverted indexes can help to identify a smaller subset of content for processing, effectively replacing the need to build, optimize, and refine the large, purpose specific indexes traditionally associated with a database.
Thanks for reading,