Ben Isherwood

HCI: Optimizing an index with stopwords

Blog Post created by Ben Isherwood Employee on Jul 12, 2017

One of the simplest ways to further optimize a search engine index is to register stopwords.

 

Stopwords are terms that are typically irrelevant in searches, like "a", "and", and "the". Removing these terms while indexing can significantly reduce index size without adversely impacting user query results.

 

Stopwords can affect the index in three ways: relevance, performance, and resource utilization.

 

  • From a relevance perspective, these high-frequency terms tend to throw off the scoring algorithm, and you won't get the best possible matching results if you leave them in. At the same time, if you remove them, you can return bad results when the stopword is actually important. Choose stopwords wisely!

 

  • From a performance perspective, if you don’t specify stopwords, some queries (especially phrase queries) can be very slow by comparison, because more terms are compared to each indexed document.

 

  • From a resource utilization perspective, if you don’t specify stopwords, the index is much larger than if you remove them. Larger indexes require more memory and disk resources.

 

Because they are effectively filtered from the index, stopwords are not considered when matching query terms with index terms. For example, when using stopwords {do, me, a, this}, a query for “do me a favor” would match a document containing the phrase “this favor”, making "favor" the most important search term impacting matches.

 

This is typically the desired behavior, as the same processing performed at index time as is performed at query time to “normalize” the user input to associate with matches. The best matches get the highest relevancy score, and appear higher in query results.

 

However, if literal exact phrases with these terms included are important, less stopwords can be better. For example, removing “do” as a stop word in the example above would cause phrase query “do me a favor” to NOT match “this favor”, but the query would still match a document containing “do this favor”.

 

The HCI index stopwords file (see "Index > Advanced > stopwords.txt") is used by the HCI_text and HCI_snippet fields. This file is empty by default for newly created indexes in 1.1.X releases, but will be populated with defaults in future releases.  It is highly recommended that you add relevant stopwords to this file prior to indexing!

 

A conservative example English stopword list that can satisfy the majority of use cases would be the following:

 

a

an

and

are

as

at

be

but

by

for

if

in

into

is

it

no

not

of

on

or

such

that

the

their

then

there

these

they

this

to

was

will

with

 

Example stopwords files in different languages are also available in the product as examples. See the "stopword_<country/language>.txt" files under the "Index > Advanced > lang" folder in the Admin application.  The above list comes from the default English stopwords_en.txt  file, taken from Lucene's StopAnalyzer.

 

Happy optimizing!

 

Thanks,

-Ben

Outcomes