Jon Chinitz

How to use a custom language parser with Content Intelligence

Blog Post created by Jon Chinitz Employee on Feb 8, 2017

[Editorial Note: This is a re-posting of an email from one of our internal dlists. Kudos to Team China!]

 

Every language has different sets of rules: how/where to tokenize words, what are the appropriate stop words, spelling, etc. Fortunately, Solr (the index and search technology used by Content Intelligence) allows you to upload your own language analyzer (parser) to meet these challenges head on. Content Intelligence exposes the internal Solr configuration files so that you can make these changes and integrate them into the product and use them when you define your index schema.

 

The local HDS team in China has been experimenting with this since Content Intelligence was released. The local local developer & ISV community in China have developed lots of Chinese analyzers, tokenizers and filters which can be integrated to Solr. The integration in Solr is quite simple:

  1. Upload your JAR Analyzer to the Solr configuration directories
  2. Change the Solr configuration files to use the new analyzer.

 

The following is an example of integrating the popular IK-Analyzer into Content Intelligence:

 

  • Create the directory “/opt/hci/data/solr-service/IK” and upload “IK-Analyzer.jar” to it.
  • Choose the index collection that you want to modify. On the Overview/Details page select "Advanced". This will expose the list of configuration files. Modify "solrconfig.xml" and add the new “IK-Analyzer.jar” to it:

solrconfig.png

  • Modify "managed-schema.xml" to create the new text field "text_ik" and associate it with new analyzer:

managed-schema.png

  • Reload the index:

reload index.png

Now you can execute a query with the new text field "text_ik" instead of, e.g. "text_cjk" (note this is using the internal Solr user interface which is available under System Configuration --> Services --> Advanced Services --> Index Management).

 

This is the result using the "text_cjk" analyzer:

text_cjk.png

This is the same query using the "text_ik" analyzer:

text_ik.png

Final note: Content Intelligence already has support for a wide range of language parsers/analyzers. Content Intelligence indexes by default expose configuration files to support the following languages:

 

Arabic

Brazilian Portuguese

Bulgarian

Catalan

Chinese

Simplified Chinese

CJK

Czech

Danish

Dutch

Finnish

French

Galician

German

Greek

Hebrew, Lao, Myanmar, Khmer

Hindi

Indonesian

Italian

Irish

Japanese

Latvian

Norwegian

Persian

Polish

Portuguese

Romanian

Russian

Scandinavian

Serbian

Spanish

Swedish

Thai

Turkish

 

None of these languages are configured in the Solr index by default – users would need to edit the Solr configuration files directly using our provided UI. I will give an example of how to do this in a separate post.

Outcomes