[Editorial Note: This is a re-posting of an email from one of our internal dlists. Kudos to Team China!]
Every language has different sets of rules: how/where to tokenize words, what are the appropriate stop words, spelling, etc. Fortunately, Solr (the index and search technology used by Content Intelligence) allows you to upload your own language analyzer (parser) to meet these challenges head on. Content Intelligence exposes the internal Solr configuration files so that you can make these changes and integrate them into the product and use them when you define your index schema.
The local HDS team in China has been experimenting with this since Content Intelligence was released. The local local developer & ISV community in China have developed lots of Chinese analyzers, tokenizers and filters which can be integrated to Solr. The integration in Solr is quite simple:
- Upload your JAR Analyzer to the Solr configuration directories
- Change the Solr configuration files to use the new analyzer.
The following is an example of integrating the popular IK-Analyzer into Content Intelligence:
- Create the directory “/opt/hci/data/solr-service/IK” and upload “IK-Analyzer.jar” to it.
- Choose the index collection that you want to modify. On the Overview/Details page select "Advanced". This will expose the list of configuration files. Modify "solrconfig.xml" and add the new “IK-Analyzer.jar” to it:
- Modify "managed-schema.xml" to create the new text field "text_ik" and associate it with new analyzer:
- Reload the index:
Now you can execute a query with the new text field "text_ik" instead of, e.g. "text_cjk" (note this is using the internal Solr user interface which is available under System Configuration --> Services --> Advanced Services --> Index Management).
This is the result using the "text_cjk" analyzer:
This is the same query using the "text_ik" analyzer:
Final note: Content Intelligence already has support for a wide range of language parsers/analyzers. Content Intelligence indexes by default expose configuration files to support the following languages:
Hebrew, Lao, Myanmar, Khmer
None of these languages are configured in the Solr index by default – users would need to edit the Solr configuration files directly using our provided UI. I will give an example of how to do this in a separate post.