Using Solr and Sunspot to Search Within Words
This post is part of our Exploring Solr and Sunspot series.
In my previous post I mentioned that out of the box Solr breaks up the search indexes on whitespace. So if you have the string “the quick brown fox”, you can search via “quick” or “brown”, but not “quic” or “uic”. Fortunately, you can configure solr to break up the sentence with a finer grain strategy.
If you have setup Solr and Sunspot according to my previous article, you can open up
RAILS_ROOT/solr/conf/schema.xml. Yeah there is quite a bit of XML in there, but don’t be disheartened. You’ll find an XML snippet like the following:
You can edit that snippet which will make Solr break up words and index each bit:
Now this will take a word such as “quick” and index smaller words (or grams) “quic”, “qui”, “qu”, “uick”, “ick”, “ck”, etc. Make sure you restart your Solr instance and re-index your models and you should be good to go.
If you want to break up words from the front of the string only you can edit that XML snippet:
Which will only break up a word from the front such as “quick” into “quic”, “qui”, “qu”.
Now when I started this post, I was thinking that there would be a noticeable performance hit by breaking up words into smaller “grams”. But when I tried to find the performance hit, I could barely find one! I uploaded a sample codebase and within it I create 30,000 Company records each with a Title and a Description. Each strategy fully indexes all records in about 2 minutes and some change on my machine. That was recorded very unscientifically using the “time” utility.
However what I did find was a significant disk space increase. With the Default indexing strategy, the indexes stored in
RAILS_ROOT/solr/data weighed in at 6.2MBs. With the EdgeNGramFilterFactory strategy they were 29MBs. And with the NGramFilterFactory strategy, the indexes came in at 115MB, which is not a trivial increase. However with disk space as cheap as it is, I think it’s perfectly acceptable.
It’s possible that I didn’t run enough records through the indexer, so perhaps the EdgeNGramFilterFactory and NGramFilterFactory strategies are slower than the Default at sufficiently high numbers. If you can find a flaw in my logic and can reproduce a significant performance increase (either CPU or Memory) please comment below.