Using Solr and Sunspot to Search Within Words

This post is part of our Exploring Solr and Sunspot series.

In my previous post I mentioned that out of the box Solr breaks up the search indexes on whitespace. So if you have the string “the quick brown fox”, you can search via “quick” or “brown”, but not “quic” or “uic”. Fortunately, you can configure solr to break up the sentence with a finer grain strategy.

If you have setup Solr and Sunspot according to my previous article, you can open up RAILS_ROOT/solr/conf/schema.xml. Yeah there is quite a bit of XML in there, but don’t be disheartened. You’ll find an XML snippet like the following:

You can edit that snippet which will make Solr break up words and index each bit:

Now this will take a word such as “quick” and index smaller words (or grams) “quic”, “qui”, “qu”, “uick”, “ick”, “ck”, etc. Make sure you restart your Solr instance and re-index your models and you should be good to go.

If you want to break up words from the front of the string only you can edit that XML snippet:

Which will only break up a word from the front such as “quick” into “quic”, “qui”, “qu”.

Performance Concerns

Now when I started this post, I was thinking that there would be a noticeable performance hit by breaking up words into smaller “grams”. But when I tried to find the performance hit, I could barely find one! I uploaded a sample codebase and within it I create 30,000 Company records each with a Title and a Description. Each strategy fully indexes all records in about 2 minutes and some change on my machine. That was recorded very unscientifically using the “time” utility.

However what I did find was a significant disk space increase. With the Default indexing strategy, the indexes stored in RAILS_ROOT/solr/data weighed in at 6.2MBs. With the EdgeNGramFilterFactory strategy they were 29MBs. And with the NGramFilterFactory strategy, the indexes came in at 115MB, which is not a trivial increase. However with disk space as cheap as it is, I think it’s perfectly acceptable.

It’s possible that I didn’t run enough records through the indexer, so perhaps the EdgeNGramFilterFactory and NGramFilterFactory strategies are slower than the Default at sufficiently high numbers. If you can find a flaw in my logic and can reproduce a significant performance increase (either CPU or Memory) please comment below.

zach@collectiveidea.com

Comments

  1. March 25, 2011 at 21:42 PM

    Thanks for sharing. I’ve always wondered how to do this, but never took time to investigate.

  2. March 27, 2011 at 15:41 PM

    Great series, Zach… this is going to come in handy for me in the next few weeks.

    I will point out that the ~20x increase in storage requirements is way beyond non-trivial though.  Disk space is cheap, but not that cheap!  I think playing with the parameters of NGramFilterFactory might yield better results—if I have a chance to experiment in the near future I’ll post some results.

  3. March 27, 2011 at 18:41 PM

    @Brandon: glad to help.

    @Keith thanks.  Check out my codebase on github and the seed.rb. 30k records with Lorem Ipsums. IMO that’s not bad for 115MB. Obviously your mileage will vary, but we’re not talking about GBs yet.

    My reservation isn’t on raw space but IO throughput, which is always tricky to deal with. Luckily search indexes usually have more flexibilty when it comes to eventual consistency.

  4. mcha226@gmail.com
    mark
    December 29, 2011 at 0:37 AM

    May I ask if I can set EdgeNGramFilterFactory on a particular column or model only instead of the whole Sunspot? Cheers

  5. francordie@gmail.com
    Franco
    February 12, 2012 at 11:16 AM

    Thanks for sharing! That was exactly what i was looking for … by the way, i have the same question of @mark. I have a db with more than 100k records with a lots of columns and that setup for just a few columns can be very usefull for me. It takes ~40min indexing with normal parameters … now i’ll try with your strategy.

    THANKS!

  6. February 12, 2012 at 13:44 PM

    mark and franco - I’m sorry, but I actually don’t know how to do that. But it could still be possible. If either of you guys find out how to accomplish that, would you mind sharing the voodoo here?

  7. March 28, 2012 at 17:55 PM

    This just came in handy!  Thanks so much for being awesome!!!

  8. anitsirc1@gmail.com
    Cristina
    September 24, 2012 at 5:52 AM

    Hey Zach, first of all great post, this actually comes really handy! But…is this still working with the newer version of sunspot_rails (1.3.3) I’ve check your demo and cannot make it work, don’t know if missing something or just not working. 

  9. anitsirc1@gmail.com
    Cristina
    September 24, 2012 at 7:51 AM

    nevermind, for some reason the solr server went crazy and didnt want to read the changes… restarted the computer and works ok. Great post!

  10. 127ajr@gmail.com
    Albert
    July 05, 2013 at 14:26 PM

    Friggin awesome. Thanks a million!

  11. mathieubourgeois25@gmail.com
    Mathieu
    March 10, 2015 at 8:46 AM

    hello, 

    I’ve just try your solution to do a search within word, but it doesn’t works. I’ve done all the operations that you explain in your 2 tutorial, and for the first, it works but not for this one.
    I have no error, I’ve restart and reindex, but nothing come when i do a search with a part of the word. 

    I use rails 4.1.0, and for the rest, I do exactly what you have explain.

  12. April 30, 2015 at 12:42 PM

    Hi all thanks for help,
    i use your snippet code, but not ngram mode activate, i have this code:

         




         




    and:

    searchable do
    text :title
    end

    thera are any error??
    thanks.