Solr 1.4 Offers Richer Document Indexing and Speed

Any day now (if not already), the newest version of Apache's Solr (news, site) project hits the streets. A year in the making since the previous release, the Java-based, open source enterprise search server Solr 1.4 offers some exciting feature and performance improvements.

Faster, Better, Stronger

It's nearly impossible to talk about Solr without talking about the Lucene search library (news, site) it relies on. With the release of Lucene 2.9, Solr receives the boost of many under the hood enhancements through its partner in crime.

But Solr received plenty of attention for itself as well. Given how critical search is to helping users find exactly what they're looking for within a site, and to keeping them there by responding quickly as well as accurately, it's little surprise that improvements to Solr itself break down into performance and features.

Performance-based improvements to Solr 1.4 include:

  • Streamlined caching through a change to the Java class ConcurrentLRUCache, which minimizes the overhead of synchronization.
  • Scalable concurrent file access with a change to the Java platform's Java Nonblocking Input/Output (NIO) API to speed index file access.
  • Smarter handling of index changes through smart re-use of unchanged index segments.
  • Faster faceting through a new implementation of UnInvertedField for multi-value fields, providing in some cases 50 times faster performance.
  • Streaming updates for SolrJ through an optimized implementation of StreamingUpdateSolrServer, which is useful for indexing many documents at one time, in some cases producing dramatic document indexing speed improvements.

New Features

Perhaps the most exciting feature improvement to Solr 1.4 is the ability to index non-XML documents through the addition of Solr Cell, which uses the Apache Tika project to convert various documents to XHTML. Supported formats include PDF, OpenDocument (OpenOffice), Microsoft OLE 2 Compound Document (Microsoft Office), HTML, RTF, gzip, ZIP, and Java Archive (JAR) files. Solr now can detect duplicate documents by using unique signatures, and have a configurable response as to how these duplicates are handled.

An addition that will thrill Windows administrators and those who didn't enjoy filing a ticket with IT is a much smoother index replication process. Rather than needing administrator access and rsync on a Unix box, Solr 1.4 now offers replication in its Java platform layer, so you can perform backups the same way on any Solr instance on any operating system without having to go to IT.

Another feature was inspired at Lucid Imagination (news, site), a company dedicated to commercial-grade support, training, development, and consulting based on Apache Lucene and Solr--as well as a major contributor to the project. In using Solr 1.4 on their own site for months to test and spot other areas for improvements, they noticed that their combination of Drupal for the CMS portion of the site and WordPress for the blog posed some challenges.

Solr uses a technique called faceting to group search results by fields. With a site combination such as this, the idea of multi-select faceting, where you count and group search results according to their fields, became an obvious new addition for Solr 1.4. There are a broad number of use cases where this feature adds great power to Solr search. For example, you can:

  • Use multi-select faceting by dates to search on the last modified date presented in year range increments with counts for each year.
  • Facet by query for arbitrary search queries and get specific counts for each.
  • Ensure that search results across multiple sites appear simultaneously, rather than narrowing and excluding through facets, as Lucid needed to do with their CMS and blog.
  • Allow support desk personnel in the middle of triage on a case to look down multiple possible issue paths at the same time. (Is the problem related to the OS? If so, the kernel? Filesystem?)
  • Eliminate results from unstructured data that don't match the parameters you're looking for.

Should You Rush to Upgrade?

Erik Hatcher, a member of the technical team at Lucid Imagination and an active committer for Lucene and Solr, says that if your current search works fine, there's no reason to rush into the upgrade. However, he also says that there are so many improvements to Solr that doing so can be well worth it. Solr 1.4 is backwards compatible with 1.3, so the Solr upgrade should be trivial--though he recommends that you always test just in case.

Those who might want to run rather than walk to Solr 1.4 include:

  • People who had problems deploying replication in Solr 1.3
  • Libraries, who were having issues with faceting on author across millions of books, performance for this aspect is much better in Solr 1.4

On the other hand, if you're updating Lucene as well, check the readme files for discussion of any caveats you might need to consider before upgrading this aspect of the solution.

What's Coming Next?

David M. Fishman, Director of Marketing for Lucid Imagination, points out that CMS and social technology is lowering the barrier to creating content. "More content means search must be faster and smarter," says Fishman, and many who want to close the gap between content and the users themselves are turning to Solr and Lucene so they don't have to build solutions from scratch.

While it's difficult to think too much on 1.5 while 1.4 isn't quite out the door, Hatcher says that items on the to do list include more data import handler improvements to support even more diverse content, and even more performance improvements.