In our application, we need to be able to search across multiple local indexes. We need this not so much for performance reasons, but because of the particular needs of our project. But the indexes, while sharing the same schema can be vary different in terms of size and distribution of documents. By that I mean that some indexes may have a lot more documents about some topic while others will have more documents about other topics. We want to be able add documents to the individual indexes as well. I can provide more detail about our project is necessary. Thus, the Distributed Search feature with shards in different cores seems to be an obvious solution except for the limitation of distributed idf.

First, I want to make sure my understanding about the distributed idf limitation are correct: If your documents are spread across your shards evenly, then the distribution of terms across the individual shards can be assumed to be even enough not to matter. If, as in our case, the shards are not very uniform, then this limitation is magnified. Even though simplistic, do I have the basic idea?

We have hacked together something that allows us to read from multiple indexes, but it isn't really a long-term solution. It's just sort of shoe-horned in there. Here are some notes from the programmer who worked on this: Two custom files: EgranaryIndexReaderFactory.java and EgranaryIndexReader.java
  EgranaryIndexReader.java
No real work is done here. This class extends lucene.index.MultiReader and overrides the directory() and getVersion() methods inherited from IndexReader. These methods don't make sense for a MultiReader as they only return a single value. However, Solr expects Readers to have these methods. directory() was overridden to return a call to directory() on the first reader in the subreader list. The same was done for getVersion(). This hack makes any use of these methods
  by Solr somewhat pointless.

  EgranaryIndexReaderFactory.java
  Overrides the newReader(Directory indexDir, boolean readOnly) method
The expected behavior of this method is to construct a Reader from the index at indexDir. However, this method ignores indexDir and reads a list of indexDirs from the solrconfig.xml file. These indices are used to create a list of lucene.index.IndexReader classes. This list is then used to create the EgranaryIndexReader.

So the second questions is: Does anybody have other ideas about how we might solve this problem? Is distributed search still our best bet?

Thanks for your thoughts!
Brent

Reply via email to