On 8/20/2010 8:56 PM, Lance Norskog wrote:
The first question is about your use cases. How many words are in the
eventual 3GB spelling index? Do you really need that many?
Spell-checking is a more controllable UI if you make it from a
dictionary.

It's built from an index-only field that combines four other fields. The data we are indexing is metadata from photos, text articles, and videos, with most of it being photos. On a single shard, the schema browser shows * * 23612208 distinct terms in the catchall field, from 7305684 documents. If it's a one-to-one relationship, there you go.

Perhaps I need to make another catchall field that leaves out the "full" text field. I'll have to experiment, because my index is already bigger than I want it to be. I have no budget for throwing more hardware at the problem. We are in the process of rewriting our application so that we can reduce our index size, but that is still a few months out.

Aside from the index itself, I'm not sure where I'd get an appropriate dictionary for photo metadata that would not require major manual work. Is there any easy way to get the full list of distinct terms and their counts? I'd imagine that if I could filter out those with only a handful of occurrences, the list would be dramatically smaller. Other filters might be useful as well, such as removing those above say 15 or 20 characters. Normally I'd go to the facet feature for this sort of information, but I'm not sure my servers could handle that.

What you're talking about is effectively promoting the spellcheck
index to a first-class Solr index, instead of an appendage bolted on
the side of an existing core. Given sharding and distributed search,
this may be a better design.

Can you elaborate on what "this" refers to above? Are you saying that you think promoting it to a full Solr index is a good idea? I saw a Jira issue with the idea of building the spellcheck index at the same time as the rest of the index, and storing it in the same directory. This sounds like a very good way to go, especially if the filtering I mentioned above were a part of the configuration.

Thanks,
Shawn

Reply via email to