I don't think that will solve the relevance issues, given that the IDF
(described at http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html)
is per document, not per field. In the end, though, it may be
negligible. Can you test it out fairly quickly?
One other thing to think about with multiple indexes is whether or not
keeping them separate affords you some extra flexibility at the cost
of some more up front work? For instance, news is probably updated
much more frequently than classifieds and so you may want to tune it
for frequent updates and possibly even give it more hardware, whereas
classifieds may not be as critical (or vice versa, I'm not in the news
biz.) Naturally, the tradeoff is you need to develop tools to manage
these various indexes, whereas the single index approach is already
pretty well understood.
I would expect that as the mutlicore (https://issues.apache.org/jira/browse/SOLR-350
) patch evolves, it is going to bring in more management tools for
working with various indexes (perhaps you can donate your expertise if
you go this route?)
-Grant
On Nov 5, 2007, at 10:27 AM, Tim Archambault wrote:
Good points Grant. I'm envisioning my front end working so that a
user would
never be able to search across all the verticals at once.
EVERY query would inject "vertical:jobs" or "vertical:news" or
"vertical:Autos", etc.. etc...
This may detrimentally affect my faceted results sets so I'll have
to think
about this more.
Wouldn't this approach overcome my relevancy and scoring issues?
On 11/5/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
One reason to consider separate indexes is in terms of relevance. Do
you want content from classifieds effecting the rankings of your news
searches? May not be an issue for you depending on your term
distributions, but might be something to consider. As you suspect,
though, having multiple indexes will require more management of the
various instances. Perhaps you can logically group things to only
have a couple of indexes? For instance, maybe home, auto,
classifieds
are similar in content and structure and news and community-generated
content are similar?
-Grant
On Nov 5, 2007, at 9:34 AM, Tim Archambault wrote:
Typical newspaper site with: news, jobs, homes, autos, classifieds,
community-generated content, guestimate of .5 million documents
Do I really need to create a different solr index for each vertical?
How
ineffecient is it to add a few additional fields for each content
type?
Thinking of having a string field name "vertical" that would be used
to
segment by verticals above.
My intuition is that most of the additional fields would be numbers:
integers, prices, decimals.
Thanks,
Tim
--
True innovation is not just about changing a product, a service or
even a
marketplace; its also about recognizing and relishing the need to
change
yourself.
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ
--
True innovation is not just about changing a product, a service or
even a
marketplace; its also about recognizing and relishing the need to
change
yourself.
--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007. Sign up now! http://www.apachecon.com
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ