Hi,

I have a core with about 20M documents and the size on disc is about
50GB. It is running on a single EC2 instance. If the core is warmed up,
everything is running fine. The problem is the following:

We assign categories (similar to tags) to documents. Those are stored in
a multivalue string field. After the commit, query times are
unacceptable slow.

Those categories are the only field that is every changed, so I was
thinking about a way to keep the information outside SOLR. I had some
ideas, but my knowledge of SOLR internals would need some improvement to
implement them. Looking for other solutions, I stumbled about this
comment in a JIRA issue:

https://issues.apache.org/jira/browse/LUCENE-4258?focusedCommentId=13423159&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13423159

The following words sound quite good to me:

"People could instead solve this by putting their apps primary key into
a docvalues field, allowing them to keep these scoring factors
completely external to lucene (e.g. their own array or whatever),
indexed by their own primary key. But the problem is I think people want
lucene to manage this, they don't want to implement themselves whats
necessary to make it consistent with commits etc."

Sounds like there is an obvious solution, how to keep data outside SOLR,
but make it somehow accessible via DocValues. But I have no idea about
what kind of solution he is talking.

Could somebody give me a starting point? I would need to filter on that
field and facet over it.

cheers,
Achim

Reply via email to