Hi, I have a core with about 20M documents and the size on disc is about 50GB. It is running on a single EC2 instance. If the core is warmed up, everything is running fine. The problem is the following:
We assign categories (similar to tags) to documents. Those are stored in a multivalue string field. After the commit, query times are unacceptable slow. Those categories are the only field that is every changed, so I was thinking about a way to keep the information outside SOLR. I had some ideas, but my knowledge of SOLR internals would need some improvement to implement them. Looking for other solutions, I stumbled about this comment in a JIRA issue: https://issues.apache.org/jira/browse/LUCENE-4258?focusedCommentId=13423159&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13423159 The following words sound quite good to me: "People could instead solve this by putting their apps primary key into a docvalues field, allowing them to keep these scoring factors completely external to lucene (e.g. their own array or whatever), indexed by their own primary key. But the problem is I think people want lucene to manage this, they don't want to implement themselves whats necessary to make it consistent with commits etc." Sounds like there is an obvious solution, how to keep data outside SOLR, but make it somehow accessible via DocValues. But I have no idea about what kind of solution he is talking. Could somebody give me a starting point? I would need to filter on that field and facet over it. cheers, Achim