Hello! First time poster so {insert ignorance disclaimer here ;)}. I'm building a web application backed by an Oracle database and we're using Lucene Solr to index various lists of "entities" (via DIH). We then harness Solr's faceting to allow the user to filter through their searches.
One aspect we're having trouble modeling is the concept of data availability. A dataset will have a data value for various entity pairs. To generalize, say we have two entities: Apples and Oranges. Therefore, there's a data value for various Apple and Orange pairs (e.g. apple1 & orange5 have value 6.566). The question we want to model is "which Apples have data for a specific set of Oranges." The problem is that the list of Oranges can be ~2000. Our first (and albeit ugly) approach was to create a dataAvailability field in each Apple document. It's a multi-valued field that holds a list of Oranges (actually a list of Orange IDs) that have data for that specific Apple. Our facet query then becomes ...facet.query=dataAvailability:(1 OR 2 OR 4 OR 45 OR 200 OR ...)... For > 1000 Oranges, the query takes a long time to run the first time a user performs it (afterwards it gets cached so it runs fairly quickly). Any thoughts on how to speed this up? Is there a better model to use? One idea was to use the autowarming features. However, the list of Oranges will always be dynamically built by the user (and it's not feasible to autowarm all possible permutations of ~2000 Oranges =)). Hope the generalization isn't too stupid, and thanks in advance! Cheers, Luis