Hi Yonik, Thanks for the response. Our use case is perhaps a little unusual. The actual domain is in bioinformatics, but I'll try to generalize. We have two types of entities, call them A's and B's. For a given pair of entities (a_i, b_j) we may or may not have an associated data value z. Standard many to many stuff in a DB. Users can select an arbitrary set of entities from A. What we'd then like to ask of Solr is: Which entities of type B have a data value for any of the A's I've selected.
The way we've approached this to date is to index the set of B, such that each document has a multivalued field containing the id's of all entities A that have a data value. If I select a set of A (a1, a2, a5, a9), then I would query data availability across B as dataAvailabilityField:(a1 OR a2 OR a5 OR a9). The sets of A and B are fairly large (~10 - 30k). This was working ok, but our datasets have increased and now the giant OR is getting too slow. As an alternative approach, we developed a ValueParser plugin that took advantage of our ability to sort the list of entity id's and do some clever things, like binary searches and short circuits on the results. For this to work, we concatenated all the id's into a single comma delimited value. So the data availability field is now single valued, but has a term that looks like "a1,a3,a6,a7....". Our function query then takes the list of A id's that we're interested in and searches the documents for ones that match any value. Worked great and quite fast when the id list was short enough. But then we tried it on the full data set and the indexed terms of id's are HUGE. I know it's a bit of an odd use case, but have you seen anything like this before? Do you have any thoughts on how we might better accomplish this functionality? Thanks! On Wed, Feb 5, 2014 at 1:42 PM, Yonik Seeley <yo...@heliosearch.com> wrote: > On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo <luis.leb...@gmail.com> wrote: > > Update: It seems I get the bad behavior (no documents returned) when the > > length of a value in the StrField is greater than or equal to 32,767 > > (2^15). Is this some type of bit overflow somewhere? > > I believe that's the maximum size of an indexed token. > Can you share your use-case? Why are you trying to index such large > values as a single token? > > -Yonik > http://heliosearch.org - native off-heap filters and fieldcache for solr >