Re: Problem querying large StrField?

Luis Lebolo Sun, 09 Feb 2014 21:43:34 -0800

Hi Yonik,

Thanks for the response. Our use case is perhaps a little unusual. The
actual domain is in bioinformatics, but I'll try to generalize. We have two
types of entities, call them A's and B's. For a given pair of entities
(a_i, b_j) we may or may not have an associated data value z. Standard many
to many stuff in a DB. Users can select an arbitrary set of entities from
A. What we'd then like to ask of Solr is: Which entities of type B have a
data value for any of the A's I've selected.

The way we've approached this to date is to index the set of B, such that
each document has a multivalued field containing the id's of all entities A
that have a data value. If I select a set of A (a1, a2, a5, a9), then I
would query data availability across B as dataAvailabilityField:(a1 OR a2
OR a5 OR a9).

The sets of A and B are fairly large (~10 - 30k). This was working ok, but
our datasets have increased and now the giant OR is getting too slow.

As an alternative approach, we developed a ValueParser plugin that took
advantage of our ability to sort the list of entity id's and do some clever
things, like binary searches and short circuits on the results. For this to
work, we concatenated all the id's into a single comma delimited value. So
the data availability field is now single valued, but has a term that looks
like "a1,a3,a6,a7....". Our function query then takes the list of A id's
that we're interested in and searches the documents for ones that match any
value. Worked great and quite fast when the id list was short enough. But
then we tried it on the full data set and the indexed terms of id's are
HUGE.

I know it's a bit of an odd use case, but have you seen anything like this
before? Do you have any thoughts on how we might better accomplish this
functionality?

Thanks!

On Wed, Feb 5, 2014 at 1:42 PM, Yonik Seeley <yo...@heliosearch.com> wrote:

> On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo <luis.leb...@gmail.com> wrote:
> > Update: It seems I get the bad behavior (no documents returned) when the
> > length of a value in the StrField is greater than or equal to 32,767
> > (2^15). Is this some type of bit overflow somewhere?
>
> I believe that's the maximum size of an indexed token.
> Can you share your use-case?  Why are you trying to index such large
> values as a single token?
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr
>

Re: Problem querying large StrField?

Reply via email to