What does each document represent? What concept is holding all these entities together?

The standard approach to true many-to-many relationships in Solr is to denormalize - each document would represent one relationship and have an ID field that links the relationship to whatever each of your current Solr documents represent.

Multivalued fields, large string fields, and dynamic fields are all powerful tools in Lucene/Solr, but only when used in moderation. The way to scale in Lucene/Solr is documents and sharding, not massive documents with lots of large multivalued/string fields.

That said, given Lucene/Solr's rich support for large tokenized fields, they might be a better choice for representing large lists of entities - if denormalization is not quite practical.

-- Jack Krupansky

-----Original Message----- From: Luis Lebolo
Sent: Monday, February 10, 2014 12:42 AM
To: solr-user
Subject: Re: Problem querying large StrField?

Hi Yonik,

Thanks for the response. Our use case is perhaps a little unusual. The
actual domain is in bioinformatics, but I'll try to generalize. We have two
types of entities, call them A's and B's. For a given pair of entities
(a_i, b_j) we may or may not have an associated data value z. Standard many
to many stuff in a DB. Users can select an arbitrary set of entities from
A. What we'd then like to ask of Solr is: Which entities of type B have a
data value for any of the A's I've selected.

The way we've approached this to date is to index the set of B, such that
each document has a multivalued field containing the id's of all entities A
that have a data value. If I select a set of A (a1, a2, a5, a9), then I
would query data availability across B as dataAvailabilityField:(a1 OR a2
OR a5 OR a9).

The sets of A and B are fairly large (~10 - 30k). This was working ok, but
our datasets have increased and now the giant OR is getting too slow.

As an alternative approach, we developed a ValueParser plugin that took
advantage of our ability to sort the list of entity id's and do some clever
things, like binary searches and short circuits on the results. For this to
work, we concatenated all the id's into a single comma delimited value. So
the data availability field is now single valued, but has a term that looks
like "a1,a3,a6,a7....". Our function query then takes the list of A id's
that we're interested in and searches the documents for ones that match any
value. Worked great and quite fast when the id list was short enough. But
then we tried it on the full data set and the indexed terms of id's are
HUGE.

I know it's a bit of an odd use case, but have you seen anything like this
before? Do you have any thoughts on how we might better accomplish this
functionality?

Thanks!


On Wed, Feb 5, 2014 at 1:42 PM, Yonik Seeley <yo...@heliosearch.com> wrote:

On Wed, Feb 5, 2014 at 1:04 PM, Luis Lebolo <luis.leb...@gmail.com> wrote:
> Update: It seems I get the bad behavior (no documents returned) when the
> length of a value in the StrField is greater than or equal to 32,767
> (2^15). Is this some type of bit overflow somewhere?

I believe that's the maximum size of an indexed token.
Can you share your use-case?  Why are you trying to index such large
values as a single token?

-Yonik
http://heliosearch.org - native off-heap filters and fieldcache for solr


Reply via email to