Have you tried modeling it with multivalued fields?

Also, why do you think Solr is a good solution? What is the problem?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Sep 11, 2018, at 7:35 PM, John Smith <localde...@gmail.com> wrote:
> 
> On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 9/11/2018 7:07 PM, John Smith wrote:
>>> header:      223,580
>>> 
>>> child1:      124,978
>>> child2:      254,045
>>> child3:      127,917
>>> child4:    1,009,030
>>> child5:      225,311
>>> child6:      381,561
>>> child7:      438,315
>>> child8:       18,850
>>> 
>>> 
>>> Trying to index that into solr with a flatfile schema, blows up into
>>> 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
>> 
>> I think you're not getting what I'm suggesting.  Or maybe there's an
>> aspect of your data that I'm not understanding.
>> 
>> If we add up all those numbers for the child docs, there are 2.5 million
>> of them.  So you would have 2.5 million docs in Solr.  I have created
>> Solr indexes far larger than this, and I do not consider my work to be
>> "big data".  Solr can handle 2.5 million docs easily, as long as the
>> hardware resources are sufficient.
>> 
>> Where the data duplication will come in is in additional fields in those
>> 2.5 million docs.  Each one will contain some (or maybe all) of the data
>> that WOULD have been in the parent document.  The amount of data
>> balloons, but the number of documents (rows) doesn't.
>> 
>> That kind of arrangement is usually enough to accomplish whatever is
>> needed.  I cannot assume that it will work for your use case, but it
>> does work for most.
>> 
>> Thanks,
>> Shawn
>> 
>> 
> The problem is that the math isn't a simple case of adding up all the row
> counts. These are "left outer join"s. In sql, it would be this query:
> 
> select * from header h
> left outer join child1 c1 on c1.hid = h.id
> left outer join child2 c2 on c2.hid = h.id
> ...
> left outer join child8 c8 on c8.hid = h.id
> 
> 
> If there are 10 rows in child1 linked to 1 header with id "abc", and 10
> rows in child2 linked to that same header, then we end up with 10 * 10 rows
> in solr, not 20. Considering there are 8 child tables in this example,
> there is simply an explosion of data.
> 
> I can't describe it much better than that (abstractly), though perhaps I
> could put together a simple example with live data. Suffice it to say, in
> my example row counts above, that is all "live data" in a relatively small
> database of ours, the row counts are real, and the final row count of 5.5
> billion was calculated inside sql using that query above:
> 
> select count(*) from (
>    select id from header h
>    left outer join child1 c1 on c1.hid = h.id
>    left outer join child2 c2 on c2.hid = h.id
>    ...
>    left outer join child8 c8 on c8.hid = h.id
> ) tmp;

Reply via email to