On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey <apa...@elyograg.org> wrote:

> On 9/11/2018 7:07 PM, John Smith wrote:
> > header:      223,580
> >
> > child1:      124,978
> > child2:      254,045
> > child3:      127,917
> > child4:    1,009,030
> > child5:      225,311
> > child6:      381,561
> > child7:      438,315
> > child8:       18,850
> >
> >
> > Trying to index that into solr with a flatfile schema, blows up into
> > 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
>
> I think you're not getting what I'm suggesting.  Or maybe there's an
> aspect of your data that I'm not understanding.
>
> If we add up all those numbers for the child docs, there are 2.5 million
> of them.  So you would have 2.5 million docs in Solr.  I have created
> Solr indexes far larger than this, and I do not consider my work to be
> "big data".  Solr can handle 2.5 million docs easily, as long as the
> hardware resources are sufficient.
>
> Where the data duplication will come in is in additional fields in those
> 2.5 million docs.  Each one will contain some (or maybe all) of the data
> that WOULD have been in the parent document.  The amount of data
> balloons, but the number of documents (rows) doesn't.
>
> That kind of arrangement is usually enough to accomplish whatever is
> needed.  I cannot assume that it will work for your use case, but it
> does work for most.
>
> Thanks,
> Shawn
>
>
The problem is that the math isn't a simple case of adding up all the row
counts. These are "left outer join"s. In sql, it would be this query:

select * from header h
left outer join child1 c1 on c1.hid = h.id
left outer join child2 c2 on c2.hid = h.id
...
left outer join child8 c8 on c8.hid = h.id


If there are 10 rows in child1 linked to 1 header with id "abc", and 10
rows in child2 linked to that same header, then we end up with 10 * 10 rows
in solr, not 20. Considering there are 8 child tables in this example,
there is simply an explosion of data.

I can't describe it much better than that (abstractly), though perhaps I
could put together a simple example with live data. Suffice it to say, in
my example row counts above, that is all "live data" in a relatively small
database of ours, the row counts are real, and the final row count of 5.5
billion was calculated inside sql using that query above:

select count(*) from (
    select id from header h
    left outer join child1 c1 on c1.hid = h.id
    left outer join child2 c2 on c2.hid = h.id
    ...
    left outer join child8 c8 on c8.hid = h.id
) tmp;

Reply via email to