On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey <apa...@elyograg.org> wrote:
> On 9/11/2018 7:07 PM, John Smith wrote: > > header: 223,580 > > > > child1: 124,978 > > child2: 254,045 > > child3: 127,917 > > child4: 1,009,030 > > child5: 225,311 > > child6: 381,561 > > child7: 438,315 > > child8: 18,850 > > > > > > Trying to index that into solr with a flatfile schema, blows up into > > 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a > > I think you're not getting what I'm suggesting. Or maybe there's an > aspect of your data that I'm not understanding. > > If we add up all those numbers for the child docs, there are 2.5 million > of them. So you would have 2.5 million docs in Solr. I have created > Solr indexes far larger than this, and I do not consider my work to be > "big data". Solr can handle 2.5 million docs easily, as long as the > hardware resources are sufficient. > > Where the data duplication will come in is in additional fields in those > 2.5 million docs. Each one will contain some (or maybe all) of the data > that WOULD have been in the parent document. The amount of data > balloons, but the number of documents (rows) doesn't. > > That kind of arrangement is usually enough to accomplish whatever is > needed. I cannot assume that it will work for your use case, but it > does work for most. > > Thanks, > Shawn > > The problem is that the math isn't a simple case of adding up all the row counts. These are "left outer join"s. In sql, it would be this query: select * from header h left outer join child1 c1 on c1.hid = h.id left outer join child2 c2 on c2.hid = h.id ... left outer join child8 c8 on c8.hid = h.id If there are 10 rows in child1 linked to 1 header with id "abc", and 10 rows in child2 linked to that same header, then we end up with 10 * 10 rows in solr, not 20. Considering there are 8 child tables in this example, there is simply an explosion of data. I can't describe it much better than that (abstractly), though perhaps I could put together a simple example with live data. Suffice it to say, in my example row counts above, that is all "live data" in a relatively small database of ours, the row counts are real, and the final row count of 5.5 billion was calculated inside sql using that query above: select count(*) from ( select id from header h left outer join child1 c1 on c1.hid = h.id left outer join child2 c2 on c2.hid = h.id ... left outer join child8 c8 on c8.hid = h.id ) tmp;