Have you tried modeling it with multivalued fields? Also, why do you think Solr is a good solution? What is the problem?
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Sep 11, 2018, at 7:35 PM, John Smith <localde...@gmail.com> wrote: > > On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey <apa...@elyograg.org> wrote: > >> On 9/11/2018 7:07 PM, John Smith wrote: >>> header: 223,580 >>> >>> child1: 124,978 >>> child2: 254,045 >>> child3: 127,917 >>> child4: 1,009,030 >>> child5: 225,311 >>> child6: 381,561 >>> child7: 438,315 >>> child8: 18,850 >>> >>> >>> Trying to index that into solr with a flatfile schema, blows up into >>> 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a >> >> I think you're not getting what I'm suggesting. Or maybe there's an >> aspect of your data that I'm not understanding. >> >> If we add up all those numbers for the child docs, there are 2.5 million >> of them. So you would have 2.5 million docs in Solr. I have created >> Solr indexes far larger than this, and I do not consider my work to be >> "big data". Solr can handle 2.5 million docs easily, as long as the >> hardware resources are sufficient. >> >> Where the data duplication will come in is in additional fields in those >> 2.5 million docs. Each one will contain some (or maybe all) of the data >> that WOULD have been in the parent document. The amount of data >> balloons, but the number of documents (rows) doesn't. >> >> That kind of arrangement is usually enough to accomplish whatever is >> needed. I cannot assume that it will work for your use case, but it >> does work for most. >> >> Thanks, >> Shawn >> >> > The problem is that the math isn't a simple case of adding up all the row > counts. These are "left outer join"s. In sql, it would be this query: > > select * from header h > left outer join child1 c1 on c1.hid = h.id > left outer join child2 c2 on c2.hid = h.id > ... > left outer join child8 c8 on c8.hid = h.id > > > If there are 10 rows in child1 linked to 1 header with id "abc", and 10 > rows in child2 linked to that same header, then we end up with 10 * 10 rows > in solr, not 20. Considering there are 8 child tables in this example, > there is simply an explosion of data. > > I can't describe it much better than that (abstractly), though perhaps I > could put together a simple example with live data. Suffice it to say, in > my example row counts above, that is all "live data" in a relatively small > database of ours, the row counts are real, and the final row count of 5.5 > billion was calculated inside sql using that query above: > > select count(*) from ( > select id from header h > left outer join child1 c1 on c1.hid = h.id > left outer join child2 c2 on c2.hid = h.id > ... > left outer join child8 c8 on c8.hid = h.id > ) tmp;