> > On 9/7/2018 7:44 PM, John Smith wrote: > > Thanks Shawn, for your comments. The reason why I don't want to go flat > > file structure, is due to all the wasted/duplicated data. If a department > > has 100 employees, then it's very wasteful in terms of disk space to > repeat > > the header data over and over again, 100 times. In this example there is > > only a few doc types, but my real-life data is much larger, and the > problem > > is a "scaling" problem; with just a little bit of data, no problem in > > duplicating header fields, but with massive amounts of data it's a large > > problem. > > If your goal is data storage, then you are completely correct. All that > data duplication is something to avoid for a data storage situation. > Normalizing your data so it's relational makes perfect sense, because > most database software is designed to efficiently deal with those > relationships. > > Solr is not designed as a data storage platform, and does not handle > those relationships efficiently. Solr's design goals are all about > *search*. It often gets touted as filling a NoSQL role ... but it's not > something I would personally use as a primary data repository. Search > is a space where data duplication is expected and completely normal. > This is something that people often have a hard time accepting. > > I'm not actually trying to use solr as a data storage platform; all our data is stored in an sql database, we are using solr strictly for the search features, not storage features.
Here is a good example from a test I ran today. I have a header table, and 8 child tables which link directly to the header table. The children link only to 1 header row, and they do not link to other children. So a 1:many between header and each child. Some row counts: header: 223,580 child1: 124,978 child2: 254,045 child3: 127,917 child4: 1,009,030 child5: 225,311 child6: 381,561 child7: 438,315 child8: 18,850 Trying to index that into solr with a flatfile schema, blows up into 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a left outer join between header and each child and getting a row count in the database. That's not going to scale, at all, considering the small size of the source input tables. Some of our indexes would require 50 million header rows alone, never mind the child tables. So solr has no way of indexing something like this? I can't believe I would be the first person to run into this issue, I have a feeling I'm missing something obvious somewhere.