>
> On 9/7/2018 7:44 PM, John Smith wrote:
> > Thanks Shawn, for your comments. The reason why I don't want to go flat
> > file structure, is due to all the wasted/duplicated data. If a department
> > has 100 employees, then it's very wasteful in terms of disk space to
> repeat
> > the header data over and over again, 100 times. In this example there is
> > only a few doc types, but my real-life data is much larger, and the
> problem
> > is a "scaling" problem; with just a little bit of data, no problem in
> > duplicating header fields, but with massive amounts of data it's a large
> > problem.
>
> If your goal is data storage, then you are completely correct.  All that
> data duplication is something to avoid for a data storage situation.
> Normalizing your data so it's relational makes perfect sense, because
> most database software is designed to efficiently deal with those
> relationships.
>
> Solr is not designed as a data storage platform, and does not handle
> those relationships efficiently.  Solr's design goals are all about
> *search*.  It often gets touted as filling a NoSQL role ... but it's not
> something I would personally use as a primary data repository.  Search
> is a space where data duplication is expected and completely normal.
> This is something that people often have a hard time accepting.
>
>
I'm not actually trying to use solr as a data storage platform; all our
data is stored in an sql database, we are using solr strictly for the
search features, not storage features.

Here is a good example from a test I ran today. I have a header table, and
8 child tables which link directly to the header table. The children link
only to 1 header row, and they do not link to other children. So a 1:many
between header and each child. Some row counts:

header:      223,580

child1:      124,978
child2:      254,045
child3:      127,917
child4:    1,009,030
child5:      225,311
child6:      381,561
child7:      438,315
child8:       18,850


Trying to index that into solr with a flatfile schema, blows up into
5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a
left outer join between header and each child and getting a row count in
the database. That's not going to scale, at all, considering the small size
of the source input tables. Some of our indexes would require 50 million
header rows alone, never mind the child tables.

So solr has no way of indexing something like this? I can't believe I would
be the first person to run into this issue, I have a feeling I'm missing
something obvious somewhere.

Reply via email to