Re: How to define my data in schema.xml

Jack Krupansky Tue, 18 Jun 2013 07:56:56 -0700

You can in fact have multiple collections in Solr and do a limited amount ofjoining, and Solr has multivalued fields as well, but none of thosetechniques should be used to avoid the process of flattening anddenormalizing a relational data model. It is hard work, but yes, it isrequired to use Solr effectively.

Again, start with the queries - what problem are you trying to solve. Nobodystores data just for the sake of storing it - how will the data be used?


-- Jack Krupansky

-----Original Message-----From: Mysurf Mail

Sent: Tuesday, June 18, 2013 9:58 AM
To: [email protected]
Subject: Re: How to define my data in schema.xml

Hi Jack,
Thanks, for you kind comment.

I am truly in the beginning of data modeling my schema over an existing
working DB.
I have used the school-teachers-student db as an example scenario.
(a, I have written it as a disclaimer in my first post. b. I really do not
know anyone that has 300 hobbies too.)

In real life my db is obviously much different,
I just used this as an example of potential pitfalls that will occur if I
use my old db data modeling notions.
obviously, the old relational modeling idioms do not apply here.

Now, my question was referring to the fact that I would really like to
avoid a flat table/join/view because of the reason listed above.
So, my scenario is answering a plain user generated text search over a
MSSQLDB that contains a few 1:n relation (and a few 1:n:n relationship).

So, I come here for tips. Should I use one combined index (treat it as a
nosql source) or separate indices or another. any other ways to define
relation data ?
Thanks.

On Tue, Jun 18, 2013 at 4:30 PM, Jack Krupansky<[email protected]>wrote:

It sounds like you still have a lot of work to do on your data model. No
matter how you slice it, 8 billion rows/fields/whatever is still way too
much for any engine to search on a single server. If you have 8 billion of
anything, a heavily sharded SolrCloud cluster is probably warranted. Don't
plan ahead to put more than 100 million rows on a single node; plan on a
proof of concept implementation to determine that number.

When we in Solr land say "flattened" or "denormalized", we mean in an
intelligent, "smart", thoughtful sense, not a mindless, mechanical
flattening. It is an opportunity for you to reconsider your data models,
both old and new.

Maybe data modeling is beyond your skill set. If so, have a chat with your
boss and ask for some assistance, training, whatever.

Actually, I am suspicious of your 8 billion number - change each of those
300's to realistic, average numbers. Each teacher teaches 300 courses?
Right. Each Student has 300 hobbies? If you say so, but...

Don't worry about schema.xml until you get your data model under control.

For an initial focus, try envisioning the use cases for user queries. That

will guide you in thinking about how the data would need to be organizedto

satisfy those user queries.

-- Jack Krupansky

-----Original Message----- From: Mysurf Mail
Sent: Tuesday, June 18, 2013 2:20 AM
To: [email protected]
Subject: Re: How to define my data in schema.xml

Thanks for your reply.
I have tried the simplest approach and it works absolutely fantastic.
Huge table - 0s to result.

two problems as I described earlier, and that is what I try to solve:
1. I create a flat table just for solar. This requires maintenance and
develop. Can I run solr over my regular tables?
   This is my simplest approach. Working over my relational tables,
2. When you query a flat table by school name, as I described, if the
school has 300 student, 300 teachers, 300  with 300 teacherCourses, 300
studentHobbies,
   you get 8.1 Billion rows (300*300*300*300). As I am sure this will work
great on solar - searching for the school name will retrieve 8.1 B rows.
3. Lets say all my searches are user generated free text search that is
searching name and comments columns.
Thanks.

On Tue, Jun 18, 2013 at 7:32 AM, Gora Mohanty <[email protected]> wrote:

 On 18 June 2013 01:10, Mysurf Mail <[email protected]> wrote:

> Thanks for your quick reply. Here are some notes:
>
> 1. Consider that all tables in my example have two columns: Name &
> Description which I would like to index and search.
> 2. I have no other reason to create flat table other than for solar. So
> I
> would like to see if I can avoid it.

> 3. If in my example I will have a flat table then obviously it will> hold

a
> lot of rows for a single school.
>     By searching the exact school name I will likely receive a lot of
rows.
> (my flat table has its own pk)

Yes, all of this is definitely the case, but in practice
it does not matter. Solr can efficiently search through
millions of rows. To start with, just try the simplest
approach, and only complicate things as and when
needed.

>     That is something I would like to avoid and I thought I can avoid
this
> by defining teachers and students as multiple value or something like
this
> and than teacherCourses and studentHobbies  as 1:n respectively.

> This is quite similiar to my real life demand, so I came here to> get

> some tips as a solr noob.

You have still not described what are the searches that
you would want to do. Again, I would suggest starting
with the most straightforward approach.

Regards,
Gora

Re: How to define my data in schema.xml

Reply via email to