Re: amount of values in a multi value field - is denormalization always the best option?

Marcelo Elias Del Valle Wed, 10 Jul 2013 15:53:32 -0700

Jack,

     When you say: "large number of values in a single document" you also
mean a block in a block join, right? Exactly the same thing, agree?
     In my case, I have just 1 insert and no updates. Even in this case,
you think a large document or block would be a really bad idea? I am more
worried about the search time.


Best regards,
Marcelo.


2013/7/10 Jack Krupansky <j...@basetechnology.com>

> Simple answer: avoid "large number of values in a single document". There
> should only be a modest to moderate number of fields in a single document.
>
> Is the data relatively static, or subject to frequent updates? To update
> any field of a single document, even with atomic update, requires Solr to
> read and rewrite every field of the document. So, lots of smaller documents
> are best for a frequent update scenario.
>
> Multivalues fields are great for storing a relatively small list of
> values. You can add to the list easily, but under the hood, Solr must read
> and rewrite the full list as well as the full document. And, there is no
> way to address or synchronize individual elements of multivalued fields.
>
> Joins are great... if used in moderation. Heavy use of joins is not a
> great idea.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Marcelo Elias Del Valle
> Sent: Wednesday, July 10, 2013 5:37 PM
> To: solr-user@lucene.apache.org
> Subject: amount of values in a multi value field - is denormalization
> always the best option?
>
>
> Hello,
>
>    I have asked a question recently about solr limitations and some about
> joins. It comes that this question is about both at the same time.
>    I am trying to figure how to denormalize my data so I will need just 1
> document in my index instead of performing a join. I figure one way of
> doing this is storing an entity as a multivalued field, instead of storing
> different fields.
>    Let me give an example. Consider the entities:
>
> User:
>    id: 1
>    type: Joan of Arc
>    age: 27
>
> Webpage:
>    id: 1
>    url: http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>
>    category: Technical
>    user_id: 1
>
>    id: 2
>    url: http://stackoverflow.com
>    category: Technical
>    user_id: 1
>
>    Instead of creating 1 document for user, 1 for webpage 1 and 1 for
> webpage 2 (1 parent and 2 childs) I could store webpages in a user
> multivalued field, as follows:
>
> User:
>    id: 1
>    name: Joan of Arc
>    age: 27
>    webpage1: ["id:1", "url: 
> http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>",
> "category:
> Technical"]
>    webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
> Technical"]
>
>    It would probably perform better than the join, right? However, it made
> me think about solr limitations again. What if I have 200 million webpges
> (200 million fields) per user? Or imagine a case where I could have 200
> million values on a field, like in the case I need to index every html DOM
> element (div, a, etc.) for each web page user visited.
>    I mean, if I need to do the query and this is a business requirement no
> matter what, although denormalizing could be better than using query time
> joins, I wonder it distributing the data present in this single document
> along the cluster wouldn't give me better performance. And this is
> something I won't get with block joins or multivalued fields...
>    I guess there is probably no right answer for this question (at least
> not a known one), and I know I should create a POC to check how each
> perform... But do you think a so large number of values in a single
> document could make denormalization not possible in an extreme case like
> this? Would you share my thoughts if I said denormalization is not always
> the right option?
>
> Best regards,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>



-- 
Marcelo Elias Del Valle
http://mvalle.com - @mvallebr

Re: amount of values in a multi value field - is denormalization always the best option?

Reply via email to