Re: amount of values in a multi value field - is denormalization always the best option?

Flavio Pompermaier Thu, 11 Jul 2013 04:52:29 -0700

I also have a similar scenario, where fundamentally I have to retrieve all
urls where a userid has been found.
So, in my schema, I designed the url as (string) key and a (possible huge)
list of attributes automatically mapped to strings.
For example:


Url1 (key):
 - language: en
 - content:userid1
 - content:userid1
 - content:userid1 (i.e. 3 times actually for user 1)
 - content:userid2
 - content:userid3
 - author:userid4

and so on and so forth.
So, if I did understand, you're saying that this is a bad design? How
should I fix my schema in your opinion in that case?

Best,
Flavio


On Wed, Jul 10, 2013 at 11:53 PM, Jack Krupansky <j...@basetechnology.com>wrote:

> Simple answer: avoid "large number of values in a single document". There
> should only be a modest to moderate number of fields in a single document.
>
> Is the data relatively static, or subject to frequent updates? To update
> any field of a single document, even with atomic update, requires Solr to
> read and rewrite every field of the document. So, lots of smaller documents
> are best for a frequent update scenario.
>
> Multivalues fields are great for storing a relatively small list of
> values. You can add to the list easily, but under the hood, Solr must read
> and rewrite the full list as well as the full document. And, there is no
> way to address or synchronize individual elements of multivalued fields.
>
> Joins are great... if used in moderation. Heavy use of joins is not a
> great idea.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Marcelo Elias Del Valle
> Sent: Wednesday, July 10, 2013 5:37 PM
> To: solr-user@lucene.apache.org
> Subject: amount of values in a multi value field - is denormalization
> always the best option?
>
>
> Hello,
>
>    I have asked a question recently about solr limitations and some about
> joins. It comes that this question is about both at the same time.
>    I am trying to figure how to denormalize my data so I will need just 1
> document in my index instead of performing a join. I figure one way of
> doing this is storing an entity as a multivalued field, instead of storing
> different fields.
>    Let me give an example. Consider the entities:
>
> User:
>    id: 1
>    type: Joan of Arc
>    age: 27
>
> Webpage:
>    id: 1
>    url: http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>
>    category: Technical
>    user_id: 1
>
>    id: 2
>    url: http://stackoverflow.com
>    category: Technical
>    user_id: 1
>
>    Instead of creating 1 document for user, 1 for webpage 1 and 1 for
> webpage 2 (1 parent and 2 childs) I could store webpages in a user
> multivalued field, as follows:
>
> User:
>    id: 1
>    name: Joan of Arc
>    age: 27
>    webpage1: ["id:1", "url: 
> http://wiki.apache.org/solr/**Join<http://wiki.apache.org/solr/Join>",
> "category:
> Technical"]
>    webpage2: ["id:2", "url: http://stackoverflow.com";, "category:
> Technical"]
>
>    It would probably perform better than the join, right? However, it made
> me think about solr limitations again. What if I have 200 million webpges
> (200 million fields) per user? Or imagine a case where I could have 200
> million values on a field, like in the case I need to index every html DOM
> element (div, a, etc.) for each web page user visited.
>    I mean, if I need to do the query and this is a business requirement no
> matter what, although denormalizing could be better than using query time
> joins, I wonder it distributing the data present in this single document
> along the cluster wouldn't give me better performance. And this is
> something I won't get with block joins or multivalued fields...
>    I guess there is probably no right answer for this question (at least
> not a known one), and I know I should create a POC to check how each
> perform... But do you think a so large number of values in a single
> document could make denormalization not possible in an extreme case like
> this? Would you share my thoughts if I said denormalization is not always
> the right option?
>
> Best regards,
> --
> Marcelo Elias Del Valle
> http://mvalle.com - @mvallebr
>

Re: amount of values in a multi value field - is denormalization always the best option?

Reply via email to