Re: Solr schema design: fitting time-series data

map reduced Tue, 17 Jan 2017 10:35:58 -0800

Anyone has any idea?

On Sun, Jan 15, 2017 at 9:54 PM, map reduced <k3t.gi...@gmail.com> wrote:


> I may have used wrong terminology, by complex types I meant non-primitive
> types. Mutlivalued can be conceptualized as a list of values for instance
> in your example myint = [ 32, 77] etc which you can possibly analyze and
> query upon. What I was trying to ask is if a complex type can be
> multi-valued or something along those lines that can be supported by range
> queries.
>
> For instance: Below rows will have to be individual docs in Solr (in my
> knowledge) -  If I want to range query from ts=Jan 12 to ts=Jan 15 give me
> sum of 'unique' where 'contentId=1,product=mobile'
>
> contentId=1,product=mobile    ts=Jan15     total=12,unique=5
> contentId=1,product=mobile    ts=Jan14     total=10,unique=3
> contentId=1,product=mobile    ts=Jan13     total=15,unique=2
> contentId=1,product=mobile    ts=Jan12     total=17,unique=4
> ......
>
> This increases number of documents in Solr by a lot. Only if there was a
> way to do something like:
>
> {
> contentId=1
> product=mobile
> ts = [
>
> {
>
> time = Jan15
>
> total = 12
>
> unique = 15
>
> },
> {
>
> time = Jan16
>
> total = 10
>
> unique = 3
>
> },
>
> ..
> ..
> ]}
>
> Of course above isn't allowed, but some way to squeeze timestamps in
> single document so that it doesn't increase the number of document by a lot
> and I am still able to range query on 'ts'.
>
> For some (combination of fields) rows the timestamps may go upto last 3-6
> months!
>
> Let me know if I am still being unclear.
>
> On Sun, Jan 15, 2017 at 8:04 PM, Erick Erickson <erickerick...@gmail.com>
> wrote:
>
>> bq: I know multivalued fields don't support complex data  types
>>
>> Not sure what you're talking about here. mulitValued actually has
>> nothing to do with data types. You can have text fields which
>> are analyzed and produce multiple tokens and are multiValued.
>> You can have primitive types (string, int/long/float/double,
>> boolean etc) that are multivalued. or they can be single valued.
>>
>> All "multiValued" means is that the _input_ can have the same field
>> repeated, i.e.
>> <doc>
>>   <field name="mytext">some stuff</field>
>>   <field name="mytext">more stuff</field>
>>   <field name='myint"">32</field>
>>   <field name='myint"">77</field>
>> </doc>
>>
>> This doc would fail of mytext or myint were multiValued=false but
>> succeed if multiValued=true at index time.
>>
>> There are some subtleties with text (analyzed) multivalued fields having
>> to do with token offsets, but that's not germane.
>>
>> Does that change your problem? Your document could have a dozen
>> timestamps....
>>
>> However, there isn't a good way to query across multiple multivalued
>> fields
>> in parallel. That is, a doc like
>>
>> myint=1
>> myint=2
>> myint=3
>> mylong=4
>> mylong=5
>> mylong=6
>>
>> there's no good way to say "only match this document if mhyint=1 AND
>> mylong=4 AND they_are_both_in_the_same_position.
>>
>> That is, asking for myint=1 AND mylong=6 would match the above. Is
>> that what you're
>> wondering about?
>>
>> ------------------
>> I expect you're really asking to do the second above, in which case you
>> might
>> want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x
>>
>> Best,
>> Erick
>>
>> On Sun, Jan 15, 2017 at 7:31 PM, map reduced <k3t.gi...@gmail.com> wrote:
>> > Hi,
>> >
>> > I am trying to fit the following data in Solr to support flexible
>> queries
>> > and would like to get your input on the same. I have data about users
>> say:
>> >
>> > contentID (assume uuid),
>> > platform (eg. website, mobile etc),
>> > softwareVersion (eg. sw1.1, sw2.5, ..etc),
>> > regionId (eg. us144, uk123, etc..)
>> > ....
>> >
>> > and few more other such fields. This data is partially pre aggregated
>> (read
>> > Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
>> > mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
>> > format:
>> >
>> > timestamp  pre-aggregated data [ uniques, total]
>> >  Jan 15    [ 12, 4]
>> >  Jan 14    [ 4, 3]
>> >  Jan 13    [ 8, 7]
>> >  ...        ...
>> >
>> > And then I also have less granular data say "contentID = uuid123 and
>> > platform = mobile and softwareVersion = ANY and regionId = ANY (These
>> > values will be more than above table since granularity is reduced)
>> >
>> > timestamp : pre-aggregated data [uniques, total]
>> >  Jan 15    [ 100, 40]
>> >  Jan 14    [ 45, 30]
>> >  ...           ...
>> >
>> > I'll get queries like "contentID = uuid123 and platform = mobile" , give
>> > sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
>> > platform=mobile and softwareVersion=sw1.2", give sum of 'total' for
>> Jan15 -
>> > Jan01.
>> >
>> > I was thinking of simple schema where documents will be like (first
>> example
>> > above):
>> >
>> > {
>> >   "contentID": "uuid12349789",
>> >   "platform" : "mobile",
>> >   "softwareVersion": "sw1.2",
>> >   "regionId": "ANY",
>> >   "ts" : "2017-01-15T01:01:21Z",
>> >   "unique": 12,
>> >   "total": 4
>> > }
>> >
>> > second example from above:
>> >
>> > {
>> >   "contentID": "uuid12349789",
>> >   "platform" : "mobile",
>> >   "softwareVersion": "ANY",
>> >   "regionId": "ANY",
>> >   "ts" : "2017-01-15T01:01:21Z",
>> >   "unique": 100,
>> >   "total": 40
>> > }
>> >
>> > Possible optimization:
>> >
>> > {
>> >   "contentID": "uuid12349789",
>> >   "platform.mobile.softwareVersion.sw1.2.region.us12" : {
>> >       "unique": 12,
>> >       "total": 4
>> >   },
>> >  "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
>> >       "unique": 100,
>> >       "total": 40
>> >   },
>> >   "ts" : "2017-01-15T01:01:21Z"
>> >   }
>> >
>> > Challenges: Number of such rows is very large and it'll grow
>> exponentially
>> > with every new field - For instance if I go with above suggested schema,
>> > I'll end up storing a new document for each combination of
>> > contentID,platform,softwareVersion,regionId. Now if we throw in another
>> > field to this document, number of combinations increase exponentially.I
>> > have more than a billion such combination rows already.
>> >
>> > I am hoping to find advice by experts if
>> >
>> >    1. Multiple such fields can be fit in same document for different
>> 'ts'
>> >    such that range queries are possible on it.
>> >    2. time range (ts) can be fit in same document as a list(?) (to
>> reduce
>> >    number of rows). I know multivalued fields don't support complex data
>> >    types, but if anything else can be done with the data/schema to
>> reduce
>> >    query time and number of rows.
>> >
>> > The number of these rows are very large, for sure more than 1billion
>> (if we
>> > go with the schema I was suggesting). What schema would you suggest for
>> > this that'll fit query requirements?
>> >
>> > FYI: All queries will be exact match on fields (no partial or
>> tokenized),
>> > so no analysis on fields is necessary. And almost all queries are range
>> > queries.
>> >
>> > Thanks,
>> >
>> > KP
>>
>
>

Re: Solr schema design: fitting time-series data

Reply via email to