Re: Solr schema design: fitting time-series data

map reduced Sun, 15 Jan 2017 21:55:19 -0800

I may have used wrong terminology, by complex types I meant non-primitive
types. Mutlivalued can be conceptualized as a list of values for instance
in your example myint = [ 32, 77] etc which you can possibly analyze and
query upon. What I was trying to ask is if a complex type can be
multi-valued or something along those lines that can be supported by range
queries.


For instance: Below rows will have to be individual docs in Solr (in my
knowledge) -  If I want to range query from ts=Jan 12 to ts=Jan 15 give me
sum of 'unique' where 'contentId=1,product=mobile'

contentId=1,product=mobile    ts=Jan15     total=12,unique=5
contentId=1,product=mobile    ts=Jan14     total=10,unique=3
contentId=1,product=mobile    ts=Jan13     total=15,unique=2
contentId=1,product=mobile    ts=Jan12     total=17,unique=4
......

This increases number of documents in Solr by a lot. Only if there was a
way to do something like:

{
contentId=1
product=mobile
ts = [

{

time = Jan15

total = 12

unique = 15

},
{

time = Jan16

total = 10

unique = 3

},

..
..
]}

Of course above isn't allowed, but some way to squeeze timestamps in single
document so that it doesn't increase the number of document by a lot and I
am still able to range query on 'ts'.

For some (combination of fields) rows the timestamps may go upto last 3-6
months!

Let me know if I am still being unclear.

On Sun, Jan 15, 2017 at 8:04 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> bq: I know multivalued fields don't support complex data  types
>
> Not sure what you're talking about here. mulitValued actually has
> nothing to do with data types. You can have text fields which
> are analyzed and produce multiple tokens and are multiValued.
> You can have primitive types (string, int/long/float/double,
> boolean etc) that are multivalued. or they can be single valued.
>
> All "multiValued" means is that the _input_ can have the same field
> repeated, i.e.
> <doc>
>   <field name="mytext">some stuff</field>
>   <field name="mytext">more stuff</field>
>   <field name='myint"">32</field>
>   <field name='myint"">77</field>
> </doc>
>
> This doc would fail of mytext or myint were multiValued=false but
> succeed if multiValued=true at index time.
>
> There are some subtleties with text (analyzed) multivalued fields having
> to do with token offsets, but that's not germane.
>
> Does that change your problem? Your document could have a dozen
> timestamps....
>
> However, there isn't a good way to query across multiple multivalued fields
> in parallel. That is, a doc like
>
> myint=1
> myint=2
> myint=3
> mylong=4
> mylong=5
> mylong=6
>
> there's no good way to say "only match this document if mhyint=1 AND
> mylong=4 AND they_are_both_in_the_same_position.
>
> That is, asking for myint=1 AND mylong=6 would match the above. Is
> that what you're
> wondering about?
>
> ------------------
> I expect you're really asking to do the second above, in which case you
> might
> want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x
>
> Best,
> Erick
>
> On Sun, Jan 15, 2017 at 7:31 PM, map reduced <k3t.gi...@gmail.com> wrote:
> > Hi,
> >
> > I am trying to fit the following data in Solr to support flexible queries
> > and would like to get your input on the same. I have data about users
> say:
> >
> > contentID (assume uuid),
> > platform (eg. website, mobile etc),
> > softwareVersion (eg. sw1.1, sw2.5, ..etc),
> > regionId (eg. us144, uk123, etc..)
> > ....
> >
> > and few more other such fields. This data is partially pre aggregated
> (read
> > Hadoop jobs): so let’s assume for "contentID = uuid123 and platform =
> > mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in
> > format:
> >
> > timestamp  pre-aggregated data [ uniques, total]
> >  Jan 15    [ 12, 4]
> >  Jan 14    [ 4, 3]
> >  Jan 13    [ 8, 7]
> >  ...        ...
> >
> > And then I also have less granular data say "contentID = uuid123 and
> > platform = mobile and softwareVersion = ANY and regionId = ANY (These
> > values will be more than above table since granularity is reduced)
> >
> > timestamp : pre-aggregated data [uniques, total]
> >  Jan 15    [ 100, 40]
> >  Jan 14    [ 45, 30]
> >  ...           ...
> >
> > I'll get queries like "contentID = uuid123 and platform = mobile" , give
> > sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and
> > platform=mobile and softwareVersion=sw1.2", give sum of 'total' for
> Jan15 -
> > Jan01.
> >
> > I was thinking of simple schema where documents will be like (first
> example
> > above):
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform" : "mobile",
> >   "softwareVersion": "sw1.2",
> >   "regionId": "ANY",
> >   "ts" : "2017-01-15T01:01:21Z",
> >   "unique": 12,
> >   "total": 4
> > }
> >
> > second example from above:
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform" : "mobile",
> >   "softwareVersion": "ANY",
> >   "regionId": "ANY",
> >   "ts" : "2017-01-15T01:01:21Z",
> >   "unique": 100,
> >   "total": 40
> > }
> >
> > Possible optimization:
> >
> > {
> >   "contentID": "uuid12349789",
> >   "platform.mobile.softwareVersion.sw1.2.region.us12" : {
> >       "unique": 12,
> >       "total": 4
> >   },
> >  "platform.mobile.softwareVersion.sw1.2.region.ANY" : {
> >       "unique": 100,
> >       "total": 40
> >   },
> >   "ts" : "2017-01-15T01:01:21Z"
> >   }
> >
> > Challenges: Number of such rows is very large and it'll grow
> exponentially
> > with every new field - For instance if I go with above suggested schema,
> > I'll end up storing a new document for each combination of
> > contentID,platform,softwareVersion,regionId. Now if we throw in another
> > field to this document, number of combinations increase exponentially.I
> > have more than a billion such combination rows already.
> >
> > I am hoping to find advice by experts if
> >
> >    1. Multiple such fields can be fit in same document for different 'ts'
> >    such that range queries are possible on it.
> >    2. time range (ts) can be fit in same document as a list(?) (to reduce
> >    number of rows). I know multivalued fields don't support complex data
> >    types, but if anything else can be done with the data/schema to reduce
> >    query time and number of rows.
> >
> > The number of these rows are very large, for sure more than 1billion (if
> we
> > go with the schema I was suggesting). What schema would you suggest for
> > this that'll fit query requirements?
> >
> > FYI: All queries will be exact match on fields (no partial or tokenized),
> > so no analysis on fields is necessary. And almost all queries are range
> > queries.
> >
> > Thanks,
> >
> > KP
>

Re: Solr schema design: fitting time-series data

Reply via email to