I may have used wrong terminology, by complex types I meant non-primitive types. Mutlivalued can be conceptualized as a list of values for instance in your example myint = [ 32, 77] etc which you can possibly analyze and query upon. What I was trying to ask is if a complex type can be multi-valued or something along those lines that can be supported by range queries.
For instance: Below rows will have to be individual docs in Solr (in my knowledge) - If I want to range query from ts=Jan 12 to ts=Jan 15 give me sum of 'unique' where 'contentId=1,product=mobile' contentId=1,product=mobile ts=Jan15 total=12,unique=5 contentId=1,product=mobile ts=Jan14 total=10,unique=3 contentId=1,product=mobile ts=Jan13 total=15,unique=2 contentId=1,product=mobile ts=Jan12 total=17,unique=4 ...... This increases number of documents in Solr by a lot. Only if there was a way to do something like: { contentId=1 product=mobile ts = [ { time = Jan15 total = 12 unique = 15 }, { time = Jan16 total = 10 unique = 3 }, .. .. ]} Of course above isn't allowed, but some way to squeeze timestamps in single document so that it doesn't increase the number of document by a lot and I am still able to range query on 'ts'. For some (combination of fields) rows the timestamps may go upto last 3-6 months! Let me know if I am still being unclear. On Sun, Jan 15, 2017 at 8:04 PM, Erick Erickson <erickerick...@gmail.com> wrote: > bq: I know multivalued fields don't support complex data types > > Not sure what you're talking about here. mulitValued actually has > nothing to do with data types. You can have text fields which > are analyzed and produce multiple tokens and are multiValued. > You can have primitive types (string, int/long/float/double, > boolean etc) that are multivalued. or they can be single valued. > > All "multiValued" means is that the _input_ can have the same field > repeated, i.e. > <doc> > <field name="mytext">some stuff</field> > <field name="mytext">more stuff</field> > <field name='myint"">32</field> > <field name='myint"">77</field> > </doc> > > This doc would fail of mytext or myint were multiValued=false but > succeed if multiValued=true at index time. > > There are some subtleties with text (analyzed) multivalued fields having > to do with token offsets, but that's not germane. > > Does that change your problem? Your document could have a dozen > timestamps.... > > However, there isn't a good way to query across multiple multivalued fields > in parallel. That is, a doc like > > myint=1 > myint=2 > myint=3 > mylong=4 > mylong=5 > mylong=6 > > there's no good way to say "only match this document if mhyint=1 AND > mylong=4 AND they_are_both_in_the_same_position. > > That is, asking for myint=1 AND mylong=6 would match the above. Is > that what you're > wondering about? > > ------------------ > I expect you're really asking to do the second above, in which case you > might > want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x > > Best, > Erick > > On Sun, Jan 15, 2017 at 7:31 PM, map reduced <k3t.gi...@gmail.com> wrote: > > Hi, > > > > I am trying to fit the following data in Solr to support flexible queries > > and would like to get your input on the same. I have data about users > say: > > > > contentID (assume uuid), > > platform (eg. website, mobile etc), > > softwareVersion (eg. sw1.1, sw2.5, ..etc), > > regionId (eg. us144, uk123, etc..) > > .... > > > > and few more other such fields. This data is partially pre aggregated > (read > > Hadoop jobs): so let’s assume for "contentID = uuid123 and platform = > > mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in > > format: > > > > timestamp pre-aggregated data [ uniques, total] > > Jan 15 [ 12, 4] > > Jan 14 [ 4, 3] > > Jan 13 [ 8, 7] > > ... ... > > > > And then I also have less granular data say "contentID = uuid123 and > > platform = mobile and softwareVersion = ANY and regionId = ANY (These > > values will be more than above table since granularity is reduced) > > > > timestamp : pre-aggregated data [uniques, total] > > Jan 15 [ 100, 40] > > Jan 14 [ 45, 30] > > ... ... > > > > I'll get queries like "contentID = uuid123 and platform = mobile" , give > > sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and > > platform=mobile and softwareVersion=sw1.2", give sum of 'total' for > Jan15 - > > Jan01. > > > > I was thinking of simple schema where documents will be like (first > example > > above): > > > > { > > "contentID": "uuid12349789", > > "platform" : "mobile", > > "softwareVersion": "sw1.2", > > "regionId": "ANY", > > "ts" : "2017-01-15T01:01:21Z", > > "unique": 12, > > "total": 4 > > } > > > > second example from above: > > > > { > > "contentID": "uuid12349789", > > "platform" : "mobile", > > "softwareVersion": "ANY", > > "regionId": "ANY", > > "ts" : "2017-01-15T01:01:21Z", > > "unique": 100, > > "total": 40 > > } > > > > Possible optimization: > > > > { > > "contentID": "uuid12349789", > > "platform.mobile.softwareVersion.sw1.2.region.us12" : { > > "unique": 12, > > "total": 4 > > }, > > "platform.mobile.softwareVersion.sw1.2.region.ANY" : { > > "unique": 100, > > "total": 40 > > }, > > "ts" : "2017-01-15T01:01:21Z" > > } > > > > Challenges: Number of such rows is very large and it'll grow > exponentially > > with every new field - For instance if I go with above suggested schema, > > I'll end up storing a new document for each combination of > > contentID,platform,softwareVersion,regionId. Now if we throw in another > > field to this document, number of combinations increase exponentially.I > > have more than a billion such combination rows already. > > > > I am hoping to find advice by experts if > > > > 1. Multiple such fields can be fit in same document for different 'ts' > > such that range queries are possible on it. > > 2. time range (ts) can be fit in same document as a list(?) (to reduce > > number of rows). I know multivalued fields don't support complex data > > types, but if anything else can be done with the data/schema to reduce > > query time and number of rows. > > > > The number of these rows are very large, for sure more than 1billion (if > we > > go with the schema I was suggesting). What schema would you suggest for > > this that'll fit query requirements? > > > > FYI: All queries will be exact match on fields (no partial or tokenized), > > so no analysis on fields is necessary. And almost all queries are range > > queries. > > > > Thanks, > > > > KP >