Anyone has any idea? On Sun, Jan 15, 2017 at 9:54 PM, map reduced <k3t.gi...@gmail.com> wrote:
> I may have used wrong terminology, by complex types I meant non-primitive > types. Mutlivalued can be conceptualized as a list of values for instance > in your example myint = [ 32, 77] etc which you can possibly analyze and > query upon. What I was trying to ask is if a complex type can be > multi-valued or something along those lines that can be supported by range > queries. > > For instance: Below rows will have to be individual docs in Solr (in my > knowledge) - If I want to range query from ts=Jan 12 to ts=Jan 15 give me > sum of 'unique' where 'contentId=1,product=mobile' > > contentId=1,product=mobile ts=Jan15 total=12,unique=5 > contentId=1,product=mobile ts=Jan14 total=10,unique=3 > contentId=1,product=mobile ts=Jan13 total=15,unique=2 > contentId=1,product=mobile ts=Jan12 total=17,unique=4 > ...... > > This increases number of documents in Solr by a lot. Only if there was a > way to do something like: > > { > contentId=1 > product=mobile > ts = [ > > { > > time = Jan15 > > total = 12 > > unique = 15 > > }, > { > > time = Jan16 > > total = 10 > > unique = 3 > > }, > > .. > .. > ]} > > Of course above isn't allowed, but some way to squeeze timestamps in > single document so that it doesn't increase the number of document by a lot > and I am still able to range query on 'ts'. > > For some (combination of fields) rows the timestamps may go upto last 3-6 > months! > > Let me know if I am still being unclear. > > On Sun, Jan 15, 2017 at 8:04 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > >> bq: I know multivalued fields don't support complex data types >> >> Not sure what you're talking about here. mulitValued actually has >> nothing to do with data types. You can have text fields which >> are analyzed and produce multiple tokens and are multiValued. >> You can have primitive types (string, int/long/float/double, >> boolean etc) that are multivalued. or they can be single valued. >> >> All "multiValued" means is that the _input_ can have the same field >> repeated, i.e. >> <doc> >> <field name="mytext">some stuff</field> >> <field name="mytext">more stuff</field> >> <field name='myint"">32</field> >> <field name='myint"">77</field> >> </doc> >> >> This doc would fail of mytext or myint were multiValued=false but >> succeed if multiValued=true at index time. >> >> There are some subtleties with text (analyzed) multivalued fields having >> to do with token offsets, but that's not germane. >> >> Does that change your problem? Your document could have a dozen >> timestamps.... >> >> However, there isn't a good way to query across multiple multivalued >> fields >> in parallel. That is, a doc like >> >> myint=1 >> myint=2 >> myint=3 >> mylong=4 >> mylong=5 >> mylong=6 >> >> there's no good way to say "only match this document if mhyint=1 AND >> mylong=4 AND they_are_both_in_the_same_position. >> >> That is, asking for myint=1 AND mylong=6 would match the above. Is >> that what you're >> wondering about? >> >> ------------------ >> I expect you're really asking to do the second above, in which case you >> might >> want to look at StreamingExpressions and/or ParallelSQL in Solr 6.x >> >> Best, >> Erick >> >> On Sun, Jan 15, 2017 at 7:31 PM, map reduced <k3t.gi...@gmail.com> wrote: >> > Hi, >> > >> > I am trying to fit the following data in Solr to support flexible >> queries >> > and would like to get your input on the same. I have data about users >> say: >> > >> > contentID (assume uuid), >> > platform (eg. website, mobile etc), >> > softwareVersion (eg. sw1.1, sw2.5, ..etc), >> > regionId (eg. us144, uk123, etc..) >> > .... >> > >> > and few more other such fields. This data is partially pre aggregated >> (read >> > Hadoop jobs): so let’s assume for "contentID = uuid123 and platform = >> > mobile and softwareVersion = sw1.2 and regionId = ANY" I have data in >> > format: >> > >> > timestamp pre-aggregated data [ uniques, total] >> > Jan 15 [ 12, 4] >> > Jan 14 [ 4, 3] >> > Jan 13 [ 8, 7] >> > ... ... >> > >> > And then I also have less granular data say "contentID = uuid123 and >> > platform = mobile and softwareVersion = ANY and regionId = ANY (These >> > values will be more than above table since granularity is reduced) >> > >> > timestamp : pre-aggregated data [uniques, total] >> > Jan 15 [ 100, 40] >> > Jan 14 [ 45, 30] >> > ... ... >> > >> > I'll get queries like "contentID = uuid123 and platform = mobile" , give >> > sum of 'uniques' for Jan15 - Jan13 or for "contentID=uuid123 and >> > platform=mobile and softwareVersion=sw1.2", give sum of 'total' for >> Jan15 - >> > Jan01. >> > >> > I was thinking of simple schema where documents will be like (first >> example >> > above): >> > >> > { >> > "contentID": "uuid12349789", >> > "platform" : "mobile", >> > "softwareVersion": "sw1.2", >> > "regionId": "ANY", >> > "ts" : "2017-01-15T01:01:21Z", >> > "unique": 12, >> > "total": 4 >> > } >> > >> > second example from above: >> > >> > { >> > "contentID": "uuid12349789", >> > "platform" : "mobile", >> > "softwareVersion": "ANY", >> > "regionId": "ANY", >> > "ts" : "2017-01-15T01:01:21Z", >> > "unique": 100, >> > "total": 40 >> > } >> > >> > Possible optimization: >> > >> > { >> > "contentID": "uuid12349789", >> > "platform.mobile.softwareVersion.sw1.2.region.us12" : { >> > "unique": 12, >> > "total": 4 >> > }, >> > "platform.mobile.softwareVersion.sw1.2.region.ANY" : { >> > "unique": 100, >> > "total": 40 >> > }, >> > "ts" : "2017-01-15T01:01:21Z" >> > } >> > >> > Challenges: Number of such rows is very large and it'll grow >> exponentially >> > with every new field - For instance if I go with above suggested schema, >> > I'll end up storing a new document for each combination of >> > contentID,platform,softwareVersion,regionId. Now if we throw in another >> > field to this document, number of combinations increase exponentially.I >> > have more than a billion such combination rows already. >> > >> > I am hoping to find advice by experts if >> > >> > 1. Multiple such fields can be fit in same document for different >> 'ts' >> > such that range queries are possible on it. >> > 2. time range (ts) can be fit in same document as a list(?) (to >> reduce >> > number of rows). I know multivalued fields don't support complex data >> > types, but if anything else can be done with the data/schema to >> reduce >> > query time and number of rows. >> > >> > The number of these rows are very large, for sure more than 1billion >> (if we >> > go with the schema I was suggesting). What schema would you suggest for >> > this that'll fit query requirements? >> > >> > FYI: All queries will be exact match on fields (no partial or >> tokenized), >> > so no analysis on fields is necessary. And almost all queries are range >> > queries. >> > >> > Thanks, >> > >> > KP >> > >