Re: multiple dateranges/timeslots per doc: modeling openinghours.

Geert-Jan Brits Tue, 11 Oct 2011 02:45:31 -0700

Op 11 oktober 2011 03:21 schreef Chris Hostetter
<hossman_luc...@fucit.org>het volgende:


>
> : Conceptually
> : the Join-approach looks like it would work from paper, although I'm not a
> : big fan of introducing a lot of complexity to the frontend / querying
> part
> : of the solution.
>
> you lost me there -- i don't see how using join would impact the front end
> / query side at all.  your query clients would never even know that a join
> had happened (your indexing code would certianly have to know about
> creating those special case docs to join against obviuosly)
>
> : As an alternative, what about using your fieldMaskingSpanQuery-approach
> : solely (without the JOIN-approach)  and encode open/close on a per day
> : basis?
> : I didn't mention it, but I 'only' need 100 days of data, which would lead
> to
> : 100 open and 100 close values, not counting the pois with multiple
>         ...
> : Data then becomes:
> :
> : open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ...
> : close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ...
>
> aw hell ... i assumed you needed to suport an arbitrarily large number
> of special case open+close pairs per doc.
>

I didn't express myself well. A POI can have multiple open+close pairs per
day, but each night I only index the coming 100 days. So MOST POIs will have
100 open+close pairs (1 openinghours per day) but some have more.


>
> if you only have to support a fix value (N=100) open+close values you
> could just have N*2 date fields and a BooleanQuery containing N 2-clause
> BooleanQueries contain ranging queries against each pair of your date
> fields. ie...
>
>  ((+open00:[* TO NOW] +close00:[NOW+3HOURS TO *])
>   (+open01:[* TO NOW] +close01:[NOW+3HOURS TO *])
>   (+open02:[* TO NOW] +close02:[NOW+3HOURS TO *])
>   ...etc...
>   (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *]))
>
> ...for a lot of indexes, 100 clauses is small potatoes as far as number of
> boolean clauses go, especially if many of them are going to short circut
> out because there won't be any matches at all.
>

Given that I need multiple open+close pairs per day this can't be used
directly.

However when setting a logical upperbound on the maximum nr of openinghours
per day (say 3), which would be possible, this could be extended to:
open00 = day0 --> open00-0 = day0 timeslot 0, open00-1 = day0 timeslot 1,
etc.

So,

 ((+open00-0:[* TO NOW] +close00-0:[NOW+3HOURS TO *])
(+open00-1:[* TO NOW] +close00-1:[NOW+3HOURS TO *])
(+open00-2:[* TO NOW] +close00-2:[NOW+3HOURS TO *])
  (+open01-0:[* TO NOW] +close01-0:[NOW+3HOURS TO *])
 (+open01-1:[* TO NOW] +close01-1:[NOW+3HOURS TO *])
 (+open01-2:[* TO NOW] +close01-2:[NOW+3HOURS TO *])
  ...etc...
  (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *]))

This would need 2*3*100 = 600 dynamicfields to cover the openinghours. You
mention this is peanuts for constructing a booleanquery, but how about
memory consumption?
I'm particularly concerned about the Lucene FieldCache getting populated for
each of the 600 fields. (Since I had some nasty OOM experiences with that in
the past. 2-3 years ago memory consumption of Lucene FieldCache couldn't be
controlled, I'm not sure how that is now to be honest)

I will not be sorting on any of the 600 dynamicfields btw. Instead I will
only use them as part of the above booleanquery, which I will likely define
as a Filter Query.
Just to be sure, in this situation, Lucene FieldCache won't be touched,
correct? If so, this will probably be a good workable solution!


> : Alternatively, how would you compare your suggested approach with the
> : approach by David Smiley using either SOLR-2155 (Geohash prefix query
> : filter) or LSP:
> :
> https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244
> .
> : That would work right now, and the LSP-approach seems pretty elegant to
> me.
>
> I'm afraid i'm totally ignorant of how the LSP stuff works so i can't
> really comment there.
>
> If i understand what you mean about mapping the open/close concepts to
> lat/lon concepts, then i can see how it would be useful for multiple pair
> wise (absolute) date ranges, but i'm not really sure how you would deal
> with the diff open+close pairs per day (or on diff days of hte week, or
> special days of the year) using the lat+lon conceptual model ... I guess
> if the LSP stuff supports arbitrary N-dimensional spaces then you could
> model day or week as a dimension .. but it still seems like you'd need
> multiple fields for the special case days, right?
>

I planned to do the folllowing using LSP, (through help from David)

Each <open,close>-tuple would be modeled as a point(x,y) . (x = open, y =
close)
So a POI can have many (100 or more) points, each representing
a <open,close>-tuple.

Given: 100 days lookahead, granularity: 5 min, we can map dimensions x and y
to to [0,30000]

E.g:
- indexing starts at / baseline is at: 2011-11-01:0000
- poi open: 2011-11-08:1800 - poi close: 2011-11-09:0300
- (query): user visit: 2011-11-08:2300 - user depart: 2011-11-09:0200

Would map to:
- poi open: 2520 - poi close: 2628 =  point(x,y) = (2520,2628)
- (query):user visit: 2580 - user depart: 2616 = bbox filter with the
ranges x:[0 TO 2580], y:[2616 TO 30000]

All pois are returned which have one or more points within the bbox.

Both approaches seem pretty good to me. I'll be testing both soon.

Thanks!
Geert-Jan




> How it would compare performance wise: no idea.
>
>
> -Hoss
>

Re: multiple dateranges/timeslots per doc: modeling openinghours.

Reply via email to