Re: Searching across multiple repeating fields

Mark Allan Tue, 29 Jun 2010 08:26:48 -0700

In case anyone's interested (and I know at least one person is becausethey asked me where to find the solr.TemporalCoverage class - sorrythat was my fault, I shouldn't have used the default package name),here's how I got around the problem.

It's not the neatest solution in the world, but it does work andperformance doesn't seem to take a hit when I do it this way. Thatsaid, I've only tested it with approximately 55,000 documents, so yourmileage may vary.

I'm defining daterange as a dynamic field with the pattern"daterange*". If any document should have more than one daterangefield, my script which generates appropriately formatted XML willappend subsequent fieldnames with a counter like so:


  <field name="daterange">19820402,19820614</field>
  <field name="daterange1">1990,2000</field>

However, the problem with this approach is that the subfields end upgetting called daterange_0_i and daterange_1_i and these in turn alsomatch the dynamicField pattern for the main daterange field. So toavoid this, I modified a copy of AbstractSubTypeFieldType.java to usea substring of the main fieldname when naming the internal subfields.They now come out as aterange_0_i and aterange_1_i.

Next, in order to ensure that all daterange fields (eg daterange,daterange1, daterange2 etc) get used in a search, I implemented acrude query parser which expands the user's query to include alldaterange* fields. It uses a "maxtempcoveragefields" default settingin solrconfig.xml to determine at runtime how many times the user'squery should be expanded before passing it on to the default parser.


Here's snippets of how everything looks:
solrconfig.xml

<requestHandler name="standard" class="solr.SearchHandler"default="true">

             <lst name="defaults">
               <int name="maxtempcoveragefields">1</int>
        ....

<queryParser name="temporalcoverageqparser"class="uk.ac.edina.solr.search.TemporalCoverageQParserPlugin" />


schema.xml

<fieldType name="temporal"class="uk.ac.edina.solr.schema.TemporalCoverage" dimension="2"subFieldSuffix="_i"/><dynamicField name="daterange*" type="temporal" indexed="true"stored="true" />


update.xml
<doc>
  ...
  <field name="daterange">19820402,19820614</field>
  <field name="daterange1">1990,2000</field>
</doc>

If anyone wants the code as it is just now, I can happily provide it.Alternatively, if you think it might be of use to others, I can rollit back into the org.apache.solr packages and submit it to therepository so that those with more Solr experience than I can see ifit could be better implemented another way.


Cheers,

Mark

On 23 Jun 2010, at 9:52 am, Mark Allan wrote:

Cheers, Geert-Jan, that's very helpful.
We won't always be searching with dates and we wouldn't wantduplicates to show up in the results, so your second suggestionlooks like a good workaround if I can't solve the actual problem. Ididn't know about FieldCollapsing, so I'll definitely keep it in mind.
Thanks
Mark

On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:
Perhaps my answer is useless, bc I don't have an answer to yourdirect
question, but:
You *might* want to consider if your concept of a solr-document ison the
correct granular level, i.e:
your problem posted could be tackled (afaik) by defining adocument being a
'sub-event' with only 1 daterange.
So for each event-doc you have now, this is replaced by several sub-event
docs in this proposed situation.
Additionally each sub-event doc gets an additional field 'parent-eventid'which maps to something like an event-id (which you're probablyusing) .
So several sub-event docs can point to the same event-id.
Lastly, all sub-event docs belonging to a particular eventimplement all the
other fields that you may have stored in that particular event-doc.
Now you can query for events based on data-rages like youenvisioned, butinstead of returning events you return sub-event-docs. Howeversince alldata of the original event (except the multiple dateranges) isavailable inthe subevent-doc this shouldn't really bother the client. If youneed todisplay all dates of an event (the only info missing from thereturnedsolr-doc) you could easily store it in a RDB and fetch it using thedefined
parent-eventid.
The only caveat I see, is that possibly multiple sub-events withthe same
'parent-eventid' might get returned for a particular query.
This however depends on the type of queries you envision. i.e:
1) If you always issue queries with date-filters, and *assuming*thatsub-events of a particular event don't temporally overlap, you willnever
get multiple sub-events returned.
2) if 1) doesn't hold and assuming you *do* mind multiple sub-events of
the same actual event, you could try to use Field Collapsing on
'parent-eventid' to only return the first sub-event per parent-eventid thatmatches the rest of your query. (Note however, that FieldCollapsing is a
patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
Not sure if this helped you at all, but at the very least it was anice
conceptual exercise ;-)

Cheers,
Geert-Jan


2010/6/22 Mark Allan <mark.al...@ed.ac.uk>
Hi all,
Firstly, I apologise for the length of this email but I need todescribe
properly what I'm doing before I get to the problem!
I'm working on a project just now which requires the ability tostore andsearch on temporal coverage data - ie. a field which specifies adate range
during which a certain event took place.
I hunted around for a few days and couldn't find anything whichseemed tofit, so I had a go at writing my own field type based onsolr.PointType.
It's used as follows:
schema.xml
     <fieldType name="temporal" class="solr.TemporalCoverage"
dimension="2" subFieldSuffix="_i"/>
<field name="daterange" type="temporal" indexed="true"stored="true"
multiValued="true"/>
data.xml
     <add>
     <doc>
     ...
     <field name="daterange">1940,1945</field>
     </doc>
     </add>

Internally, this gets stored as:
 <arr name="daterange"><str>1940,1945</str></arr>
 <int name="daterange_0_i">19400000</int>
 <int name="daterange_1_i">19450000</int>
In due course, I'll declare the subfields as a proper date type,but in themeantime, this works absolutely fine. I can search for anindividual dateand Solr will check (queryDate > daterange_0 AND queryDate <daterange_1 )and the correct documents are returned. My code also allows theuser toinput a date range in the query but I won't complicate matterswith that
just now!
The problem arises when a document has more than one "daterange"field(imagine a news broadcast which covers a variety of topics andhence time
periods).

A document with two daterange fields
     <doc>
     ...
     <field name="daterange">19820402,19820614</field>
     <field name="daterange">1990,2000</field>
     </doc>
gets stored internally as
 <arr
name="daterange"><str>19820402,19820614</str><str>1990,2000</str></arr><arr name="daterange_0_i"><int>19820402</int><int>19900000</int></arr><arr name="daterange_1_i"><int>19820614</int><int>20000000</int></arr>
In this situation, searching for 1985 should yield zero results asit iscontained within neither daterange, however, the above document isreturnedin the result set. What Solr is doing is checking that thequeryDate (1985)is greater than *any* of the values in daterange_0 AND queryDateis less
than *any* of the values in daterange_1.
How can I get Solr to respect the positions of each item in thedaterange_0and _1 arrays? Ideally I'd like the search to use the followinglogic, thuspreventing the above document from being returned in a search for1985:
     (queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
(queryDate > daterange_0[1] AND queryDate < daterange_1[1])
Someone else had a very similar problem recently on the mailinglist with a
multiValued PointType field but the thread went cold without a final
solution.
While I could filter the results when they get back to myapplication
layer, it seems like it's not really the right place to do it.
Any help getting Solr to respect the positions of items in arrayswould be
very gratefully received.

Many thanks,
Mark



--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Searching across multiple repeating fields

Reply via email to