In case anyone's interested (and I know at least one person is because
they asked me where to find the solr.TemporalCoverage class - sorry
that was my fault, I shouldn't have used the default package name),
here's how I got around the problem.
It's not the neatest solution in the world, but it does work and
performance doesn't seem to take a hit when I do it this way. That
said, I've only tested it with approximately 55,000 documents, so your
mileage may vary.
I'm defining daterange as a dynamic field with the pattern
"daterange*". If any document should have more than one daterange
field, my script which generates appropriately formatted XML will
append subsequent fieldnames with a counter like so:
<field name="daterange">19820402,19820614</field>
<field name="daterange1">1990,2000</field>
However, the problem with this approach is that the subfields end up
getting called daterange_0_i and daterange_1_i and these in turn also
match the dynamicField pattern for the main daterange field. So to
avoid this, I modified a copy of AbstractSubTypeFieldType.java to use
a substring of the main fieldname when naming the internal subfields.
They now come out as aterange_0_i and aterange_1_i.
Next, in order to ensure that all daterange fields (eg daterange,
daterange1, daterange2 etc) get used in a search, I implemented a
crude query parser which expands the user's query to include all
daterange* fields. It uses a "maxtempcoveragefields" default setting
in solrconfig.xml to determine at runtime how many times the user's
query should be expanded before passing it on to the default parser.
Here's snippets of how everything looks:
solrconfig.xml
<requestHandler name="standard" class="solr.SearchHandler"
default="true">
<lst name="defaults">
<int name="maxtempcoveragefields">1</int>
....
<queryParser name="temporalcoverageqparser"
class="uk.ac.edina.solr.search.TemporalCoverageQParserPlugin" />
schema.xml
<fieldType name="temporal"
class="uk.ac.edina.solr.schema.TemporalCoverage" dimension="2"
subFieldSuffix="_i"/>
<dynamicField name="daterange*" type="temporal" indexed="true"
stored="true" />
update.xml
<doc>
...
<field name="daterange">19820402,19820614</field>
<field name="daterange1">1990,2000</field>
</doc>
If anyone wants the code as it is just now, I can happily provide it.
Alternatively, if you think it might be of use to others, I can roll
it back into the org.apache.solr packages and submit it to the
repository so that those with more Solr experience than I can see if
it could be better implemented another way.
Cheers,
Mark
On 23 Jun 2010, at 9:52 am, Mark Allan wrote:
Cheers, Geert-Jan, that's very helpful.
We won't always be searching with dates and we wouldn't want
duplicates to show up in the results, so your second suggestion
looks like a good workaround if I can't solve the actual problem. I
didn't know about FieldCollapsing, so I'll definitely keep it in mind.
Thanks
Mark
On 22 Jun 2010, at 3:44 pm, Geert-Jan Brits wrote:
Perhaps my answer is useless, bc I don't have an answer to your
direct
question, but:
You *might* want to consider if your concept of a solr-document is
on the
correct granular level, i.e:
your problem posted could be tackled (afaik) by defining a
document being a
'sub-event' with only 1 daterange.
So for each event-doc you have now, this is replaced by several sub-
event
docs in this proposed situation.
Additionally each sub-event doc gets an additional field 'parent-
eventid'
which maps to something like an event-id (which you're probably
using) .
So several sub-event docs can point to the same event-id.
Lastly, all sub-event docs belonging to a particular event
implement all the
other fields that you may have stored in that particular event-doc.
Now you can query for events based on data-rages like you
envisioned, but
instead of returning events you return sub-event-docs. However
since all
data of the original event (except the multiple dateranges) is
available in
the subevent-doc this shouldn't really bother the client. If you
need to
display all dates of an event (the only info missing from the
returned
solr-doc) you could easily store it in a RDB and fetch it using the
defined
parent-eventid.
The only caveat I see, is that possibly multiple sub-events with
the same
'parent-eventid' might get returned for a particular query.
This however depends on the type of queries you envision. i.e:
1) If you always issue queries with date-filters, and *assuming*
that
sub-events of a particular event don't temporally overlap, you will
never
get multiple sub-events returned.
2) if 1) doesn't hold and assuming you *do* mind multiple sub-
events of
the same actual event, you could try to use Field Collapsing on
'parent-eventid' to only return the first sub-event per parent-
eventid that
matches the rest of your query. (Note however, that Field
Collapsing is a
patch at the moment. http://wiki.apache.org/solr/FieldCollapsing)
Not sure if this helped you at all, but at the very least it was a
nice
conceptual exercise ;-)
Cheers,
Geert-Jan
2010/6/22 Mark Allan <mark.al...@ed.ac.uk>
Hi all,
Firstly, I apologise for the length of this email but I need to
describe
properly what I'm doing before I get to the problem!
I'm working on a project just now which requires the ability to
store and
search on temporal coverage data - ie. a field which specifies a
date range
during which a certain event took place.
I hunted around for a few days and couldn't find anything which
seemed to
fit, so I had a go at writing my own field type based on
solr.PointType.
It's used as follows:
schema.xml
<fieldType name="temporal" class="solr.TemporalCoverage"
dimension="2" subFieldSuffix="_i"/>
<field name="daterange" type="temporal" indexed="true"
stored="true"
multiValued="true"/>
data.xml
<add>
<doc>
...
<field name="daterange">1940,1945</field>
</doc>
</add>
Internally, this gets stored as:
<arr name="daterange"><str>1940,1945</str></arr>
<int name="daterange_0_i">19400000</int>
<int name="daterange_1_i">19450000</int>
In due course, I'll declare the subfields as a proper date type,
but in the
meantime, this works absolutely fine. I can search for an
individual date
and Solr will check (queryDate > daterange_0 AND queryDate <
daterange_1 )
and the correct documents are returned. My code also allows the
user to
input a date range in the query but I won't complicate matters
with that
just now!
The problem arises when a document has more than one "daterange"
field
(imagine a news broadcast which covers a variety of topics and
hence time
periods).
A document with two daterange fields
<doc>
...
<field name="daterange">19820402,19820614</field>
<field name="daterange">1990,2000</field>
</doc>
gets stored internally as
<arr
name="daterange"><str>19820402,19820614</str><str>1990,2000</str></
arr>
<arr name="daterange_0_i"><int>19820402</int><int>19900000</int></
arr>
<arr name="daterange_1_i"><int>19820614</int><int>20000000</int></
arr>
In this situation, searching for 1985 should yield zero results as
it is
contained within neither daterange, however, the above document is
returned
in the result set. What Solr is doing is checking that the
queryDate (1985)
is greater than *any* of the values in daterange_0 AND queryDate
is less
than *any* of the values in daterange_1.
How can I get Solr to respect the positions of each item in the
daterange_0
and _1 arrays? Ideally I'd like the search to use the following
logic, thus
preventing the above document from being returned in a search for
1985:
(queryDate > daterange_0[0] AND queryDate < daterange_1[0]) OR
(queryDate > daterange_0[1] AND queryDate < daterange_1[1])
Someone else had a very similar problem recently on the mailing
list with a
multiValued PointType field but the thread went cold without a final
solution.
While I could filter the results when they get back to my
application
layer, it seems like it's not really the right place to do it.
Any help getting Solr to respect the positions of items in arrays
would be
very gratefully received.
Many thanks,
Mark
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.