Querying multiple fields with the MoreLikeThis handler and mlt.fl

2010-03-17 Thread Alf Eaton
I'm wondering if there's been any progress on an issue described a
year or so ago in "More details on my MoreLikeThis mlt.qf boosting
problem" ,
where it was pointed out that the MoreLikeThis handler only queries
one field for each of the "interesting terms" that it finds in the
input text.

I was hoping that using
/mlt?mlt.fl=title+text&mlt.qf=title^2+text^0.5&mlt.interestingTerms=details&stream.body=tony+blair
would produce
title:tony^2 title:blair^2 text:tony^0.5 text:blair^0.5
but it actually produces just
text:tony^0.5 text:blair^0.5

i.e. despite including the "title" field in both mlt.qf and mlt.fl, it
only searches the "text" field.

If I set mlt.fl=title, it produces
title:tony^2 title:blair^2
so it is having an effect, just not the one I'm hoping for...

As it stands, in Solr 1.4, the MoreLikeThis result set from the query
above doesn't produce the document with title "Tony Blair" as the
first result, which would seem appropriate given the input text "tony
blair" and a boost on the title field.

alf


Payloads for multiValued fields?

2007-08-16 Thread Alf Eaton
When searching a multiValued field, is it possible to know which of  
the multiple fields the match was in?


For example if I have an index of documents, each of which has  
multiple image captions stored in separate fields, I'd like to be  
able to link from the search results to the caption in the original  
document.


One possibility could be attaching metadata to a field, similar to  
payloads for terms. At the moment all I can think of is adding  
metadata inside the stored field and stripping that out when it's  
indexed and displayed, but that's not ideal.


alf.


Re: Payloads for multiValued fields?

2007-08-16 Thread Alf Eaton


On 16 Aug 2007, at 17:20, Alf Eaton wrote:

When searching a multiValued field, is it possible to know which of  
the multiple fields the match was in?


For example if I have an index of documents, each of which has  
multiple image captions stored in separate fields, I'd like to be  
able to link from the search results to the caption in the original  
document.


One possibility could be attaching metadata to a field, similar to  
payloads for terms. At the moment all I can think of is adding  
metadata inside the stored field and stripping that out when it's  
indexed and displayed, but that's not ideal.


Actually on reflection all this would need would be for the  
Highlighter to add a field to the response, saying which item of the  
multiValued field the match was in. Is that possible?


alf.




Re: Payloads for multiValued fields?

2007-08-16 Thread Alf Eaton

On 16 Aug 2007, at 17:34, Yonik Seeley wrote:


On 8/16/07, Alf Eaton <[EMAIL PROTECTED]> wrote:


On 16 Aug 2007, at 17:20, Alf Eaton wrote:


When searching a multiValued field, is it possible to know which of
the multiple fields the match was in?

For example if I have an index of documents, each of which has
multiple image captions stored in separate fields, I'd like to be
able to link from the search results to the caption in the original
document.

One possibility could be attaching metadata to a field, similar to
payloads for terms. At the moment all I can think of is adding
metadata inside the stored field and stripping that out when it's
indexed and displayed, but that's not ideal.


Actually on reflection all this would need would be for the
Highlighter to add a field to the response, saying which item of the
multiValued field the match was in. Is that possible?


Could you perhaps index the captions as
#1 this is the first caption
#2 this is the second caption

And then when just look for #n in the highlighted results?
For display, you could also strip out the #n in the captions.


I think that would probably work, yes - '#1' wouldn't be indexed so  
wouldn't affect the search results.


Thanks,
alf.


Re: solr + carrot2

2007-08-16 Thread Alf Eaton

Pieter Berkel wrote:
 > In a similar vein, I'm also looking at methods of term extraction and

automatic keyword generation from indexed documents.  I've been
experimenting with MoreLikeThis and values returned by the "
mlt.interestingTerms" parameter, which has potential but needs a bit of
refinement before it can be truely useful.  Has anybody else discovered
clever or useful methods of term extraction using solr?


I've been using mlt.interestingTerms and it works well. For example, 
fetching the mlt.interestingTerms for each of the top 20 search results, 
combining the scores and using the top words to suggest additional 
search terms.


alf.


FunctionQuery, DisMax and Highlighting

2007-10-18 Thread Alf Eaton
I'm currently using the standard request handler for queries, because it
provides highlighting (unlike DisMax). I'd also like to be able to use
FunctionQuery to boost certain fields.

>From looking through the lists and JIRA it looks like there has been
some work to add highlighting to DisMax queries, but that things seem to
be stalled waiting for a more modular approach (Search Components). Is
that a fair assessment, and, if so, can anyone suggest the best way to
get both highlighting and FunctionQuery at the same time?

Thanks,
alf


Re: FunctionQuery, DisMax and Highlighting

2007-10-19 Thread Alf Eaton
Mike Klaas wrote:
> On 18-Oct-07, at 8:47 AM, Alf Eaton wrote:
> 
>> I'm currently using the standard request handler for queries, because it
>> provides highlighting (unlike DisMax). I'd also like to be able to use
>> FunctionQuery to boost certain fields.
>>
>> From looking through the lists and JIRA it looks like there has been
>> some work to add highlighting to DisMax queries, but that things seem to
>> be stalled waiting for a more modular approach (Search Components). Is
>> that a fair assessment, and, if so, can anyone suggest the best way to
>> get both highlighting and FunctionQuery at the same time?
> 
> I'm pleased to inform you that DisMax already provides highlighting, in
> exactly the same was as does StandardRequestHandler.

Good, thanks Mike - I must have been getting confused with the
wildcards/DisMax/highlighting situation.

alf


Re: Payloads for multiValued fields?

2007-10-24 Thread Alf Eaton
Yonik Seeley wrote:
> On 8/16/07, Alf Eaton <[EMAIL PROTECTED]> wrote:
>> On 16 Aug 2007, at 17:20, Alf Eaton wrote:
>>
>>> When searching a multiValued field, is it possible to know which of
>>> the multiple fields the match was in?
>>>
>>> For example if I have an index of documents, each of which has
>>> multiple image captions stored in separate fields, I'd like to be
>>> able to link from the search results to the caption in the original
>>> document.
>>>
>>> One possibility could be attaching metadata to a field, similar to
>>> payloads for terms. At the moment all I can think of is adding
>>> metadata inside the stored field and stripping that out when it's
>>> indexed and displayed, but that's not ideal.
>> Actually on reflection all this would need would be for the
>> Highlighter to add a field to the response, saying which item of the
>> multiValued field the match was in. Is that possible?
> 
> Could you perhaps index the captions as
> #1 this is the first caption
> #2 this is the second caption
> 
> And then when just look for #n in the highlighted results?
> For display, you could also strip out the #n in the captions.
> 

This was working ok for a while, but there's a problem: the highlighter
doesn't return the whole caption - just the highlighted part - so
sometimes the #n at the start of the caption field doesn't get returned
and isn't available. Any other ideas? Perhaps there's a way for the
response to say which fields of each document were matched?

alf



Re: Payloads for multiValued fields?

2007-10-24 Thread Alf Eaton
Yonik Seeley wrote:
> On 10/24/07, Alf Eaton <[EMAIL PROTECTED]> wrote:
>> Yonik Seeley wrote:
>>> Could you perhaps index the captions as
>>> #1 this is the first caption
>>> #2 this is the second caption
>>>
>>> And then when just look for #n in the highlighted results?
>>> For display, you could also strip out the #n in the captions.
>>>
>> This was working ok for a while, but there's a problem: the highlighter
>> doesn't return the whole caption - just the highlighted part - so
>> sometimes the #n at the start of the caption field doesn't get returned
>> and isn't available. Any other ideas? Perhaps there's a way for the
>> response to say which fields of each document were matched?
> 
> Perhaps try hl.fragsize=0
> 
> http://wiki.apache.org/solr/HighlightingParameters

Yes, I was just trying that this morning and it's an improvement, though
not ideal if the field contains a lot of text (in other words it's still
a suboptimal workaround).

I do think it might be useful for the response to contain an element
saying which fields were matched by the query, including which
sub-sections of a multi-valued field were matched.

alf


Empty field error when boosting a dismax query using bf

2007-10-24 Thread Alf Eaton
I'm trying to use the bf parameter to boost a dismax query based on the value 
of a certain (integer) field. The trouble is that for some of the documents 
this field is empty (rather than zero), which means that there's an error when 
using the bf parameter:

-
select?q=query+string&qf=body&qt=dismax&bf=intfield
-

java.lang.NumberFormatException: For input string: ""
at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:468)
at java.lang.Integer.parseInt(Integer.java:497)
at 
org.apache.lucene.search.FieldCacheImpl$3.parseInt(FieldCacheImpl.java:148)
at 
org.apache.lucene.search.FieldCacheImpl$7.createValue(FieldCacheImpl.java:261)
at 
org.apache.lucene.search.FieldCacheImpl$Cache.get(FieldCacheImpl.java:72)
at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:244)
at 
org.apache.lucene.search.FieldCacheImpl.getInts(FieldCacheImpl.java:238)
at 
org.apache.solr.search.function.IntFieldSource.getValues(IntFieldSource.java:50)
at 
org.apache.solr.search.function.FunctionQuery$AllScorer.(FunctionQuery.java:103)
at 
org.apache.solr.search.function.FunctionQuery$FunctionWeight.scorer(FunctionQuery.java:81)
at 
org.apache.lucene.search.BooleanQuery$BooleanWeight.scorer(BooleanQuery.java:233)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:143)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)
at org.apache.lucene.search.Searcher.search(Searcher.java:97)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:894)
at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:810)
at 
org.apache.solr.search.SolrIndexSearcher.getDocList(SolrIndexSearcher.java:701)
at 
org.apache.solr.handler.DisMaxRequestHandler.handleRequestBody(DisMaxRequestHandler.java:319)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:78)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:811)
at org.apache.solr.servlet.SolrServlet.doGet(SolrServlet.java:64)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:200)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:869)
at 
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:664)
at 
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:527)
at 
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:80)
at 
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:595)

-

Is there a way round this?

alf


Re: Payloads for multiValued fields?

2007-10-24 Thread Alf Eaton
Mike Klaas wrote:
> On 24-Oct-07, at 7:10 AM, Alf Eaton wrote:
>> Yes, I was just trying that this morning and it's an improvement, though
>> not ideal if the field contains a lot of text (in other words it's still
>> a suboptimal workaround).
>>
>> I do think it might be useful for the response to contain an element
>> saying which fields were matched by the query, including which
>> sub-sections of a multi-valued field were matched.
> 
> This isn't readily-accessible information.   Text search engines work by
> storing a list of documents and occurrence frequency for each document
> _per term_.  At that point, the information about the structure of the
> document is not available.

The highlighting engine seems to know which fields were matched by the
query though - enough to be able to use hl.requireFieldMatch to only
return snippets from matched fields. The highlighter seems to have a
small problem with snippets reaching across multivalued fields, but if
that was sorted out then in theory the highlighter should be able to
tell you which field, and which of the multiple values, was matched, no?

> Have you considered storing each section as a separate Solr Document?

I have considered this - in theory it would be easy enough to create a
separate index just for these items, but it adds an extra lump of
complexity to the search engine that I'd rather avoid. The workaround of
adding a marked-up value to the indexed field, setting hl.fragsize to 0
and parsing out the marked-up value from the highlighted fragment should
be good enough for now.

alf


Re: Payloads for multiValued fields?

2007-10-25 Thread Alf Eaton
Alf Eaton wrote:
> Mike Klaas wrote:
>> On 24-Oct-07, at 7:10 AM, Alf Eaton wrote:
>>> Yes, I was just trying that this morning and it's an improvement, though
>>> not ideal if the field contains a lot of text (in other words it's still
>>> a suboptimal workaround).
>>>
>>> I do think it might be useful for the response to contain an element
>>> saying which fields were matched by the query, including which
>>> sub-sections of a multi-valued field were matched.
>> This isn't readily-accessible information.   Text search engines work by
>> storing a list of documents and occurrence frequency for each document
>> _per term_.  At that point, the information about the structure of the
>> document is not available.
> 
> The highlighting engine seems to know which fields were matched by the
> query though - enough to be able to use hl.requireFieldMatch to only
> return snippets from matched fields. The highlighter seems to have a
> small problem with snippets reaching across multivalued fields, but if
> that was sorted out then in theory the highlighter should be able to
> tell you which field, and which of the multiple values, was matched, no?
> 
>> Have you considered storing each section as a separate Solr Document?
> 
> I have considered this - in theory it would be easy enough to create a
> separate index just for these items, but it adds an extra lump of
> complexity to the search engine that I'd rather avoid. The workaround of
> adding a marked-up value to the indexed field, setting hl.fragsize to 0
> and parsing out the marked-up value from the highlighted fragment should
> be good enough for now.
> 

Actually this is still a problem: with hl.fragsize set to 0 the highlighter 
actually returns the whole of the multi-valued field, with all of the items 
lumped together, so there really is no way to know reliably which of the 
multiple values was matched.

Maybe it will be necessary to build a separate index after all.

alf