breadcrumb in Solr

2007-08-16 Thread Jae Joo
Hi,

I am looking for the way to have "breadcrumb"
Is there any way to get thoes kind of information from Solr search result..


Thanks,

Jae Joo


Function Queries

2007-08-16 Thread Yakn

Hello all, I am struggling with FunctionQueries in Solr. Thanks in advance
for taking the time to read this and answer my questions. There doesn't seem
to be a "how to" page anywhere. I have been to these sites:

http://wiki.apache.org/solr/FunctionQuery

http://wiki.apache.org/solr/DisMaxRequestHandler#head-14b9ca618089829d139e6f3d6f52ff63e22a80d1

I have also searched through the forums for the answer, but have found none
on what I am asking below. 

What I have come to realize is that if you have or don't have certain
parameters, then the dismax will error out or the function query does not
work. One example is that if you have mm being blank in the solrConfig.xml
and not commented out, then it will throw a NumberFormatException. Another
example is that without something in qf, then the query, using dt=dismax in
the query request string, does not return any results. 

So, what I am really looking for here is the proper way to do the whole
solrConfig.xml, for the dismax request handler. It seems that I am somehow
missing something. The way that I understand it right now is this, for all
the fields that will be searched on and a function query will be used, they
need to be in the qf parameter. For the function query itself, I have just a
field called importancerank which is a float type field. I do not use ord()
or reord() or linear() etc... because I just want to take that value of that
field and add it to the score. I also have a 0.01 in tie. I have echoParams
set to explicit. These are the only parameters that I have set up. I have
the rest commented out such as pf, ps, q.alt, and mm. Also, what is fl? I
could not find any documentation on that. 

What happens currently for me is that when I put the dt=dismax parameter in
my query request string, I get exactly the same results as if I didn't,
meaning it didn't appear to sort it at all. What other parameters do I have
to fill out in the request handler to make this work? What might I have done
wrong in my thinking of how things work? Thanks again for reading through
this and replying to my questions. Another thing that would be helpful is to
see a whole solrConfig schema for the dismax request handler. I have only
read about bits of it and I think that to get a view of a full one that
actually works would be very helpful. Thanks again.

Mike
-- 
View this message in context: 
http://www.nabble.com/Function-Queries-tf4280039.html#a12182596
Sent from the Solr - User mailing list archive at Nabble.com.



Payloads for multiValued fields?

2007-08-16 Thread Alf Eaton
When searching a multiValued field, is it possible to know which of  
the multiple fields the match was in?


For example if I have an index of documents, each of which has  
multiple image captions stored in separate fields, I'd like to be  
able to link from the search results to the caption in the original  
document.


One possibility could be attaching metadata to a field, similar to  
payloads for terms. At the moment all I can think of is adding  
metadata inside the stored field and stripping that out when it's  
indexed and displayed, but that's not ideal.


alf.


Re: Payloads for multiValued fields?

2007-08-16 Thread Alf Eaton


On 16 Aug 2007, at 17:20, Alf Eaton wrote:

When searching a multiValued field, is it possible to know which of  
the multiple fields the match was in?


For example if I have an index of documents, each of which has  
multiple image captions stored in separate fields, I'd like to be  
able to link from the search results to the caption in the original  
document.


One possibility could be attaching metadata to a field, similar to  
payloads for terms. At the moment all I can think of is adding  
metadata inside the stored field and stripping that out when it's  
indexed and displayed, but that's not ideal.


Actually on reflection all this would need would be for the  
Highlighter to add a field to the response, saying which item of the  
multiValued field the match was in. Is that possible?


alf.




Re: Payloads for multiValued fields?

2007-08-16 Thread Yonik Seeley
On 8/16/07, Alf Eaton <[EMAIL PROTECTED]> wrote:
>
> On 16 Aug 2007, at 17:20, Alf Eaton wrote:
>
> > When searching a multiValued field, is it possible to know which of
> > the multiple fields the match was in?
> >
> > For example if I have an index of documents, each of which has
> > multiple image captions stored in separate fields, I'd like to be
> > able to link from the search results to the caption in the original
> > document.
> >
> > One possibility could be attaching metadata to a field, similar to
> > payloads for terms. At the moment all I can think of is adding
> > metadata inside the stored field and stripping that out when it's
> > indexed and displayed, but that's not ideal.
>
> Actually on reflection all this would need would be for the
> Highlighter to add a field to the response, saying which item of the
> multiValued field the match was in. Is that possible?

Could you perhaps index the captions as
#1 this is the first caption
#2 this is the second caption

And then when just look for #n in the highlighted results?
For display, you could also strip out the #n in the captions.

-Yonik


Re: breadcrumb in Solr

2007-08-16 Thread Matthew Runo

What do you mean by "breadcrumbs"?

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Aug 16, 2007, at 7:03 AM, Jae Joo wrote:


Hi,

I am looking for the way to have "breadcrumb"
Is there any way to get thoes kind of information from Solr search  
result..



Thanks,

Jae Joo




Re: Payloads for multiValued fields?

2007-08-16 Thread Alf Eaton

On 16 Aug 2007, at 17:34, Yonik Seeley wrote:


On 8/16/07, Alf Eaton <[EMAIL PROTECTED]> wrote:


On 16 Aug 2007, at 17:20, Alf Eaton wrote:


When searching a multiValued field, is it possible to know which of
the multiple fields the match was in?

For example if I have an index of documents, each of which has
multiple image captions stored in separate fields, I'd like to be
able to link from the search results to the caption in the original
document.

One possibility could be attaching metadata to a field, similar to
payloads for terms. At the moment all I can think of is adding
metadata inside the stored field and stripping that out when it's
indexed and displayed, but that's not ideal.


Actually on reflection all this would need would be for the
Highlighter to add a field to the response, saying which item of the
multiValued field the match was in. Is that possible?


Could you perhaps index the captions as
#1 this is the first caption
#2 this is the second caption

And then when just look for #n in the highlighted results?
For display, you could also strip out the #n in the captions.


I think that would probably work, yes - '#1' wouldn't be indexed so  
wouldn't affect the search results.


Thanks,
alf.


String collapsing

2007-08-16 Thread Lance Norskog
Does Solr have a processing tool that collapses, say, "E L V I S" to
"Elvis", or "D.N.A." to "DNA"?





Re: String collapsing

2007-08-16 Thread Yonik Seeley
On 8/16/07, Lance Norskog <[EMAIL PROTECTED]> wrote:
> Does Solr have a processing tool that collapses, say, "E L V I S" to
> "Elvis", or "D.N.A." to "DNA"?

WordDelimiterFilter can be configured to collapse things
like D.N.A to DNA, but not if space separated like D N A

-Yonik


Replacing existing documents in the index

2007-08-16 Thread Lance Norskog
Hi-
 
We recrawl the same places and update blindly without checking if a document
is already in the index.   We have a use case where we would like to delete
documents (porn) and have them stay deleted. To implement this use case now,
we would need to check the existence of the document and check for a
'deleted' flag. Or, we would maintain a separate database of deleted
documents that we check against.
 
A more efficient way to do this would be to have a 'do not delete' flag in
the document. Delete failures are currently ignored and they would continue
to be ignored.
 
Is this a worthwhile addition to 1.3 or 1.4?
 
Thanks for your time,
 
Lance


Re: Replacing existing documents in the index

2007-08-16 Thread Yonik Seeley
It sounds like it might be more efficient to implement this at the
crawler level to short-circuit crawling whole sites.  Baring that, a
separate database sounds more flexible.
Non-deletable docs doesn't sound like something that should be a
general feature.

However, one would probably be able to implement custom logic to do
this using an update-processor plugin (should be in the next version
of Solr)

-Yonik

On 8/16/07, Lance Norskog <[EMAIL PROTECTED]> wrote:
> Hi-
>
> We recrawl the same places and update blindly without checking if a document
> is already in the index.   We have a use case where we would like to delete
> documents (porn) and have them stay deleted. To implement this use case now,
> we would need to check the existence of the document and check for a
> 'deleted' flag. Or, we would maintain a separate database of deleted
> documents that we check against.
>
> A more efficient way to do this would be to have a 'do not delete' flag in
> the document. Delete failures are currently ignored and they would continue
> to be ignored.
>
> Is this a worthwhile addition to 1.3 or 1.4?
>
> Thanks for your time,
>
> Lance
>


Re: Function Queries

2007-08-16 Thread Pieter Berkel
Hi Yakn,

On 17/08/07, Yakn <[EMAIL PROTECTED]> wrote:

> One example is that if you have mm being blank in the solrConfig.xml
> and not commented out, then it will throw a NumberFormatException.


The required format of the mm field is described in more detail here:
http://lucene.apache.org/solr/api/org/apache/solr/util/doc-files/min-should-match.html
The parser is pretty fussy about how this field is formatted, when a value
is not specified the default value is "100%" which means "match documents
that contain every term in the query".
Perhaps it might be a good idea to add a simple sanity check to mm testing
for empty string?

Another example is that without something in qf, then the query, using

dt=dismax in the query request string, does not return any results.


qf is a required parameter and is needed (in combination with q param) to
construct a Lucene query, it won't work without it (as you've discovered).
The default values in solrconfig.xml serve as an example and you'll most
probably need to change it to match your schema (the value in
solrconfig.xmlis only used if qf is not set in your request query
string).

So, what I am really looking for here is the proper way to do the whole
> solrConfig.xml, for the dismax request handler. It seems that I am somehow
>
missing something.


I think the whole point of the values set in the solrconfig.xml included
with the distribution are to serve as a guide for you to try with the
examples provided.  In general most of these default values (that don't
refer to specific fields) can be left unmodified and dismax requests will
still work fine, however you can change and tweak these parameters to suit
your particular requirements if neccessary.


> The way that I understand it right now is this, for all
> the fields that will be searched on and a function query will be used,
> they
> need to be in the qf parameter.


Only fields that you want to match terms in "q" need to be listed in "qf",
it is not necessary to list fields used in a function query there.


> For the function query itself, I have just a
> field called importancerank which is a float type field. I do not use
> ord()
> or reord() or linear() etc... because I just want to take that value of
> that
> field and add it to the score.


I haven't tried this myself but it should be as simple as adding the
following to your query string: bf=importancerank


> I also have a 0.01 in tie. I have echoParams
> set to explicit. These are the only parameters that I have set up. I have
> the rest commented out such as pf, ps, q.alt, and mm. Also, what is fl? I
> could not find any documentation on that.


A lot of parameters (including fl) are common and used by both Standard and
Dismax request handlers, so you should take a look at:
http://wiki.apache.org/solr/CommonQueryParameters

What happens currently for me is that when I put the dt=dismax parameter in
> my query request string, I get exactly the same results as if I didn't,
> meaning it didn't appear to sort it at all. What other parameters do I
> have
>
to fill out in the request handler to make this work? What might I have done
> wrong in my thinking of how things work?


You'll have to provide more information about your query (e.g. query string
parameters, field definitions from schema.xml, list the contents of
 in
solrconfig.xml) in order to see what's going on.

Another thing that would be helpful is to see a whole solrConfig schema for
> the dismax request handler. I have only
> read about bits of it and I think that to get a view of a full one that
> actually works would be very helpful. Thanks again.


This is the solrconfig.xml that I mentioned earlier, it is provided with the
Solr distribution (in /example/solr/conf/):
http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml

Hope this helps,
Piete


Re: solr + carrot2

2007-08-16 Thread Pieter Berkel
Any updates on this?  It certainly would be quite interesting to see how
well carrot2 clustering can be integrated with solr, I suppose it's a fairly
similar concept to simple faceting (maybe another candidate for SOLR-281
component?).

One concern I have is that the additional processing required at query time
would make the whole operation significant slower (which is something I'd
like to avoid).  I've been wondering if it might be possible to calculate
(and store) clustering information at index time
however since carrot2 seems to use the query term & result set to create
clustering info this doesn't appear to be a practical approach.

In a similar vein, I'm also looking at methods of term extraction and
automatic keyword generation from indexed documents.  I've been
experimenting with MoreLikeThis and values returned by the "
mlt.interestingTerms" parameter, which has potential but needs a bit of
refinement before it can be truely useful.  Has anybody else discovered
clever or useful methods of term extraction using solr?

Piete



On 02/08/07, Burkamp, Christian <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
> In my opinion the results from carrot2 clustering could be used in the
> same way that facet results are used.
> That's the way I'm planning to use them.
> The user of the search application can narrow the search by selecting one
> of the facets presented in the search result presentation. These facets
> could come from metadata (classic facets) or from dynamically computed
> categories which are results from carrot2.
>
> From this point of view it would be most convenient to have the
> integration for carrot2 directly in the StandardRequestHandler. This leaves
> questions open like "how should filters for categories from carrot2 be
> formulated".
>
> Is anybody already using carrot2 with solr?
>
> -- Christian
>
> -Ursprüngliche Nachricht-
> Von: [EMAIL PROTECTED] [mailto: [EMAIL PROTECTED] Im Auftrag von
> Stanislaw Osinski
> Gesendet: Mittwoch, 1. August 2007 14:01
> An: solr-user@lucene.apache.org
> Betreff: Re: solr + carrot2
>
>
> >
> > Has anyone looked into using carrot2 clustering with solr?
> >
> > I know this is integrated with nutch:
> >
> > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/clustering/car
> > rot2/Clusterer.html
> >
> > It looks like carrot has support to read results from a solr index:
> >
> > http://demo.carrot2.org/head/api/org/carrot2/input/solr/package-summar
> > y.html
> >
> > But I'm hoping for something that returns clustered results from solr.
> >
> > Carrot also has something to read lucene indexes:
> >
> > http://demo.carrot2.org/head/api/org/carrot2/input/lucene/package-summ
> > ary.html
> >
> > Any pointers or experience before I (may) delve into this?
> >
>
> First of all, apologies for a delayed response. I'm one of Carrot2
> developers and indeed we did some Solr integration, but from Carrot2's
> perspective, which I guess will not be directly useful in this case. If you
> have any ideas for integration, questions or requests for changes/patches,
> feel free to post on Carrot2 mailing list or file an issue for us.
>
> Thanks,
>
> Staszek
>


synchronizing slave indexes in distributing collections

2007-08-16 Thread Yu-Hui Jin
Hi, there,

We want to use Solr's Collection Distribution. Here's the question regarding
recovery of failures of the scripts.  To my understanding:

* if the snapuller fails on a slave, we can possibly implement something
like the master would examine the status messages from all slaves and notify
all slaves to execute snapinstaller if all statuses are success.

* however, if then snapinstaller fails on a slave, there is really no simple
operation to rollback so that all slaves can still keep the same old index.
Besides, there is usually some hardware, network or simply Solr problems
causing the snapinstaller to fail. The problem may prevent any rollback
operation to execute, even if there is such an operation.

It seems possible to implement a 2-phase commit like protocol to provide
automatic recovery to keep all slave indexes consistent at all time.
However, one being that I don't see there's an rollback operation for
snapinstaller; two this would definitely complicates the system.

So looks like all we can do is it monitoring the logs and alarm people to
fix the issue and rerun the scripts, etc. whenever failures occur. Is that
the correct understanding?


Thanks,

-Hui


Re: how to retrieve all the documents in an index?

2007-08-16 Thread Chris Hostetter

: Any of you know whether the new "q:*.*" query performs better than the
: get-around solutions like using a ranged query?  I would guess so, but I
: haven't looked into the Lucene implementation.

it's faster -- it has almost no work to do relative the range query
version.



-Hoss



Re: synchronizing slave indexes in distributing collections

2007-08-16 Thread Chris Hostetter

: So looks like all we can do is it monitoring the logs and alarm people to
: fix the issue and rerun the scripts, etc. whenever failures occur. Is that
: the correct understanding?

I have *never* seen snappuller or snapinstaller fail (except during an
initial rollout of Solr when i forgot to setup the neccessary ssh keys).

I suppose we could at an option to snapinstaller to support explicitly
installing a snapshot by name ... then if you detect that salve Z didn't
load the latest snapshot, you could always tell the other slaves to
snapinstall whatever older version slave Z is still using -- but frankly
that seems a little silly -- not to mention that if you couldn't load the
snapshot into Z, odds are Z isn't responding to queries either.

a better course of action might just be to have an automated system which
monitors the distribution status info on the master, and takes any slaves
that don't update it properly out of your load balances rotation (and
notifies people to look into it)



-Hoss



Re: solr + carrot2

2007-08-16 Thread Alf Eaton

Pieter Berkel wrote:
 > In a similar vein, I'm also looking at methods of term extraction and

automatic keyword generation from indexed documents.  I've been
experimenting with MoreLikeThis and values returned by the "
mlt.interestingTerms" parameter, which has potential but needs a bit of
refinement before it can be truely useful.  Has anybody else discovered
clever or useful methods of term extraction using solr?


I've been using mlt.interestingTerms and it works well. For example, 
fetching the mlt.interestingTerms for each of the top 20 search results, 
combining the scores and using the top words to suggest additional 
search terms.


alf.