solr index reusable with nutch?

2006-12-13 Thread Thorsten Scherler
Hi all,

is it possible to directly use the solr index in nutch?

My client is creating a portal search based on nutch. In this portal
there is as well my project and ATM I prefer to go with solr instead of
nutch since it its much better for my use case.

Now the question is whether the portal search engine could use the solr
index for my part of the portal.

Can somebody point me to related documentation?

TIA

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)



Re: solr index reusable with nutch?

2006-12-13 Thread Otis Gospodnetic
Hi,

Solr should be able to search any Lucene index, not just those created by Solr 
itself, as long as you configure it properly via schema.xml.  Thus, you should 
be able to use Solr to search an index created by Nutch.  Haven't tried it.  It 
would be nice if you could contribute the configuration for doing this.

Otis

- Original Message 
From: Thorsten Scherler <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, December 13, 2006 8:26:51 AM
Subject: solr index reusable with nutch?

Hi all,

is it possible to directly use the solr index in nutch?

My client is creating a portal search based on nutch. In this portal
there is as well my project and ATM I prefer to go with solr instead of
nutch since it its much better for my use case.

Now the question is whether the portal search engine could use the solr
index for my part of the portal.

Can somebody point me to related documentation?

TIA

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)






'New' "Date Math" parsing code in Solr

2006-12-13 Thread Chris Hostetter

(I have this nasty habit of commiting cool things to Solr that should be
announced on solr-user, and then deciding I'll wait untill they are in a
nightly snapshot before I send an email about them -- and then forgetting
that I never sent the mail).

A while back I added some functionality to the DateField class which
extends it's parsing abilities so that it can recognize strings like
"NOW", "NOW+1DAY", "NOW+1DAY-3HOURS", and even "NOW/MONTH+3MONTHS" which
means round down to the nearest month, then add three months.

You can play with this syntax and see what exactly it does with various
inputs by looking at the "parsedquery" debug info for any date field, if
you are running the example config from Solr something like this will
work...

http://localhost:8983/solr/select?version=2.1&q=field_dt%3A%5BNOW+TO+NOW%2FDAY%2B1MONTH%5D&start=0&rows=0&debugQuery=on

This syntax was added to not only make it easier to run quick data
inspection queries like "startDate:[NOW TO *]" but also so that *relative*
date based queries can be included directly in the default query options
for request handlers configured in solrconfig.xml.  For example, if you
only want to let people search articles less then a month old, you could
put...

 pubDate:[NOW/DAY-1MONTH]

...in your requestHandler config, and that cached filter query will be
reusable for 24 hours.

This syntax is supported anywhere Solr parses a DateField, so it can even
be used in  values when sending  messages.

More info can be found in the javadocs...

http://incubator.apache.org/solr/docs/api/org/apache/solr/schema/DateField.html
http://incubator.apache.org/solr/docs/api/org/apache/solr/util/DateMathParser.html


-Hoss



Re: automatic index time field?

2006-12-13 Thread Chris Hostetter

: Is there a way to automatically set a field when a document is indexed?
: Specifically, I'd like to have a date field updated to the current time when
: a document is indexed.

Your message reminded me that i never announced the new "Date Match"
parsing code, which does let you say something like...

  NOW

...in your  calls, but there is currently no way to have
"default" values for fields in your schema ... it's on the wishlist, but
no one is currently pursueing it as far as i know.

: I have a bunch of stuff stored in SQL, my plan is to:
:  * note the current time

...the gist of your plan is sound, but to eliminate possible headaches
from clock sync issues, instead of getting the "current time" from
somewhere, i would query your index for the all docs (of the type
you are interested in) sorted by date desc, and then note the date of the
newst doc and later delete all docs with dates up to and including that
one.

: My options are:
: 1) Send the index time along with the document.
: 2) extend UpdateHandler (DirectUpdateHandler2) to do this automatically
:
: 1) is the easiest but requires that everyone sending data sends a valid
: "index_time" field.
: 2) more complicated, but then we know everything has a valid "index_time"
: field.

As i said, you could just put "NOW" in all of your docs, but if you are
interested in pursuing option#2, the most general purpose and reusable
approach miht be to add an optional default="value" attribute to the
 declarations in the schema.xml (relevant classes are SchemaField
and IndexSchema) and then modify the DocumentBuilder.getDoc method to
check for any default values of fields the Document doesn't already have
values for and add them .. then your timestamp field becomes...



..but you can also have other default fields...




...etc.


-Hoss



Re: Strange Sorting results on a Text Field

2006-12-13 Thread Tracey Jaquith

Despite considerations of stemming and such for "text"
type fields, is it the case that 
if we have a single value "text" type field,

will sorting work, though?

--tracey

On 9/11/06, Tom Weber <[EMAIL PROTECTED]> wrote:

  Thanks also for the "multiValued" explanation, this is useful for
my current application. But then, if I use this field and I ask for
sorting, how will the sorting be done, alphanumeric on the first
entry for this field ? Until now, I entered more than one entry by
separting them with a space in the same field, like text1 text2 text3.
 


Sorting is currently only supported when there is at most one value
(or token) per document.  This is a lucene restriction.

-Yonik




Re: automatic index time field?

2006-12-13 Thread ryan mckinley

thanks for the advice.  I implemented option #2, followed the directions on:
http://wiki.apache.org/solr/HowToContribute

and made:
 http://issues.apache.org/jira/browse/SOLR-82

The only change I might make is to have the schema store if it has fields
with default values so that DocumentBuilder.getDoc() does not cycle through
all fields if there aren't any.

Thanks
ryan



On 12/13/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Is there a way to automatically set a field when a document is indexed?
: Specifically, I'd like to have a date field updated to the current time
when
: a document is indexed.

Your message reminded me that i never announced the new "Date Match"
parsing code, which does let you say something like...

  NOW

...in your  calls, but there is currently no way to have
"default" values for fields in your schema ... it's on the wishlist, but
no one is currently pursueing it as far as i know.

: I have a bunch of stuff stored in SQL, my plan is to:
:  * note the current time

...the gist of your plan is sound, but to eliminate possible headaches
from clock sync issues, instead of getting the "current time" from
somewhere, i would query your index for the all docs (of the type
you are interested in) sorted by date desc, and then note the date of the
newst doc and later delete all docs with dates up to and including that
one.

: My options are:
: 1) Send the index time along with the document.
: 2) extend UpdateHandler (DirectUpdateHandler2) to do this automatically
:
: 1) is the easiest but requires that everyone sending data sends a valid
: "index_time" field.
: 2) more complicated, but then we know everything has a valid
"index_time"
: field.

As i said, you could just put "NOW" in all of your docs, but if you are
interested in pursuing option#2, the most general purpose and reusable
approach miht be to add an optional default="value" attribute to the
 declarations in the schema.xml (relevant classes are SchemaField
and IndexSchema) and then modify the DocumentBuilder.getDoc method to
check for any default values of fields the Document doesn't already have
values for and add them .. then your timestamp field becomes...



..but you can also have other default fields...




...etc.


-Hoss




Case sensitivity on hostnames and email addresses

2006-12-13 Thread Wade Leftwich
I've run into some unexpected case sensitivity on searches, at least
unexpected by me.

If you index a text field containing this sentence:

A sentence containing CamelCase words by [EMAIL PROTECTED] is found
at StudlyCaps.org

The document will be found by searching for "camelcase" but not for
"[EMAIL PROTECTED]" or "studlycaps.org".

This happens with the Standard or the DisMax query handler.

A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.

Is this maybe a bug? Or a WAD?

-- Wade Leftwich
Ithaca, NY



Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Otis Gospodnetic
When indexing (and searching), make sure you are using an Analyzer that 
lower-cases (or upper-cases) tokens.
These are from Lucene, so Solr has them, too:
  ./src/java/org/apache/lucene/analysis/LowerCaseTokenizer.java
  ./src/java/org/apache/lucene/analysis/LowerCaseFilter.java

Otis

- Original Message 
From: Wade Leftwich <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, December 13, 2006 11:32:11 PM
Subject: Case sensitivity on hostnames and email addresses

I've run into some unexpected case sensitivity on searches, at least
unexpected by me.

If you index a text field containing this sentence:

A sentence containing CamelCase words by [EMAIL PROTECTED] is found
at StudlyCaps.org

The document will be found by searching for "camelcase" but not for
"[EMAIL PROTECTED]" or "studlycaps.org".

This happens with the Standard or the DisMax query handler.

A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.

Is this maybe a bug? Or a WAD?

-- Wade Leftwich
Ithaca, NY






Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Walter Underwood
Also, avoid stemming URLs. I used a stemmer that turned my
"best.com" URL into "good.com". The Lucene StandardAnalyzer
works pretty hard to avoid that. --wunder

On 12/13/06 9:33 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> When indexing (and searching), make sure you are using an Analyzer that
> lower-cases (or upper-cases) tokens.
> These are from Lucene, so Solr has them, too:
>   ./src/java/org/apache/lucene/analysis/LowerCaseTokenizer.java
>   ./src/java/org/apache/lucene/analysis/LowerCaseFilter.java
> 
> Otis
> 
> - Original Message 
> From: Wade Leftwich <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, December 13, 2006 11:32:11 PM
> Subject: Case sensitivity on hostnames and email addresses
> 
> I've run into some unexpected case sensitivity on searches, at least
> unexpected by me.
> 
> If you index a text field containing this sentence:
> 
> A sentence containing CamelCase words by [EMAIL PROTECTED] is found
> at StudlyCaps.org
> 
> The document will be found by searching for "camelcase" but not for
> "[EMAIL PROTECTED]" or "studlycaps.org".
> 
> This happens with the Standard or the DisMax query handler.
> 
> A bit of a problem for me, because I'm indexing a bunch of business
> magazines, and domain names are frequently capitalized, often in CamelCase.
> 
> Is this maybe a bug? Or a WAD?
> 
> -- Wade Leftwich
> Ithaca, NY
> 
> 
> 
> 



Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Yonik Seeley

On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote:

I've run into some unexpected case sensitivity on searches, at least
unexpected by me.

If you index a text field containing this sentence:

A sentence containing CamelCase words by [EMAIL PROTECTED] is found
at StudlyCaps.org

The document will be found by searching for "camelcase" but not for
"[EMAIL PROTECTED]" or "studlycaps.org".

This happens with the Standard or the DisMax query handler.

A bit of a problem for me, because I'm indexing a bunch of business
magazines, and domain names are frequently capitalized, often in CamelCase.


It's your text analysis configuration.
The WordDelimiterFilter is doing this... it's so "CamelCase" can be
found searching for "camelcase", "camel-case" or "camel case".
It does this by detecting all the word parts and then indexing them
separately as well as all catenated.  So "CamelCase" is indexed as
both both "camelcase" and "camel case".
When searching, the WordDelimiterFilter is configured to split only,
so "camelcase", "camel-case", and "camel case" will all match.

When it hits something like [EMAIL PROTECTED], it would index it as
"upanddownmysitecom" and "up and down mysite com"
On the search side, a search of "[EMAIL PROTECTED]" is broken into
"upanddown mysite com" which doesn't match anything indexed.

There are a number of options, not limited to
- create a new fieldtype and throw out the WordDelimiterFilter... the
current "text"
  field type is for demonstration purposes only anyway.  Solr, like
Lucene, is meant
  to be customized.
- If you want to keep the camel-case flexibility, but not across "."
and "-", then
  try using a letter tokenizer to throw away the non-letter tokenizers first.
- create a specific filter for email or website addresses if no combination of
  existing filters do what you want.

Play around with the analysis tool on the admin page, it will help you
understand what's going on.

-Yonik


Re: Case sensitivity on hostnames and email addresses

2006-12-13 Thread Yonik Seeley

Oh, and yet another way to get around it (with it's own trade offs) is
to use something like fieldtype textTight in the example schema.xml,
which catenates all word parts in both the index analyzer and query
analyzer.

This would index as "upanddownmysitecom" and allow the following
queries to match:
"[EMAIL PROTECTED]", "[EMAIL PROTECTED]/com", "[EMAIL PROTECTED]"

The downside is that it would *not* allow "upanddown" or "UpAndDown" to match.

-Yonik

On 12/14/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 12/13/06, Wade Leftwich <[EMAIL PROTECTED]> wrote:
> I've run into some unexpected case sensitivity on searches, at least
> unexpected by me.
>
> If you index a text field containing this sentence:
>
> A sentence containing CamelCase words by [EMAIL PROTECTED] is found
> at StudlyCaps.org
>
> The document will be found by searching for "camelcase" but not for
> "[EMAIL PROTECTED]" or "studlycaps.org".
>
> This happens with the Standard or the DisMax query handler.
>
> A bit of a problem for me, because I'm indexing a bunch of business
> magazines, and domain names are frequently capitalized, often in CamelCase.

It's your text analysis configuration.
The WordDelimiterFilter is doing this... it's so "CamelCase" can be
found searching for "camelcase", "camel-case" or "camel case".
It does this by detecting all the word parts and then indexing them
separately as well as all catenated.  So "CamelCase" is indexed as
both both "camelcase" and "camel case".
When searching, the WordDelimiterFilter is configured to split only,
so "camelcase", "camel-case", and "camel case" will all match.

When it hits something like [EMAIL PROTECTED], it would index it as
"upanddownmysitecom" and "up and down mysite com"
On the search side, a search of "[EMAIL PROTECTED]" is broken into
"upanddown mysite com" which doesn't match anything indexed.

There are a number of options, not limited to
 - create a new fieldtype and throw out the WordDelimiterFilter... the
current "text"
   field type is for demonstration purposes only anyway.  Solr, like
Lucene, is meant
   to be customized.
 - If you want to keep the camel-case flexibility, but not across "."
and "-", then
   try using a letter tokenizer to throw away the non-letter tokenizers first.
 - create a specific filter for email or website addresses if no combination of
   existing filters do what you want.

Play around with the analysis tool on the admin page, it will help you
understand what's going on.

-Yonik



Re: Strange Sorting results on a Text Field

2006-12-13 Thread Chris Hostetter

: Despite considerations of stemming and such for "text"
: type fields, is it the case that
: if we have a single value "text" type field,
: will sorting work, though?

correct ... KeywordTokenizer with Filters of your choice should produce a
sortable string of whatever form you desire.



-Hoss



Re: solr index reusable with nutch?

2006-12-13 Thread Thorsten Scherler
On Wed, 2006-12-13 at 07:45 -0800, Otis Gospodnetic wrote:
> Hi,
> 
> Solr should be able to search any Lucene index,

ok, good to know. :) 

So can I guess that the same is true for nutch? Meaning the index solr
is creating could be used by a nutch searcher.

>  not just those created by Solr itself, as long as you configure it properly 
> via schema.xml.  

http://wiki.apache.org/solr/SchemaXml?highlight=%28schema%29

> Thus, you should be able to use Solr to search an index created by Nutch. 

In my use case I need the reverse. Nutch searches the index created by
my solr application. The application is just one component in the portal
and the portal will provide a "global" search engine which should use
the index from solr.

>  Haven't tried it.  It would be nice if you could contribute the 
> configuration for doing this.
> 

As I figure it out I will keep you informed.

Thanks for the feedback.

salu2

> Otis
> 
> - Original Message 
> From: Thorsten Scherler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, December 13, 2006 8:26:51 AM
> Subject: solr index reusable with nutch?
> 
> Hi all,
> 
> is it possible to directly use the solr index in nutch?
> 
> My client is creating a portal search based on nutch. In this portal
> there is as well my project and ATM I prefer to go with solr instead of
> nutch since it its much better for my use case.
> 
> Now the question is whether the portal search engine could use the solr
> index for my part of the portal.
> 
> Can somebody point me to related documentation?
> 
> TIA
> 
> salu2