Re: Nutch with SOLR

2007-09-26 Thread Doğacan Güney
On 9/26/07, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
> > Sami has a patch in there which used a older version of the solr
> > client. with the current solr client in the SVN tree, his patch
> > becomes much easier.
> > your job would be to upgrade the patch and mail it back to him so
> > he can update his blog, or post it as a patch for inclusion in
> > nutch/contrib (if sami is ok with that). If you have issues with
> > how to use the solr client api, solr-user is here to help.
> >
>
> I've done this. Apparently someone else has taken on the solr-nutch
> job and made it a bit more complicated (which is good for the long
> term) than sami's original patch -- https://issues.apache.org/jira/
> browse/NUTCH-442

That someone else is me :)

NUTCH-442 is one of the issues that I want to really see resolved.
Unfortunately, I haven't received many (as in, none) comments, so I
haven't made further progress on it.

Patch at NUTCH-442 tries to integrate SOLR in a way that it is a
"first-class" citizen (so to speak), so that you can index to solr or
to lucene within the same Indexer job (or both), retrieve search
results from a solr server or from nutch's home-grown index servers in
nutch's web UI (or a combination of both). And I think patch lays the
ground work for generating summaries from solr.

>
> But we still use a version of Sami's patch that works on both trunk
> nutch and trunk solr (solrj.) I sent my changes to sami when we did
> it, if you need it let me know...
>
>
> -b
>
>
>


-- 
Doğacan Güney


custom sorting

2007-09-26 Thread Sandeep Shetty
> Hi Guys,
> 
> this question as been asked before but i was unable to find an answer
> thats good for me, so hope you guys can help again
> i am working on a website where we need to sort the results by distance
> from the location entered by the user. I have indexed the lat and long
> info for each record in solr and also i can get the lat and long of the
> location input by the user.
> Previously we were using lucene to do this. by using the
> SortComparatorSource we could sort the documents returned by distance
> nicely. we are now switching over to lucene because of the features it
> provides, however i am not able to see a way to do this in Solr. 
> 
> If someone can point me in the right direction i would be very grateful!
> 
> Thanks in advance,
> Sandeep

This email is confidential and may also be privileged. If you are not the 
intended recipient please notify us immediately by telephoning +44 (0)20 7452 
5300 or email [EMAIL PROTECTED] You should not copy it or use it for any 
purpose nor disclose its contents to any other person. Touch Local cannot 
accept liability for statements made which are clearly the sender's own and are 
not made on behalf of the firm.

Touch Local Limited
Registered Number: 2885607
VAT Number: GB896112114
Cardinal Tower, 12 Farringdon Road, London EC1M 3NN
+44 (0)20 7452 5300



Result grouping options

2007-09-26 Thread Thomas

Hello,

For the project I'm working on now it is important to group the results 
of a query by a "product" field. Documents
belong to only one product and there will never be more than 10 
different products alltogether.


When searching through the archives I identified 3 options:

1) [[Client-side XSLT]]
2) Faceting and querying all possible product facets
3) Field collapsing on product field (SOLR-236)

Option 1 is not feasable.
Option 2 would be possible, but 10 queries for every single initial 
query is not really a good idea either.
Option 3 seems like the best option as far as I undestand it but it only 
exists as a patch.


Is it possible to use faceting to not only get the facet count but also 
the top-n documents for every facet

directly? If not, how hard would it be to implement this as an extension?

If its not possible at all, would field collapsing really be a solution 
here and can it somehow be used

with Solr.1.2?

Thanks a lot!

Thomas


[JOB] Full-time opportunity in Paris, France

2007-09-26 Thread nicolas . dessaigne
Arisem is a French ISV delivering best-of-breed text analytics software. We
are using Lucene in our products since 2001 and are in search of a Lucene
expert to complement our R&D team.

 

Required skills:

- Master degree in computer science

- 2+ years of experience in working with Lucene

- Strong design and coding skills in Java on Linux platforms

- Strong desire to work in an environment combining development and research

- Innovation and excellent communication skills

 

Fluency in French is a plus.

Ideal candidates will also have an experience in research and skills in text
mining and NLP. Familiarity with C++, SOLR and Eclipse is also desired.

 

If you are available and interested, please contact me directly at
nicolas.dessaigne_at_arisem.com

 

Nicolas Dessaigne

Chief Technical Officer

ARISEM

 

 



Re: Nutch with SOLR

2007-09-26 Thread Brian Whitman


On Sep 26, 2007, at 4:04 AM, Doğacan Güney wrote:


NUTCH-442 is one of the issues that I want to really see resolved.
Unfortunately, I haven't received many (as in, none) comments, so I
haven't made further progress on it.



I am probably your target customer but to be honest all we care about  
is using Solr to index, not for any of the searching or summary stuff  
in Nutch. Is there a way to get Sami's SolrIndexer in nutch trunk  
(now that it's working OK) sooner than later and keep working on  
NUTCH-442 as well? Do they conflict? -b





dataset parameters suitable for lucene application

2007-09-26 Thread Law, John
I am new to the list and new to lucene and solr. I am considering Lucene
for a potential new application and need to know how well it scales. 

Following are the parameters of the dataset.

Number of records: 7+ million
Database size: 13.3 GB
Index Size:  10.9 GB 

My questions are simply:

1) Approximately how long would it take Lucene to index these documents?
2) What would the approximate retrieval time be (i.e. search response
time)?

Can someone provide me with some informed guidance in this regard?

Thanks in advance,
John

__
John Law
Director, Platform Management
ProQuest
789 Eisenhower Parkway
Ann Arbor, MI 48106
734-997-4877
[EMAIL PROTECTED]
www.proquest.com
www.csa.com

ProQuest... Start here.





Re: dataset parameters suitable for lucene application

2007-09-26 Thread Walter Underwood
That seems well within Solr's capabilities, though you should come up
with a desired queries/sec figure.

Solr's query rate varies widely with the configuration -- how many
fields, fuzzy search, highlighting, facets, etc.

Essentially, Solr uses Lucene, a modern search core. It has performance
and scaling comparable to the commercial products I know about, and I was
building enterprise search for nine years. If you need to search over
100M docs or over 1000 queries/second, you may need fancier distributed
search than is available in Solr or commercially.

Solr's big weaknesses are the quality of the stemmers, parsing document
formats (PDF, MS Word), and access control on queries. If you can live
with the stemmers, Solr will probably do the job.

I worked at Infoseek, Inktomi, Verity, and Autonomy, and I'm using
Solr here at Netflix.

wunder

On 9/26/07 7:27 AM, "Law, John" <[EMAIL PROTECTED]> wrote:

> I am new to the list and new to lucene and solr. I am considering Lucene
> for a potential new application and need to know how well it scales.
> 
> Following are the parameters of the dataset.
> 
> Number of records: 7+ million
> Database size: 13.3 GB
> Index Size:  10.9 GB
> 
> My questions are simply:
> 
> 1) Approximately how long would it take Lucene to index these documents?
> 2) What would the approximate retrieval time be (i.e. search response
> time)?
> 
> Can someone provide me with some informed guidance in this regard?
> 
> Thanks in advance,
> John
> 
> __
> John Law
> Director, Platform Management
> ProQuest
> 789 Eisenhower Parkway
> Ann Arbor, MI 48106
> 734-997-4877
> [EMAIL PROTECTED]
> www.proquest.com
> www.csa.com
> 
> ProQuest... Start here.




2 indexes

2007-09-26 Thread philguillard

Hi,

I'm new to solr, sorry if i missed my answer in the docs somewhere...

I need 2 different solr indexes.
Should i create 2 webapps? In that case i have tomcat contexts solr and 
solr2, then i can't start solr2, i get this error:


Sep 26, 2007 6:07:25 PM org.apache.catalina.core.StandardContext filterStart
SEVERE: Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError

Regards,
Phil


Re: 2 indexes

2007-09-26 Thread philguillard

Oups. I forgot to set the 2 files with solr home :
/opt/tomcat/conf/Catalina/localhost/solr.xml
/opt/tomcat/conf/Catalina/localhost/solr.xml

Phil

philguillard wrote:

Hi,

I'm new to solr, sorry if i missed my answer in the docs somewhere...

I need 2 different solr indexes.
Should i create 2 webapps? In that case i have tomcat contexts solr and 
solr2, then i can't start solr2, i get this error:


Sep 26, 2007 6:07:25 PM org.apache.catalina.core.StandardContext 
filterStart

SEVERE: Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError

Regards,
Phil



RE: dataset parameters suitable for lucene application

2007-09-26 Thread Charlie Jackson
My experiences so far with this level of data have been good.

Number of records: Maxed out at 8.8 million
Database size: friggin huge (100+ GB)
Index size: ~24 GB

1) It took me about a day to index 8 million docs using a non-optimized
program I wrote. It's non-optimized in the sense that it's not
multi-threaded. It batched together groups of about 5,000 docs at a time
to be indexed.

2) Search times for a basic search are almost always sub-second. If we
toss in some faceting, it takes a little longer, but I've hardly ever
seen it go above 1-2 seconds even with the most advanced queries. 

Hope that helps.


Charlie



-Original Message-
From: Law, John [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 9:28 AM
To: solr-user@lucene.apache.org
Subject: dataset parameters suitable for lucene application

I am new to the list and new to lucene and solr. I am considering Lucene
for a potential new application and need to know how well it scales. 

Following are the parameters of the dataset.

Number of records: 7+ million
Database size: 13.3 GB
Index Size:  10.9 GB 

My questions are simply:

1) Approximately how long would it take Lucene to index these documents?
2) What would the approximate retrieval time be (i.e. search response
time)?

Can someone provide me with some informed guidance in this regard?

Thanks in advance,
John

__
John Law
Director, Platform Management
ProQuest
789 Eisenhower Parkway
Ann Arbor, MI 48106
734-997-4877
[EMAIL PROTECTED]
www.proquest.com
www.csa.com

ProQuest... Start here.





searching for non-empty fields

2007-09-26 Thread Brian Whitman
I have a large index with a field for a URL. For some reason or  
another, sometimes a doc will get indexed with that field blank. This  
is fine but I want a query to return only the set URL fields...


If I do a query like:

q=URL:[* TO *]

I get a lot of empty fields back, like:



http://thing.com

What I can query for to remove the empty fields?





Re: dataset parameters suitable for lucene application

2007-09-26 Thread Chris Harris
By "maxed out" do you mean that Solr's performance became unacceptable
beyond 8.8M records, or that you only had 8.8M records to index? If
the former, can you share the particular symptoms?

On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote:
> My experiences so far with this level of data have been good.
>
> Number of records: Maxed out at 8.8 million
> Database size: friggin huge (100+ GB)
> Index size: ~24 GB
>
> 1) It took me about a day to index 8 million docs using a non-optimized
> program I wrote. It's non-optimized in the sense that it's not
> multi-threaded. It batched together groups of about 5,000 docs at a time
> to be indexed.
>
> 2) Search times for a basic search are almost always sub-second. If we
> toss in some faceting, it takes a little longer, but I've hardly ever
> seen it go above 1-2 seconds even with the most advanced queries.
>
> Hope that helps.
>
>
> Charlie
>
> 
>
> -Original Message-
> From: Law, John [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 26, 2007 9:28 AM
> To: solr-user@lucene.apache.org
> Subject: dataset parameters suitable for lucene application
>
> I am new to the list and new to lucene and solr. I am considering Lucene
> for a potential new application and need to know how well it scales.
>
> Following are the parameters of the dataset.
>
> Number of records: 7+ million
> Database size: 13.3 GB
> Index Size:  10.9 GB
>
> My questions are simply:
>
> 1) Approximately how long would it take Lucene to index these documents?
> 2) What would the approximate retrieval time be (i.e. search response
> time)?
>
> Can someone provide me with some informed guidance in this regard?
>
> Thanks in advance,
> John
>
> __
> John Law
> Director, Platform Management
> ProQuest
> 789 Eisenhower Parkway
> Ann Arbor, MI 48106
> 734-997-4877
> [EMAIL PROTECTED]
> www.proquest.com
> www.csa.com
>
> ProQuest... Start here.
>
>
>
>


RE: dataset parameters suitable for lucene application

2007-09-26 Thread Xuesong Luo
My experience so far:
200k number of indexes were created in 90 mins(including db time), index
size is 200m, query a key word on all string fields(30) takes 0.3-1 sec,
query a key word on one field takes tens of mill seconds.



-Original Message-
From: Charlie Jackson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 8:53 AM
To: solr-user@lucene.apache.org
Subject: RE: dataset parameters suitable for lucene application

My experiences so far with this level of data have been good.

Number of records: Maxed out at 8.8 million
Database size: friggin huge (100+ GB)
Index size: ~24 GB

1) It took me about a day to index 8 million docs using a non-optimized
program I wrote. It's non-optimized in the sense that it's not
multi-threaded. It batched together groups of about 5,000 docs at a time
to be indexed.

2) Search times for a basic search are almost always sub-second. If we
toss in some faceting, it takes a little longer, but I've hardly ever
seen it go above 1-2 seconds even with the most advanced queries. 

Hope that helps.


Charlie



-Original Message-
From: Law, John [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 9:28 AM
To: solr-user@lucene.apache.org
Subject: dataset parameters suitable for lucene application

I am new to the list and new to lucene and solr. I am considering Lucene
for a potential new application and need to know how well it scales. 

Following are the parameters of the dataset.

Number of records: 7+ million
Database size: 13.3 GB
Index Size:  10.9 GB 

My questions are simply:

1) Approximately how long would it take Lucene to index these documents?
2) What would the approximate retrieval time be (i.e. search response
time)?

Can someone provide me with some informed guidance in this regard?

Thanks in advance,
John

__
John Law
Director, Platform Management
ProQuest
789 Eisenhower Parkway
Ann Arbor, MI 48106
734-997-4877
[EMAIL PROTECTED]
www.proquest.com
www.csa.com

ProQuest... Start here.




How to get debug information while indexing?

2007-09-26 Thread Urvashi Gadi
Hi,

I am trying to create my own application using SOLR and while trying to
index my data i get

Server returned HTTP response code: 400 for URL:
http://localhost:8983/solr/update or
Server returned HTTP response code: 500 for URL:
http://localhost:8983/solr/update

Is there a way to get more debug information than this (any logs, which file
is wrong, schema.xml? etc)

I have modified schema.xml and have my own xml file for indexing.

Thanks for help.

Urvashi


RE: dataset parameters suitable for lucene application

2007-09-26 Thread Charlie Jackson
Sorry, I meant that it maxed out in the sense that my maxDoc field on
the stats page was 8.8 million, which indicates that the most docs it
has ever had was around 8.8 million. It's down to about 7.8 million
currently. I have seen no signs of a "maximum" number of docs Solr can
handle. 


-Original Message-
From: Chris Harris [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: dataset parameters suitable for lucene application

By "maxed out" do you mean that Solr's performance became unacceptable
beyond 8.8M records, or that you only had 8.8M records to index? If
the former, can you share the particular symptoms?

On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote:
> My experiences so far with this level of data have been good.
>
> Number of records: Maxed out at 8.8 million
> Database size: friggin huge (100+ GB)
> Index size: ~24 GB
>
> 1) It took me about a day to index 8 million docs using a
non-optimized
> program I wrote. It's non-optimized in the sense that it's not
> multi-threaded. It batched together groups of about 5,000 docs at a
time
> to be indexed.
>
> 2) Search times for a basic search are almost always sub-second. If we
> toss in some faceting, it takes a little longer, but I've hardly ever
> seen it go above 1-2 seconds even with the most advanced queries.
>
> Hope that helps.
>
>
> Charlie
>
> 
>
> -Original Message-
> From: Law, John [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 26, 2007 9:28 AM
> To: solr-user@lucene.apache.org
> Subject: dataset parameters suitable for lucene application
>
> I am new to the list and new to lucene and solr. I am considering
Lucene
> for a potential new application and need to know how well it scales.
>
> Following are the parameters of the dataset.
>
> Number of records: 7+ million
> Database size: 13.3 GB
> Index Size:  10.9 GB
>
> My questions are simply:
>
> 1) Approximately how long would it take Lucene to index these
documents?
> 2) What would the approximate retrieval time be (i.e. search response
> time)?
>
> Can someone provide me with some informed guidance in this regard?
>
> Thanks in advance,
> John
>
> __
> John Law
> Director, Platform Management
> ProQuest
> 789 Eisenhower Parkway
> Ann Arbor, MI 48106
> 734-997-4877
> [EMAIL PROTECTED]
> www.proquest.com
> www.csa.com
>
> ProQuest... Start here.
>
>
>
>


RE: dataset parameters suitable for lucene application

2007-09-26 Thread Law, John
Thanks all! One last question...

If I had a collection of 2.5 billion docs and a demand averaging 200
queries per second, what's the confidence that Solr/Lucene could handle
this volume and execute search with sub-second response times?


-Original Message-
From: Charlie Jackson [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 1:32 PM
To: solr-user@lucene.apache.org
Subject: RE: dataset parameters suitable for lucene application

Sorry, I meant that it maxed out in the sense that my maxDoc field on
the stats page was 8.8 million, which indicates that the most docs it
has ever had was around 8.8 million. It's down to about 7.8 million
currently. I have seen no signs of a "maximum" number of docs Solr can
handle. 


-Original Message-
From: Chris Harris [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: dataset parameters suitable for lucene application

By "maxed out" do you mean that Solr's performance became unacceptable
beyond 8.8M records, or that you only had 8.8M records to index? If
the former, can you share the particular symptoms?

On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote:
> My experiences so far with this level of data have been good.
>
> Number of records: Maxed out at 8.8 million
> Database size: friggin huge (100+ GB)
> Index size: ~24 GB
>
> 1) It took me about a day to index 8 million docs using a
non-optimized
> program I wrote. It's non-optimized in the sense that it's not
> multi-threaded. It batched together groups of about 5,000 docs at a
time
> to be indexed.
>
> 2) Search times for a basic search are almost always sub-second. If we
> toss in some faceting, it takes a little longer, but I've hardly ever
> seen it go above 1-2 seconds even with the most advanced queries.
>
> Hope that helps.
>
>
> Charlie
>
> 
>
> -Original Message-
> From: Law, John [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 26, 2007 9:28 AM
> To: solr-user@lucene.apache.org
> Subject: dataset parameters suitable for lucene application
>
> I am new to the list and new to lucene and solr. I am considering
Lucene
> for a potential new application and need to know how well it scales.
>
> Following are the parameters of the dataset.
>
> Number of records: 7+ million
> Database size: 13.3 GB
> Index Size:  10.9 GB
>
> My questions are simply:
>
> 1) Approximately how long would it take Lucene to index these
documents?
> 2) What would the approximate retrieval time be (i.e. search response
> time)?
>
> Can someone provide me with some informed guidance in this regard?
>
> Thanks in advance,
> John
>
> __
> John Law
> Director, Platform Management
> ProQuest
> 789 Eisenhower Parkway
> Ann Arbor, MI 48106
> 734-997-4877
> [EMAIL PROTECTED]
> www.proquest.com
> www.csa.com
>
> ProQuest... Start here.
>
>
>
>


Re: dataset parameters suitable for lucene application

2007-09-26 Thread Walter Underwood
No one can answer that, because it depends on how you configure Solr.
How many fields do you want to search? Are you using fuzzy search?
Facets? Highlighting?

We are searching a much smaller collection, about 250K docs, with
great success. We see 80 queries/sec on each of four servers, and
response times under 100ms. Each query searches against seven
fields and we don't use any of the features I listed above.

wunder

On 9/26/07 10:50 AM, "Law, John" <[EMAIL PROTECTED]> wrote:

> Thanks all! One last question...
> 
> If I had a collection of 2.5 billion docs and a demand averaging 200
> queries per second, what's the confidence that Solr/Lucene could handle
> this volume and execute search with sub-second response times?
> 
> 
> -Original Message-
> From: Charlie Jackson [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 26, 2007 1:32 PM
> To: solr-user@lucene.apache.org
> Subject: RE: dataset parameters suitable for lucene application
> 
> Sorry, I meant that it maxed out in the sense that my maxDoc field on
> the stats page was 8.8 million, which indicates that the most docs it
> has ever had was around 8.8 million. It's down to about 7.8 million
> currently. I have seen no signs of a "maximum" number of docs Solr can
> handle. 
> 
> 
> -Original Message-
> From: Chris Harris [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 26, 2007 11:49 AM
> To: solr-user@lucene.apache.org
> Subject: Re: dataset parameters suitable for lucene application
> 
> By "maxed out" do you mean that Solr's performance became unacceptable
> beyond 8.8M records, or that you only had 8.8M records to index? If
> the former, can you share the particular symptoms?
> 
> On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote:
>> My experiences so far with this level of data have been good.
>> 
>> Number of records: Maxed out at 8.8 million
>> Database size: friggin huge (100+ GB)
>> Index size: ~24 GB
>> 
>> 1) It took me about a day to index 8 million docs using a
> non-optimized
>> program I wrote. It's non-optimized in the sense that it's not
>> multi-threaded. It batched together groups of about 5,000 docs at a
> time
>> to be indexed.
>> 
>> 2) Search times for a basic search are almost always sub-second. If we
>> toss in some faceting, it takes a little longer, but I've hardly ever
>> seen it go above 1-2 seconds even with the most advanced queries.
>> 
>> Hope that helps.
>> 
>> 
>> Charlie
>> 
>> 
>> 
>> -Original Message-
>> From: Law, John [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, September 26, 2007 9:28 AM
>> To: solr-user@lucene.apache.org
>> Subject: dataset parameters suitable for lucene application
>> 
>> I am new to the list and new to lucene and solr. I am considering
> Lucene
>> for a potential new application and need to know how well it scales.
>> 
>> Following are the parameters of the dataset.
>> 
>> Number of records: 7+ million
>> Database size: 13.3 GB
>> Index Size:  10.9 GB
>> 
>> My questions are simply:
>> 
>> 1) Approximately how long would it take Lucene to index these
> documents?
>> 2) What would the approximate retrieval time be (i.e. search response
>> time)?
>> 
>> Can someone provide me with some informed guidance in this regard?
>> 
>> Thanks in advance,
>> John
>> 
>> __
>> John Law
>> Director, Platform Management
>> ProQuest
>> 789 Eisenhower Parkway
>> Ann Arbor, MI 48106
>> 734-997-4877
>> [EMAIL PROTECTED]
>> www.proquest.com
>> www.csa.com
>> 
>> ProQuest... Start here.
>> 
>> 
>> 
>> 



Re: How to get debug information while indexing?

2007-09-26 Thread Yonik Seeley
On 9/26/07, Urvashi Gadi <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I am trying to create my own application using SOLR and while trying to
> index my data i get
>
> Server returned HTTP response code: 400 for URL:
> http://localhost:8983/solr/update or
> Server returned HTTP response code: 500 for URL:
> http://localhost:8983/solr/update
>
> Is there a way to get more debug information than this (any logs, which file
> is wrong, schema.xml? etc)

Both the HTTP reason and response body should contain more information.
What are you using to communicate with Solr?
Try a bad request with curl and you can see the info that comes back:

[EMAIL PROTECTED] /cygdrive/f/code/lucene
$ curl -i http://localhost:8983/solr/select?q=foo:bar
HTTP/1.1 400 undefined_field_foo
Content-Type: text/html; charset=iso-8859-1
Content-Length: 1398
Server: Jetty(6.1.3)




Error 400 

HTTP ERROR: 400undefined field foo
RequestURI=/solr/selecthttp://jetty.mortbay.org/";>P
owered by Jetty://

























Errors should also be logged.

-Yonik


Geographical distance searching

2007-09-26 Thread Lance Norskog
It is a "best practice" to store the master copy of this data in a
relational database and use Solr/Lucene as a high-speed cache.
MySQL has a geographical database option, so maybe that is a better option
than Lucene indexing.

Lance

(P.s. please start new threads for new topics.)

-Original Message-
From: Sandeep Shetty [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 5:15 AM
To: 'solr-user@lucene.apache.org'
Subject: custom sorting

> Hi Guys,
> 
> this question as been asked before but i was unable to find an answer 
> thats good for me, so hope you guys can help again i am working on a 
> website where we need to sort the results by distance from the 
> location entered by the user. I have indexed the lat and long info for 
> each record in solr and also i can get the lat and long of the 
> location input by the user.
> Previously we were using lucene to do this. by using the 
> SortComparatorSource we could sort the documents returned by distance 
> nicely. we are now switching over to lucene because of the features it 
> provides, however i am not able to see a way to do this in Solr.
> 
> If someone can point me in the right direction i would be very grateful!
> 
> Thanks in advance,
> Sandeep

This email is confidential and may also be privileged. If you are not the
intended recipient please notify us immediately by telephoning +44 (0)20
7452 5300 or email [EMAIL PROTECTED] You should not copy it or use
it for any purpose nor disclose its contents to any other person. Touch
Local cannot accept liability for statements made which are clearly the
sender's own and are not made on behalf of the firm.

Touch Local Limited
Registered Number: 2885607
VAT Number: GB896112114
Cardinal Tower, 12 Farringdon Road, London EC1M 3NN
+44 (0)20 7452 5300



RE: dataset parameters suitable for lucene application

2007-09-26 Thread Lance Norskog
My limited experience with larger indexes is:  
1) the logistics of copying around and backing up this much data, and
2) indexing is disk-bound. We're on SAS disks and it makes no difference
between one indexing thread and a dozen (we have small records).

Smaller returns are faster. You need to limit the search results via as many
parameters as you can, and filters are the way to do this.

-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: dataset parameters suitable for lucene application

No one can answer that, because it depends on how you configure Solr.
How many fields do you want to search? Are you using fuzzy search?
Facets? Highlighting?

We are searching a much smaller collection, about 250K docs, with great
success. We see 80 queries/sec on each of four servers, and response times
under 100ms. Each query searches against seven fields and we don't use any
of the features I listed above.

wunder

On 9/26/07 10:50 AM, "Law, John" <[EMAIL PROTECTED]> wrote:

> Thanks all! One last question...
> 
> If I had a collection of 2.5 billion docs and a demand averaging 200 
> queries per second, what's the confidence that Solr/Lucene could 
> handle this volume and execute search with sub-second response times?
> 
> 
> -Original Message-
> From: Charlie Jackson [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 26, 2007 1:32 PM
> To: solr-user@lucene.apache.org
> Subject: RE: dataset parameters suitable for lucene application
> 
> Sorry, I meant that it maxed out in the sense that my maxDoc field on 
> the stats page was 8.8 million, which indicates that the most docs it 
> has ever had was around 8.8 million. It's down to about 7.8 million 
> currently. I have seen no signs of a "maximum" number of docs Solr can 
> handle.
> 
> 
> -Original Message-
> From: Chris Harris [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 26, 2007 11:49 AM
> To: solr-user@lucene.apache.org
> Subject: Re: dataset parameters suitable for lucene application
> 
> By "maxed out" do you mean that Solr's performance became unacceptable 
> beyond 8.8M records, or that you only had 8.8M records to index? If 
> the former, can you share the particular symptoms?
> 
> On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote:
>> My experiences so far with this level of data have been good.
>> 
>> Number of records: Maxed out at 8.8 million Database size: friggin 
>> huge (100+ GB) Index size: ~24 GB
>> 
>> 1) It took me about a day to index 8 million docs using a
> non-optimized
>> program I wrote. It's non-optimized in the sense that it's not 
>> multi-threaded. It batched together groups of about 5,000 docs at a
> time
>> to be indexed.
>> 
>> 2) Search times for a basic search are almost always sub-second. If 
>> we toss in some faceting, it takes a little longer, but I've hardly 
>> ever seen it go above 1-2 seconds even with the most advanced queries.
>> 
>> Hope that helps.
>> 
>> 
>> Charlie
>> 
>> 
>> 
>> -Original Message-
>> From: Law, John [mailto:[EMAIL PROTECTED]
>> Sent: Wednesday, September 26, 2007 9:28 AM
>> To: solr-user@lucene.apache.org
>> Subject: dataset parameters suitable for lucene application
>> 
>> I am new to the list and new to lucene and solr. I am considering
> Lucene
>> for a potential new application and need to know how well it scales.
>> 
>> Following are the parameters of the dataset.
>> 
>> Number of records: 7+ million
>> Database size: 13.3 GB
>> Index Size:  10.9 GB
>> 
>> My questions are simply:
>> 
>> 1) Approximately how long would it take Lucene to index these
> documents?
>> 2) What would the approximate retrieval time be (i.e. search response 
>> time)?
>> 
>> Can someone provide me with some informed guidance in this regard?
>> 
>> Thanks in advance,
>> John
>> 
>> __
>> John Law
>> Director, Platform Management
>> ProQuest
>> 789 Eisenhower Parkway
>> Ann Arbor, MI 48106
>> 734-997-4877
>> [EMAIL PROTECTED]
>> www.proquest.com
>> www.csa.com
>> 
>> ProQuest... Start here.
>> 
>> 
>> 
>> 



Re: dataset parameters suitable for lucene application

2007-09-26 Thread Mike Klaas

On 26-Sep-07, at 10:50 AM, Law, John wrote:


Thanks all! One last question...

If I had a collection of 2.5 billion docs and a demand averaging 200
queries per second, what's the confidence that Solr/Lucene could  
handle

this volume and execute search with sub-second response times?


No search software can search 2.5 billion docs (assuming web-sized  
documents) in 5ms on a single server.


You certainly could build such a system with Solr distributed over  
100's of nodes, but this is not built into Solr currently.


-Mike


Re: custom sorting

2007-09-26 Thread Mike Klaas

On 26-Sep-07, at 5:14 AM, Sandeep Shetty wrote:


Hi Guys,

this question as been asked before but i was unable to find an answer
thats good for me, so hope you guys can help again
i am working on a website where we need to sort the results by  
distance
from the location entered by the user. I have indexed the lat and  
long
info for each record in solr and also i can get the lat and long  
of the

location input by the user.
Previously we were using lucene to do this. by using the
SortComparatorSource we could sort the documents returned by distance
nicely. we are now switching over to lucene because of the  
features it

provides, however i am not able to see a way to do this in Solr.

If someone can point me in the right direction i would be very  
grateful!


Thanks in advance,
Sandeep


This email is confidential and may also be privileged. If you are  
not the intended recipient please notify us immediately by  
telephoning +44 (0)20 7452 5300 or email [EMAIL PROTECTED]  
You should not copy it or use it for any purpose nor disclose its  
contents to any other person. Touch Local cannot accept liability  
for statements made which are clearly the sender's own and are not  
made on behalf of the firm.


Sorry, I'm afraid the above email is already irrevokably publicly  
archived.


-Mike


What is facet?

2007-09-26 Thread Teruhiko Kurosaka
Could someone tell me what facet is?
I have a vague idea but I am not too clear.
A pointer to a sample web site that uses Solr facet
would be very good.

Thanks.
-Kuro


RE: Geographical distance searching

2007-09-26 Thread Will Johnson
With the new/improved value source functions it should be pretty easy to
develop a new best practice.  You should be able to pull in the lat/lon
values from valuesource fields and then do your greater circle calculation.

- will

-Original Message-
From: Lance Norskog [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 3:15 PM
To: solr-user@lucene.apache.org
Subject: Geographical distance searching

It is a "best practice" to store the master copy of this data in a
relational database and use Solr/Lucene as a high-speed cache.
MySQL has a geographical database option, so maybe that is a better option
than Lucene indexing.

Lance

(P.s. please start new threads for new topics.)

-Original Message-
From: Sandeep Shetty [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 5:15 AM
To: 'solr-user@lucene.apache.org'
Subject: custom sorting

> Hi Guys,
> 
> this question as been asked before but i was unable to find an answer 
> thats good for me, so hope you guys can help again i am working on a 
> website where we need to sort the results by distance from the 
> location entered by the user. I have indexed the lat and long info for 
> each record in solr and also i can get the lat and long of the 
> location input by the user.
> Previously we were using lucene to do this. by using the 
> SortComparatorSource we could sort the documents returned by distance 
> nicely. we are now switching over to lucene because of the features it 
> provides, however i am not able to see a way to do this in Solr.
> 
> If someone can point me in the right direction i would be very grateful!
> 
> Thanks in advance,
> Sandeep

This email is confidential and may also be privileged. If you are not the
intended recipient please notify us immediately by telephoning +44 (0)20
7452 5300 or email [EMAIL PROTECTED] You should not copy it or use
it for any purpose nor disclose its contents to any other person. Touch
Local cannot accept liability for statements made which are clearly the
sender's own and are not made on behalf of the firm.

Touch Local Limited
Registered Number: 2885607
VAT Number: GB896112114
Cardinal Tower, 12 Farringdon Road, London EC1M 3NN
+44 (0)20 7452 5300



Converting German special characters / umlaute

2007-09-26 Thread Matthias Eireiner
Dear list,

I have two questions regarding German special characters or umlaute.

is there an analyzer which automatically converts all german special
characters to their specific dissected from, such as ü to ue and ä to
ae, etc.?!

I also would like to have, that the search is always run against the
dissected data. But when the results are returned the initial data with
the non modified data should be returned. 

Does lucene GermanAnalyzer this job? I run across it, but I could not
figure out from the documentation whether it does the job or not.

thanks a lot in advance.

Matthias
 


--
Matthias Eireiner
Web Reisen GmbH
Amalienstr. 45
80799 München
+49 (89) 289-22920
[EMAIL PROTECTED] 

Geschäftsführung: Gabriel Graf Matuschka - Sitz der Gesellschaft:
München
Registergericht: Amtsgericht München, HRB 167305





Re: What is facet?

2007-09-26 Thread Ezra Ball
Faceted search is an approach to search where a taxonomy or categorization
scheme is visible in addition to document matches.
 
http://www.searchtools.com/info/faceted-metadata.html

--Ezra.


On 9/26/07 3:47 PM, "Teruhiko Kurosaka" <[EMAIL PROTECTED]> wrote:

> Could someone tell me what facet is?
> I have a vague idea but I am not too clear.
> A pointer to a sample web site that uses Solr facet
> would be very good.
> 
> Thanks.
> -Kuro



Re: Converting German special characters / umlaute

2007-09-26 Thread Thomas Traeger

Try the SnowballPorterFilterFactory described here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You should use the German2 variant that converts ä and ae to a, ö and oe 
to o and so on. More details:

http://snowball.tartarus.org/algorithms/german2/stemmer.html

Every document in solr can have any number of fields which might have 
the same source but have different field types and are therefore handled 
differently (stored as is, analyzed in different ways...). Use copyField 
in your schema.xml to feed your data into multiple fields. During 
searching you decide which fields you like to search on (usually the 
analyzed ones) and which you retrieve when getting the document back.


Tom

Matthias Eireiner schrieb:

Dear list,

I have two questions regarding German special characters or umlaute.

is there an analyzer which automatically converts all german special
characters to their specific dissected from, such as ü to ue and ä to
ae, etc.?!

I also would like to have, that the search is always run against the
dissected data. But when the results are returned the initial data with
the non modified data should be returned. 


Does lucene GermanAnalyzer this job? I run across it, but I could not
figure out from the documentation whether it does the job or not.

thanks a lot in advance.

Matthias
  


Re: What is facet?

2007-09-26 Thread Chris Hostetter

: Faceted search is an approach to search where a taxonomy or categorization
: scheme is visible in addition to document matches.

My ApacheConUS2006 talk went into a little more detail, including the best 
definition of faceted searching/browsing I've ever seen...
http://people.apache.org/~hossman/apachecon2006us/

"Interaction style where users filter a set 
 of items by progressively selecting from 
 only valid values of a faceted 
 classification system"
 ­ Keith Instone, SOASIS&T, July 8, 2004

Specificly regarding the term "facet" ... there we tend to find some 
ambiguity.  Lots of people can describe Faceted Searching, but most 
people's concept of a "Facet" tends to be very narrow.  Since I wrote most 
of the Solr Faceting documentation, it tends to follow my bias (also from 
my 2006 talk) ...

  Explaining My Terms
* Facet: A distinct feature or aspect of a 
  set of objects; "a way in which a 
  resource can be classified"
* Constraint: A viable method of limiting a 
  set of objects

In this regard, "color" is a facet, "blue" is a constraint on the color 
facet which may be expressed as the query "color:blue".  Likewise 
"popularity" is a facet, and a constraint query on the popularity facet 
might be "populartiy:high" or it might be "popularity:[100 TO *]" 
depending on the specifics of how you manage your data.  A more 
complicated example is that you might define a high level conceptual facet 
of "coolness" which does not directly relate to a specific concrete field, 
but instead relates to a complex query on many fields (hence: Solr's 
facet.query options) such that the "coolness" facet has constraints...

  cool => (popularity:[100 TO *] (+numFeatures:[10 TO *] +price:[0 TO 10]))
  lame => (+popularity:[* TO 99] +numFeatures:[* TO 9] +price:[11 TO *])




-Hoss


Re: Converting German special characters / umlaute

2007-09-26 Thread Chris Hostetter

: is there an analyzer which automatically converts all german special
: characters to their specific dissected from, such as ü to ue and ä to
: ae, etc.?!

See also the ISOLatin1TokenFilter which does this regardless of langauge.

: I also would like to have, that the search is always run against the
: dissected data. But when the results are returned the initial data with
: the non modified data should be returned. 

stored fields (returned to clients) always contain the orriginal field 
value, regardless of which analyzer/tokenizer/tokenfilters you use.


PS: http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss

Re: Geographical distance searching

2007-09-26 Thread Ian Holsman

Have you guys seen Local Lucene ?
http://www.nsshutdown.com/projects/lucene/whitepaper/*locallucene*.htm

no need for mysql if you don't want too.

rgrds
Ian

Will Johnson wrote:

With the new/improved value source functions it should be pretty easy to
develop a new best practice.  You should be able to pull in the lat/lon
values from valuesource fields and then do your greater circle calculation.

- will

-Original Message-
From: Lance Norskog [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 3:15 PM

To: solr-user@lucene.apache.org
Subject: Geographical distance searching

It is a "best practice" to store the master copy of this data in a
relational database and use Solr/Lucene as a high-speed cache.
MySQL has a geographical database option, so maybe that is a better option
than Lucene indexing.

Lance

(P.s. please start new threads for new topics.)

-Original Message-
From: Sandeep Shetty [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 26, 2007 5:15 AM

To: 'solr-user@lucene.apache.org'
Subject: custom sorting

  

Hi Guys,

this question as been asked before but i was unable to find an answer 
thats good for me, so hope you guys can help again i am working on a 
website where we need to sort the results by distance from the 
location entered by the user. I have indexed the lat and long info for 
each record in solr and also i can get the lat and long of the 
location input by the user.
Previously we were using lucene to do this. by using the 
SortComparatorSource we could sort the documents returned by distance 
nicely. we are now switching over to lucene because of the features it 
provides, however i am not able to see a way to do this in Solr.


If someone can point me in the right direction i would be very grateful!

Thanks in advance,
Sandeep



This email is confidential and may also be privileged. If you are not the
intended recipient please notify us immediately by telephoning +44 (0)20
7452 5300 or email [EMAIL PROTECTED] You should not copy it or use
it for any purpose nor disclose its contents to any other person. Touch
Local cannot accept liability for statements made which are clearly the
sender's own and are not made on behalf of the firm.

Touch Local Limited
Registered Number: 2885607
VAT Number: GB896112114
Cardinal Tower, 12 Farringdon Road, London EC1M 3NN
+44 (0)20 7452 5300


  




Re: searching for non-empty fields

2007-09-26 Thread Pieter Berkel
I've experienced a similar problem before, assuming the field type is
"string" (i.e. not tokenized), there is subtle yet important difference
between a field that is null (i.e. not contained in the document) and one
that is an empty string (in the document but with no value). See
http://www.nabble.com/indexing-null-values--tf4238702.html#a12067741 for a
previous discussion of the issue.

Your query will work if you make sure the URL field is omitted from the
document at index time when the field is blank.

cheers,
Piete



On 27/09/2007, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
> I have a large index with a field for a URL. For some reason or
> another, sometimes a doc will get indexed with that field blank. This
> is fine but I want a query to return only the set URL fields...
>
> If I do a query like:
>
> q=URL:[* TO *]
>
> I get a lot of empty fields back, like:
>
> 
> 
> http://thing.com
>
> What I can query for to remove the empty fields?
>
>
>
>


Re: Geographical distance searching

2007-09-26 Thread patrick o'leary




Might want to remove the *'s around that url
http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm

There's actually a download-able demo
http://www.nsshutdown.com/solr-example_s1.3_ls0.2.tgz
start it up as you would a normal solr example
$ cd solr-example/apache-solr*/example
$ java -jar start.jar

Open up firefox (sorry demo ui was quick and dirty so firefox only) and
go to http://localhost:8983/localcinema/
Make sure you specify localhost, there's a google maps key based upon
the url's domain, and click 'Go' at the bottom of the page.

The demo comes with some sample data already indexed for the NY region,
so have a play.

p.s after a little tidy up I'll be adding this to both lucene and
solr's repositories if folks feel that it's a useful addition.

Thanks
Patrick

Ian Holsman wrote:
Have
you guys seen Local Lucene ?
  
http://www.nsshutdown.com/projects/lucene/whitepaper/*locallucene*.htm
  
  
no need for mysql if you don't want too.
  
  
rgrds
  
Ian
  
  
Will Johnson wrote:
  
  With the new/improved value source functions
it should be pretty easy to

develop a new best practice.  You should be able to pull in the lat/lon

values from valuesource fields and then do your greater circle
calculation.


- will


-Original Message-

From: Lance Norskog [mailto:[EMAIL PROTECTED]] Sent: Wednesday,
September 26, 2007 3:15 PM

To: solr-user@lucene.apache.org

Subject: Geographical distance searching


It is a "best practice" to store the master copy of this data in a

relational database and use Solr/Lucene as a high-speed cache.

MySQL has a geographical database option, so maybe that is a better
option

than Lucene indexing.


Lance


(P.s. please start new threads for new topics.)


-Original Message-

From: Sandeep Shetty [mailto:[EMAIL PROTECTED]] Sent:
Wednesday, September 26, 2007 5:15 AM

To: 'solr-user@lucene.apache.org'

Subject: custom sorting


 
Hi Guys,
  
  
this question as been asked before but i was unable to find an answer
thats good for me, so hope you guys can help again i am working on a
website where we need to sort the results by distance from the location
entered by the user. I have indexed the lat and long info for each
record in solr and also i can get the lat and long of the location
input by the user.
  
Previously we were using lucene to do this. by using the
SortComparatorSource we could sort the documents returned by distance
nicely. we are now switching over to lucene because of the features it
provides, however i am not able to see a way to do this in Solr.
  
  
If someone can point me in the right direction i would be very
grateful!
  
  
Thanks in advance,
  
Sandeep
  
    

This email is confidential and may also be privileged. If you are not
the

intended recipient please notify us immediately by telephoning +44
(0)20

7452 5300 or email [EMAIL PROTECTED]. You should not copy it or
use

it for any purpose nor disclose its contents to any other person. Touch

Local cannot accept liability for statements made which are clearly the

sender's own and are not made on behalf of the firm.


Touch Local Limited

Registered Number: 2885607

VAT Number: GB896112114

Cardinal Tower, 12 Farringdon Road, London EC1M 3NN

+44 (0)20 7452 5300



  
  
  


-- 
Patrick O'Leary


You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles.
 Do you understand this? 
And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat.
  - Albert Einstein

View
Patrick O Leary's profile





anyone can send me jetty-plus

2007-09-26 Thread James liu
i can't download it from http://jetty.mortbay.org/jetty5/plus/index.html

-- 
regards
jl


Re: searching for non-empty fields

2007-09-26 Thread Ryan McKinley


Your query will work if you make sure the URL field is omitted from the
document at index time when the field is blank.



adding something like:
  

to the schema field should do it without needing to ensure it is not 
null or "" on the client side.


ryan


Re: searching for non-empty fields

2007-09-26 Thread Chris Hostetter


: Date: Thu, 27 Sep 2007 00:12:48 -0400
: From: Ryan McKinley <[EMAIL PROTECTED]>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Re: searching for non-empty fields
: 
: > 
: > Your query will work if you make sure the URL field is omitted from the
: > document at index time when the field is blank.
: > 
: 
: adding something like:
:   
: 
: to the schema field should do it without needing to ensure it is not null or
: "" on the client side.

...and to work arround the problem untill you reindex...

q=(URL:[* TO *] -URL:"")

...at least: i'm 97% certain that will work.  it won't help if you "empty" 
values are really " " or "  " or ...



-Hoss