Re: Nutch with SOLR
On 9/26/07, Brian Whitman <[EMAIL PROTECTED]> wrote: > > > Sami has a patch in there which used a older version of the solr > > client. with the current solr client in the SVN tree, his patch > > becomes much easier. > > your job would be to upgrade the patch and mail it back to him so > > he can update his blog, or post it as a patch for inclusion in > > nutch/contrib (if sami is ok with that). If you have issues with > > how to use the solr client api, solr-user is here to help. > > > > I've done this. Apparently someone else has taken on the solr-nutch > job and made it a bit more complicated (which is good for the long > term) than sami's original patch -- https://issues.apache.org/jira/ > browse/NUTCH-442 That someone else is me :) NUTCH-442 is one of the issues that I want to really see resolved. Unfortunately, I haven't received many (as in, none) comments, so I haven't made further progress on it. Patch at NUTCH-442 tries to integrate SOLR in a way that it is a "first-class" citizen (so to speak), so that you can index to solr or to lucene within the same Indexer job (or both), retrieve search results from a solr server or from nutch's home-grown index servers in nutch's web UI (or a combination of both). And I think patch lays the ground work for generating summaries from solr. > > But we still use a version of Sami's patch that works on both trunk > nutch and trunk solr (solrj.) I sent my changes to sami when we did > it, if you need it let me know... > > > -b > > > -- Doğacan Güney
custom sorting
> Hi Guys, > > this question as been asked before but i was unable to find an answer > thats good for me, so hope you guys can help again > i am working on a website where we need to sort the results by distance > from the location entered by the user. I have indexed the lat and long > info for each record in solr and also i can get the lat and long of the > location input by the user. > Previously we were using lucene to do this. by using the > SortComparatorSource we could sort the documents returned by distance > nicely. we are now switching over to lucene because of the features it > provides, however i am not able to see a way to do this in Solr. > > If someone can point me in the right direction i would be very grateful! > > Thanks in advance, > Sandeep This email is confidential and may also be privileged. If you are not the intended recipient please notify us immediately by telephoning +44 (0)20 7452 5300 or email [EMAIL PROTECTED] You should not copy it or use it for any purpose nor disclose its contents to any other person. Touch Local cannot accept liability for statements made which are clearly the sender's own and are not made on behalf of the firm. Touch Local Limited Registered Number: 2885607 VAT Number: GB896112114 Cardinal Tower, 12 Farringdon Road, London EC1M 3NN +44 (0)20 7452 5300
Result grouping options
Hello, For the project I'm working on now it is important to group the results of a query by a "product" field. Documents belong to only one product and there will never be more than 10 different products alltogether. When searching through the archives I identified 3 options: 1) [[Client-side XSLT]] 2) Faceting and querying all possible product facets 3) Field collapsing on product field (SOLR-236) Option 1 is not feasable. Option 2 would be possible, but 10 queries for every single initial query is not really a good idea either. Option 3 seems like the best option as far as I undestand it but it only exists as a patch. Is it possible to use faceting to not only get the facet count but also the top-n documents for every facet directly? If not, how hard would it be to implement this as an extension? If its not possible at all, would field collapsing really be a solution here and can it somehow be used with Solr.1.2? Thanks a lot! Thomas
[JOB] Full-time opportunity in Paris, France
Arisem is a French ISV delivering best-of-breed text analytics software. We are using Lucene in our products since 2001 and are in search of a Lucene expert to complement our R&D team. Required skills: - Master degree in computer science - 2+ years of experience in working with Lucene - Strong design and coding skills in Java on Linux platforms - Strong desire to work in an environment combining development and research - Innovation and excellent communication skills Fluency in French is a plus. Ideal candidates will also have an experience in research and skills in text mining and NLP. Familiarity with C++, SOLR and Eclipse is also desired. If you are available and interested, please contact me directly at nicolas.dessaigne_at_arisem.com Nicolas Dessaigne Chief Technical Officer ARISEM
Re: Nutch with SOLR
On Sep 26, 2007, at 4:04 AM, Doğacan Güney wrote: NUTCH-442 is one of the issues that I want to really see resolved. Unfortunately, I haven't received many (as in, none) comments, so I haven't made further progress on it. I am probably your target customer but to be honest all we care about is using Solr to index, not for any of the searching or summary stuff in Nutch. Is there a way to get Sami's SolrIndexer in nutch trunk (now that it's working OK) sooner than later and keep working on NUTCH-442 as well? Do they conflict? -b
dataset parameters suitable for lucene application
I am new to the list and new to lucene and solr. I am considering Lucene for a potential new application and need to know how well it scales. Following are the parameters of the dataset. Number of records: 7+ million Database size: 13.3 GB Index Size: 10.9 GB My questions are simply: 1) Approximately how long would it take Lucene to index these documents? 2) What would the approximate retrieval time be (i.e. search response time)? Can someone provide me with some informed guidance in this regard? Thanks in advance, John __ John Law Director, Platform Management ProQuest 789 Eisenhower Parkway Ann Arbor, MI 48106 734-997-4877 [EMAIL PROTECTED] www.proquest.com www.csa.com ProQuest... Start here.
Re: dataset parameters suitable for lucene application
That seems well within Solr's capabilities, though you should come up with a desired queries/sec figure. Solr's query rate varies widely with the configuration -- how many fields, fuzzy search, highlighting, facets, etc. Essentially, Solr uses Lucene, a modern search core. It has performance and scaling comparable to the commercial products I know about, and I was building enterprise search for nine years. If you need to search over 100M docs or over 1000 queries/second, you may need fancier distributed search than is available in Solr or commercially. Solr's big weaknesses are the quality of the stemmers, parsing document formats (PDF, MS Word), and access control on queries. If you can live with the stemmers, Solr will probably do the job. I worked at Infoseek, Inktomi, Verity, and Autonomy, and I'm using Solr here at Netflix. wunder On 9/26/07 7:27 AM, "Law, John" <[EMAIL PROTECTED]> wrote: > I am new to the list and new to lucene and solr. I am considering Lucene > for a potential new application and need to know how well it scales. > > Following are the parameters of the dataset. > > Number of records: 7+ million > Database size: 13.3 GB > Index Size: 10.9 GB > > My questions are simply: > > 1) Approximately how long would it take Lucene to index these documents? > 2) What would the approximate retrieval time be (i.e. search response > time)? > > Can someone provide me with some informed guidance in this regard? > > Thanks in advance, > John > > __ > John Law > Director, Platform Management > ProQuest > 789 Eisenhower Parkway > Ann Arbor, MI 48106 > 734-997-4877 > [EMAIL PROTECTED] > www.proquest.com > www.csa.com > > ProQuest... Start here.
2 indexes
Hi, I'm new to solr, sorry if i missed my answer in the docs somewhere... I need 2 different solr indexes. Should i create 2 webapps? In that case i have tomcat contexts solr and solr2, then i can't start solr2, i get this error: Sep 26, 2007 6:07:25 PM org.apache.catalina.core.StandardContext filterStart SEVERE: Exception starting filter SolrRequestFilter java.lang.NoClassDefFoundError Regards, Phil
Re: 2 indexes
Oups. I forgot to set the 2 files with solr home : /opt/tomcat/conf/Catalina/localhost/solr.xml /opt/tomcat/conf/Catalina/localhost/solr.xml Phil philguillard wrote: Hi, I'm new to solr, sorry if i missed my answer in the docs somewhere... I need 2 different solr indexes. Should i create 2 webapps? In that case i have tomcat contexts solr and solr2, then i can't start solr2, i get this error: Sep 26, 2007 6:07:25 PM org.apache.catalina.core.StandardContext filterStart SEVERE: Exception starting filter SolrRequestFilter java.lang.NoClassDefFoundError Regards, Phil
RE: dataset parameters suitable for lucene application
My experiences so far with this level of data have been good. Number of records: Maxed out at 8.8 million Database size: friggin huge (100+ GB) Index size: ~24 GB 1) It took me about a day to index 8 million docs using a non-optimized program I wrote. It's non-optimized in the sense that it's not multi-threaded. It batched together groups of about 5,000 docs at a time to be indexed. 2) Search times for a basic search are almost always sub-second. If we toss in some faceting, it takes a little longer, but I've hardly ever seen it go above 1-2 seconds even with the most advanced queries. Hope that helps. Charlie -Original Message- From: Law, John [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 9:28 AM To: solr-user@lucene.apache.org Subject: dataset parameters suitable for lucene application I am new to the list and new to lucene and solr. I am considering Lucene for a potential new application and need to know how well it scales. Following are the parameters of the dataset. Number of records: 7+ million Database size: 13.3 GB Index Size: 10.9 GB My questions are simply: 1) Approximately how long would it take Lucene to index these documents? 2) What would the approximate retrieval time be (i.e. search response time)? Can someone provide me with some informed guidance in this regard? Thanks in advance, John __ John Law Director, Platform Management ProQuest 789 Eisenhower Parkway Ann Arbor, MI 48106 734-997-4877 [EMAIL PROTECTED] www.proquest.com www.csa.com ProQuest... Start here.
searching for non-empty fields
I have a large index with a field for a URL. For some reason or another, sometimes a doc will get indexed with that field blank. This is fine but I want a query to return only the set URL fields... If I do a query like: q=URL:[* TO *] I get a lot of empty fields back, like: http://thing.com What I can query for to remove the empty fields?
Re: dataset parameters suitable for lucene application
By "maxed out" do you mean that Solr's performance became unacceptable beyond 8.8M records, or that you only had 8.8M records to index? If the former, can you share the particular symptoms? On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: > My experiences so far with this level of data have been good. > > Number of records: Maxed out at 8.8 million > Database size: friggin huge (100+ GB) > Index size: ~24 GB > > 1) It took me about a day to index 8 million docs using a non-optimized > program I wrote. It's non-optimized in the sense that it's not > multi-threaded. It batched together groups of about 5,000 docs at a time > to be indexed. > > 2) Search times for a basic search are almost always sub-second. If we > toss in some faceting, it takes a little longer, but I've hardly ever > seen it go above 1-2 seconds even with the most advanced queries. > > Hope that helps. > > > Charlie > > > > -Original Message- > From: Law, John [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 9:28 AM > To: solr-user@lucene.apache.org > Subject: dataset parameters suitable for lucene application > > I am new to the list and new to lucene and solr. I am considering Lucene > for a potential new application and need to know how well it scales. > > Following are the parameters of the dataset. > > Number of records: 7+ million > Database size: 13.3 GB > Index Size: 10.9 GB > > My questions are simply: > > 1) Approximately how long would it take Lucene to index these documents? > 2) What would the approximate retrieval time be (i.e. search response > time)? > > Can someone provide me with some informed guidance in this regard? > > Thanks in advance, > John > > __ > John Law > Director, Platform Management > ProQuest > 789 Eisenhower Parkway > Ann Arbor, MI 48106 > 734-997-4877 > [EMAIL PROTECTED] > www.proquest.com > www.csa.com > > ProQuest... Start here. > > > >
RE: dataset parameters suitable for lucene application
My experience so far: 200k number of indexes were created in 90 mins(including db time), index size is 200m, query a key word on all string fields(30) takes 0.3-1 sec, query a key word on one field takes tens of mill seconds. -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 8:53 AM To: solr-user@lucene.apache.org Subject: RE: dataset parameters suitable for lucene application My experiences so far with this level of data have been good. Number of records: Maxed out at 8.8 million Database size: friggin huge (100+ GB) Index size: ~24 GB 1) It took me about a day to index 8 million docs using a non-optimized program I wrote. It's non-optimized in the sense that it's not multi-threaded. It batched together groups of about 5,000 docs at a time to be indexed. 2) Search times for a basic search are almost always sub-second. If we toss in some faceting, it takes a little longer, but I've hardly ever seen it go above 1-2 seconds even with the most advanced queries. Hope that helps. Charlie -Original Message- From: Law, John [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 9:28 AM To: solr-user@lucene.apache.org Subject: dataset parameters suitable for lucene application I am new to the list and new to lucene and solr. I am considering Lucene for a potential new application and need to know how well it scales. Following are the parameters of the dataset. Number of records: 7+ million Database size: 13.3 GB Index Size: 10.9 GB My questions are simply: 1) Approximately how long would it take Lucene to index these documents? 2) What would the approximate retrieval time be (i.e. search response time)? Can someone provide me with some informed guidance in this regard? Thanks in advance, John __ John Law Director, Platform Management ProQuest 789 Eisenhower Parkway Ann Arbor, MI 48106 734-997-4877 [EMAIL PROTECTED] www.proquest.com www.csa.com ProQuest... Start here.
How to get debug information while indexing?
Hi, I am trying to create my own application using SOLR and while trying to index my data i get Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/update or Server returned HTTP response code: 500 for URL: http://localhost:8983/solr/update Is there a way to get more debug information than this (any logs, which file is wrong, schema.xml? etc) I have modified schema.xml and have my own xml file for indexing. Thanks for help. Urvashi
RE: dataset parameters suitable for lucene application
Sorry, I meant that it maxed out in the sense that my maxDoc field on the stats page was 8.8 million, which indicates that the most docs it has ever had was around 8.8 million. It's down to about 7.8 million currently. I have seen no signs of a "maximum" number of docs Solr can handle. -Original Message- From: Chris Harris [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 11:49 AM To: solr-user@lucene.apache.org Subject: Re: dataset parameters suitable for lucene application By "maxed out" do you mean that Solr's performance became unacceptable beyond 8.8M records, or that you only had 8.8M records to index? If the former, can you share the particular symptoms? On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: > My experiences so far with this level of data have been good. > > Number of records: Maxed out at 8.8 million > Database size: friggin huge (100+ GB) > Index size: ~24 GB > > 1) It took me about a day to index 8 million docs using a non-optimized > program I wrote. It's non-optimized in the sense that it's not > multi-threaded. It batched together groups of about 5,000 docs at a time > to be indexed. > > 2) Search times for a basic search are almost always sub-second. If we > toss in some faceting, it takes a little longer, but I've hardly ever > seen it go above 1-2 seconds even with the most advanced queries. > > Hope that helps. > > > Charlie > > > > -Original Message- > From: Law, John [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 9:28 AM > To: solr-user@lucene.apache.org > Subject: dataset parameters suitable for lucene application > > I am new to the list and new to lucene and solr. I am considering Lucene > for a potential new application and need to know how well it scales. > > Following are the parameters of the dataset. > > Number of records: 7+ million > Database size: 13.3 GB > Index Size: 10.9 GB > > My questions are simply: > > 1) Approximately how long would it take Lucene to index these documents? > 2) What would the approximate retrieval time be (i.e. search response > time)? > > Can someone provide me with some informed guidance in this regard? > > Thanks in advance, > John > > __ > John Law > Director, Platform Management > ProQuest > 789 Eisenhower Parkway > Ann Arbor, MI 48106 > 734-997-4877 > [EMAIL PROTECTED] > www.proquest.com > www.csa.com > > ProQuest... Start here. > > > >
RE: dataset parameters suitable for lucene application
Thanks all! One last question... If I had a collection of 2.5 billion docs and a demand averaging 200 queries per second, what's the confidence that Solr/Lucene could handle this volume and execute search with sub-second response times? -Original Message- From: Charlie Jackson [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 1:32 PM To: solr-user@lucene.apache.org Subject: RE: dataset parameters suitable for lucene application Sorry, I meant that it maxed out in the sense that my maxDoc field on the stats page was 8.8 million, which indicates that the most docs it has ever had was around 8.8 million. It's down to about 7.8 million currently. I have seen no signs of a "maximum" number of docs Solr can handle. -Original Message- From: Chris Harris [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 11:49 AM To: solr-user@lucene.apache.org Subject: Re: dataset parameters suitable for lucene application By "maxed out" do you mean that Solr's performance became unacceptable beyond 8.8M records, or that you only had 8.8M records to index? If the former, can you share the particular symptoms? On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: > My experiences so far with this level of data have been good. > > Number of records: Maxed out at 8.8 million > Database size: friggin huge (100+ GB) > Index size: ~24 GB > > 1) It took me about a day to index 8 million docs using a non-optimized > program I wrote. It's non-optimized in the sense that it's not > multi-threaded. It batched together groups of about 5,000 docs at a time > to be indexed. > > 2) Search times for a basic search are almost always sub-second. If we > toss in some faceting, it takes a little longer, but I've hardly ever > seen it go above 1-2 seconds even with the most advanced queries. > > Hope that helps. > > > Charlie > > > > -Original Message- > From: Law, John [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 9:28 AM > To: solr-user@lucene.apache.org > Subject: dataset parameters suitable for lucene application > > I am new to the list and new to lucene and solr. I am considering Lucene > for a potential new application and need to know how well it scales. > > Following are the parameters of the dataset. > > Number of records: 7+ million > Database size: 13.3 GB > Index Size: 10.9 GB > > My questions are simply: > > 1) Approximately how long would it take Lucene to index these documents? > 2) What would the approximate retrieval time be (i.e. search response > time)? > > Can someone provide me with some informed guidance in this regard? > > Thanks in advance, > John > > __ > John Law > Director, Platform Management > ProQuest > 789 Eisenhower Parkway > Ann Arbor, MI 48106 > 734-997-4877 > [EMAIL PROTECTED] > www.proquest.com > www.csa.com > > ProQuest... Start here. > > > >
Re: dataset parameters suitable for lucene application
No one can answer that, because it depends on how you configure Solr. How many fields do you want to search? Are you using fuzzy search? Facets? Highlighting? We are searching a much smaller collection, about 250K docs, with great success. We see 80 queries/sec on each of four servers, and response times under 100ms. Each query searches against seven fields and we don't use any of the features I listed above. wunder On 9/26/07 10:50 AM, "Law, John" <[EMAIL PROTECTED]> wrote: > Thanks all! One last question... > > If I had a collection of 2.5 billion docs and a demand averaging 200 > queries per second, what's the confidence that Solr/Lucene could handle > this volume and execute search with sub-second response times? > > > -Original Message- > From: Charlie Jackson [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 1:32 PM > To: solr-user@lucene.apache.org > Subject: RE: dataset parameters suitable for lucene application > > Sorry, I meant that it maxed out in the sense that my maxDoc field on > the stats page was 8.8 million, which indicates that the most docs it > has ever had was around 8.8 million. It's down to about 7.8 million > currently. I have seen no signs of a "maximum" number of docs Solr can > handle. > > > -Original Message- > From: Chris Harris [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 11:49 AM > To: solr-user@lucene.apache.org > Subject: Re: dataset parameters suitable for lucene application > > By "maxed out" do you mean that Solr's performance became unacceptable > beyond 8.8M records, or that you only had 8.8M records to index? If > the former, can you share the particular symptoms? > > On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: >> My experiences so far with this level of data have been good. >> >> Number of records: Maxed out at 8.8 million >> Database size: friggin huge (100+ GB) >> Index size: ~24 GB >> >> 1) It took me about a day to index 8 million docs using a > non-optimized >> program I wrote. It's non-optimized in the sense that it's not >> multi-threaded. It batched together groups of about 5,000 docs at a > time >> to be indexed. >> >> 2) Search times for a basic search are almost always sub-second. If we >> toss in some faceting, it takes a little longer, but I've hardly ever >> seen it go above 1-2 seconds even with the most advanced queries. >> >> Hope that helps. >> >> >> Charlie >> >> >> >> -Original Message- >> From: Law, John [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, September 26, 2007 9:28 AM >> To: solr-user@lucene.apache.org >> Subject: dataset parameters suitable for lucene application >> >> I am new to the list and new to lucene and solr. I am considering > Lucene >> for a potential new application and need to know how well it scales. >> >> Following are the parameters of the dataset. >> >> Number of records: 7+ million >> Database size: 13.3 GB >> Index Size: 10.9 GB >> >> My questions are simply: >> >> 1) Approximately how long would it take Lucene to index these > documents? >> 2) What would the approximate retrieval time be (i.e. search response >> time)? >> >> Can someone provide me with some informed guidance in this regard? >> >> Thanks in advance, >> John >> >> __ >> John Law >> Director, Platform Management >> ProQuest >> 789 Eisenhower Parkway >> Ann Arbor, MI 48106 >> 734-997-4877 >> [EMAIL PROTECTED] >> www.proquest.com >> www.csa.com >> >> ProQuest... Start here. >> >> >> >>
Re: How to get debug information while indexing?
On 9/26/07, Urvashi Gadi <[EMAIL PROTECTED]> wrote: > Hi, > > I am trying to create my own application using SOLR and while trying to > index my data i get > > Server returned HTTP response code: 400 for URL: > http://localhost:8983/solr/update or > Server returned HTTP response code: 500 for URL: > http://localhost:8983/solr/update > > Is there a way to get more debug information than this (any logs, which file > is wrong, schema.xml? etc) Both the HTTP reason and response body should contain more information. What are you using to communicate with Solr? Try a bad request with curl and you can see the info that comes back: [EMAIL PROTECTED] /cygdrive/f/code/lucene $ curl -i http://localhost:8983/solr/select?q=foo:bar HTTP/1.1 400 undefined_field_foo Content-Type: text/html; charset=iso-8859-1 Content-Length: 1398 Server: Jetty(6.1.3) Error 400 HTTP ERROR: 400undefined field foo RequestURI=/solr/selecthttp://jetty.mortbay.org/";>P owered by Jetty:// Errors should also be logged. -Yonik
Geographical distance searching
It is a "best practice" to store the master copy of this data in a relational database and use Solr/Lucene as a high-speed cache. MySQL has a geographical database option, so maybe that is a better option than Lucene indexing. Lance (P.s. please start new threads for new topics.) -Original Message- From: Sandeep Shetty [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 5:15 AM To: 'solr-user@lucene.apache.org' Subject: custom sorting > Hi Guys, > > this question as been asked before but i was unable to find an answer > thats good for me, so hope you guys can help again i am working on a > website where we need to sort the results by distance from the > location entered by the user. I have indexed the lat and long info for > each record in solr and also i can get the lat and long of the > location input by the user. > Previously we were using lucene to do this. by using the > SortComparatorSource we could sort the documents returned by distance > nicely. we are now switching over to lucene because of the features it > provides, however i am not able to see a way to do this in Solr. > > If someone can point me in the right direction i would be very grateful! > > Thanks in advance, > Sandeep This email is confidential and may also be privileged. If you are not the intended recipient please notify us immediately by telephoning +44 (0)20 7452 5300 or email [EMAIL PROTECTED] You should not copy it or use it for any purpose nor disclose its contents to any other person. Touch Local cannot accept liability for statements made which are clearly the sender's own and are not made on behalf of the firm. Touch Local Limited Registered Number: 2885607 VAT Number: GB896112114 Cardinal Tower, 12 Farringdon Road, London EC1M 3NN +44 (0)20 7452 5300
RE: dataset parameters suitable for lucene application
My limited experience with larger indexes is: 1) the logistics of copying around and backing up this much data, and 2) indexing is disk-bound. We're on SAS disks and it makes no difference between one indexing thread and a dozen (we have small records). Smaller returns are faster. You need to limit the search results via as many parameters as you can, and filters are the way to do this. -Original Message- From: Walter Underwood [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 10:58 AM To: solr-user@lucene.apache.org Subject: Re: dataset parameters suitable for lucene application No one can answer that, because it depends on how you configure Solr. How many fields do you want to search? Are you using fuzzy search? Facets? Highlighting? We are searching a much smaller collection, about 250K docs, with great success. We see 80 queries/sec on each of four servers, and response times under 100ms. Each query searches against seven fields and we don't use any of the features I listed above. wunder On 9/26/07 10:50 AM, "Law, John" <[EMAIL PROTECTED]> wrote: > Thanks all! One last question... > > If I had a collection of 2.5 billion docs and a demand averaging 200 > queries per second, what's the confidence that Solr/Lucene could > handle this volume and execute search with sub-second response times? > > > -Original Message- > From: Charlie Jackson [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 1:32 PM > To: solr-user@lucene.apache.org > Subject: RE: dataset parameters suitable for lucene application > > Sorry, I meant that it maxed out in the sense that my maxDoc field on > the stats page was 8.8 million, which indicates that the most docs it > has ever had was around 8.8 million. It's down to about 7.8 million > currently. I have seen no signs of a "maximum" number of docs Solr can > handle. > > > -Original Message- > From: Chris Harris [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 26, 2007 11:49 AM > To: solr-user@lucene.apache.org > Subject: Re: dataset parameters suitable for lucene application > > By "maxed out" do you mean that Solr's performance became unacceptable > beyond 8.8M records, or that you only had 8.8M records to index? If > the former, can you share the particular symptoms? > > On 9/26/07, Charlie Jackson <[EMAIL PROTECTED]> wrote: >> My experiences so far with this level of data have been good. >> >> Number of records: Maxed out at 8.8 million Database size: friggin >> huge (100+ GB) Index size: ~24 GB >> >> 1) It took me about a day to index 8 million docs using a > non-optimized >> program I wrote. It's non-optimized in the sense that it's not >> multi-threaded. It batched together groups of about 5,000 docs at a > time >> to be indexed. >> >> 2) Search times for a basic search are almost always sub-second. If >> we toss in some faceting, it takes a little longer, but I've hardly >> ever seen it go above 1-2 seconds even with the most advanced queries. >> >> Hope that helps. >> >> >> Charlie >> >> >> >> -Original Message- >> From: Law, John [mailto:[EMAIL PROTECTED] >> Sent: Wednesday, September 26, 2007 9:28 AM >> To: solr-user@lucene.apache.org >> Subject: dataset parameters suitable for lucene application >> >> I am new to the list and new to lucene and solr. I am considering > Lucene >> for a potential new application and need to know how well it scales. >> >> Following are the parameters of the dataset. >> >> Number of records: 7+ million >> Database size: 13.3 GB >> Index Size: 10.9 GB >> >> My questions are simply: >> >> 1) Approximately how long would it take Lucene to index these > documents? >> 2) What would the approximate retrieval time be (i.e. search response >> time)? >> >> Can someone provide me with some informed guidance in this regard? >> >> Thanks in advance, >> John >> >> __ >> John Law >> Director, Platform Management >> ProQuest >> 789 Eisenhower Parkway >> Ann Arbor, MI 48106 >> 734-997-4877 >> [EMAIL PROTECTED] >> www.proquest.com >> www.csa.com >> >> ProQuest... Start here. >> >> >> >>
Re: dataset parameters suitable for lucene application
On 26-Sep-07, at 10:50 AM, Law, John wrote: Thanks all! One last question... If I had a collection of 2.5 billion docs and a demand averaging 200 queries per second, what's the confidence that Solr/Lucene could handle this volume and execute search with sub-second response times? No search software can search 2.5 billion docs (assuming web-sized documents) in 5ms on a single server. You certainly could build such a system with Solr distributed over 100's of nodes, but this is not built into Solr currently. -Mike
Re: custom sorting
On 26-Sep-07, at 5:14 AM, Sandeep Shetty wrote: Hi Guys, this question as been asked before but i was unable to find an answer thats good for me, so hope you guys can help again i am working on a website where we need to sort the results by distance from the location entered by the user. I have indexed the lat and long info for each record in solr and also i can get the lat and long of the location input by the user. Previously we were using lucene to do this. by using the SortComparatorSource we could sort the documents returned by distance nicely. we are now switching over to lucene because of the features it provides, however i am not able to see a way to do this in Solr. If someone can point me in the right direction i would be very grateful! Thanks in advance, Sandeep This email is confidential and may also be privileged. If you are not the intended recipient please notify us immediately by telephoning +44 (0)20 7452 5300 or email [EMAIL PROTECTED] You should not copy it or use it for any purpose nor disclose its contents to any other person. Touch Local cannot accept liability for statements made which are clearly the sender's own and are not made on behalf of the firm. Sorry, I'm afraid the above email is already irrevokably publicly archived. -Mike
What is facet?
Could someone tell me what facet is? I have a vague idea but I am not too clear. A pointer to a sample web site that uses Solr facet would be very good. Thanks. -Kuro
RE: Geographical distance searching
With the new/improved value source functions it should be pretty easy to develop a new best practice. You should be able to pull in the lat/lon values from valuesource fields and then do your greater circle calculation. - will -Original Message- From: Lance Norskog [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 3:15 PM To: solr-user@lucene.apache.org Subject: Geographical distance searching It is a "best practice" to store the master copy of this data in a relational database and use Solr/Lucene as a high-speed cache. MySQL has a geographical database option, so maybe that is a better option than Lucene indexing. Lance (P.s. please start new threads for new topics.) -Original Message- From: Sandeep Shetty [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 5:15 AM To: 'solr-user@lucene.apache.org' Subject: custom sorting > Hi Guys, > > this question as been asked before but i was unable to find an answer > thats good for me, so hope you guys can help again i am working on a > website where we need to sort the results by distance from the > location entered by the user. I have indexed the lat and long info for > each record in solr and also i can get the lat and long of the > location input by the user. > Previously we were using lucene to do this. by using the > SortComparatorSource we could sort the documents returned by distance > nicely. we are now switching over to lucene because of the features it > provides, however i am not able to see a way to do this in Solr. > > If someone can point me in the right direction i would be very grateful! > > Thanks in advance, > Sandeep This email is confidential and may also be privileged. If you are not the intended recipient please notify us immediately by telephoning +44 (0)20 7452 5300 or email [EMAIL PROTECTED] You should not copy it or use it for any purpose nor disclose its contents to any other person. Touch Local cannot accept liability for statements made which are clearly the sender's own and are not made on behalf of the firm. Touch Local Limited Registered Number: 2885607 VAT Number: GB896112114 Cardinal Tower, 12 Farringdon Road, London EC1M 3NN +44 (0)20 7452 5300
Converting German special characters / umlaute
Dear list, I have two questions regarding German special characters or umlaute. is there an analyzer which automatically converts all german special characters to their specific dissected from, such as ü to ue and ä to ae, etc.?! I also would like to have, that the search is always run against the dissected data. But when the results are returned the initial data with the non modified data should be returned. Does lucene GermanAnalyzer this job? I run across it, but I could not figure out from the documentation whether it does the job or not. thanks a lot in advance. Matthias -- Matthias Eireiner Web Reisen GmbH Amalienstr. 45 80799 München +49 (89) 289-22920 [EMAIL PROTECTED] Geschäftsführung: Gabriel Graf Matuschka - Sitz der Gesellschaft: München Registergericht: Amtsgericht München, HRB 167305
Re: What is facet?
Faceted search is an approach to search where a taxonomy or categorization scheme is visible in addition to document matches. http://www.searchtools.com/info/faceted-metadata.html --Ezra. On 9/26/07 3:47 PM, "Teruhiko Kurosaka" <[EMAIL PROTECTED]> wrote: > Could someone tell me what facet is? > I have a vague idea but I am not too clear. > A pointer to a sample web site that uses Solr facet > would be very good. > > Thanks. > -Kuro
Re: Converting German special characters / umlaute
Try the SnowballPorterFilterFactory described here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters You should use the German2 variant that converts ä and ae to a, ö and oe to o and so on. More details: http://snowball.tartarus.org/algorithms/german2/stemmer.html Every document in solr can have any number of fields which might have the same source but have different field types and are therefore handled differently (stored as is, analyzed in different ways...). Use copyField in your schema.xml to feed your data into multiple fields. During searching you decide which fields you like to search on (usually the analyzed ones) and which you retrieve when getting the document back. Tom Matthias Eireiner schrieb: Dear list, I have two questions regarding German special characters or umlaute. is there an analyzer which automatically converts all german special characters to their specific dissected from, such as ü to ue and ä to ae, etc.?! I also would like to have, that the search is always run against the dissected data. But when the results are returned the initial data with the non modified data should be returned. Does lucene GermanAnalyzer this job? I run across it, but I could not figure out from the documentation whether it does the job or not. thanks a lot in advance. Matthias
Re: What is facet?
: Faceted search is an approach to search where a taxonomy or categorization : scheme is visible in addition to document matches. My ApacheConUS2006 talk went into a little more detail, including the best definition of faceted searching/browsing I've ever seen... http://people.apache.org/~hossman/apachecon2006us/ "Interaction style where users filter a set of items by progressively selecting from only valid values of a faceted classification system" Keith Instone, SOASIS&T, July 8, 2004 Specificly regarding the term "facet" ... there we tend to find some ambiguity. Lots of people can describe Faceted Searching, but most people's concept of a "Facet" tends to be very narrow. Since I wrote most of the Solr Faceting documentation, it tends to follow my bias (also from my 2006 talk) ... Explaining My Terms * Facet: A distinct feature or aspect of a set of objects; "a way in which a resource can be classified" * Constraint: A viable method of limiting a set of objects In this regard, "color" is a facet, "blue" is a constraint on the color facet which may be expressed as the query "color:blue". Likewise "popularity" is a facet, and a constraint query on the popularity facet might be "populartiy:high" or it might be "popularity:[100 TO *]" depending on the specifics of how you manage your data. A more complicated example is that you might define a high level conceptual facet of "coolness" which does not directly relate to a specific concrete field, but instead relates to a complex query on many fields (hence: Solr's facet.query options) such that the "coolness" facet has constraints... cool => (popularity:[100 TO *] (+numFeatures:[10 TO *] +price:[0 TO 10])) lame => (+popularity:[* TO 99] +numFeatures:[* TO 9] +price:[11 TO *]) -Hoss
Re: Converting German special characters / umlaute
: is there an analyzer which automatically converts all german special : characters to their specific dissected from, such as ü to ue and ä to : ae, etc.?! See also the ISOLatin1TokenFilter which does this regardless of langauge. : I also would like to have, that the search is always run against the : dissected data. But when the results are returned the initial data with : the non modified data should be returned. stored fields (returned to clients) always contain the orriginal field value, regardless of which analyzer/tokenizer/tokenfilters you use. PS: http://people.apache.org/~hossman/#threadhijack When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/Thread_hijacking -Hoss
Re: Geographical distance searching
Have you guys seen Local Lucene ? http://www.nsshutdown.com/projects/lucene/whitepaper/*locallucene*.htm no need for mysql if you don't want too. rgrds Ian Will Johnson wrote: With the new/improved value source functions it should be pretty easy to develop a new best practice. You should be able to pull in the lat/lon values from valuesource fields and then do your greater circle calculation. - will -Original Message- From: Lance Norskog [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 3:15 PM To: solr-user@lucene.apache.org Subject: Geographical distance searching It is a "best practice" to store the master copy of this data in a relational database and use Solr/Lucene as a high-speed cache. MySQL has a geographical database option, so maybe that is a better option than Lucene indexing. Lance (P.s. please start new threads for new topics.) -Original Message- From: Sandeep Shetty [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 26, 2007 5:15 AM To: 'solr-user@lucene.apache.org' Subject: custom sorting Hi Guys, this question as been asked before but i was unable to find an answer thats good for me, so hope you guys can help again i am working on a website where we need to sort the results by distance from the location entered by the user. I have indexed the lat and long info for each record in solr and also i can get the lat and long of the location input by the user. Previously we were using lucene to do this. by using the SortComparatorSource we could sort the documents returned by distance nicely. we are now switching over to lucene because of the features it provides, however i am not able to see a way to do this in Solr. If someone can point me in the right direction i would be very grateful! Thanks in advance, Sandeep This email is confidential and may also be privileged. If you are not the intended recipient please notify us immediately by telephoning +44 (0)20 7452 5300 or email [EMAIL PROTECTED] You should not copy it or use it for any purpose nor disclose its contents to any other person. Touch Local cannot accept liability for statements made which are clearly the sender's own and are not made on behalf of the firm. Touch Local Limited Registered Number: 2885607 VAT Number: GB896112114 Cardinal Tower, 12 Farringdon Road, London EC1M 3NN +44 (0)20 7452 5300
Re: searching for non-empty fields
I've experienced a similar problem before, assuming the field type is "string" (i.e. not tokenized), there is subtle yet important difference between a field that is null (i.e. not contained in the document) and one that is an empty string (in the document but with no value). See http://www.nabble.com/indexing-null-values--tf4238702.html#a12067741 for a previous discussion of the issue. Your query will work if you make sure the URL field is omitted from the document at index time when the field is blank. cheers, Piete On 27/09/2007, Brian Whitman <[EMAIL PROTECTED]> wrote: > > I have a large index with a field for a URL. For some reason or > another, sometimes a doc will get indexed with that field blank. This > is fine but I want a query to return only the set URL fields... > > If I do a query like: > > q=URL:[* TO *] > > I get a lot of empty fields back, like: > > > > http://thing.com > > What I can query for to remove the empty fields? > > > >
Re: Geographical distance searching
Might want to remove the *'s around that url http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene.htm There's actually a download-able demo http://www.nsshutdown.com/solr-example_s1.3_ls0.2.tgz start it up as you would a normal solr example $ cd solr-example/apache-solr*/example $ java -jar start.jar Open up firefox (sorry demo ui was quick and dirty so firefox only) and go to http://localhost:8983/localcinema/ Make sure you specify localhost, there's a google maps key based upon the url's domain, and click 'Go' at the bottom of the page. The demo comes with some sample data already indexed for the NY region, so have a play. p.s after a little tidy up I'll be adding this to both lucene and solr's repositories if folks feel that it's a useful addition. Thanks Patrick Ian Holsman wrote: Have you guys seen Local Lucene ? http://www.nsshutdown.com/projects/lucene/whitepaper/*locallucene*.htm no need for mysql if you don't want too. rgrds Ian Will Johnson wrote: With the new/improved value source functions it should be pretty easy to develop a new best practice. You should be able to pull in the lat/lon values from valuesource fields and then do your greater circle calculation. - will -Original Message- From: Lance Norskog [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 26, 2007 3:15 PM To: solr-user@lucene.apache.org Subject: Geographical distance searching It is a "best practice" to store the master copy of this data in a relational database and use Solr/Lucene as a high-speed cache. MySQL has a geographical database option, so maybe that is a better option than Lucene indexing. Lance (P.s. please start new threads for new topics.) -Original Message- From: Sandeep Shetty [mailto:[EMAIL PROTECTED]] Sent: Wednesday, September 26, 2007 5:15 AM To: 'solr-user@lucene.apache.org' Subject: custom sorting Hi Guys, this question as been asked before but i was unable to find an answer thats good for me, so hope you guys can help again i am working on a website where we need to sort the results by distance from the location entered by the user. I have indexed the lat and long info for each record in solr and also i can get the lat and long of the location input by the user. Previously we were using lucene to do this. by using the SortComparatorSource we could sort the documents returned by distance nicely. we are now switching over to lucene because of the features it provides, however i am not able to see a way to do this in Solr. If someone can point me in the right direction i would be very grateful! Thanks in advance, Sandeep This email is confidential and may also be privileged. If you are not the intended recipient please notify us immediately by telephoning +44 (0)20 7452 5300 or email [EMAIL PROTECTED]. You should not copy it or use it for any purpose nor disclose its contents to any other person. Touch Local cannot accept liability for statements made which are clearly the sender's own and are not made on behalf of the firm. Touch Local Limited Registered Number: 2885607 VAT Number: GB896112114 Cardinal Tower, 12 Farringdon Road, London EC1M 3NN +44 (0)20 7452 5300 -- Patrick O'Leary You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles. Do you understand this? And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat. - Albert Einstein View Patrick O Leary's profile
anyone can send me jetty-plus
i can't download it from http://jetty.mortbay.org/jetty5/plus/index.html -- regards jl
Re: searching for non-empty fields
Your query will work if you make sure the URL field is omitted from the document at index time when the field is blank. adding something like: to the schema field should do it without needing to ensure it is not null or "" on the client side. ryan
Re: searching for non-empty fields
: Date: Thu, 27 Sep 2007 00:12:48 -0400 : From: Ryan McKinley <[EMAIL PROTECTED]> : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: Re: searching for non-empty fields : : > : > Your query will work if you make sure the URL field is omitted from the : > document at index time when the field is blank. : > : : adding something like: : : : to the schema field should do it without needing to ensure it is not null or : "" on the client side. ...and to work arround the problem untill you reindex... q=(URL:[* TO *] -URL:"") ...at least: i'm 97% certain that will work. it won't help if you "empty" values are really " " or " " or ... -Hoss