Re: delete snapshot??
How can I remove from time to time, because for the script snapcleaner I just have the option to delete last day ??? thanks a lot Noble and sorry again for all this question, Noble Paul നോബിള് नोब्ळ् wrote: > > The hardlinks will prevent the unused files from getting cleaned up. > So the diskspace is consumed for unused index files also. You may need > to delete unused snapshots from time to time > --Noble > > On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr wrote: >> >> Hi Noble, >> >> I maybe don't get something >> Ok if it's hard link but how come i've not space left on device error and >> 30G shown on the data folder ?? >> sorry I'm quite new >> >> 6.0G/data/solr/book/data/snapshot.20090216214502 >> 35M /data/solr/book/data/snapshot.20090216195003 >> 12M /data/solr/book/data/snapshot.20090216195502 >> 12K /data/solr/book/data/spellchecker2 >> 36M /data/solr/book/data/snapshot.20090216185502 >> 37M /data/solr/book/data/snapshot.20090216203502 >> 6.0M/data/solr/book/data/index >> 12K /data/solr/book/data/snapshot.20090216204002 >> 5.8G/data/solr/book/data/snapshot.20090216172020 >> 12K /data/solr/book/data/spellcheckerFile >> 28K /data/solr/book/data/snapshot.20090216200503 >> 40K /data/solr/book/data/snapshot.20090216194002 >> 24K /data/solr/book/data/snapshot.2009021622 >> 32K /data/solr/book/data/snapshot.20090216184502 >> 20K /data/solr/book/data/snapshot.20090216191004 >> 1.1M/data/solr/book/data/snapshot.20090216213502 >> 1.1M/data/solr/book/data/snapshot.20090216201502 >> 1.1M/data/solr/book/data/snapshot.20090216213005 >> 24K /data/solr/book/data/snapshot.20090216191502 >> 1.1M/data/solr/book/data/snapshot.20090216212503 >> 107M/data/solr/book/data/snapshot.20090216212002 >> 14M /data/solr/book/data/snapshot.20090216190502 >> 32K /data/solr/book/data/snapshot.20090216201002 >> 2.3M/data/solr/book/data/snapshot.20090216204502 >> 28K /data/solr/book/data/snapshot.20090216184002 >> 5.8G/data/solr/book/data/snapshot.20090216181425 >> 44K /data/solr/book/data/snapshot.20090216190001 >> 20K /data/solr/book/data/snapshot.20090216183401 >> 1.1M/data/solr/book/data/snapshot.20090216203002 >> 44K /data/solr/book/data/snapshot.20090216194502 >> 36K /data/solr/book/data/snapshot.20090216185004 >> 12K /data/solr/book/data/snapshot.20090216182720 >> 12K /data/solr/book/data/snapshot.20090216214001 >> 5.8G/data/solr/book/data/snapshot.20090216175106 >> 1.1M/data/solr/book/data/snapshot.20090216202003 >> 5.8G/data/solr/book/data/snapshot.20090216173224 >> 12K /data/solr/book/data/spellchecker1 >> 1.1M/data/solr/book/data/snapshot.20090216202502 >> 30G /data/solr/book/data >> thanks a lot, >> >> >> Noble Paul നോബിള് नोब्ळ् wrote: >>> >>> they are just hardlinks. they do not consume space on disk >>> >>> On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr wrote: Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: > > The --delete option of the rsync command deletes extraneous files from > the > destination directory. It does not delete Solr snapshots. To do that > you > can use the snapcleaner on the master and/or slave. > > Bill > > On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr > wrote: > >> >> root 26834 16.2 0.0 19412 824 ?S16:05 0:08 >> rsync >> -Wa >> --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ >> /data/solr/books/data/snapshot.20090213160051-wip >> >> Hi obviously it can't delete them because the adress is bad it >> shouldnt >> be >> : >> rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ >> but: >> rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ >> >> Where should I change this, I checked my script.conf on the slave >> server >> but >> it seems good. >> >> Because files can be very big and my server in few hours is getting >> full. >> >> So actually snapcleaner is not necessary on the master ? what about >> the >> slave? >> >> Thanks a lot, >> Sunny >> -- >> View this message in context: >> http://www.nabble.com/delete-snapshot---tp21998333p21998333.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/delete-snapshot---tp21998333p22041332.html Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >>> >>> -- >>> --Noble Paul >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/delete-snapshot---tp21998333p22048398.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > > > -- > --Noble Paul >
Re: dealing with logs - feature advice based on a use case
Marc, I don't have a Multicore setup that's itching for better logging, but I think what you are suggesting is good. If I had a multicore setup I might want either separate logs or the option to log the core name. Perhaps an Enhancement type JIRA entry is in order? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Marc Sturlese To: solr-user@lucene.apache.org Sent: Wednesday, January 14, 2009 11:54:09 PM Subject: dealing with logs - feature advice based on a use case Hey there, Just want to explain a feature I think would be really useful for the future. In my use case I need a log per core. I spoke about this feature before. My idea was to separate the logs with log4j but saw it was not that easy. In the other thread we spoke about passing the core name to the loggers. Do that would be so much hacking so I decided not to do that (otherwise would be almost impossible to upgrade to new releases). I think would be great to have it in Solr. To solve it, what I have done is use log4j and log all messages in the syslog. Once in there I have bash scripts that redirect the messages depending on the core name they have. Apparently this would solve my problem but there are lots of messages that haven't the core name so I can't redirect them to the needed log file. So, another possible solution would be to have the core name in all log messages. Don't you think would be useful in many use cases? Thanks in advance -- View this message in context: http://www.nabble.com/dealing-with-logs---feature-advice-based-on-a-use-case-tp21458747p21458747.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: delete snapshot??
Hi, snapcleaner lets you delete snapshots by one of the following two criteria: - delete all but last N snapshots - delete all snapshots older than N days Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: sunnyfr To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 4:17:39 PM Subject: Re: delete snapshot?? How can I remove from time to time, because for the script snapcleaner I just have the option to delete last day ??? thanks a lot Noble and sorry again for all this question, Noble Paul നോബിള് नोब्ळ् wrote: > > The hardlinks will prevent the unused files from getting cleaned up. > So the diskspace is consumed for unused index files also. You may need > to delete unused snapshots from time to time > --Noble > > On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr wrote: >> >> Hi Noble, >> >> I maybe don't get something >> Ok if it's hard link but how come i've not space left on device error and >> 30G shown on the data folder ?? >> sorry I'm quite new >> >> 6.0G /data/solr/book/data/snapshot.20090216214502 >> 35M /data/solr/book/data/snapshot.20090216195003 >> 12M /data/solr/book/data/snapshot.20090216195502 >> 12K /data/solr/book/data/spellchecker2 >> 36M /data/solr/book/data/snapshot.20090216185502 >> 37M /data/solr/book/data/snapshot.20090216203502 >> 6.0M /data/solr/book/data/index >> 12K /data/solr/book/data/snapshot.20090216204002 >> 5.8G /data/solr/book/data/snapshot.20090216172020 >> 12K /data/solr/book/data/spellcheckerFile >> 28K /data/solr/book/data/snapshot.20090216200503 >> 40K /data/solr/book/data/snapshot.20090216194002 >> 24K /data/solr/book/data/snapshot.2009021622 >> 32K /data/solr/book/data/snapshot.20090216184502 >> 20K /data/solr/book/data/snapshot.20090216191004 >> 1.1M /data/solr/book/data/snapshot.20090216213502 >> 1.1M /data/solr/book/data/snapshot.20090216201502 >> 1.1M /data/solr/book/data/snapshot.20090216213005 >> 24K /data/solr/book/data/snapshot.20090216191502 >> 1.1M /data/solr/book/data/snapshot.20090216212503 >> 107M /data/solr/book/data/snapshot.20090216212002 >> 14M /data/solr/book/data/snapshot.20090216190502 >> 32K /data/solr/book/data/snapshot.20090216201002 >> 2.3M /data/solr/book/data/snapshot.20090216204502 >> 28K /data/solr/book/data/snapshot.20090216184002 >> 5.8G /data/solr/book/data/snapshot.20090216181425 >> 44K /data/solr/book/data/snapshot.20090216190001 >> 20K /data/solr/book/data/snapshot.20090216183401 >> 1.1M /data/solr/book/data/snapshot.20090216203002 >> 44K /data/solr/book/data/snapshot.20090216194502 >> 36K /data/solr/book/data/snapshot.20090216185004 >> 12K /data/solr/book/data/snapshot.20090216182720 >> 12K /data/solr/book/data/snapshot.20090216214001 >> 5.8G /data/solr/book/data/snapshot.20090216175106 >> 1.1M /data/solr/book/data/snapshot.20090216202003 >> 5.8G /data/solr/book/data/snapshot.20090216173224 >> 12K /data/solr/book/data/spellchecker1 >> 1.1M /data/solr/book/data/snapshot.20090216202502 >> 30G /data/solr/book/data >> thanks a lot, >> >> >> Noble Paul നോബിള് नोब्ळ् wrote: >>> >>> they are just hardlinks. they do not consume space on disk >>> >>> On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr wrote: Hi, Ok but can I use it more often then every day like every three hours, because snapshot are quite big. Thanks a lot, Bill Au wrote: > > The --delete option of the rsync command deletes extraneous files from > the > destination directory. It does not delete Solr snapshots. To do that > you > can use the snapcleaner on the master and/or slave. > > Bill > > On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr > wrote: > >> >> root 26834 16.2 0.0 19412 824 ? S 16:05 0:08 >> rsync >> -Wa >> --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ >> /data/solr/books/data/snapshot.20090213160051-wip >> >> Hi obviously it can't delete them because the adress is bad it >> shouldnt >> be >> : >> rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ >> but: >> rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/ >> >> Where should I change this, I checked my script.conf on the slave >> server >> but >> it seems good. >> >> Because files can be very big and my server in few hours is getting >> full. >> >> So actually snapcleaner is not necessary on the master ? what about >> the >> slave? >> >> Thanks a lot, >> Sunny >> -- >> View this message in context: >> http://www.nabble.com/delete-snapshot---tp21998333p21998333.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/delete-snapshot---tp21
Re: Outofmemory error for large files
On Tue, Feb 17, 2009 at 1:10 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Right. But I was trying to point out that a single 150MB Document is not > in fact what the o.p. wants to do. For example, if your 150MB represents, > say, a whole book, should that really be a single document? Or should > individual chapters be separate documents, for example? > > Yes, a 150MB document is probably not a good idea. I am only trying to point out that even if he writes multiple documents in a 150MB batch, he may still hit the OOME because all the XML is written to memory first and then out to the server. -- Regards, Shalin Shekhar Mangar.
Facet search on Multi-Valued Fields
Hi all, I have been experimenting solr faceted search for 2 weeks. But I meet performance limitation on facet Search. My solr contains 4,000,000 documents. Normal searching is fairly fast, But faceted search is extremely slow. I am trying to do facet search on 3 fields (all multivalued fields) in one query. field1 has 2 million distinct values, field2 has 1.5 million distinct values, field3 has 50,000 distinct values. I already set the filterCache to 3,000,000, But the searching speed is still very slow. Normally each query will took 5 mins or more. As I narrow down the search, the speed will increase dramatically. Is there anyway to optimize the faceted search? Every help is appreciated. Thanks in advanced. Regards GC
Re: Facet search on Multi-Valued Fields
Have you tired with a nightly build with the new facet algorithm (it is activated by default)? http://www.nabble.com/new-faceting-algorithm-td20674902.html Wang Guangchen wrote: > > Hi all, > I have been experimenting solr faceted search for 2 weeks. But I meet > performance limitation on facet Search. > My solr contains 4,000,000 documents. Normal searching is fairly fast, But > faceted search is extremely slow. > > I am trying to do facet search on 3 fields (all multivalued fields) in one > query. field1 has 2 million distinct values, field2 has 1.5 million > distinct > values, field3 has 50,000 distinct values. > > I already set the filterCache to 3,000,000, But the searching speed is > still > very slow. Normally each query will took 5 mins or more. As I narrow down > the search, the speed will increase dramatically. > > Is there anyway to optimize the faceted search? Every help is > appreciated. > Thanks in advanced. > > > > Regards > > GC > > -- View this message in context: http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multilanguage
I was looking for such a tool and haven't found it yet. Using StandardAnalyzer one can obtain some form of token-stream which can be used for "agnostic analysis". Clearly, then, something that matches words in a dictionary and decides on the language based on the language of the majority could do a decent job to decide the analyzer. Does such a tool exist? It doesn't seem too hard for Lucene. paul Le 17-févr.-09 à 04:44, Otis Gospodnetic a écrit : The best option would be to identify the language after parsing the PDF and then index it using an appropriate analyzer defined in schema.xml. smime.p7s Description: S/MIME cryptographic signature
Re: Facet search on Multi-Valued Fields
Nope, I am using the latest stable version of solr 1.3.0. Thanks for your tips. Besides this, Is there any other thing I should do? I am reading some previous threads about index optimization. ( http://www.mail-archive.com/solr-user@lucene.apache.org/msg05290.html), Will it improve the facet search speed? GC On Tue, Feb 17, 2009 at 5:30 PM, Marc Sturlese wrote: > > Have you tired with a nightly build with the new facet algorithm (it is > activated by default)? > http://www.nabble.com/new-faceting-algorithm-td20674902.html > > > Wang Guangchen wrote: > > > > Hi all, > > I have been experimenting solr faceted search for 2 weeks. But I meet > > performance limitation on facet Search. > > My solr contains 4,000,000 documents. Normal searching is fairly fast, > But > > faceted search is extremely slow. > > > > I am trying to do facet search on 3 fields (all multivalued fields) in > one > > query. field1 has 2 million distinct values, field2 has 1.5 million > > distinct > > values, field3 has 50,000 distinct values. > > > > I already set the filterCache to 3,000,000, But the searching speed is > > still > > very slow. Normally each query will took 5 mins or more. As I narrow > down > > the search, the speed will increase dramatically. > > > > Is there anyway to optimize the faceted search? Every help is > > appreciated. > > Thanks in advanced. > > > > > > > > Regards > > > > GC > > > > > > -- > View this message in context: > http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: Multilanguage
Paul Libbrecht schrieb: Clearly, then, something that matches words in a dictionary and decides on the language based on the language of the majority could do a decent job to decide the analyzer. Does such a tool exist? I once played around with http://ngramj.sourceforge.net/ for language guessing. It did a good job. It doesn't use dictionaries for language identification but a statistical approach using ngrams. I don't have any precise numbers, but out of about 1 documents in different languages (most in English, German and French, few in other european languages like Polish) there were only some 10 not identified correctly. Till -- Till Kinstler Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) Platz der Göttinger Sieben 1, D 37073 Göttingen kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
Re: Facet search on Multi-Valued Fields
Well doing an optimization after you do indexing will always improve your search speed a little bit. But with the new facet algorithm you will note a huge improvement ... Other things to consider is to just index and store the necessary fields, omitNorms always that is possible... there are many tips around... keep reading ;) Wang Guangchen wrote: > > Nope, I am using the latest stable version of solr 1.3.0. > > Thanks for your tips. > > Besides this, Is there any other thing I should do? I am reading some > previous threads about index optimization. ( > http://www.mail-archive.com/solr-user@lucene.apache.org/msg05290.html), > Will > it improve the facet search speed? > > GC > > > > On Tue, Feb 17, 2009 at 5:30 PM, Marc Sturlese > wrote: > >> >> Have you tired with a nightly build with the new facet algorithm (it is >> activated by default)? >> http://www.nabble.com/new-faceting-algorithm-td20674902.html >> >> >> Wang Guangchen wrote: >> > >> > Hi all, >> > I have been experimenting solr faceted search for 2 weeks. But I meet >> > performance limitation on facet Search. >> > My solr contains 4,000,000 documents. Normal searching is fairly fast, >> But >> > faceted search is extremely slow. >> > >> > I am trying to do facet search on 3 fields (all multivalued fields) in >> one >> > query. field1 has 2 million distinct values, field2 has 1.5 million >> > distinct >> > values, field3 has 50,000 distinct values. >> > >> > I already set the filterCache to 3,000,000, But the searching speed is >> > still >> > very slow. Normally each query will took 5 mins or more. As I narrow >> down >> > the search, the speed will increase dramatically. >> > >> > Is there anyway to optimize the faceted search? Every help is >> > appreciated. >> > Thanks in advanced. >> > >> > >> > >> > Regards >> > >> > GC >> > >> > >> >> -- >> View this message in context: >> http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22054095.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet search on Multi-Valued Fields
Thank you very much. On Tue, Feb 17, 2009 at 6:04 PM, Marc Sturlese wrote: > > Well doing an optimization after you do indexing will always improve your > search speed a little bit. But with the new facet algorithm you will note a > huge improvement ... > Other things to consider is to just index and store the necessary fields, > omitNorms always that is possible... there are many tips around... keep > reading ;) > > > Wang Guangchen wrote: > > > > Nope, I am using the latest stable version of solr 1.3.0. > > > > Thanks for your tips. > > > > Besides this, Is there any other thing I should do? I am reading some > > previous threads about index optimization. ( > > http://www.mail-archive.com/solr-user@lucene.apache.org/msg05290.html), > > Will > > it improve the facet search speed? > > > > GC > > > > > > > > On Tue, Feb 17, 2009 at 5:30 PM, Marc Sturlese > > wrote: > > > >> > >> Have you tired with a nightly build with the new facet algorithm (it is > >> activated by default)? > >> http://www.nabble.com/new-faceting-algorithm-td20674902.html > >> > >> > >> Wang Guangchen wrote: > >> > > >> > Hi all, > >> > I have been experimenting solr faceted search for 2 weeks. But I meet > >> > performance limitation on facet Search. > >> > My solr contains 4,000,000 documents. Normal searching is fairly fast, > >> But > >> > faceted search is extremely slow. > >> > > >> > I am trying to do facet search on 3 fields (all multivalued fields) in > >> one > >> > query. field1 has 2 million distinct values, field2 has 1.5 million > >> > distinct > >> > values, field3 has 50,000 distinct values. > >> > > >> > I already set the filterCache to 3,000,000, But the searching speed is > >> > still > >> > very slow. Normally each query will took 5 mins or more. As I narrow > >> down > >> > the search, the speed will increase dramatically. > >> > > >> > Is there anyway to optimize the faceted search? Every help is > >> > appreciated. > >> > Thanks in advanced. > >> > > >> > > >> > > >> > Regards > >> > > >> > GC > >> > > >> > > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > >> > >> > > > > > > -- > View this message in context: > http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22054095.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Finding total range of dates for date faceting
Hi, I'm trying to write some code to build a facet list for a date field, but I don't know what the first and last available dates are. I would adjust the gap param accordingly. If there is a 10yr stretch between min(date) and max(date) I'd want to facet by year. If it is a 1 month gap, I'd want to facet by day. Is there a way to do this? Thanks, Jacob -- +1 510 277-0891 (o) +91 33 7458 (m) web: http://pajamadesign.com Skype: pajamadesign Yahoo: jacobsingh AIM: jacobsingh gTalk: jacobsi...@gmail.com
Re: Multilanguage
Does Apache Tika help find the language of the given document? On 2/17/09, Till Kinstler wrote: > > Paul Libbrecht schrieb: > > Clearly, then, something that matches words in a dictionary and decides on >> the language based on the language of the majority could do a decent job to >> decide the analyzer. >> >> Does such a tool exist? >> > > I once played around with http://ngramj.sourceforge.net/ for language > guessing. It did a good job. It doesn't use dictionaries for language > identification but a statistical approach using ngrams. > I don't have any precise numbers, but out of about 1 documents in > different languages (most in English, German and French, few in other > european languages like Polish) there were only some 10 not identified > correctly. > > Till > > -- > Till Kinstler > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) > Platz der Göttinger Sieben 1, D 37073 Göttingen > kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de >
DIH full-import with clean=true fails and rollback empties index
Hi there, I've got a pretty simple question regarding the DIH full-import command. I have a SOLR server running that has a full index with lots of documents in it. Once a day, a full-import is run, which uses the default parameters (clean=true, because it's not an incremental index). When I run a full-import, the first step is cleaning up the whole index: Feb 7, 2009 2:12:01 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX After that, suppose the import suddenly fails for one reason or another (ie. SQL error), which initiates a rollback: Feb 7, 2009 2:12:02 AM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed [...] Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true) Unfortunately, this rollback does not "refill" the index with the old data, and neither keeps the old index from being overwritten with the new, erroneous index. Now my question is: is there anything I can do to keep Solr from trashing my index on a full-import when there is a problem with the database? Or should I use clean=false, even though 99% of the imported documents are not incremental but the same documents that already were in the index, only with new data? Any tips will be greatly appreciated! :) - Steffen -- View this message in context: http://www.nabble.com/DIH-full-import-with-clean%3Dtrue-fails-and-rollback-empties-index-tp22055065p22055065.html Sent from the Solr - User mailing list archive at Nabble.com.
Query regarding setTimeAllowed(Integer) and setRows(Integer)
Hi, I am trying to avoid queries which take a lot of server time. For this I plan to use setRows(Integer) and setTimeAllowed(Integer) methods while creating the SolrQuery. I would like to know the following: 1. I set SolrQuery.setRows(5000) Will the processing of the query stop once 5000 results are found or the query will be completely processed and then the result set is sorted out based on Rank Boosting and the top 5000 results are returned? 2. If I set SolrQuery.setTimeAllowed(2000) Will this kill query processing after 2 secs? (I know this question sounds silly but I just want a confirmation from the experts J ) Is there anything else I can do to get the desired results? Thanks, Kumar
Re: DIH full-import with clean=true fails and rollback empties index
On Tue, Feb 17, 2009 at 4:42 PM, Steffen B. wrote: > > Unfortunately, this rollback does not "refill" the index with the old data, > and neither keeps the old index from being overwritten with the new, > erroneous index. Now my question is: is there anything I can do to keep > Solr > from trashing my index on a full-import when there is a problem with the > database? This is not good. I'll try to write some tests and try to find the cause. > > Or should I use clean=false, even though 99% of the imported documents are > not incremental but the same documents that already were in the index, only > with new data? Use clean=false for the time being. The old documents will be replaced with the new ones (old and new must have same uniqueKey). -- Regards, Shalin Shekhar Mangar.
Re: DIH full-import with clean=true fails and rollback empties index
may be you can try "postImportDeleteQuery" (not yet documented , SOLR-801) on a root entity. You can keep a timestamp in the fields which can keep the value of ${dataimporter.index_start_time} as a field . Use that to remove old docs which may exist in the index before the indexing started --Noble On Tue, Feb 17, 2009 at 4:42 PM, Steffen B. wrote: > > Hi there, > I've got a pretty simple question regarding the DIH full-import command. > I have a SOLR server running that has a full index with lots of documents in > it. Once a day, a full-import is run, which uses the default parameters > (clean=true, because it's not an incremental index). > When I run a full-import, the first step is cleaning up the whole index: > > Feb 7, 2009 2:12:01 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll > INFO: [] REMOVING ALL DOCUMENTS FROM INDEX > > After that, suppose the import suddenly fails for one reason or another (ie. > SQL error), which initiates a rollback: > > Feb 7, 2009 2:12:02 AM org.apache.solr.handler.dataimport.DataImporter > doFullImport > SEVERE: Full Import failed > [...] > Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback > INFO: start rollback > Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback > INFO: end_rollback > Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 commit > INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true) > > Unfortunately, this rollback does not "refill" the index with the old data, > and neither keeps the old index from being overwritten with the new, > erroneous index. Now my question is: is there anything I can do to keep Solr > from trashing my index on a full-import when there is a problem with the > database? > Or should I use clean=false, even though 99% of the imported documents are > not incremental but the same documents that already were in the index, only > with new data? > Any tips will be greatly appreciated! :) > - Steffen > -- > View this message in context: > http://www.nabble.com/DIH-full-import-with-clean%3Dtrue-fails-and-rollback-empties-index-tp22055065p22055065.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
2 strange behaviours with DIH full-import.
Hey, I have 2 problems that I think are really important and can be useful for other users: 1.) I am runing 3 cores in a solr instance. Each core contains about a milion and a half docs. Once a full-import is run in a core it will free just a little bit of java memory. Once that first full-import is done and I run another full-import with another core the memory used by the first full-import will never be set free. Once the second full-import is done I run the third... and I run out of memory! Is this a Solr bug setting memory to free or I am missing something? Is there any way yo tell Solr to free all memory after a full-import? It's a really severe error in my case as I can not be restarting Tomcat server (I have other cron actions syncronized with it). 2.)I run a full-import and everythins works fine... I run another full-import in the same core and everything seems so work find. But I have noticed that the index in /data/index dir is two times bigger. I have seen that Solr uses this indexwriter constructor when executes a deleteAll at the begining of the full import : http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexDeletionPolicy,%20org.apache.lucene.index.IndexWriter.MaxFieldLength) Why lucene is not deleteing the data of the old index if the boolean var of the constructor is set to true? (the results are not duplicated but phisically the directory /index is double size). Has this something to do with de deletionPolicy that is saving commits or a lucenes 2.9-dev bug or something like that??? I am running a nightly-build (from begining of January with some patches that have been apperaring about concurrency indexing problems) with lucene 2.9-dev. I would apreciate any advice as these two problems are really driving my crazy and don't know how to sort it... specially the first one. Thanks in advance!! -- View this message in context: http://www.nabble.com/2-strange-behaviours-with-DIH-full-import.-tp22055769p22055769.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multilanguage
Hi, No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for its accuracy nor speed (but I know the code has been around for years). Another LangID implementation is at the URL below my name. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 6:39:40 PM Subject: Re: Multilanguage Does Apache Tika help find the language of the given document? On 2/17/09, Till Kinstler wrote: > > Paul Libbrecht schrieb: > > Clearly, then, something that matches words in a dictionary and decides on >> the language based on the language of the majority could do a decent job to >> decide the analyzer. >> >> Does such a tool exist? >> > > I once played around with http://ngramj.sourceforge.net/ for language > guessing. It did a good job. It doesn't use dictionaries for language > identification but a statistical approach using ngrams. > I don't have any precise numbers, but out of about 1 documents in > different languages (most in English, German and French, few in other > european languages like Polish) there were only some 10 not identified > correctly. > > Till > > -- > Till Kinstler > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) > Platz der Göttinger Sieben 1, D 37073 Göttingen > kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de >
Re: indexing Chienese langage
CharFilter can normalize (convert) traditional chinese to simplified chinese or vice versa, if you define mapping.txt. Here is the sample of Chinese character normalization: https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG See SOLR-822 for the detail: https://issues.apache.org/jira/browse/SOLR-822 Koji revathy arun wrote: Hi, When I index chinese content using chinese tokenizer and analyzer in solr 1.3 ,some of the chinese text files are getting indexed but others are not. Since chinese has got many different language subtypes as in standard chinese,simplified chinese etc which of these does the chinese tokenizer support and is there any method to find the type of chiense language from the file? Rgds
Re: Word Locations & Search Components
Hmm, Otis, very nice! Koji Otis Gospodnetic wrote: Hi, Wouldn't this be as easy as: - split email into "paragraphs" - for each paragraph compute signature (MD5 or something fuzzier, like in SOLR-799) - for each signature look for other emails with this signature - when you find an email with an identical signature, you know you've found the "banner" I'd do this in a pre-processing phase. You may have to add special logic for ">" and other email-quoting characters. Perhaps you can make use of assumption that banners always come at the end of emails. Perhaps you can make use of situations where the banner appears multiple times in a single email (the one with lots of back-and-forth replies, for example). This is similar to MoreLikeThis on paragraph level. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
Re: Finding total range of dates for date faceting
It *looks* as though Solr supports returning the results of arbitrary calculations: http://wiki.apache.org/solr/SolrQuerySyntax However, I am so far unable to get any example working except in the context of a dismax bf. It seems like one ought to be able to write a query to return the doc matching the max OR the min of a particular field. -Peter On Tue, Feb 17, 2009 at 5:33 AM, Jacob Singh wrote: > Hi, > > I'm trying to write some code to build a facet list for a date field, > but I don't know what the first and last available dates are. I would > adjust the gap param accordingly. If there is a 10yr stretch between > min(date) and max(date) I'd want to facet by year. If it is a 1 month > gap, I'd want to facet by day. > > Is there a way to do this? > > Thanks, > Jacob > > -- > > +1 510 277-0891 (o) > +91 33 7458 (m) > > web: http://pajamadesign.com > > Skype: pajamadesign > Yahoo: jacobsingh > AIM: jacobsingh > gTalk: jacobsi...@gmail.com > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: DIH transformers - sect 2
>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie wrote: >> >> 2) Having used TemplateTransformer to assign a value to an >> entity column that column cannot be used in other >> TemplateTransformer operations. In my project I am >> attempting to reuse "x.fileWebPath". To fix this, the >> last line of transformRow() in TemplateTransformer.java >> needs replaced with the following which as well as >> 'putting' the templated-ed string in 'row' also saves it >> into the 'resolver'. >> >> **originally** >> row.put(column, resolver.replaceTokens(expr)); >> } >> >> **new** >> String columnName = map.get(DataImporter.COLUMN); >> expr=resolver.replaceTokens(expr); >> row.put(columnName, expr); >> resolverMapCopy.put(columnName, expr); >> } > >isn't it better to write a custom transformer to achieve this. I did >not want a standard component to change the state of the >VariableResolver . > >I am not sure what is the best way. > Noble, (Good to have email working :-) Hmm not sure why this requires a custom transformer. Why is this not more in the nature of a bug fix? Also the current behavior temporarily adds all the column names into the resolver for the duration of the TemplateTransformer's operation, removing them again at the end. I do not think there is any permanent change to the state of the VariableResolver. Surely if we have defined a value for a column, that value should be temporarily available in subsequent template or regexp operations? Fergus. >> >> >> >> >> >>> processor="FileListEntityProcessor" >> fileName="^.*\.xml$" >> newerThan="'NOW-1000DAYS'" >> recursive="true" >> rootEntity="false" >> dataSource="null" >> baseDir="/Volumes/spare/ts/solr/content" >> > >>> dataSource="myfilereader" >> processor="XPathEntityProcessor" >> url="${jc.fileAbsolutePath}" >> rootEntity="true" >> stream="false" >> forEach="/record | /record/mediaBlock" >> >> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer"> >> >> >> > replaceWith="/ford$1" sourceColName="fileAbsolutePath"/> >> >> >> >> > xpath="/record/metadata/da...@qualifier='pubDate']" >> dateTimeFormat="MMdd" /> >> >> > xpath="/record/mediaBlock/mediaObject/@vurl" /> >> > template="${dataimporter.request.fordinstalldir}" /> >> >> >> > template="${dataimporter.request.contentinstalldir}" /> >> >> > replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/> >> > replaceWith="$1/imagery/${x.vurl}.jpg" sourceColName="fileWebPath"/> >> > template="${jc.fileAbsolutePath}#${x.vurl}" /> >> >> >> >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: snapshot created if there is no documente updated/new?
A sanpshot is created every time snapshooter is invoked even if there is no changed in the index. However, since snapshots are created using hard links, no additional space is used if there are no changed to the index. It does use up one directory entry in the data directory. Bill On Mon, Feb 16, 2009 at 5:03 AM, sunnyfr wrote: > > Hi > > I would like to know if a snapshot is automaticly created even if there is > no document update or added ? > > Thanks a lot, > -- > View this message in context: > http://www.nabble.com/snapshot-created-if-there-is-no-documente-updated-new--tp22034462p22034462.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: snapshot as big as the index folder?
Snapshots are created using hard links. So even though it is as big as the index, it is not taking up any more space on the disk. The size of the snapshot will change as the size of the index changes. Bill On Mon, Feb 16, 2009 at 9:50 AM, sunnyfr wrote: > > It change a lot in few minute ?? is it normal ? thanks > > 5.8Gbook/data/snapshot.20090216153346 > 4.0Kbook/data/index > 5.8Gbook/data/ > r...@search-07:/data/solr# du -h book/data/ > 5.8Gbook/data/snapshot.20090216153346 > 3.7Gbook/data/index > 4.0Kbook/data/snapshot.20090216153759 > 9.4Gbook/data/ > r...@search-07:/data/solr# du -h book/data/ > 5.8Gvideo/data/snapshot.20090216153346 > 4.4Gbook/data/index > 4.0Kbook/data/snapshot.20090216153759 > 11G book/data/ > r...@search-07:/data/solr# du -h book/data/ > 5.8Gbook/data/snapshot.20090216153346 > 5.8Gbook/data/index > 4.0Kbook/data/snapshot.20090216154819 > 4.0Kbook/data/snapshot.20090216154820 > 15M book/data/snapshot.20090216153759 > 12G book/data/ > > > > > sunnyfr wrote: > > > > Hi, > > > > Is it normal or did I miss something ?? > > 5.8G book/data/snapshot.20090216153346 > > 12K book/data/spellchecker2 > > 4.0K book/data/index > > 12K book/data/spellcheckerFile > > 12K book/data/spellchecker1 > > 5.8G book/data/ > > > > Last update ? > > 92562 > > 45492 > > 0 > > 2009-02-16 15:20:01 > > 2009-02-16 15:20:01 > > 2009-02-16 15:20:42 > > 2009-02-16 15:20:42 > > 13223 > > - > > > > Indexing completed. Added/Updated: 13223 documents. Deleted 0 documents. > > > > 2009-02-16 15:33:50 > > 2009-02-16 15:33:50 > > 0:13:48.853 > > > > > > Thanks a lot, > > > > > > -- > View this message in context: > http://www.nabble.com/snapshot-as-big-as-the-index-folder--tp22038427p22038656.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: delete snapshot??
usage: snapcleaner -D | -N [-d dir] [-u username] [-v] -Dcleanup snapshots more than days old -N keep the most recent number of snapshots and cleanup up the remaining ones that are not being pulled -d specify directory holding index data -u specify user to sudo to before running script -v increase verbosity -V output debugging info Bill On Tue, Feb 17, 2009 at 3:24 AM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > Hi, > > snapcleaner lets you delete snapshots by one of the following two criteria: > - delete all but last N snapshots > - delete all snapshots older than N days > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > From: sunnyfr > To: solr-user@lucene.apache.org > Sent: Tuesday, February 17, 2009 4:17:39 PM > Subject: Re: delete snapshot?? > > > How can I remove from time to time, because for the script snapcleaner I > just > have the option to delete last day ??? > thanks a lot Noble and sorry again for all this question, > > > Noble Paul നോബിള് नोब्ळ् wrote: > > > > The hardlinks will prevent the unused files from getting cleaned up. > > So the diskspace is consumed for unused index files also. You may need > > to delete unused snapshots from time to time > > --Noble > > > > On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr wrote: > >> > >> Hi Noble, > >> > >> I maybe don't get something > >> Ok if it's hard link but how come i've not space left on device error > and > >> 30G shown on the data folder ?? > >> sorry I'm quite new > >> > >> 6.0G/data/solr/book/data/snapshot.20090216214502 > >> 35M/data/solr/book/data/snapshot.20090216195003 > >> 12M/data/solr/book/data/snapshot.20090216195502 > >> 12K/data/solr/book/data/spellchecker2 > >> 36M/data/solr/book/data/snapshot.20090216185502 > >> 37M/data/solr/book/data/snapshot.20090216203502 > >> 6.0M/data/solr/book/data/index > >> 12K/data/solr/book/data/snapshot.20090216204002 > >> 5.8G/data/solr/book/data/snapshot.20090216172020 > >> 12K/data/solr/book/data/spellcheckerFile > >> 28K/data/solr/book/data/snapshot.20090216200503 > >> 40K/data/solr/book/data/snapshot.20090216194002 > >> 24K/data/solr/book/data/snapshot.2009021622 > >> 32K/data/solr/book/data/snapshot.20090216184502 > >> 20K/data/solr/book/data/snapshot.20090216191004 > >> 1.1M/data/solr/book/data/snapshot.20090216213502 > >> 1.1M/data/solr/book/data/snapshot.20090216201502 > >> 1.1M/data/solr/book/data/snapshot.20090216213005 > >> 24K/data/solr/book/data/snapshot.20090216191502 > >> 1.1M/data/solr/book/data/snapshot.20090216212503 > >> 107M/data/solr/book/data/snapshot.20090216212002 > >> 14M/data/solr/book/data/snapshot.20090216190502 > >> 32K/data/solr/book/data/snapshot.20090216201002 > >> 2.3M/data/solr/book/data/snapshot.20090216204502 > >> 28K/data/solr/book/data/snapshot.20090216184002 > >> 5.8G/data/solr/book/data/snapshot.20090216181425 > >> 44K/data/solr/book/data/snapshot.20090216190001 > >> 20K/data/solr/book/data/snapshot.20090216183401 > >> 1.1M/data/solr/book/data/snapshot.20090216203002 > >> 44K/data/solr/book/data/snapshot.20090216194502 > >> 36K/data/solr/book/data/snapshot.20090216185004 > >> 12K/data/solr/book/data/snapshot.20090216182720 > >> 12K/data/solr/book/data/snapshot.20090216214001 > >> 5.8G/data/solr/book/data/snapshot.20090216175106 > >> 1.1M/data/solr/book/data/snapshot.20090216202003 > >> 5.8G/data/solr/book/data/snapshot.20090216173224 > >> 12K/data/solr/book/data/spellchecker1 > >> 1.1M/data/solr/book/data/snapshot.20090216202502 > >> 30G/data/solr/book/data > >> thanks a lot, > >> > >> > >> Noble Paul നോബിള് नोब्ळ् wrote: > >>> > >>> they are just hardlinks. they do not consume space on disk > >>> > >>> On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr > wrote: > > Hi, > > Ok but can I use it more often then every day like every three hours, > because snapshot are quite big. > > Thanks a lot, > > > Bill Au wrote: > > > > The --delete option of the rsync command deletes extraneous files > from > > the > > destination directory. It does not delete Solr snapshots. To do > that > > you > > can use the snapcleaner on the master and/or slave. > > > > Bill > > > > On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr > > wrote: > > > >> > >> root26834 16.2 0.0 19412 824 ?S16:05 0:08 > >> rsync > >> -Wa > >> --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/ > >> /data/solr/books/data/snapshot.20090213160051-wip > >> > >> Hi obviously it can't delete them because the adress is bad it > >> shouldnt > >> be > >> : > >> rsync://##.##.##.##:18180/solr/snapshot
Re: delete snapshot??
I run snapcleaner from cron. That cleans up old snapshots once each day. Here is a crontab line that runs it at 30 minutes past the hour, every hour. 30 * * * * /apps/wss/solr_home/bin/snapcleaner -N 3 wunder On 2/17/09 7:23 AM, "Bill Au" wrote: > usage: snapcleaner -D | -N [-d dir] [-u username] [-v] >-Dcleanup snapshots more than days old >-N keep the most recent number of snapshots and >cleanup up the remaining ones that are not being pulled >-d specify directory holding index data >-u specify user to sudo to before running script >-v increase verbosity >-V output debugging info > > Bill > > On Tue, Feb 17, 2009 at 3:24 AM, Otis Gospodnetic < > otis_gospodne...@yahoo.com> wrote: > >> Hi, >> >> snapcleaner lets you delete snapshots by one of the following two criteria: >> - delete all but last N snapshots >> - delete all snapshots older than N days >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >> >> >> >> >> >> From: sunnyfr >> To: solr-user@lucene.apache.org >> Sent: Tuesday, February 17, 2009 4:17:39 PM >> Subject: Re: delete snapshot?? >> >> >> How can I remove from time to time, because for the script snapcleaner I >> just >> have the option to delete last day ??? >> thanks a lot Noble and sorry again for all this question, >> >> >> Noble Paul നോബിള് नोब्ळ् wrote: >>> >>> The hardlinks will prevent the unused files from getting cleaned up. >>> So the diskspace is consumed for unused index files also. You may need >>> to delete unused snapshots from time to time >>> --Noble >>> >>> On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr wrote: Hi Noble, I maybe don't get something Ok if it's hard link but how come i've not space left on device error >> and 30G shown on the data folder ?? sorry I'm quite new 6.0G/data/solr/book/data/snapshot.20090216214502 35M/data/solr/book/data/snapshot.20090216195003 12M/data/solr/book/data/snapshot.20090216195502 12K/data/solr/book/data/spellchecker2 36M/data/solr/book/data/snapshot.20090216185502 37M/data/solr/book/data/snapshot.20090216203502 6.0M/data/solr/book/data/index 12K/data/solr/book/data/snapshot.20090216204002 5.8G/data/solr/book/data/snapshot.20090216172020 12K/data/solr/book/data/spellcheckerFile 28K/data/solr/book/data/snapshot.20090216200503 40K/data/solr/book/data/snapshot.20090216194002 24K/data/solr/book/data/snapshot.2009021622 32K/data/solr/book/data/snapshot.20090216184502 20K/data/solr/book/data/snapshot.20090216191004 1.1M/data/solr/book/data/snapshot.20090216213502 1.1M/data/solr/book/data/snapshot.20090216201502 1.1M/data/solr/book/data/snapshot.20090216213005 24K/data/solr/book/data/snapshot.20090216191502 1.1M/data/solr/book/data/snapshot.20090216212503 107M/data/solr/book/data/snapshot.20090216212002 14M/data/solr/book/data/snapshot.20090216190502 32K/data/solr/book/data/snapshot.20090216201002 2.3M/data/solr/book/data/snapshot.20090216204502 28K/data/solr/book/data/snapshot.20090216184002 5.8G/data/solr/book/data/snapshot.20090216181425 44K/data/solr/book/data/snapshot.20090216190001 20K/data/solr/book/data/snapshot.20090216183401 1.1M/data/solr/book/data/snapshot.20090216203002 44K/data/solr/book/data/snapshot.20090216194502 36K/data/solr/book/data/snapshot.20090216185004 12K/data/solr/book/data/snapshot.20090216182720 12K/data/solr/book/data/snapshot.20090216214001 5.8G/data/solr/book/data/snapshot.20090216175106 1.1M/data/solr/book/data/snapshot.20090216202003 5.8G/data/solr/book/data/snapshot.20090216173224 12K/data/solr/book/data/spellchecker1 1.1M/data/solr/book/data/snapshot.20090216202502 30G/data/solr/book/data thanks a lot, Noble Paul നോബിള് नोब्ळ् wrote: > > they are just hardlinks. they do not consume space on disk > > On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr >> wrote: >> >> Hi, >> >> Ok but can I use it more often then every day like every three hours, >> because snapshot are quite big. >> >> Thanks a lot, >> >> >> Bill Au wrote: >>> >>> The --delete option of the rsync command deletes extraneous files >> from >>> the >>> destination directory. It does not delete Solr snapshots. To do >> that >>> you >>> can use the snapcleaner on the master and/or slave. >>> >>> Bill >>> >>> On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr >>> wrote: >>> root26834 16.2 0.0 19412 824 ?S16:0
Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)
Requesting 5000 rows will use a lot of server time, because it has to fetch the information for 5000 results when it makes the response. It is much more efficient to request only the results you will need, usually 10 at a time. wunder On 2/17/09 3:30 AM, "Jana, Kumar Raja" wrote: > Hi, > > > > I am trying to avoid queries which take a lot of server time. For this I > plan to use setRows(Integer) and setTimeAllowed(Integer) methods while > creating the SolrQuery. I would like to know the following: > > > > 1. I set SolrQuery.setRows(5000) Will the processing of the query > stop once 5000 results are found or the query will be completely > processed and then the result set is sorted out based on Rank Boosting > and the top 5000 results are returned? > > 2. If I set SolrQuery.setTimeAllowed(2000) Will this kill query > processing after 2 secs? (I know this question sounds silly but I just > want a confirmation from the experts J ) > > > > Is there anything else I can do to get the desired results? > > > > Thanks, > > Kumar >
Store content out of solr
Hello, We are indexing information from diferent sources so we would like to centralize the information content so i can retrieve using the ID provided buy solr? Does anyone did something like this, and have some advices ? I thinking in store the information into a database like mysql ? Thanks, -- "Without love, we are birds with broken wings." Morrie
Re: Multilanguage
Hi Otis, But this is not freeware ,right? On 2/17/09, Otis Gospodnetic wrote: > > Hi, > > No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for > its accuracy nor speed (but I know the code has been around for > years). Another LangID implementation is at the URL below my name. > > Otis -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > From: revathy arun > To: solr-user@lucene.apache.org > Sent: Tuesday, February 17, 2009 6:39:40 PM > Subject: Re: Multilanguage > > Does Apache Tika help find the language of the given document? > > > > On 2/17/09, Till Kinstler wrote: > > > > Paul Libbrecht schrieb: > > > > Clearly, then, something that matches words in a dictionary and decides > on > >> the language based on the language of the majority could do a decent job > to > >> decide the analyzer. > >> > >> Does such a tool exist? > >> > > > > I once played around with http://ngramj.sourceforge.net/ for language > > guessing. It did a good job. It doesn't use dictionaries for language > > identification but a statistical approach using ngrams. > > I don't have any precise numbers, but out of about 1 documents in > > different languages (most in English, German and French, few in other > > european languages like Polish) there were only some 10 not identified > > correctly. > > > > Till > > > > -- > > Till Kinstler > > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) > > Platz der Göttinger Sieben 1, D 37073 Göttingen > > kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de > > >
Re: Store content out of solr
Sure, we are doing essentially that with our Drupal integration module - each search result contains a link to the "real" content, which is stored in MySQL, etc, and presented via the Drupal CMS. http://drupal.org/project/apachesolr -Peter On Tue, Feb 17, 2009 at 11:57 AM, roberto wrote: > Hello, > > We are indexing information from diferent sources so we would like to > centralize the information content so i can retrieve using the ID > provided buy solr? > > Does anyone did something like this, and have some advices ? I > thinking in store the information into a database like mysql ? > > Thanks, > -- > "Without love, we are birds with broken wings." > Morrie > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)
Jana, Kumar Raja wrote: 2. If I set SolrQuery.setTimeAllowed(2000) Will this kill query processing after 2 secs? (I know this question sounds silly but I just want a confirmation from the experts J That is the idea, but only some of the code is within the timer. So, there are cases where a query could exceed the timeAllowed specified because the bulk of the work for that particular query is not in the actual collect, for example, an expensive range query. -Sean
Re: Store content out of solr
A common approach (for web search engines) is to use HBase [1] as a "Document Repository". Each document indexed inside Solr will have an entry (row, identified by the document URL) in the HBase table. This works great when you deal with a large data collection (it scales better than a SQL database). The counterpart is that it is slightly slower than a local database. [1] http://hadoop.apache.org/hbase/ -- Renaud Delbru roberto wrote: Hello, We are indexing information from diferent sources so we would like to centralize the information content so i can retrieve using the ID provided buy solr? Does anyone did something like this, and have some advices ? I thinking in store the information into a database like mysql ? Thanks,
Re: Multilanguage
There are a number of options for freeware here, just do some searching on your favorite Internet search engine. TextCat is one of the more popular, as I seem to recall: http://odur.let.rug.nl/~vannoord/TextCat/ I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826 but it is marked as won't fix. Nutch has a Language Identification plugin as well (the document in the link below) that probably isn't too hard to extract the source from for your needs Also see http://www.lucidimagination.com/search/?q=multilingual+detection and also http://www.lucidimagination.com/search/?q=language +detection for help If purchasing, several companies offer solutions, but I don't know that their quality is any better than what you can get through open source, as generally speaking, the problem is solved with a high degree of accuracy through n-gram analysis. -Grant On Feb 17, 2009, at 11:57 AM, revathy arun wrote: Hi Otis, But this is not freeware ,right? On 2/17/09, Otis Gospodnetic wrote: Hi, No, Tika doesn't do LangID. I haven't used ngramj, so I can't speak for its accuracy nor speed (but I know the code has been around for years). Another LangID implementation is at the URL below my name. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: revathy arun To: solr-user@lucene.apache.org Sent: Tuesday, February 17, 2009 6:39:40 PM Subject: Re: Multilanguage Does Apache Tika help find the language of the given document? On 2/17/09, Till Kinstler wrote: Paul Libbrecht schrieb: Clearly, then, something that matches words in a dictionary and decides on the language based on the language of the majority could do a decent job to decide the analyzer. Does such a tool exist? I once played around with http://ngramj.sourceforge.net/ for language guessing. It did a good job. It doesn't use dictionaries for language identification but a statistical approach using ngrams. I don't have any precise numbers, but out of about 1 documents in different languages (most in English, German and French, few in other european languages like Polish) there were only some 10 not identified correctly. Till -- Till Kinstler Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) Platz der Göttinger Sieben 1, D 37073 Göttingen kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Re: Multilanguage
On 2/17/09 12:26 PM, "Grant Ingersoll" wrote: > If purchasing, several companies offer solutions, but I don't know > that their quality is any better than what you can get through open > source, as generally speaking, the problem is solved with a high > degree of accuracy through n-gram analysis. The expensive part of the problem is getting a good corpus in each language, tuning the classifier, and QA. The commercial ones usually recognize encoding and language, which is more complicated. Sorting out the ISO-2022 codes is a real mess, for example. Pre-Unicode PDF files are also a horror. To do it right, you need to recognize which fonts are Central European, and so on. wunder
making changes to solr schema
Preface: This is my first attempt at using solr. What happens if I need to do a change to a solr schema that's already in production? Can fields be added or removed? Can a type change from an integer to a float? Thanks in advance, Jon -- Jonathan Haddad http://www.rustyrazorblade.com
making changes to solr schema after deployed to production
Preface: This is my first attempt at using solr. What happens if I need to do a change to a solr schema that's already in production? Can fields be added or removed? Can a type change from an integer to a float? Thanks in advance, Jon
embedded wildcard search not working?
This is a straightforward question, but I haven't been able to figure out what is up with my application. I seem to be able to search on trailing wildcards just find. For example, fieldName:a* will return documents with apple, ardvaark, etc. in them. But if I was to try and search on a field containing 'apple' with 'a*e' I would return nothing. My gut is telling me that I should be using a different data type or a different filter option. Here is how my text type is defined: Thanks for your help.
Reading Core-Specific Config File in a Row Transformer
I'm using the DataImportHandler to load data. I created a custom row transformer, and inside of it I'm reading a configuration file. I am using the system's solr.solr.home property to figure out which directory the file should be in. That works for a single-core deployment, but not for multi-core deployments (since I'm always looking in solr.solr.home/conf/file.txt). Is there a clean way to resolve the actual conf directory path from within a custom row transformer so that it works for both single-core and multi-core deployments? Thanks, Wojtek -- View this message in context: http://www.nabble.com/Reading-Core-Specific-Config-File-in-a-Row-Transformer-tp22069449p22069449.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reading Core-Specific Config File in a Row Transformer
On Wed, Feb 18, 2009 at 5:53 AM, wojtekpia wrote: > > Is there a clean way to resolve the actual > conf directory path from within a custom row transformer so that it works > for both single-core and multi-core deployments? > You can use Context.getSolrCore().getInstanceDir() -- Regards, Shalin Shekhar Mangar.
Re: making changes to solr schema
On Wed, Feb 18, 2009 at 3:37 AM, Jonathan Haddad wrote: > Preface: This is my first attempt at using solr. > > What happens if I need to do a change to a solr schema that's already > in production? Can fields be added or removed? you may need a core reload or a serverrestart fields can be added and the subsequent document additions take advantage of it. fields can be removed if you are no longer going to use them in queries > > Can a type change from an integer to a float? In general type changes may need re-indexing of data > > Thanks in advance, > Jon > > -- > Jonathan Haddad > http://www.rustyrazorblade.com > -- --Noble Paul
Data Normalization in Solr.
Hi, I want to store normalized data into Solr, example am spliting personal information datas(fname, lname, mname) as one solr record, Address (personal, office) as another record in Solr. the id is different 123212_name, 123212_add, Now, some case i require both personal and address records by single xml say( fname, lname, officeaddress only) it self, (with single http request), Is it possible? Thanks, kalidoss.m,
RE: Query regarding setTimeAllowed(Integer) and setRows(Integer)
Thanks wunder for the response. So I would like to know if I were to limit the resultset from Solr to 10 and my query actually matches, say 1000 documents, will the query processing stop the moment the search finds the first 10 documents? Or will the entire search be carried out and then sorted out based on their ranks and the top 10 results be returned? -Kumar -Original Message- From: Walter Underwood [mailto:wunderw...@netflix.com] Sent: Tuesday, February 17, 2009 10:06 PM To: solr-user@lucene.apache.org Subject: Re: Query regarding setTimeAllowed(Integer) and setRows(Integer) Requesting 5000 rows will use a lot of server time, because it has to fetch the information for 5000 results when it makes the response. It is much more efficient to request only the results you will need, usually 10 at a time. wunder On 2/17/09 3:30 AM, "Jana, Kumar Raja" wrote: > Hi, > > > > I am trying to avoid queries which take a lot of server time. For this I > plan to use setRows(Integer) and setTimeAllowed(Integer) methods while > creating the SolrQuery. I would like to know the following: > > > > 1. I set SolrQuery.setRows(5000) Will the processing of the query > stop once 5000 results are found or the query will be completely > processed and then the result set is sorted out based on Rank Boosting > and the top 5000 results are returned? > > 2. If I set SolrQuery.setTimeAllowed(2000) Will this kill query > processing after 2 secs? (I know this question sounds silly but I just > want a confirmation from the experts J ) > > > > Is there anything else I can do to get the desired results? > > > > Thanks, > > Kumar >
RE: Query regarding setTimeAllowed(Integer) and setRows(Integer)
Thanks Sean. That clears up the timer concept. Is there any other way through which I can make sure that the server time is not wasted? -Original Message- From: Sean Timm [mailto:tim...@aol.com] Sent: Wednesday, February 18, 2009 1:00 AM To: solr-user@lucene.apache.org Subject: Re: Query regarding setTimeAllowed(Integer) and setRows(Integer) Jana, Kumar Raja wrote: > 2. If I set SolrQuery.setTimeAllowed(2000) Will this kill query > processing after 2 secs? (I know this question sounds silly but I just > want a confirmation from the experts J That is the idea, but only some of the code is within the timer. So, there are cases where a query could exceed the timeAllowed specified because the bulk of the work for that particular query is not in the actual collect, for example, an expensive range query. -Sean
Re: Data Normalization in Solr.
Hi, There are no entity relationships in Solr and there are no joins, so the simplest thing to do in this case is to issue two requests. You could also write a custom SearchComponent that internally does two requests and returns a single unified response. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Kalidoss MM To: solr-user@lucene.apache.org Sent: Wednesday, February 18, 2009 2:44:15 PM Subject: Data Normalization in Solr. Hi, I want to store normalized data into Solr, example am spliting personal information datas(fname, lname, mname) as one solr record, Address (personal, office) as another record in Solr. the id is different 123212_name, 123212_add, Now, some case i require both personal and address records by single xml say( fname, lname, officeaddress only) it self, (with single http request), Is it possible? Thanks, kalidoss.m,
Re: embedded wildcard search not working?
Jim, Does app*l or even a*p* work? Perhaps "apple" gets stemmed to something that doesn't end in "e", such as "appl"? Regarding your config, you probably want to lowercase before removing stop words, so you'll want to change the order of those filters a bit. That's not related to your wildcard question. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch From: Jim Adams To: solr-user@lucene.apache.org Sent: Wednesday, February 18, 2009 6:30:22 AM Subject: embedded wildcard search not working? This is a straightforward question, but I haven't been able to figure out what is up with my application. I seem to be able to search on trailing wildcards just find. For example, fieldName:a* will return documents with apple, ardvaark, etc. in them. But if I was to try and search on a field containing 'apple' with 'a*e' I would return nothing. My gut is telling me that I should be using a different data type or a different filter option. Here is how my text type is defined: Thanks for your help.