Russian stopwords
I am trying to filter russian stopwords but have not been successful with that. I am using the following schema entry - . .. Intrestingly, Russian synonyms are working fine. English and russian synonyms get searched correctly. Also,If I add an English language word to stopwords.txt it gets filtered correctly. Its the russian words that are not getting filtered as stopwords. Can someone explain the behaviour. Thanks, Tushar. -- View this message in context: http://www.nabble.com/Russian-stopwords-tp20851093p20851093.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: new faceting algorithm
Yonik Seeley schrieb: We'd love some feedback on how it works to ensure that it actually is a win for the majority and should be the default. I just did a quick test using Solr nightly 2008-11-30. I have an index of about 2.9 mil bibliographic records, size: 16G. I tested facetting author names, each index document may contain multiple author names, so author names go into a multivalued field (not analyzed). Queries used for testing were extracted from log files of a prototype application. With facet.method=enum, 50 request threads, I get an average response time of about 19(!) ms, no cache evictions. With 1 request thread: about 1800 ms. With facet.method=fc, 50 threads I get an average response time of around 300 ms. 1 thread: 16 ms. Seems to be a major improvement at first sight :-) Regards, Till -- Till Kinstler Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG) Platz der Göttinger Sieben 1, D 37073 Göttingen [EMAIL PROTECTED], +49 (0) 551 39-13431, http://www.gbv.de
JSONResponseWriter bug ? (solr-1.3)
Hi, I think I've discovered a bug with the JSONResponseWriter : starting from the following query - http://127.0.0.1:8080/solr-urbamet/select?q=(tout:1)&rows=0&sort=TITRE+desc&facet=true&facet.query=SUJET:b*&facet.field=SUJET&facet.prefix=b&facet.limit=1&facet.missing=true&wt=json&json.nl=arrarr - which produced a NullPointerException (see the stacktrace below), I played with the parameters and obtained the following results : ##PAGINATION rows : starting from 0, the exception occurs until we pass a certain threshold => rows implicated ##SORTING the rows threshold afore mentionned seems to be influenced by the presence/absence of the sort parameter ##FACETS facet=false => OK while facet=true => NullPointerException =>facets implicated -- facet.field absent => OK while facet.field=whatever => NullPointerException =>facet.field implicated -- facet.missing=false => OK while facet.missing=true => NullPointerException => facet.missing implicated -- facet.limit=-1 or 0 => OK while facet.limit>0 => NullPointerException => facet.limit implicated -- facet.query absent or facet.query = whatever => NullPointerException =>facet.query not implicated -- facet.offset=(several values or absent) => NullPointerException => facet.offset not implicated -- => facet.sort not implicated (true or false => NullPointerException) -- => facet.mincount not implicated (several values or absent => NullPointerException) #ResponseWriter wt=standard => ok while wt=json => NullPointerException => jsonwriter implicated json.nl=flat or map => ok => jsonwriter 'arrarr' format implicated I hope this debugging is readable and will help. -- Grégoire Neuville Stacktrace : GRAVE: java.lang.NullPointerException at org.apache.solr.request.JSONWriter.writeStr(JSONResponseWriter.java:607) at org.apache.solr.request.JSONWriter.writeNamedListAsArrArr(JSONResponseWriter.java:245) at org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:294) at org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java:151) at org.apache.solr.request.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:175) at org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:288) at org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java:151) at org.apache.solr.request.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:175) at org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:288) at org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java:151) at org.apache.solr.request.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:175) at org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:288) at org.apache.solr.request.JSONWriter.writeResponse(JSONResponseWriter.java:88) at org.apache.solr.request.JSONResponseWriter.write(JSONResponseWriter.java:49) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:595)
multiValued multiValued fields
Hello, I want to index a field with an array of arrays, is that possible in Solr? I.e I have one multi-valued field with persons and would like one multi-valued field with their employer, but sometimes there are more than one employer per person and therefor it would've been good to use a multi-valued multi-valued field: Person-field: ["Andersson, John","Svensson, Marcus"] Employer-field: [ [ "Volvo","Saab" ] , [ "Ericsson", "Nokia", "Motorola" ] ] I could from these fields easily retrieve which companies are associated with which person. Thanks in advance // Joel
Can Solr follow links?
Hello, Is there any way for Solr to follow links stored in my database and index the content of these files and HTTP-resources? Thanks in advance! // Joel
Re: new faceting algorithm
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Till Kinstler schrieb: Hi, > I just did a quick test using Solr nightly 2008-11-30. I have an index > of about 2.9 mil bibliographic records, size: 16G. I tested facetting > author names, each index document may contain multiple author names, so > author names go into a multivalued field (not analyzed). Queries used > for testing were extracted from log files of a prototype application. > With facet.method=enum, 50 request threads, I get an average response > time of about 19(!) ms, no cache evictions. With 1 request thread: > about 1800 ms. > With facet.method=fc, 50 threads I get an average response time of > around 300 ms. 1 thread: 16 ms. > Seems to be a major improvement at first sight :-) same here: multi valued author fields were the bottleneck with 1.3 for us, too. I'm currently testing with 1.5 million records, ~1.2 million of which have values for the author field, but with ~2 million distinct values. With Solr 1.3 we had average response times of 15000-25000 ms for 10 parallel requests (depending on cache settings), with 1.4 they are now down to 230 ms... Regards, Andre - -- Andre Hagenbruch Projekt "Integriertes Bibliotheksportal" Universitaetsbibliothek Bochum, Etage 4/Raum 6 Fon: +49 234 3229346, Fax: +49 234 3214736 -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkk5G5kACgkQ3wuzs9k1icVbOACgta0COUoOJGRN93puG2LzBJZU t1EAn3od/3CmD9zE0ioo/yjQ5YrHv+1m =80sA -END PGP SIGNATURE-
Re: Is there a clean way to determine whether a core exists?
Wow -- thanks for all the help!! With everyone's help, I did end up in a *much* better place: private static boolean solrCoreExists(String coreName, String solrRootUrl) throws IOException, SolrServerException { CommonsHttpSolrServer adminServer = new CommonsHttpSolrServer(solrRootUrl); CoreAdminResponse status = CoreAdminRequest.getStatus(coreName, adminServer); return status.getCoreStatus(coreName).get("instanceDir") != null; } On Dec 5, 2008, at 1:09 AM, Ryan McKinley wrote: yes: http://localhost:8983/solr/admin/cores?action=STATUS will give you a list of running cores. However that is not easy to check with a simple status != 404 see: http://wiki.apache.org/solr/CoreAdmin On Dec 4, 2008, at 11:46 PM, Chris Hostetter wrote: : Subject: Is there a clean way to determine whether a core exists? doesn't the CoreAdminHandler's STATUS feature make this easy? -Hoss
Re: new faceting algorithm
Hi Yonik, May I ask in which class(es) this improvement was made? I've been using the DocSet, DocList, BitDocSet, HashDocSet from Solr from a few years ago with a Lucene based app. to do faceting. Thanks, Peter On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > A new faceting algorithm has been committed to the development version > of Solr, and should be available in the next nightly test build (will > be dated 11-25). This change should generally improve field faceting > where the field has many unique values but relatively few values per > document. This new algorithm is now the default for multi-valued > fields (including tokenized fields) so you shouldn't have to do > anything to enable it. We'd love some feedback on how it works to > ensure that it actually is a win for the majority and should be the > default. > > -Yonik >
DataImportHandler - time stamp format in
In the dataimport.properties file, there is the timespamp. #Thu Dec 04 15:36:22 EST 2008 last_index_time=2008-12-04 15\:36\:20 I am using the Oracle (10g) and would like to know which format of timestamp I have to use in Oracle. Thanks, Jae
Re: JSONResponseWriter bug ? (solr-1.3)
Thanks for the report Grégoire, it definitely looks like a bug. Would you mind opening a JIRA issue for this? -Yonik On Fri, Dec 5, 2008 at 6:26 AM, Grégoire Neuville <[EMAIL PROTECTED]> wrote: > Hi, > > I think I've discovered a bug with the JSONResponseWriter : starting > from the following query - > > http://127.0.0.1:8080/solr-urbamet/select?q=(tout:1)&rows=0&sort=TITRE+desc&facet=true&facet.query=SUJET:b*&facet.field=SUJET&facet.prefix=b&facet.limit=1&facet.missing=true&wt=json&json.nl=arrarr > > - which produced a NullPointerException (see the stacktrace below), I > played with the parameters and obtained the following results : > > ##PAGINATION > rows : starting from 0, the exception occurs until we pass a certain threshold > => rows implicated > > ##SORTING > the rows threshold afore mentionned seems to be influenced by the > presence/absence of the sort parameter > > ##FACETS > facet=false => OK while facet=true => NullPointerException > =>facets implicated > -- > facet.field absent => OK while facet.field=whatever => NullPointerException > =>facet.field implicated > -- > facet.missing=false => OK while facet.missing=true => NullPointerException > => facet.missing implicated > -- > facet.limit=-1 or 0 => OK while facet.limit>0 => NullPointerException > => facet.limit implicated > -- > facet.query absent or facet.query = whatever => NullPointerException > =>facet.query not implicated > -- > facet.offset=(several values or absent) => NullPointerException > => facet.offset not implicated > -- > => facet.sort not implicated (true or false => NullPointerException) > -- > => facet.mincount not implicated (several values or absent => > NullPointerException) > > #ResponseWriter > wt=standard => ok while wt=json => NullPointerException > => jsonwriter implicated > json.nl=flat or map => ok > => jsonwriter 'arrarr' format implicated > > I hope this debugging is readable and will help. > -- > Grégoire Neuville > > Stacktrace : > > GRAVE: java.lang.NullPointerException > at > org.apache.solr.request.JSONWriter.writeStr(JSONResponseWriter.java:607) > at > org.apache.solr.request.JSONWriter.writeNamedListAsArrArr(JSONResponseWriter.java:245) > at > org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:294) > at > org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java:151) > at > org.apache.solr.request.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:175) > at > org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:288) > at > org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java:151) > at > org.apache.solr.request.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:175) > at > org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:288) > at > org.apache.solr.request.TextResponseWriter.writeVal(TextResponseWriter.java:151) > at > org.apache.solr.request.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:175) > at > org.apache.solr.request.JSONWriter.writeNamedList(JSONResponseWriter.java:288) > at > org.apache.solr.request.JSONWriter.writeResponse(JSONResponseWriter.java:88) > at > org.apache.solr.request.JSONResponseWriter.write(JSONResponseWriter.java:49) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:257) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) > at java.lang.Thread.run(Thread.java:595) >
Re: Solr on Solaris
I do have same experience. What is the CPU in the Solaris box? it is not depending on the operating system (linux or Solaris). It is depenong on the CPU (Intel ro SPARC). Don't know why, but based on my performance test, SPARC machine requires MORE memory for java application. Jae On Thu, Dec 4, 2008 at 10:40 PM, Kashyap, Raghu <[EMAIL PROTECTED]>wrote: > We are running solr on a solaris box with 4 CPU's(8 cores) and 3GB Ram. > When we try to index sometimes the HTTP Connection just hangs and the > client which is posting documents to solr doesn't get any response back. > We since then have added timeouts to our http requests from the clients. > > > > I then get this error. > > > > java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. Out > of swap space? > > java.lang.OutOfMemoryError: unable to create new native thread > > Exception in thread "JmxRmiRegistryConnectionPoller" > java.lang.OutOfMemoryError: unable to create new native thread > > > > We are running JDK 1.6_10 on the solaris box. . The weird thing is we > are running the same application on linux box with JDK 1.6 and we > haven't seen any problem like this. > > > > Any suggestions? > > > > -Raghu > >
RE: Solr on Solaris
Jon, What do you mean by off a "Zone"? Please clarify -Raghu -Original Message- From: Jon Baer [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:56 PM To: solr-user@lucene.apache.org Subject: Re: Solr on Solaris Just curious, is this off a "zone" by any chance? - Jon On Dec 4, 2008, at 10:40 PM, Kashyap, Raghu wrote: > We are running solr on a solaris box with 4 CPU's(8 cores) and 3GB > Ram. > When we try to index sometimes the HTTP Connection just hangs and the > client which is posting documents to solr doesn't get any response > back. > We since then have added timeouts to our http requests from the > clients. > > > > I then get this error. > > > > java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. Out > of swap space? > > java.lang.OutOfMemoryError: unable to create new native thread > > Exception in thread "JmxRmiRegistryConnectionPoller" > java.lang.OutOfMemoryError: unable to create new native thread > > > > We are running JDK 1.6_10 on the solaris box. . The weird thing is we > are running the same application on linux box with JDK 1.6 and we > haven't seen any problem like this. > > > > Any suggestions? > > > > -Raghu >
RE: Solr on Solaris
Hi Jae, Its intel based CPU. -Raghu -Original Message- From: Jae Joo [mailto:[EMAIL PROTECTED] Sent: Friday, December 05, 2008 9:53 AM To: solr-user@lucene.apache.org Subject: Re: Solr on Solaris I do have same experience. What is the CPU in the Solaris box? it is not depending on the operating system (linux or Solaris). It is depenong on the CPU (Intel ro SPARC). Don't know why, but based on my performance test, SPARC machine requires MORE memory for java application. Jae On Thu, Dec 4, 2008 at 10:40 PM, Kashyap, Raghu <[EMAIL PROTECTED]>wrote: > We are running solr on a solaris box with 4 CPU's(8 cores) and 3GB Ram. > When we try to index sometimes the HTTP Connection just hangs and the > client which is posting documents to solr doesn't get any response back. > We since then have added timeouts to our http requests from the clients. > > > > I then get this error. > > > > java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. Out > of swap space? > > java.lang.OutOfMemoryError: unable to create new native thread > > Exception in thread "JmxRmiRegistryConnectionPoller" > java.lang.OutOfMemoryError: unable to create new native thread > > > > We are running JDK 1.6_10 on the solaris box. . The weird thing is we > are running the same application on linux box with JDK 1.6 and we > haven't seen any problem like this. > > > > Any suggestions? > > > > -Raghu > >
Re: new faceting algorithm
very similar situation to those already reported. 2.9M bilbiographic records, with authors being the (previous) bottleneck, and the one we're starting to test with the new algorithm. so far, no load tests, but just in single requests i'm seeing the same improvements...phenomenal improvements, btw, with most example queries taking less than 1/100th of the time always very impressed with this project/product, and just thought i'd add a "me-too" to the list...cheers, and have a great weekend, rob On Mon, Nov 24, 2008 at 11:12 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > A new faceting algorithm has been committed to the development version > of Solr, and should be available in the next nightly test build (will > be dated 11-25). This change should generally improve field faceting > where the field has many unique values but relatively few values per > document. This new algorithm is now the default for multi-valued > fields (including tokenized fields) so you shouldn't have to do > anything to enable it. We'd love some feedback on how it works to > ensure that it actually is a win for the majority and should be the > default. > > -Yonik >
Re: new faceting algorithm
Peter, It is UnInvertedField class. See also: https://issues.apache.org/jira/browse/SOLR-475 Peter Keegan wrote: Hi Yonik, May I ask in which class(es) this improvement was made? I've been using the DocSet, DocList, BitDocSet, HashDocSet from Solr from a few years ago with a Lucene based app. to do faceting. Thanks, Peter
RE: Russian stopwords
Hi Tushar, On 12/05/2008 at 5:18 AM, tushar kapoor wrote: > I am trying to filter russian stopwords but have not been > successful with that. [...] > words="stopwords.txt"/> >ignoreCase="true" expand="false"/> [...] > Intrestingly, Russian synonyms are working fine. English and russian > synonyms get searched correctly. > > Also,If I add an English language word to stopwords.txt it > gets filtered correctly. Its the russian words that are not > getting filtered as stopwords. It might be an encoding issue - StopFilterFactory delegates stopword file reading to SolrResourceLoader.getLines(), which uses an InputStreamReader instantiated with the UTF-8 charset. Is your stopwords.txt encoded as UTF-8? It's strange that synonyms are working fine, though - SynonymFilterFactory reads in the synonyms file using the same mechanism as StopFilterFactory - is it possible that your synonyms file is encoded as UTF-8, but your stopwords file is encoded with a different encoding, perhaps KOI8-R? Like UTF-8, KOI8-R includes the entirety of 7-bit ASCII, so English words would be properly decoded under UTF-8. Steve
Re: Solr on Solaris
Are you running Solr in a container more specifically, Ive had few issues w/ zones in the past and Solr (I believe there are some networking issues w/ older Solaris versions) ... They are basically where you can slice ("virtualize") your resources and divide a box up into something similar to a VPS ... http://www.sun.com/bigadmin/content/zones/ - Jon On Dec 5, 2008, at 10:58 AM, Kashyap, Raghu wrote: Jon, What do you mean by off a "Zone"? Please clarify -Raghu -Original Message- From: Jon Baer [mailto:[EMAIL PROTECTED] Sent: Thursday, December 04, 2008 9:56 PM To: solr-user@lucene.apache.org Subject: Re: Solr on Solaris Just curious, is this off a "zone" by any chance? - Jon On Dec 4, 2008, at 10:40 PM, Kashyap, Raghu wrote: We are running solr on a solaris box with 4 CPU's(8 cores) and 3GB Ram. When we try to index sometimes the HTTP Connection just hangs and the client which is posting documents to solr doesn't get any response back. We since then have added timeouts to our http requests from the clients. I then get this error. java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. Out of swap space? java.lang.OutOfMemoryError: unable to create new native thread Exception in thread "JmxRmiRegistryConnectionPoller" java.lang.OutOfMemoryError: unable to create new native thread We are running JDK 1.6_10 on the solaris box. . The weird thing is we are running the same application on linux box with JDK 1.6 and we haven't seen any problem like this. Any suggestions? -Raghu
Re: Merging Indices
On Fri, Dec 5, 2008 at 5:09 AM, ashokc <[EMAIL PROTECTED]> wrote: > > The SOLR wiki says > > >>3. Make sure both indexes you want to merge are closed. > > What exactly does 'closed' mean? I think that would mean that the IndexReader and IndexWriter on that index are closed. 1. Do I need to stop SOLR search on both indexes before running the merge > command? So a brief downtime is required? I think so. > Or do I simply prevent any 'updates/deletes' to these indices during the > merge time so they can still serve up results (read only?) while I am > creating a new merged index? > > 2. Before the new index replaces the old index, do I need to stop SOLR for > that instance? Or can I simply move the old index out and place the new > index in the same place, without having to stop SOLR The rsync based replication in Solr uses similar schema. It creates hardlinks to the new index files over the old ones. > 3. If SOLR has to be stopped during the merge operation, can we work with a > redundant/failover instance and stagger the merge so the search service > will > not go down? Any guidelines here are welcome. It is not very clear as to what you are actually trying to do. Why do you even need to merge indices? Are you creating your index outside of Solr? Just curious to know your use-case. -- Regards, Shalin Shekhar Mangar.
Re: DataImportHandler - time stamp format in
I gguess you are trying to pass it in the SQL query. Tryit as it is . If oracle does not take it you can format the date according to what oracle likes http://wiki.apache.org/solr/DataImportHandler#head-5675e913396a42eb7c6c5d3c894ada5dadbb62d7 On Fri, Dec 5, 2008 at 8:09 PM, Jae Joo <[EMAIL PROTECTED]> wrote: > In the dataimport.properties file, there is the timespamp. > > #Thu Dec 04 15:36:22 EST 2008 > last_index_time=2008-12-04 15\:36\:20 > > I am using the Oracle (10g) and would like to know which format of timestamp > I have to use in Oracle. > > Thanks, > > Jae > -- --Noble Paul
Re: Can Solr follow links?
Look at http://wiki.apache.org/solr/DataImportHandler You may use an outer entity with SqlEntityProcessor and an inner entity with XPathEntityProcessor On Fri, Dec 5, 2008 at 5:35 PM, Joel Karlsson <[EMAIL PROTECTED]> wrote: > Hello, > > Is there any way for Solr to follow links stored in my database and index > the content of these files and HTTP-resources? > > Thanks in advance! // Joel > -- --Noble Paul
Re: Merging Indices
On Thu, Dec 4, 2008 at 6:39 PM, ashokc <[EMAIL PROTECTED]> wrote: > > The SOLR wiki says > >>>3. Make sure both indexes you want to merge are closed. > > What exactly does 'closed' mean? If you do a commit, and then prevent updates, the index should be closed (no open IndexWriter). > 1. Do I need to stop SOLR search on both indexes before running the merge > command? So a brief downtime is required? > Or do I simply prevent any 'updates/deletes' to these indices during the > merge time so they can still serve up results (read only?) while I am > creating a new merged index? Preventing updates/deletes should be sufficient. > 2. Before the new index replaces the old index, do I need to stop SOLR for > that instance? Or can I simply move the old index out and place the new > index in the same place, without having to stop SOLR Yes, simply moving the index should work if you are careful to avoid any updates since the last commit. > 3. If SOLR has to be stopped during the merge operation, can we work with a > redundant/failover instance and stagger the merge so the search service will > not go down? Any guidelines here are welcome. > > Thanks > > - ashok > -- > View this message in context: > http://www.nabble.com/Merging-Indices-tp20845009p20845009.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: Merging Indices
Thanks for the help Yonik & Shalin.It really makes it easy for me if I do not have to stop/start the SOLR app during the merge operations. The reason I have to do this many times a day, is that I am implementing a simple-minded entity-extraction procedure for the content I am indexing. I have a user defined taxonomy into which the current documents, and any new documents should be classified under. The taxonomy defines the nested facet fields for SOLR. When a new document is posted, the user expects to have it available in the right facet right away. My classification procedure is as follows when a new document is added. 1. Create a new temporary index with that document (no taxonomy fields at this time) 2. Search this index with each of the taxonomy terms (synonyms are employed as well through synonyms.txt) and find out which of these categories is a hit for this document. 3. Add a new " > On Thu, Dec 4, 2008 at 6:39 PM, ashokc <[EMAIL PROTECTED]> wrote: >> >> The SOLR wiki says >> 3. Make sure both indexes you want to merge are closed. >> >> What exactly does 'closed' mean? > > If you do a commit, and then prevent updates, the index should be > closed (no open IndexWriter). > >> 1. Do I need to stop SOLR search on both indexes before running the merge >> command? So a brief downtime is required? >> Or do I simply prevent any 'updates/deletes' to these indices during the >> merge time so they can still serve up results (read only?) while I am >> creating a new merged index? > > Preventing updates/deletes should be sufficient. > >> 2. Before the new index replaces the old index, do I need to stop SOLR >> for >> that instance? Or can I simply move the old index out and place the new >> index in the same place, without having to stop SOLR > > Yes, simply moving the index should work if you are careful to avoid > any updates since the last commit. > >> 3. If SOLR has to be stopped during the merge operation, can we work with >> a >> redundant/failover instance and stagger the merge so the search service >> will >> not go down? Any guidelines here are welcome. >> >> Thanks >> >> - ashok >> -- >> View this message in context: >> http://www.nabble.com/Merging-Indices-tp20845009p20845009.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Merging-Indices-tp20845009p20859513.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr on Solaris
Jon, We are running under tomcat. Thanks for the link I will check it out -Raghu -Original Message- From: Jon Baer [mailto:[EMAIL PROTECTED] Sent: Friday, December 05, 2008 10:57 AM To: solr-user@lucene.apache.org Subject: Re: Solr on Solaris Are you running Solr in a container more specifically, Ive had few issues w/ zones in the past and Solr (I believe there are some networking issues w/ older Solaris versions) ... They are basically where you can slice ("virtualize") your resources and divide a box up into something similar to a VPS ... http://www.sun.com/bigadmin/content/zones/ - Jon On Dec 5, 2008, at 10:58 AM, Kashyap, Raghu wrote: > Jon, > > What do you mean by off a "Zone"? Please clarify > > -Raghu > > > -Original Message- > From: Jon Baer [mailto:[EMAIL PROTECTED] > Sent: Thursday, December 04, 2008 9:56 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr on Solaris > > Just curious, is this off a "zone" by any chance? > > - Jon > > On Dec 4, 2008, at 10:40 PM, Kashyap, Raghu wrote: > >> We are running solr on a solaris box with 4 CPU's(8 cores) and 3GB >> Ram. >> When we try to index sometimes the HTTP Connection just hangs and the >> client which is posting documents to solr doesn't get any response >> back. >> We since then have added timeouts to our http requests from the >> clients. >> >> >> >> I then get this error. >> >> >> >> java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. >> Out >> of swap space? >> >> java.lang.OutOfMemoryError: unable to create new native thread >> >> Exception in thread "JmxRmiRegistryConnectionPoller" >> java.lang.OutOfMemoryError: unable to create new native thread >> >> >> >> We are running JDK 1.6_10 on the solaris box. . The weird thing is we >> are running the same application on linux box with JDK 1.6 and we >> haven't seen any problem like this. >> >> >> >> Any suggestions? >> >> >> >> -Raghu >> >
Re: IOException: Mark invalid while analyzing HTML
Was this one ever addressed? I'm seeing it in some small percentage of the documents that I index in 1.4-dev 708596M. I don't see a corresponding JIRA issue. James Brady-3 wrote: > > Hi, > I'm seeing a problem mentioned in Solr-42, Highlighting problems with > HTMLStripWhitespaceTokenizerFactory: > https://issues.apache.org/jira/browse/SOLR-42 > > I'm indexing HTML documents, and am getting reams of "Mark invalid" > IOExceptions: > SEVERE: java.io.IOException: Mark invalid > at java.io.BufferedReader.reset(Unknown Source) > at > org > .apache > .solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:171) > at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: > 728) > at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java: > 742) > at java.io.Reader.read(Unknown Source) > at org.apache.lucene.analysis.CharTokenizer.next(CharTokenizer.java:56) > at org.apache.lucene.analysis.StopFilter.next(StopFilter.java:118) > at > org > .apache > .solr.analysis.WordDelimiterFilter.next(WordDelimiterFilter.java:249) > at > org.apache.lucene.analysis.LowerCaseFilter.next(LowerCaseFilter.java:33) > at > org > .apache > .solr > .analysis.EnglishPorterFilter.next(EnglishPorterFilterFactory.java:92) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:45) > at > org > .apache > .solr.analysis.BufferedTokenStream.read(BufferedTokenStream.java:94) > at > org > .apache > .solr > .analysis > .RemoveDuplicatesTokenFilter.process(RemoveDuplicatesTokenFilter.java: > 33) > at > org > .apache > .solr.analysis.BufferedTokenStream.next(BufferedTokenStream.java:82) > at org.apache.lucene.analysis.TokenStream.next(TokenStream.java:79) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.invertField(DocumentsWriter.java:1518) > at org.apache.lucene.index.DocumentsWriter$ThreadState > $FieldData.processField(DocumentsWriter.java:1407) > at org.apache.lucene.index.DocumentsWriter > $ThreadState.processDocument(DocumentsWriter.java:1116) > at > org > .apache > .lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:2440) > at > org > .apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java: > 2422) > at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java: > 1445) > > > This is using a ~1 week old version of Solr 1.3 from SVN. > > One workaround mentioned in that Jira issue was to move HTML stripping > outside of Solr; can anyone suggest a better approach than that? > > Thanks > James > > > -- View this message in context: http://www.nabble.com/IOException%3A-Mark-invalid-while-analyzing-HTML-tp17052153p20859862.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr on Solaris
your out of memory :). each instance of an application server you can technically only allocate like 1024mb to the JVM, to take advantage of the memory you need to run multiple instances of the application server. are you using RAMDirectory with SOLR? On Thu, Dec 4, 2008 at 10:40 PM, Kashyap, Raghu <[EMAIL PROTECTED]> wrote: > We are running solr on a solaris box with 4 CPU's(8 cores) and 3GB Ram. > When we try to index sometimes the HTTP Connection just hangs and the > client which is posting documents to solr doesn't get any response back. > We since then have added timeouts to our http requests from the clients. > > > > I then get this error. > > > > java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. Out > of swap space? > > java.lang.OutOfMemoryError: unable to create new native thread > > Exception in thread "JmxRmiRegistryConnectionPoller" > java.lang.OutOfMemoryError: unable to create new native thread > > > > We are running JDK 1.6_10 on the solaris box. . The weird thing is we > are running the same application on linux box with JDK 1.6 and we > haven't seen any problem like this. > > > > Any suggestions? > > > > -Raghu > > -- Jeryl Cook /^\ Pharaoh /^\ http://pharaohofkush.blogspot.com/ "Whether we bring our enemies to justice, or bring justice to our enemies, justice will be done." --George W. Bush, Address to a Joint Session of Congress and the American People, September 20, 2001
getting xml out of a SolrDocument ?
I am using solrj to query solr and the QueryResponse.getResults() returns a SolrDocumentList. There is a SolrDocument in the list with the results I want. The problem is that I want to view these results as XML. How can I get the SolrDocument to give me XML? Thanks in advance. -Dan -- View this message in context: http://www.nabble.com/getting-xml-out-of-a-SolrDocument---tp20861491p20861491.html Sent from the Solr - User mailing list archive at Nabble.com.
creating cores on demand
Our application processes RSS feeds. Its search activity is heavily concentrated on the most recent 24 hours, with modest searching across the past few days, and rare (but important) searching across months or more. So we create a Solr core for each day, and then search the appropriate set of cores for any given date range. We used to pile up zillions of cores in solr.xml, and open them on every Solr restart. But we kept running out of things: memory, open file descriptors, and threads. So I think I have a better solution. Now, any time we need a core, we create it on the fly. We have solr.xml set up to *not* persist new cores. But of course their data directories are persistent. So far this appears to work great in QA. I've only done limited testing yet, but I believe each core that we create will either "reconnect" to an existing data directory or create a new data directory, as appropriate. Anyone know of problems with this approach? Here is some of the most important source code (using Solrj), in case someone else finds this approach useful, or in case someone feels motivated to study it for problems. Dean /** * Keeps track of the names of cores that are known to exist, so we don't have to keep checking. */ private Set knownCores = new HashSet(20); /** * Returns the [EMAIL PROTECTED] SolrServer} for the specified [EMAIL PROTECTED] prefix} and [EMAIL PROTECTED] day}. */ private SolrServer getSolrServer(String prefix, int day) throws SolrServerException, IOException { String coreName = prefix + day; String serverUrl = solrRootUrl + "/" + coreName; try { makeCoreAvailable(coreName); return new CommonsHttpSolrServer(serverUrl); } catch (MalformedURLException e) { String message = "Invalid Solr server URL (misconfiguration of solrRootUrl) " + serverUrl + ": " + ExceptionUtil.getMessage(e); LOGGER.error(message, e); reportError(); throw new SolrMisconfigurationException(message, e); } } private synchronized void makeCoreAvailable(String coreName) throws SolrServerException, IOException { if (knownCores.contains(coreName)) { return; } if (solrCoreExists(coreName, solrRootUrl)) { knownCores.add(coreName); return; } CommonsHttpSolrServer adminServer = new CommonsHttpSolrServer(solrRootUrl); LOGGER.info("Creating new Solr core " + coreName); CoreAdminRequest.createCore(coreName, coreName, adminServer, solrConfigFilename, solrSchemaFilename); LOGGER.info("Successfully created new Solr core " + coreName); } private static boolean solrCoreExists(String coreName, String solrRootUrl) throws IOException, SolrServerException { CommonsHttpSolrServer adminServer = new CommonsHttpSolrServer(solrRootUrl); CoreAdminResponse status = CoreAdminRequest.getStatus(coreName, adminServer); return status.getCoreStatus(coreName).get("instanceDir") != null; }
Re: Solr on Solaris
When you are saying "application server" do you mean tomcat? If yes, I have allocated >8GB of heap to tomcat and it uses it all no problem (64 bit Intel/64 bit Java). -glen 2008/12/5 Jeryl Cook <[EMAIL PROTECTED]>: > your out of memory :). > > each instance of an application server you can technically only > allocate like 1024mb to the JVM, to take advantage of the memory you > need to run multiple instances of the application server. > > are you using RAMDirectory with SOLR? > > On Thu, Dec 4, 2008 at 10:40 PM, Kashyap, Raghu > <[EMAIL PROTECTED]> wrote: >> We are running solr on a solaris box with 4 CPU's(8 cores) and 3GB Ram. >> When we try to index sometimes the HTTP Connection just hangs and the >> client which is posting documents to solr doesn't get any response back. >> We since then have added timeouts to our http requests from the clients. >> >> >> >> I then get this error. >> >> >> >> java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. Out >> of swap space? >> >> java.lang.OutOfMemoryError: unable to create new native thread >> >> Exception in thread "JmxRmiRegistryConnectionPoller" >> java.lang.OutOfMemoryError: unable to create new native thread >> >> >> >> We are running JDK 1.6_10 on the solaris box. . The weird thing is we >> are running the same application on linux box with JDK 1.6 and we >> haven't seen any problem like this. >> >> >> >> Any suggestions? >> >> >> >> -Raghu >> >> > > > > -- > Jeryl Cook > /^\ Pharaoh /^\ > http://pharaohofkush.blogspot.com/ > "Whether we bring our enemies to justice, or bring justice to our > enemies, justice will be done." > --George W. Bush, Address to a Joint Session of Congress and the > American People, September 20, 2001 > -- -
Re: Stemmer vs. exact match
On Dec 4, 2008, at 8:19 PM, Jonathan Ariel wrote: Hi! I'm wondering what solr is really doing with the exact word vs. the stemmed word. So for example I have 2 documents. The first one has in the title the word "convertible" The second one has "convert" When solr stem the titles, both will be the same since convertible -> convert. Then when I search "convertible" both documents seems to have the same relevancy... is that right or Solr keeps track of the original word and gives extra score to the fact that I am actually looking for the same exact word that I have in a document... I might be wrong, but it seems to me that it should score that better. Solr doesn't keep track of the original word, unless you tell it to. So, if you are stemming, then you are losing the original word. A common way to solve what you are doing is to actually have two fields, where one is stemmed and one is exact (you can do this with the mechanism in the Schema). Thus, if you want exact match, you search the exact match field, otherwise you search the stemmed field. -Grant
Re: getting xml out of a SolrDocument ?
I'd somehow pass through Solr's XML response, or perhaps consider using Solr's XSLT response writer to convert to the format you want. I don't have the magic incantation handy, but it should be possible to make a request through SolrJ and get the raw response string back in whatever format you want. Erik On Dec 5, 2008, at 3:02 PM, Dan Robin wrote: I am using solrj to query solr and the QueryResponse.getResults() returns a SolrDocumentList. There is a SolrDocument in the list with the results I want. The problem is that I want to view these results as XML. How can I get the SolrDocument to give me XML? Thanks in advance. -Dan -- View this message in context: http://www.nabble.com/getting-xml-out-of-a-SolrDocument---tp20861491p20861491.html Sent from the Solr - User mailing list archive at Nabble.com.
Smaller filterCache giving better performance
I've seen some strangle results in the last few days of testing, but this one flies in the face of everything I've read on this forum: Reducing filterCache size has increased performance. I have posted my setup here: http://www.nabble.com/Throughput-Optimization-td20335132.html. My original filterCache was 700,000. Reducing it to 20,000, I found: - Average response time decreased by 85% - Average throughput increased by 250% - CPU time used by the garbage collector decreased by 85% - The system showed to weird GC issues (reported yesterday at: http://www.nabble.com/new-faceting-algorithm-td20674902.html) Further reducing the filterCache to 10,000 - Average response time decreased by another 27% - Average throughput increased by another 30% - GC CPU usage also dropped - System behavior changed after ~30 minutes, with a slight performance degradation These results came from a load test. I'm running trunk code from Dec 2 with Yonik's faceting improvement turned on. Any thoughts? -- View this message in context: http://www.nabble.com/Smaller-filterCache-giving-better-performance-tp20863674p20863674.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Smaller filterCache giving better performance
On 5-Dec-08, at 2:24 PM, wojtekpia wrote: I've seen some strangle results in the last few days of testing, but this one flies in the face of everything I've read on this forum: Reducing filterCache size has increased performance. This isn't really unexpected behaviour. The problem with a huge filter cache is that it is fighting with OS disk cache--the latter of which can be much much more important. Reducing the size of the filter cache give more to the OS. Try giving 17GB to java, and letting the OS cache the entire index. Increase the filter cache as much as you can without OOM'ing. That should give optimal performance. Note that you don't always need the _whole_ index in the os cache to get acceptable performance, but if you can afford it, it is a good idea. It is also possible that you are experiencing contention in the filtercache code--have you tried the concurrent filter cache impl? -Mike I have posted my setup here: http://www.nabble.com/Throughput-Optimization-td20335132.html. My original filterCache was 700,000. Reducing it to 20,000, I found: - Average response time decreased by 85% - Average throughput increased by 250% - CPU time used by the garbage collector decreased by 85% - The system showed to weird GC issues (reported yesterday at: http://www.nabble.com/new-faceting-algorithm-td20674902.html) Further reducing the filterCache to 10,000 - Average response time decreased by another 27% - Average throughput increased by another 30% - GC CPU usage also dropped - System behavior changed after ~30 minutes, with a slight performance degradation These results came from a load test. I'm running trunk code from Dec 2 with Yonik's faceting improvement turned on. Any thoughts? -- View this message in context: http://www.nabble.com/Smaller-filterCache-giving-better-performance-tp20863674p20863674.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: getting xml out of a SolrDocument ?
On Fri, Dec 5, 2008 at 5:24 PM, Erik Hatcher <[EMAIL PROTECTED]> wrote: > I'd somehow pass through Solr's XML response, or perhaps consider using > Solr's XSLT response writer to convert to the format you want. I don't have > the magic incantation handy, but it should be possible to make a request > through SolrJ and get the raw response string back in whatever format you > want. One could subclass RequestParser (or XMLRequestParser) and do nothing but put the entire response body in a String. -Yonik
Re: Smaller filterCache giving better performance
On Fri, Dec 5, 2008 at 5:24 PM, wojtekpia <[EMAIL PROTECTED]> wrote: > > I've seen some strangle results in the last few days of testing, but this one > flies in the face of everything I've read on this forum: Reducing > filterCache size has increased performance. > > I have posted my setup here: > http://www.nabble.com/Throughput-Optimization-td20335132.html. > > My original filterCache was 700,000. Reducing it to 20,000, I found: > - Average response time decreased by 85% > - Average throughput increased by 250% > - CPU time used by the garbage collector decreased by 85% > - The system showed to weird GC issues (reported yesterday at: > http://www.nabble.com/new-faceting-algorithm-td20674902.html) > > Further reducing the filterCache to 10,000 > - Average response time decreased by another 27% > - Average throughput increased by another 30% > - GC CPU usage also dropped > - System behavior changed after ~30 minutes, with a slight performance > degradation > > These results came from a load test. I'm running trunk code from Dec 2 with > Yonik's faceting improvement turned on. Old faceting used the filterCache exclusively. New faceting only uses it for terms that cover ~5% of the index, so you can reduce the filterCache quite a bit potentially, save more RAM, and increase the amount of memory you can give to the OS cache. -Yonik
Re: Smaller filterCache giving better performance
Reducing the amount of memory given to java slowed down Solr at first, then quickly caused the garbage collector to behave badly (same issue as I referenced above). I am using the concurrent cache for all my caches. -- View this message in context: http://www.nabble.com/Smaller-filterCache-giving-better-performance-tp20863674p20864928.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Dealing with field values as key/value pairs
: So i'm basically looking for design pattern/best practice for that scenario : based on people's experience. I've taken two approaches in the past... 1) encode the "id" and the "label" in the field value; facet on it; require clients to know how to decode. This works really well for simple things where the the id=>label mappings don't ever change, and are easy to encode (ie "01234:Chris Hostetter"). This is a horrible approach when id=>label mappings do change with any frequency. 2) have a seperate type of "metadata" document, one per "thing" that you are faceting on containing fields for id and the label (and probably a doc_type field so you can tell it apart from your main docs) then once you've done your main query and gotten the results back facetied on id, you can query for those ids to get the corrisponding labels. this works realy well if the labels ever change (just reindex the corrisponding metadata document) and has the added bonus that you can store additional metadata in each of those docs, and in many use cases for presenting an initial "browse" interface, you can sometimes get away with a cheap search for all metadata docs (or all metadata docs meeting a certain criteria) instead of an expensive facet query across all of your main documents. -Hoss
Re: Ordering updates
On Fri, Dec 5, 2008 at 5:40 AM, Laurence Rowe <[EMAIL PROTECTED]> wrote: > 2008/12/4 Shalin Shekhar Mangar <[EMAIL PROTECTED]>: > > > I think we have a slight misunderstanding here. Because there are many > CMS processes it is possible that the same document will be updated > concurrently (from different web requests). In this case two updates > are sent (one by each process). The problem arises when the two update > requests are processed in a different order to the original database > transactions. Ok, I think I understand your problem now. If multiple processes send update requests, they will overwrite each other which is not what you want. > I guess the only way to achieve consistency is to stage my indexed > data in a database table and trigger a DataImportHandler to perform > delta imports after each transaction. I agree. You need to use a transactional mechanism to ensure consistency, then you should use a database. Periodically, you can index this particular table into Solr. However, if you have multi-valued fields, you may run into problems. One more thing that you can think about, depending on your use-case, is whether a small amount of stale data is OK? Do you really need things consistent and upto date all the time in Solr? I also know of cases where people have removed frequently changing fields from Solr and fetched them from DB at the time of page render. Ofcourse, that doesn't work when you need to sort by that frequently changind field. > >> From what I can tell this conditional indexing feature is not > >> supported by Solr. Might it be supported by Lucene but not exposed by > >> Solr? > >> > > > > No this is not supported by either of Lucene/Solr. > > > This is a pity, eventual consistency is a nice model. > > Regards, > > Laurence > -- Regards, Shalin Shekhar Mangar.