A problem of tracking the commits of Lucene using SHA num
Thanks for your patience and helps. Recently, I acquired a batch of commits?? SHA data of Lucene, of which the time span is from 2010 to 2015. In order to get original info, I tried to use these SHA data to track commits. First, I cloned Lucene repository to my local host, using the cmd git clone https:// https://github.com/apache/lucene-solr.git. Then, I used git show [commit SHA] to get commits?? history record, but failed with the CMD info like this: >> git show be5672c0c242d658b7ce36f291b74c344de925c7 >> fatal: bad object be5672c0c242d658b7ce36f291b74c344de925c7 After that, I cloned another mirror of Apache Lucene & Solr (https://github.com/mdodsworth/lucene-solr, the update ended at 2014/08/30), and got the right record like this: Moreover, I tried to track a commit using its title msg. However, for a same commit, e.g. LUCENE-5909: Fix stupid bug, I found different SHA nums from the two above mirror repositories (https://github.com/apache/lucene-solr/commit/3c0d111d07184e96a73ca6dc05c6227d839724e2 and https://github.com/mdodsworth/lucene-solr/commit/4bc8dde26371627d11c299f65c399ecb3240a34c), which confused me. In summary, 1) did the method to generate SHA num of commit change once before? 2) because the second mirror repository ended its update since 2014, how can I track the whole commits of my dataset? Thanks so much!
A problem of tracking the commits of Lucene using SHA num
Thanks for your patience and helps. Recently, I acquired a batch of commits?? SHA data of Lucene, of which the time span is from 2010 to 2015. In order to get original info, I tried to use these SHA data to track commits. First, I cloned Lucene repository to my local host, using the cmd git clone https:// https://github.com/apache/lucene-solr.git. Then, I used git show [commit SHA] to get commits?? history record, but failed with the CMD info like this: >> git show be5672c0c242d658b7ce36f291b74c344de925c7 >> fatal: bad object be5672c0c242d658b7ce36f291b74c344de925c7 After that, I cloned another mirror of Apache Lucene & Solr (https://github.com/mdodsworth/lucene-solr, the update ended at 2014/08/30), and got the right record like this: Moreover, I tried to track a commit using its title msg. However, for a same commit, e.g. LUCENE-5909: Fix stupid bug, I found different SHA nums from the two above mirror repositories (https://github.com/apache/lucene-solr/commit/3c0d111d07184e96a73ca6dc05c6227d839724e2 and https://github.com/mdodsworth/lucene-solr/commit/4bc8dde26371627d11c299f65c399ecb3240a34c), which confused me. In summary, 1) did the method to generate SHA num of commit change once before? 2) because the second mirror repository ended its update since 2014, how can I track the whole commits of my dataset? Thanks so much!
Re: A problem of tracking the commits of Lucene using SHA num
Dear Shawn and Chris, Thanks very much for your replies and helps. And so sorry for my mistakes of first-time use of Mailing Lists. On 11/9/2017 5:13 PM, Shawn wrote: > Where did this information originate? My SHA data come from the paper On the Naturalness of Buggy Code(Baishakhi Ray, et al. ICSE ??16), and download from http://odd-code.github.io/Data.html. On 11/9/2017 6:10 PM, Chris wrote: > Also -- What exactly are you trying to do? what is your objective? I want to analysis buggy codes?? statistical properties through some learning models on Ray??s experimental dataset. Since its large size, Ray did not make the entire data online. What I can acquire is a batch of commits?? SHA data and some other info. So, I need to pick out the old commits which are correlated to these SHAs. On 17/9/2017 1:47 PM, Shawn wrote: > The commit data you're using is nearly useless, because the repository > where it originated has been gone for nearly two years. If you can find > out how it was generated, you can build a new version from the current > repository -- either on github or from Apache's official servers. Thanks for all of your suggestions and helps, I am going to try other ways. Thanks so much. Best, Xian
dataimport handler
Hi, I am trying to use dataimporthandler(Solr 4.6) from oracle database, but I have some issues in mapping the data. I have 3 columns in the test_table, column1, column2, id dataconfig.xml Issue is, - if I remove the id column from the table, index fails, solr is looking for id column even though it is not mapped in dataconfig.xml. - if I add, it directly maps the id column form the db to solr id, it ignores the column1, even though it is mapped. my problem is I don't have ID in every table, I should be able to map the column I choose from the table to solr Id, any solution will be greatly appreciated. `Tom -- View this message in context: http://lucene.472066.n3.nabble.com/dataimport-handler-tp4112830.html Sent from the Solr - User mailing list archive at Nabble.com.
TemplateTransformer returns null values
Hi, I am trying a simple transformer on data input using DIH, Solr 4.6. when I run the below query while DIH I get null values for new_url. what is wrong? even tried with "${document_solr.id}" the name is data-config.xml: below stack trace: 8185946 [Thread-29] INFO org.apache.solr.search.SolrIndexSearcher û Opening Searcher@5a5f4cb7 realtime 8185960 [Thread-29] INFO org.apache.solr.handler.dataimport.JdbcDataSource û Creating a connection for entity document_solr with URL: jdbc:oracle:thin:@vluedb01:1521:iedwdev 8186225 [Thread-29] INFO org.apache.solr.handler.dataimport.JdbcDataSource û Time taken forgetConnection():265 8186226 [Thread-29] DEBUG org.apache.solr.handler.dataimport.JdbcDataSource û Executing SQL: select DOC_IDN as id, BILL_IDN as bill_id from document_solr 8186291 [Thread-29] TRACE org.apache.solr.handler.dataimport.JdbcDataSource û Time taken for sql :64 8186301 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is 8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is 8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer û The name is `Tom -- View this message in context: http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: TemplateTransformer returns null values
Thanks Alexandre for quick response, I tried both the ways but still no luck null values, anything I am doing fundamentally wrong? query="select DOC_IDN, BILL_IDN from document_fact" > and query="select DOC_IDN as id ,BILL_IDN as bill_id from document_fact" > -- View this message in context: http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539p4114544.html Sent from the Solr - User mailing list archive at Nabble.com.
Faceting return value of a function query?
Hi, I'm new to Solr, and I'm having a problem with faceting. I would really appreciate it if you could help :) I have a set of documents in JSON format, which I could post to my Solr core using the post.jar tool. Each document contains two fields, namely "startDate" and "endDate", both of which are of type "date". Conceptually, I would like to have a third field "timeSpan" that is automatically generated from the return value of function query "ms(endDate, startDate)", and do range facet on it, i.e. compute the distribution of "timeSpan", among either all of or a filtered subset of the documents. I have tried to find ways of both directly faceting the function return values and automatically generate the "timeSpan" field during indexing, but without luck yet. Suggestions are greatly appreciated! Best, Yubing
Re: Faceting return value of a function query?
Hi Erik, Thanks for the reply! Do you mean parse and modify the documents before sending them to Solr? Cheers, Yubing On Mon, Nov 3, 2014 at 8:48 PM, Erick Erickson wrote: > Wouldn't it be easiest to compute the span at index time? Then it's > very straight-forward. > > Best, > Erick > > On Mon, Nov 3, 2014 at 8:18 PM, Yubing (Tom) Dong 董玉冰 > wrote: > > Hi, > > > > I'm new to Solr, and I'm having a problem with faceting. I would really > > appreciate it if you could help :) > > > > I have a set of documents in JSON format, which I could post to my Solr > > core using the post.jar tool. Each document contains two fields, namely > > "startDate" and "endDate", both of which are of type "date". > > > > Conceptually, I would like to have a third field "timeSpan" that is > > automatically generated from the return value of function query > > "ms(endDate, startDate)", and do range facet on it, i.e. compute the > > distribution of "timeSpan", among either all of or a filtered subset of > the > > documents. > > > > I have tried to find ways of both directly faceting the function return > > values and automatically generate the "timeSpan" field during indexing, > but > > without luck yet. > > > > Suggestions are greatly appreciated! > > > > Best, > > Yubing >
Re: Faceting return value of a function query?
I see. Thank you! :-) Sent from my Android phone On Nov 3, 2014 9:35 PM, "Erick Erickson" wrote: > Yep. It's almost always easier and faster if you can pre-compute as > much as possible during indexing time. It'll take longer to index of > course, but the ratio of writing to the index to searching is usually > hugely in favor of doing the work during indexing. > > Best, > Erick > > On Mon, Nov 3, 2014 at 8:52 PM, Yubing (Tom) Dong 董玉冰 > wrote: > > Hi Erik, > > > > Thanks for the reply! Do you mean parse and modify the documents before > > sending them to Solr? > > > > Cheers, > > Yubing > > > > On Mon, Nov 3, 2014 at 8:48 PM, Erick Erickson > > wrote: > > > >> Wouldn't it be easiest to compute the span at index time? Then it's > >> very straight-forward. > >> > >> Best, > >> Erick > >> > >> On Mon, Nov 3, 2014 at 8:18 PM, Yubing (Tom) Dong 董玉冰 > >> wrote: > >> > Hi, > >> > > >> > I'm new to Solr, and I'm having a problem with faceting. I would > really > >> > appreciate it if you could help :) > >> > > >> > I have a set of documents in JSON format, which I could post to my > Solr > >> > core using the post.jar tool. Each document contains two fields, > namely > >> > "startDate" and "endDate", both of which are of type "date". > >> > > >> > Conceptually, I would like to have a third field "timeSpan" that is > >> > automatically generated from the return value of function query > >> > "ms(endDate, startDate)", and do range facet on it, i.e. compute the > >> > distribution of "timeSpan", among either all of or a filtered subset > of > >> the > >> > documents. > >> > > >> > I have tried to find ways of both directly faceting the function > return > >> > values and automatically generate the "timeSpan" field during > indexing, > >> but > >> > without luck yet. > >> > > >> > Suggestions are greatly appreciated! > >> > > >> > Best, > >> > Yubing > >> >
Re: Faceting return value of a function query?
Turns out that update processors perfectly suit me needs. I ended up using the StatelessScriptUpdateProcessor with a simple js script :-) On Mon Nov 03 2014 at 下午10:40:52 Yubing (Tom) Dong 董玉冰 < tom.tung@gmail.com> wrote: > I see. Thank you! :-) > > Sent from my Android phone > On Nov 3, 2014 9:35 PM, "Erick Erickson" wrote: > >> Yep. It's almost always easier and faster if you can pre-compute as >> much as possible during indexing time. It'll take longer to index of >> course, but the ratio of writing to the index to searching is usually >> hugely in favor of doing the work during indexing. >> >> Best, >> Erick >> >> On Mon, Nov 3, 2014 at 8:52 PM, Yubing (Tom) Dong 董玉冰 >> wrote: >> > Hi Erik, >> > >> > Thanks for the reply! Do you mean parse and modify the documents before >> > sending them to Solr? >> > >> > Cheers, >> > Yubing >> > >> > On Mon, Nov 3, 2014 at 8:48 PM, Erick Erickson > > >> > wrote: >> > >> >> Wouldn't it be easiest to compute the span at index time? Then it's >> >> very straight-forward. >> >> >> >> Best, >> >> Erick >> >> >> >> On Mon, Nov 3, 2014 at 8:18 PM, Yubing (Tom) Dong 董玉冰 >> >> wrote: >> >> > Hi, >> >> > >> >> > I'm new to Solr, and I'm having a problem with faceting. I would >> really >> >> > appreciate it if you could help :) >> >> > >> >> > I have a set of documents in JSON format, which I could post to my >> Solr >> >> > core using the post.jar tool. Each document contains two fields, >> namely >> >> > "startDate" and "endDate", both of which are of type "date". >> >> > >> >> > Conceptually, I would like to have a third field "timeSpan" that is >> >> > automatically generated from the return value of function query >> >> > "ms(endDate, startDate)", and do range facet on it, i.e. compute the >> >> > distribution of "timeSpan", among either all of or a filtered subset >> of >> >> the >> >> > documents. >> >> > >> >> > I have tried to find ways of both directly faceting the function >> return >> >> > values and automatically generate the "timeSpan" field during >> indexing, >> >> but >> >> > without luck yet. >> >> > >> >> > Suggestions are greatly appreciated! >> >> > >> >> > Best, >> >> > Yubing >> >> >> >
possible spellcheck bug in 3.5 causing erroneous suggestions
hi folks, i think i found a bug in the spellchecker but am not quite sure: this is the query i send to solr: http://lh:8983/solr/CompleteIndex/select? &rows=0 &echoParams=all &spellcheck=true &spellcheck.onlyMorePopular=true &spellcheck.extendedResults=no &q=a+bb+ccc++ and this is the result: 0 4 all true all no a bb ccc 0 true 1 2 4 abb 1 5 8 ccc 1 5 8 ccc 1 10 14 dvd now, i know this is just a technical query and i have done it for a test regarding suggestions and i discovered the oddity just by chance and was not regarding the test i did: my question is regarding, how the suggestions 1 and 2 come about. from what i understand from the wiki, that the entries in spellcheck/suggestions are only (misspelled) substrings from the user query. the setup/context is thus: - the words a ccc exists 11 times in the index but 1 and 2 dont http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 0name="QTime">1name="ccc">11 - analyzer for the spellchecker yields the terms as entered, i.e. a|bb|ccc| - the config is thus textSpell default spell ./spellchecker does anyone have a clue what's going on?
Re: possible spellcheck bug in 3.5 causing erroneous suggestions
same On 22.03.2012 10:00, Markus Jelsma wrote: Can you try spellcheck.q ? On Thu, 22 Mar 2012 09:57:19 +0100, tom wrote: hi folks, i think i found a bug in the spellchecker but am not quite sure: this is the query i send to solr: http://lh:8983/solr/CompleteIndex/select? &rows=0 &echoParams=all &spellcheck=true &spellcheck.onlyMorePopular=true &spellcheck.extendedResults=no &q=a+bb+ccc++ and this is the result: 0 4 all true all no a bb ccc 0 true 1 2 4 abb 1 5 8 ccc 1 5 8 ccc 1 10 14 dvd now, i know this is just a technical query and i have done it for a test regarding suggestions and i discovered the oddity just by chance and was not regarding the test i did: my question is regarding, how the suggestions 1 and 2 come about. from what i understand from the wiki, that the entries in spellcheck/suggestions are only (misspelled) substrings from the user query. the setup/context is thus: - the words a ccc exists 11 times in the index but 1 and 2 dont http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 0111 - analyzer for the spellchecker yields the terms as entered, i.e. a|bb|ccc| - the config is thus textSpell default spell ./spellchecker does anyone have a clue what's going on?
Re: possible spellcheck bug in 3.5 causing erroneous suggestions
so any one has a clue what's (might be) going wrong ? or do i have to debug and myself and post a jira issue? PS: unfortunately i cant give anyone the index for testing due to NDA. cheers On 22.03.2012 10:17, tom wrote: same On 22.03.2012 10:00, Markus Jelsma wrote: Can you try spellcheck.q ? On Thu, 22 Mar 2012 09:57:19 +0100, tom wrote: hi folks, i think i found a bug in the spellchecker but am not quite sure: this is the query i send to solr: http://lh:8983/solr/CompleteIndex/select? &rows=0 &echoParams=all &spellcheck=true &spellcheck.onlyMorePopular=true &spellcheck.extendedResults=no &q=a+bb+ccc++ and this is the result: 0 4 all true all no a bb ccc 0 true 1 2 4 abb 1 5 8 ccc 1 5 8 ccc 1 10 14 dvd now, i know this is just a technical query and i have done it for a test regarding suggestions and i discovered the oddity just by chance and was not regarding the test i did: my question is regarding, how the suggestions 1 and 2 come about. from what i understand from the wiki, that the entries in spellcheck/suggestions are only (misspelled) substrings from the user query. the setup/context is thus: - the words a ccc exists 11 times in the index but 1 and 2 dont http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 0111 - analyzer for the spellchecker yields the terms as entered, i.e. a|bb|ccc| - the config is thus textSpell default spell ./spellchecker does anyone have a clue what's going on?
solrj and replication
hi, i was just wondering if i need to do smth special if i want to have an embedded slave to get replication working ? my setup is like so: - in my clustered application that uses embedded solr(j) (for performance). the cores are configured as slaves that should connect to a master which runs in a jetty. - the embedded codes dont expose any of the solr servlets note: that the slave config, if started in jetty, does proper replication, while when embedded it doesnt. using solr 3.5 thx tom
Re: solrj and replication
ok tested it myself and a slave runnning embedded works, just not within my application -- yet... On 20.06.2012 18:14, tom wrote: hi, i was just wondering if i need to do smth special if i want to have an embedded slave to get replication working ? my setup is like so: - in my clustered application that uses embedded solr(j) (for performance). the cores are configured as slaves that should connect to a master which runs in a jetty. - the embedded codes dont expose any of the solr servlets note: that the slave config, if started in jetty, does proper replication, while when embedded it doesnt. using solr 3.5 thx tom
suggester/autocomplete locks file preventing replication
hi, i'm using the suggester with a file like so: suggest name="classname">org.apache.solr.spelling.suggest.Suggester name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup content 0.05 true 100 autocomplete.dictionary when trying to replicate i get the following error message on the slave side: 2012-06-21 14:34:50,781 ERROR [pool-3-thread-1 ] handler.ReplicationHandler- SnapPull failed org.apache.solr.common.SolrException: Unable to rename: autocomplete.dictionary.20120620120611 at org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642) at org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) so i dug around it and found out that the solr's java process holds a lock on the autocomplete.dictionary file. any reason why this is so? thx, running: solr 3.5 win7
Re: suggester/autocomplete locks file preventing replication
BTW: a core unload doesnt release the lock either ;( On 21.06.2012 14:39, tom wrote: hi, i'm using the suggester with a file like so: suggest name="classname">org.apache.solr.spelling.suggest.Suggester name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup content 0.05 true 100 autocomplete.dictionary when trying to replicate i get the following error message on the slave side: 2012-06-21 14:34:50,781 ERROR [pool-3-thread-1 ] handler.ReplicationHandler- SnapPull failed org.apache.solr.common.SolrException: Unable to rename: autocomplete.dictionary.20120620120611 at org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642) at org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) so i dug around it and found out that the solr's java process holds a lock on the autocomplete.dictionary file. any reason why this is so? thx, running: solr 3.5 win7
Re: suggester/autocomplete locks file preventing replication
pocking into the code i think the FileDictionary class is the culprit: It takes an InputStream as a ctor argument but never releases the stream. what puzzles me is that the class seems to allow a one-time iteration and then the stream is useless, unless i'm missing smth. here. is there a good reason for this or rather a bug? should i move the topic to the dev list? On 21.06.2012 14:49, tom wrote: BTW: a core unload doesnt release the lock either ;( On 21.06.2012 14:39, tom wrote: hi, i'm using the suggester with a file like so: suggest name="classname">org.apache.solr.spelling.suggest.Suggester name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup content 0.05 true 100 autocomplete.dictionary when trying to replicate i get the following error message on the slave side: 2012-06-21 14:34:50,781 ERROR [pool-3-thread-1 ] handler.ReplicationHandler- SnapPull failed org.apache.solr.common.SolrException: Unable to rename: autocomplete.dictionary.20120620120611 at org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642) at org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299) at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:619) so i dug around it and found out that the solr's java process holds a lock on the autocomplete.dictionary file. any reason why this is so? thx, running: solr 3.5 win7
Fwd: suggester/autocomplete locks file preventing replication
FYI matter has been dealt with on the dev list Original Message Subject:Re: Re: suggester/autocomplete locks file preventing replication Date: Fri, 22 Jun 2012 12:16:35 +0200 From: Simon Willnauer Reply-To: d...@lucene.apache.org, simon.willna...@gmail.com To: d...@lucene.apache.org here is the issue https://issues.apache.org/jira/browse/SOLR-3570 On Fri, Jun 22, 2012 at 11:55 AM, Simon Willnauer mailto:simon.willna...@googlemail.com>> wrote: On Fri, Jun 22, 2012 at 11:47 AM, Simon Willnauer mailto:simon.willna...@googlemail.com>> wrote: On Fri, Jun 22, 2012 at 10:37 AM, tom mailto:dev.tom.men...@gmx.net>> wrote: cross posting this issue to the dev list in the hope to get a response here... I think you are right. Closing the Stream / Reader is the responsibility of the caller not the FileDictionary IMO but solr doesn't close it so that might cause your problems. Are you running on windows by any chance? I will create an issue and fix it. hmm I just looked at it and I see a IOUtils.close call in FileDictionary https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/lucene/contrib/spellchecker/src/java/org/apache/lucene/search/suggest/FileDictionary.java are you using solr 3.6? simon Original Message Subject: Re: suggester/autocomplete locks file preventing replication Date:Thu, 21 Jun 2012 17:11:40 +0200 From:tom <mailto:dev.tom.men...@gmx.net> Reply-To:solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> To: solr-user@lucene.apache.org <mailto:solr-user@lucene.apache.org> pocking into the code i think the FileDictionary class is the culprit: It takes an InputStream as a ctor argument but never releases the stream. what puzzles me is that the class seems to allow a one-time iteration and then the stream is useless, unless i'm missing smth. here. is there a good reason for this or rather a bug? should i move the topic to the dev list? On21.06.2012 14 :49, tom wrote: > BTW: a core unload doesnt release the lock either ;( > > > On21.06.2012 14 :39, tom wrote: >> hi, >> >> i'm using the suggester with a file like so: >> >> >> >> suggest >> > name="classname">org.apache.solr.spelling.suggest.Suggester >> > name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup >> >> >> >> content >> 0.05 >> true >> 100 >> autocomplete.dictionary >> >> >> >> when trying to replicate i get the following error message on the >> slave side: >> >> 2012-06-21 14:34:50,781 ERROR >> [pool-3-thread-1 ] >> handler.ReplicationHandler- SnapPull failed >> org.apache.solr.common.SolrException: Unable to rename: >> autocomplete.dictionary.20120620120611 >> at >> org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642) >> at >> org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526) >> at >> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299) >> at >> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268) >> at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >> at >> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317) >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolE
Re: Restrict access to localhost
If you are using another app to create the index, I think you can remove the update servlet mapping in the web.xml. -- View this message in context: http://lucene.472066.n3.nabble.com/Restrict-access-to-localhost-tp2004475p2014129.html Sent from the Solr - User mailing list archive at Nabble.com.
copyField for big indexes
Is it a good rule of thumb, that when dealing with large indexes copyField should not be used. It seems to duplicate the indexing of data. You don't need copyField to be able to search on multiple fields. Example, if I have two fields: title and post and I want to search on both, I could just query title: OR post: So it seems to me if you have lot's of data and a large indexes, copyField should be avoided. Any thoughts? -- View this message in context: http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p3275712.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: copyField for big indexes
Thanks Erick -- View this message in context: http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p3275816.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: copyField for big indexes
Bill, I was using it as a simple default search field. I realise now that's not a good reason to use copyField. As I see it now, it should be used if you want to search in a way that is different: use different analyzers, etc; not for just searching on multiple fields in a single query. Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p3276994.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process
10K documents. Why not just batch them? You could read in 10K from your database, load em into an array of SolrDocuments. and them post them all at once to the Solr server? Or do em in 1K increments if they are really big. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279708.html Sent from the Solr - User mailing list archive at Nabble.com.
Trimming the list of docs returned.
Hi - I'd like to be able to limit the number of documents returned from any particular group of documents, much as Google only shows a max of two results from any one website. The docs are all marked as to which group they belong to. There will probably be multiple groups returned from any search. Documents belong to only one group I could just examine each returned document, and discard documents from groups I have seen before, but that seems slow (but I'm not sure there is a better alternative). The number of groups is fairly high percentage of the number of documents (maybe 5% of all documents), so building something like a filter for each group doesn't seem feasible. CustomHitCollector of some sort could work, but there is the comment in the javadoc about "should not call Searcher.doc(int) or IndexReader.document(int) on every document number encountered." which would seem to be necessary to get the group id. Does Solr add anything to Lucene in this regard? Thanks, Tom
Re: Trimming the list of docs returned.
Hi - On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Yes, a custom hit collector would work. Searcher.doc() would be > deadly... but since each doc has at most one category, the FieldCache > could be used (it quickly maps id to field value and was historically > used for sorting). Not to be dense, but how do I use a custom HitCollector with Solr? I've checked the wiki, and searched the mailing list, and don't see anything. Is there a way to configure this, or do I just build a custom version of Solr? I have no problems doing this in Lucene, but I'm not quite sure where to configure/code this in Solr. Thanks, Tom On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Hi Tom, I moderated your email in... you need to subscribe to prevent > your emails being blocked in the future. Thanks. That's fixed, I hope. I was using the wrong address. > http://incubator.apache.org/solr/mailing_lists.html > > On 10/30/06, Tom <[EMAIL PROTECTED]> wrote: > > I'd like to be able to limit the number of documents returned from > > any particular group of documents, much as Google only shows a max of > > two results from any one website. > > You bring up an interesting problem that may be of general use. > Solr doesn't currently do this, but it should be possible (with some > work in the internals). > > > The docs are all marked as to which group they belong to. There will > > probably be multiple groups returned from any search. Documents > > belong to only one group > > Documents belonging to only one group does make things easier. > > > I could just examine each returned document, and discard documents > > from groups I have seen before, but that seems slow (but I'm not sure > > there is a better alternative). > > > > The number of groups is fairly high percentage of the number of > > documents (maybe 5% of all documents), so building something like a > > filter for each group doesn't seem feasible. > > > > CustomHitCollector of some sort could work, but there is the comment > > in the javadoc about "should not call Searcher.doc(int) > > or IndexReader.document(int) on every document number encountered." > > which would seem to be necessary to get the group id. > > Yes, a custom hit collector would work. Searcher.doc() would be > deadly... but since each doc has at most one category, the FieldCache > could be used (it quickly maps id to field value and was historically > used for sorting). > > It might be useful to see what Nutch does in this regard too. > > -Yonik >
Re: Trimming the list of docs returned.
Hi - Recap: > > I'd like to be able to limit the number of documents returned from > > any particular group of documents, much as Google only shows a max of > > two results from any one website. > > > > The docs are all marked as to which group they belong to. There will > > probably be multiple groups returned from any search. Documents > > belong to only one group It looks like that for trimming, the places I want to modify are in ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to return the top item in the group that matches, whether by score or sort, not just the first one that goes through the HitCollector. But since I want to enable this per request basis, I need some way to get the parameters from the original request, and pass it down to my implementation of ScorePriorityQueue. I'm trying to minimize the number of changes I'd have to make, so I've defined another flag (like SolrIndexHandler.GET_SCORES), and I check and set it in a modified version of StandardRequestHandler. This seems to work, and doesn't require me to change any method signatures. Suggestions for other implementations welcome! Index: src/java/org/apache/solr/request/StandardRequestHandler.java === --- src/java/org/apache/solr/request/StandardRequestHandler.java (revision 470495) +++ src/java/org/apache/solr/request/StandardRequestHandler.java (working copy) @@ -97,6 +97,10 @@ // find fieldnames to return (fieldlist) String fl = p.get(SolrParams.FL); int flags = 0; + String trim = p.get("trim"); + if ((trim == null) || !trim.equals("0")) + flags |= SolrIndexSearcher.TRIM_RESULTS; + if (fl != null) { flags |= U.setReturnFields(fl, rsp); } But, unsurprisingly, trimming vs. not trimming is being ignored with regard to caching. How would I indicate that a query with trim=0 is not the same as trim=1? I do still want to cache. But obviously, my implementation won't work at the moment, since all queries will cache the value generated using the results generated by the value of trim on the initial query. Any suggestions for where to go poking around to fix this vs. caching? Thanks, Tom At 11:10 AM 11/8/2006, you wrote: On 11/8/06, Tom <[EMAIL PROTECTED]> wrote: On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Yes, a custom hit collector would work. Searcher.doc() would be > deadly... but since each doc has at most one category, the FieldCache > could be used (it quickly maps id to field value and was historically > used for sorting). Not to be dense, but how do I use a custom HitCollector with Solr? You would need a custom request handler, then just use the SolrIndexSearcher you get with a request... it exposes all of the Lucene IndexSearcher methods. -Yonik On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Hi Tom, I moderated your email in... you need to subscribe to prevent > your emails being blocked in the future. Thanks. That's fixed, I hope. I was using the wrong address. > http://incubator.apache.org/solr/mailing_lists.html > > On 10/30/06, Tom <[EMAIL PROTECTED]> wrote: > > I'd like to be able to limit the number of documents returned from > > any particular group of documents, much as Google only shows a max of > > two results from any one website. > > You bring up an interesting problem that may be of general use. > Solr doesn't currently do this, but it should be possible (with some > work in the internals). > > > The docs are all marked as to which group they belong to. There will > > probably be multiple groups returned from any search. Documents > > belong to only one group > > Documents belonging to only one group does make things easier. > > > I could just examine each returned document, and discard documents > > from groups I have seen before, but that seems slow (but I'm not sure > > there is a better alternative). > > > > The number of groups is fairly high percentage of the number of > > documents (maybe 5% of all documents), so building something like a > > filter for each group doesn't seem feasible. > > > > CustomHitCollector of some sort could work, but there is the comment > > in the javadoc about "should not call Searcher.doc(int) > > or IndexReader.document(int) on every document number encountered." > > which would seem to be necessary to get the group id. > > Yes, a custom hit collector would work. Searcher.doc() would be > deadly... but since each doc has at most one category, the FieldCache > could be used (it quickly maps id to field value and was historically > used for sorting). > > It might be useful to see what Nutch does in this regard too. > > -Yonik
Re: Trimming the list of docs returned.
At 01:35 PM 11/15/2006, you wrote: On 11/15/06, Tom <[EMAIL PROTECTED]> wrote: It looks like that for trimming, the places I want to modify are in ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to return the top item in the group that matches, whether by score or sort, not just the first one that goes through the HitCollector. Wouldn't you actually need a priority queue per group? I'm still playing with implementations, but I think you just need a max score for each group. You can't just do a PrioirtyQueue (of either max, or PriorityQueues) since I don't think the Lucene PriorityQueue handles entries whose value changes after insertion. But, unsurprisingly, trimming vs. not trimming is being ignored with regard to caching. How would I indicate that a query with trim=0 is not the same as trim=1? I do still want to cache. One hack: implement a simple query that delegates to another query and encapsulates the trim value... that way hashCode/equals won't match unless the trim does. Not sure what you mean by "delegates to another query". Could you clarify or give me a pointer? I was thinking in terms of just adding some guaranteed true clause to the end when trimming, is that similar to what you were talking about? Thanks, Tom -Yonik But obviously, my implementation won't work at the moment, since all queries will cache the value generated using the results generated by the value of trim on the initial query. Any suggestions for where to go poking around to fix this vs. caching? Thanks, Tom
MatchAllDocsQuery in solr?
Is there a way to do a match all docs query in solr? I mean is there something I can put in a solr URL that will get recognized by the SolrQueryParser as meaning a "match all"? Why? Because I'm porting unit tests from our internal Lucene container to Solr, and the tests usually run such a query, upon completion, to make sure the index is in the expected state (nothing missing, nothing extra). Yes, I can create a query that will match all my docs, there are a few fields that have a relatively small range of values. I was just looking for a standard way to do it first. Thanks, Tom
Re: MatchAllDocsQuery in solr?
Thanks for the quick response. I thought about a range query on the ID, but was wondering what the implications were for a large range query. (e.g. Number of docs > maxBooleanClauses). But this approach will work for me, as my test indicies are generally small. For a large data set, would it be faster to do that on a field with fewer values (but the same number of documents) e.g. type:[* TO *] where the type field has a small number of values. Or does that not matter? Thanks, Tom At 02:49 PM 11/21/2006, you wrote: : > I mean is there something I can put in a solr URL that will get : > recognized by the SolrQueryParser as meaning a "match all"? : : No, but there should be. if you use the uniqueKey feature, then you can do id:[* TO *] ... that acctually works on any field to find all docs that have "a" value, but on a uniqueKey field it by definition returns all docs since all docs have a uniequeKey. -Hoss
Re: MatchAllDocsQuery in solr?
At 03:18 PM 11/21/2006, Hoss wrote: It would would be really cool is if you could say something like... field:[low TO high]^0 other clauses XXX^0 ...and SolrIndexSearcher recognised that teh score contributions from the range query and the XXX TermQuery weren't going to contribute to the score, so it pulled the DocSets for them explicitly, and replaced their spots in the orriginal query with ConstantScoreQueries containing their DocSets ... that way they could be cached independently and reused. Just checking my understanding here. Right now, if I have ranges that I don't want to affect the score, but I would like to have cached, I should use Filter Queries, right? (SolrParams.FQ) Thanks, Tom
Cache stats
Hi - I'm starting to try to tune my installation a bit, and I'm looking for cache statistics. Is there a way to peek into a running installation, and see what my cache stats are? I'm looking for the usual cache hits/cache misses sort of things. Also, on a related note, I was looking for solr info via mbeans. I fired up jconsole, and I can see all sort of tomcat mbeans, but nothing for solr. Is there something extra I have to do to turn this on? I see things implementing SolrInfoMBean, so I'm assuming there is something there. (Off topic, but suggestions for anything better than JConsole also welcome). Thanks, Tom
boosts?
Hi - I'm having a problem getting boosts to work the way I think they are supposed to. What I want is for documents to be returned in doc boost order, when all the queries are constant scoring range queries. (e.g. date:[2006 TO 2007]) I believe (but am not certain) that this is supposed to be what happens. If that's not the case, you can probably skip the rest :-) As an example, I grabbed solr-1.1, and ran it (java -jar start.jar). Then I modified the hd.xml example doc, to add a boost on the first document (SP2514N) Then I loaded monitor.xml, and hd.xml ./post.sh monitor.xml ./post.sh hd.xml I then went to the solr admin interface and queried on id:[* TO *] Which I believe gets mapped to a ConstantScoreRangeQuery. So, given http://fred:8983/solr/select/?q=id%3A%5B*+TO+*%5D&version=2.2&start=0&rows=10&indent=on&debugQuery=1 I get the result below. Note that all the results list "boost=1.0" I would expect to see a boost of 100 on the SP2514N, in the explanation. Should I get that? I would also expect it to be at the head of the list, but I think I'm seeing the docs in insertion order. (if I insert xd.xml before monitor.xml, I get them in insertion order in that case as well.) Please let me know if my assumptions or my methods aren't correct. Thanks, Tom 0 4 10 0 on id:[* TO *] 1 2.2 electronicsmonitor 30" TFT active matrix LCD, 2560 x 1600, .25mm dot pitch, 700:1 contrast 3007WFP true USB cable Dell, Inc. Dell Widescreen UltraSharp 3007WFP 6 2199.0 3007WFP 401.6 electronicshard drive 7200RPM, 8MB cache, IDE Ultra ATA-133NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor SP2514N true Samsung Electronics Co. Ltd. Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133 6 92.0 SP2514N electronicshard drive SATA 3.0Gb/s, NCQ8.5ms seek16MB cache 6H500F0 true Maxtor Corp. Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300 6 350.0 6H500F0 id:[* TO *] id:[* TO *] id:[* TO *] id:[* TO *] 1.0 = (MATCH) ConstantScoreQuery(id:[-}), product of: 1.0 = boost 1.0 = queryNorm 1.0 = (MATCH) ConstantScoreQuery(id:[-}), product of: 1.0 = boost 1.0 = queryNorm 1.0 = (MATCH) ConstantScoreQuery(id:[-}), product of: 1.0 = boost 1.0 = queryNorm
Re: boosts?
Hi Yonik, Thanks for the quick response. At 07:45 AM 12/28/2006, you wrote: On 12/27/06, Tom <[EMAIL PROTECTED]> wrote: I'm having a problem getting boosts to work the way I think they are supposed to. Do you have a specific relevance problem you are trying to solve, or just testing things out? Specific problem. Frequently our users will start by specifying a facet, such a date range, geo location, etc. At this point I don't have any positive query terms, just constant score range queries that are used to eliminate things the user is not interested in. So at this point, there's nothing to be relevant to, so I need to pick some ordering. Since I have information about which results tend to be more interesting in the general case, I've set boosts on the documents. I'd like to order by that, until the user gives me more information. For an example, think of amazon ordering by "best selling", when the user asks for books published since Dec. 1st. You don't yet know what is relevant to this user's query, since all you have is "since Dec 1st", but you want to give an order more reasonable than "doc number", or "date published". What I want is for documents to be returned in doc boost order, when all the queries are constant scoring range queries. (e.g. date:[2006 TO 2007]) They are *constant scoring* range queries :-) Index-time boosts currently don't factor in. Gotcha. I think I misinterpreted an earlier post (which did say "query boost"). I was thinking it would include index time boost, too. I'd recommend only using index-time boosting when you can't get the relevance you want with query boosting and scoring. I'm not sure how I'd do it that way. What I want (what I _think_ I want :-) is a way to specify a default order for results, for the cases where the user has only provided exclusion information. In this case, I'm doing a match all docs, with filter queries. Tom
Re: boosts?
At 12:03 PM 12/28/2006, you wrote: On 12/28/06, Tom <[EMAIL PROTECTED]> wrote: Could you index your documents in the desired order? This is the default sort order. I don't think I can control document order, as documents may get edited after creation. If not, you can add a field that is present in all documents, and add this as part of the query. Then you can fiddle with the index-time field boost to alter the results (without skewing queries that have a meaningful relevancy score as using document boosts would do). That seems to work. Thanks! I'll probably do it that way, but... :-) I was looking at how I would write a modified version of MatchAllDocsQuery that would simply return the documents boost as the score. But I haven't really figured out Lucene scoring. Could someone explain how one would do something like this? I'm just trying to understand how one might do custom scoring in Lucene, so I'm more looking for concepts than code. Thanks! Tom
Re: boosts?
At 06:03 PM 12/28/2006, you wrote: maybe i'm missing something, but it sounds like what you want is a simple sort on a numeric field -- whatever value you are tyring to use as the index time boost, you can just set as a field value instead and then sort on it right? Yes. I had been just been thinking about it in terms of how to use the info I already had in the index. But making another field works, too, and is probably simpler. : I was looking at how I would write a modified version of : MatchAllDocsQuery that would simply return the documents boost as the : score. But I haven't really figured out Lucene scoring. document boosts aren't maintained in the index ... they are multiplied by the various field boosts and lengthNorms and stored on a per field basis. Thanks! I had seen comments that the doc boost wasn't stored, but didn't know how it worked. Tom
SolrCloud, DIH, and XPathEntityProcessor
Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having some problems with a DIH config that attempts to load an XML file and iterate through the nodes in that file, it trys to load the file from disk instead of from zookeeper. The file exists in zookeeper, adjacent to the data_import.conf in the lookups_config conf folder. The exception: 2016-01-12 12:59:47.852 ERROR (Thread-44) [c:lookups s:shard1 r:core_node6 x:lookups_shard1_replica2] o.a.s.h.d.DataImporter Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:417) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462) Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233) ... 3 more Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:62) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:287) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:225) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:202) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415) ... 5 more Caused by: java.lang.RuntimeException: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:127) at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:86) at org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:48) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:284) ... 10 more Caused by: java.io.FileNotFoundException: Could not find file: lookup_conf.xml (resolved to: /mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:123) ... 13 more Any hints gratefully accepted Cheers Tom
Re: SolrCloud, DIH, and XPathEntityProcessor
On Tue, Jan 12, 2016 at 2:32 PM, Shawn Heisey wrote: > On 1/12/2016 6:05 AM, Tom Evans wrote: >> Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having >> some problems with a DIH config that attempts to load an XML file and >> iterate through the nodes in that file, it trys to load the file from >> disk instead of from zookeeper. >> >> > dataSource="lookup_conf" >> rootEntity="false" >> name="lookups" >> processor="XPathEntityProcessor" >> url="lookup_conf.xml" >> forEach="/lookups/lookup"> >> >> The file exists in zookeeper, adjacent to the data_import.conf in the >> lookups_config conf folder. > > SolrCloud puts all the *config* for Solr into zookeeper, and adds a new > abstraction for indexes (the collection), but other parts of Solr like > DIH are not really affected. The entity processors in DIH cannot > retrieve data from zookeeper. They do not know how. That makes no sense whatsoever. DIH loads the data_import.conf from ZK just fine, or is that provided to DIH from another module that does know about ZK? Either way, it is entirely sub-optimal to have SolrCloud store "all" its configuration in ZK, but still require manually storing and updating files on specific nodes in order to influence DIH. If a server is mistakenly not updated, or manually modified locally on disk, that node would start indexing documents differently than other replicas, which sounds dangerous and scary! If there is not a ZkFileDataSource, it shouldn't be too tricky to add one... I'll see how much I dislike having config files on the host... Cheers Tom
Re: SolrCloud, DIH, and XPathEntityProcessor
On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey wrote: > On 1/12/2016 7:45 AM, Tom Evans wrote: >> That makes no sense whatsoever. DIH loads the data_import.conf from ZK >> just fine, or is that provided to DIH from another module that does >> know about ZK? > > This is accomplished indirectly through a resource loader in the > SolrCore object that is responsible for config files. Also, the > dataimport handler is created by the main Solr code which then hands the > configuration to the dataimport module. DIH itself does not know about > zookeeper. ZkPropertiesWriter seems to know a little.. > >> Either way, it is entirely sub-optimal to have SolrCloud store "all" >> its configuration in ZK, but still require manually storing and >> updating files on specific nodes in order to influence DIH. If a >> server is mistakenly not updated, or manually modified locally on >> disk, that node would start indexing documents differently than other >> replicas, which sounds dangerous and scary! > > The entity processor you are using accesses files through a Java > interface for mounted filesystems. As already mentioned, it does not > know about zookeeper. > >> If there is not a ZkFileDataSource, it shouldn't be too tricky to add >> one... I'll see how much I dislike having config files on the host... > > Creating your own DIH class would be the only solution available right now. > > I don't know how useful this would be in practice. Without special > config in multiple places, Zookeeper limits the size of the files it > contains to 1MB. It is not designed to deal with a large amount of data > at once. This is not large amounts of data, it is a 5kb XML file containing configuration of what tables to query for what fields and how to map them in to the document. > > You could submit a feature request in Jira, but unless you supply a > complete patch that survives the review process, I do not know how > likely an implementation would be. We've already started implementation, basing around FileDataSource and using SolrZkClient, which we will deploy as an additional library whilst that process is ongoing or doesn't survive it. Cheers Tom
Shard allocation across nodes
Hi all We're setting up a solr cloud cluster, and unfortunately some of our VMs may be physically located on the same VM host. Is there a way of ensuring that all copies of a shard are not located on the same physical server? If they do end up in that state, is there a way of rebalancing them? Cheers Tom
Re: Shard allocation across nodes
Thank you both, those are exactly what I was looking for! If I'm reading it right, if I specify a "-Dvmhost=foo" when starting SolrCloud, and then specify a snitch rule like this when creating the collection: sysprop.vmhost:*,replica:<2 then this would ensure that on each vmhost there is at most one replica. I'm assuming that a shard leader and a replica are both treated as replicas in this scenario. Thanks Tom On Mon, Feb 1, 2016 at 8:34 PM, Erick Erickson wrote: > See the createNodeset and node parameters for the Collections API CREATE and > ADDREPLICA commands, respectively. That's more a manual process, there's > nothing OOB but Jeff's suggestion is sound. > > Best, > Erick > > > > On Mon, Feb 1, 2016 at 11:00 AM, Jeff Wartes wrote: >> >> You could write your own snitch: >> https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement >> >> Or, it would be more annoying, but you can always add/remove replicas >> manually and juggle things yourself after you create the initial collection. >> >> >> >> >> On 2/1/16, 8:42 AM, "Tom Evans" wrote: >> >>>Hi all >>> >>>We're setting up a solr cloud cluster, and unfortunately some of our >>>VMs may be physically located on the same VM host. Is there a way of >>>ensuring that all copies of a shard are not located on the same >>>physical server? >>> >>>If they do end up in that state, is there a way of rebalancing them? >>> >>>Cheers >>> >>>Tom
Change in EXPLAIN info since Solr 5
Hi group, While exploring Solr 5.4.0, I noticed a subtle difference in the EXPLAIN debug information, compared to the version we currently use (4.10.1). Solr 4.10.1: 2.0739748 = (MATCH) max plus 1.0 times others of: 2.0739748 = (MATCH) weight(text:test in 30) [DefaultSimilarity], result of: 2.0739748 = score(doc=30,freq=3.0), product of: 0.3556181 = queryWeight, product of: 3.3671236 = idf(docFreq=17, maxDocs=192) 0.105614804 = queryNorm 5.832029 = fieldWeight in 30, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 3.3671236 = idf(docFreq=17, maxDocs=192) 1.0 = fieldNorm(doc=30) Solr 5.4.0: 2.0739748 = max plus 1.0 times others of: 2.0739748 = weight(text:test in 30) [ClassicSimilarity], result of: 2.0739748 = score(doc=30,freq=3.0), product of: 0.3556181 = queryWeight, product of: 3.3671236 = idf(docFreq=17, maxDocs=192) 0.105614804 = queryNorm 5.832029 = fieldWeight in 30, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 3.3671236 = idf(docFreq=17, maxDocs=192) 1.0 = fieldNorm(doc=30) The difference is the removal of (MATCH) in some of the EXPLAIN lines. That is causing issues for us since we have developed an EXPLAIN parser that leans on the presence of (MATCH) in the EXPLAIN. Does anyone have a suggestion how to insert back (MATCH) in the explain info (like which file should we patch)? Thanks, Tom
fq in SolrCloud
I have a small question about fq in cloud mode that I couldn't find an explanation for in confluence. If I specify a query with an fq, where is that cached, is it just on the nodes/replicas that process that specific query, or will it exist on all replicas? We have a sub type of queries that specify an expensive join condition that we specify in the fq, so that subsequent requests with the same fq won't have to do the same expensive query, and was wondering whether we needed to ensure that the query goes to the same node when we move to cloud. Cheers Tom
Re: Json faceting, aggregate numeric field by day?
On Wed, Feb 10, 2016 at 10:21 AM, Markus Jelsma wrote: > Hi - if we assume the following simple documents: > > > 2015-01-01T00:00:00Z > 2 > > > 2015-01-01T00:00:00Z > 4 > > > 2015-01-02T00:00:00Z > 3 > > > 2015-01-02T00:00:00Z > 7 > > > Can i get a daily average for the field 'value' by day? e.g. > > > 3.0 > 5.0 > > > Reading the documentation, i don't think i can, or i am missing it > completely. But i just want to be sure. Yes, you can facet by day, and use the stats component to calculate the mean average. This blog post explains it: https://lucidworks.com/blog/2015/01/29/you-got-stats-in-my-facets/ Cheers Tom
Re: Json faceting, aggregate numeric field by day?
On Wed, Feb 10, 2016 at 12:13 PM, Markus Jelsma wrote: > Hi Tom - thanks. But judging from the article and SOLR-6348 faceting stats > over ranges is not yet supported. More specifically, SOLR-6352 is what we > would need. > > [1]: https://issues.apache.org/jira/browse/SOLR-6348 > [2]: https://issues.apache.org/jira/browse/SOLR-6352 > > Thanks anyway, at least we found the tickets :) > No problem - as I was reading this I was thinking "But wait, I *know* we do this ourselves for average price vs month published". In fact, I was forgetting that we index the ranges that we will want to facet over as part of the document - so a document with a date_published of "2010-03-29T00:00:00Z" also has a date_published.month of "201003" (and a bunch of other ranges that we want to facet by). The frontend then converts those fields in to the appropriate values for display. This might be an acceptable solution for you guys too, depending on how many ranges that you require, and how much larger it would make your index. Cheers Tom
Solr and Nutch integration
I am having problem configuring Solr to read Nutch data or Integrate with Nutch. Does anyone able to get SOLR 5.4.x to work with Nutch? I went through lot of google's article any still not able to get SOLR 5.4.1 to searching Nutch contents. Any howto or working configuration sample that you can share will be greatly appreciate it. Thanks, Toom
Display entire string containing query string
Hello, I am working on a project using Solr to search data from retrieved from Nutch. I have successfully integrated Nutch with Solr, and Solr is able to search Nutch's data. However I am having a bit of a problem. If I query Solr, it will bring back the numfound and which document the query string was found in, but it will not display the string that contains the query string. Can anyone help on how to display the entire string that contains the query. I appreciate your time and guidance. Thank you so much! -T
Re: Display entire string containing query string
Hello Thank you for your reply. I am wondering if you can clarify a bit more for me. Is field_where_string_may_be_present something that I have to specify? I am searching HTML page. For example if I search for the word "name" I am trying to display the entire sentence containing "name = T" or maybe "name: T". Ultimately by searching for the string "name" I am trying to find the value of name. Thanks for your time. I appreciate your help -T On Feb 18, 2016 1:18 AM, "Binoy Dalal" wrote: > Append &fl= > > On Thu, 18 Feb 2016, 11:35 Tom Running wrote: > > > Hello, > > > > I am working on a project using Solr to search data from retrieved from > > Nutch. > > > > I have successfully integrated Nutch with Solr, and Solr is able to > search > > Nutch's data. > > > > However I am having a bit of a problem. If I query Solr, it will bring > back > > the numfound and which document the query string was found in, but it > will > > not display the string that contains the query string. > > > > Can anyone help on how to display the entire string that contains the > > query. > > > > > > I appreciate your time and guidance. Thank you so much! > > > > -T > > > -- > Regards, > Binoy Dalal >
Re: docValues error
On Mon, Feb 29, 2016 at 11:43 AM, David Santamauro wrote: > You will have noticed below, the field definition does not contain > multiValues=true What version of the schema are you using? In pre 1.1 schemas, multiValued="true" is the default if it is omitted. Cheers Tom
Separating cores from Solr home
Hi all I'm struggling to configure solr cloud to put the index files and core.properties in the correct places in SolrCloud 5.5. Let me explain what I am trying to achieve: * solr is installed in /opt/solr * the user who runs solr only has read only access to that tree * the solr home files - custom libraries, log4j.properties, solr.in.sh and solr.xml - live in /data/project/solr/releases/, which is then the target of a symlink /data/project/solr/releases/current * releasing a new version of the solr home (eg adding/changing libraries, changing logging options) is done by checking out a fresh copy of the solr home, switching the symlink and restarting solr * the solr core.properties and any data live in /data/project/indexes, so they are preserved when new solr home is released Setting core specific dataDir with absolute paths in solrconfig.xml only gets me part of the way, as the core.properties for each shard is created inside the solr home. This is obviously no good, as when releasing a new version of the solr home, they will no longer be in the current solr home. Cheers Tom
Re: Separating cores from Solr home
Hmm, I've worked around this by setting the directory where the indexes should live to be the actual solr home, and symlink the files from the current release in to that directory, but it feels icky. Any better ideas? Cheers Tom On Thu, Mar 3, 2016 at 11:12 AM, Tom Evans wrote: > Hi all > > I'm struggling to configure solr cloud to put the index files and > core.properties in the correct places in SolrCloud 5.5. Let me explain > what I am trying to achieve: > > * solr is installed in /opt/solr > * the user who runs solr only has read only access to that tree > * the solr home files - custom libraries, log4j.properties, solr.in.sh > and solr.xml - live in /data/project/solr/releases/, which > is then the target of a symlink /data/project/solr/releases/current > * releasing a new version of the solr home (eg adding/changing > libraries, changing logging options) is done by checking out a fresh > copy of the solr home, switching the symlink and restarting solr > * the solr core.properties and any data live in /data/project/indexes, > so they are preserved when new solr home is released > > Setting core specific dataDir with absolute paths in solrconfig.xml > only gets me part of the way, as the core.properties for each shard is > created inside the solr home. > > This is obviously no good, as when releasing a new version of the solr > home, they will no longer be in the current solr home. > > Cheers > > Tom
mergeFactor/maxMergeDocs is deprecated
Hi all Updating to Solr 5.5.0, and getting these messages in our error log: Beginning with Solr 5.5, is deprecated, configure it on the relevant instead. Beginning with Solr 5.5, is deprecated, configure it on the relevant instead. However, mergeFactor is only mentioned in a commented out sections of our solrconfig.xml files, and mergeFactor is not mentioned at all. > $ ack -B 1 -A 1 '10 212- --> > $ ack --all maxMergeDocs > $ Any ideas? Cheers Tom
Ping handler in SolrCloud mode
Hi all I have a cloud setup with 8 nodes and 3 collections, products, items and skus. All collections have just one shard, products has 6 replicas, items has 2 replicas, skus has 8 replicas. No node has both products and items, all nodes have skus Some of our queries join from sku to either products or items. If the query is directed at a node without the appropriate shard on them, we obviously get an error, so we have separate balancers for products and items. The problem occurs when we attempt to query a node to see if products or items is active on that node. The balancer (haproxy) requests the ping handler for the appropriate collection, however all the nodes return OK for all the collections(!) Eg, on node01, it has replicas for products and skus, but the ping handler for /solr/items/admin/ping returns 200! This means that as far as the balancer is concerned, node01 is a valid destination for item queries, and inevitably it blows up as soon as such a query is made to it. As I understand it, this is because the URL we are checking is for the collection ("items") rather than a specific core ("items_shard1_replica1") Is there a way to make the ping handler only check local shards? I have tried with distrib=false&preferLocalShards=false, but it still returns a 200. The option I'm trying now is to make two ping handler for skus that join to one of items/products, which should fail on the servers which do not support it, but I am concerned that this is a little heavyweight for a status check to see whether we can direct requests at this server or not. Cheers Tom
Re: Ping handler in SolrCloud mode
On Wed, Mar 16, 2016 at 2:14 PM, Tom Evans wrote: > Hi all > > [ .. ] > > The option I'm trying now is to make two ping handler for skus that > join to one of items/products, which should fail on the servers which > do not support it, but I am concerned that this is a little > heavyweight for a status check to see whether we can direct requests > at this server or not. This worked, I would still be interested in a lighter-weight approach that doesn't involve joins to see if a given collection has a shard on this server. I suspect that might require a custom ping handler plugin however. Cheers Tom
Re: Ping handler in SolrCloud mode
On Wed, Mar 16, 2016 at 4:10 PM, Shawn Heisey wrote: > On 3/16/2016 8:14 AM, Tom Evans wrote: >> The problem occurs when we attempt to query a node to see if products >> or items is active on that node. The balancer (haproxy) requests the >> ping handler for the appropriate collection, however all the nodes >> return OK for all the collections(!) >> >> Eg, on node01, it has replicas for products and skus, but the ping >> handler for /solr/items/admin/ping returns 200! > > This returns OK because as long as one replica for every shard in > "items" is available somewhere in the cloud, you can make a request for > "items" on that node and it will work. Or at least it *should* work, > and if it's not working, that's a bug. I remember that one of the older > 4.x versions *did* have a bug where queries for a collection would only > work if the node actually contained shards for that collection. Sorry, this is Solr 5.5, I should have said. Yes, we can absolutely make a request of "items", and it will work correctly. However, we are making requests of "skus" that join to "products", and the query is routed to a node which has only "skus" and "items", and the request fails because joins can only work over local replicas. To fix this, we now have two additional balancers: solr: has all the nodes, all nodes are valid backends solr-items: has all the nodes in the cluster, but nodes are only valid backends if it has "items" and "skus" replicas. solr-products: has all the nodes in the cluster, but nodes are only valid backends if it has "products" and "skus" replicas (I'm simplifying things a bit, there are another 6 collections that are on all nodes, hence the main balancer.) The new balancers need a cheap way of checking what nodes are valid, and ideally I'd like that check to not involve a query with a join clause! Cheers Tom
Paging and cursorMark
Hi all With Solr 5.5.0, we're trying to improve our paging performance. When we are delivering results using infinite scrolling, cursorMark is perfectly fine - one page is followed by the next. However, we also offer traditional paging of results, and this is where it gets a little tricky. Say we have 10 results per page, and a user wants to jump from page 1 to page 20, and then wants to view page 21, there doesn't seem to be a simple way to get the nextCursorMark. We can make an inefficient request for page 20 (start=190, rows=10), but we cannot give that request a cursorMark=* as it contains start=190. Consequently, if the user clicks to page 21, we have to continue along using start=200, as we have no cursorMark. The only way I can see to get a cursorMark at that point is to omit the start=200, and instead say rows=210, and ignore the first 200 results on the client side. Obviously, this gets more and more inefficient the deeper we page - I know that internally to Solr, using start=200&rows=10 has to do the same work as rows=210, but less data is sent over the wire to the client. As I understand it, the cursorMark is a hash of the sort values of the last document returned, so I don't really see why it is forbidden to specify start=190&rows=10&cursorMark=* - why is it not possible to calculate the nextCursorMark from the last document returned? I was also thinking a possible temporary workaround would be to request start=190&rows=10, note the last document returned, and then make a subsequent query for q=id:""&rows=1&cursorMark=*. This seems to work, but means an extra Solr query for no real reason. Is there any other problem to doing this? Is there some other simple trick I am missing that we can use to get both the page of results we want and a nextCursorMark for the subsequent page? Cheers Tom
Re: Re: Paging and cursorMark
On Wed, Mar 23, 2016 at 12:21 PM, Vanlerberghe, Luc wrote: > I worked on something similar a couple of years ago, but didn’t continue work > on it in the end. > > I've included the text of my original mail. > If you're interested, I could try to find the sources I was working on at the > time > > Luc > Thanks both Luc and Steve. I'm not sure if we will have time to deploy patched versions of things to production - time is always the enemy :( , and we're not a Java shop so there is non trivial time investment in just building replacement jars, let alone getting that integrated in to our RPMs - but I'll definitely try it out on my dev server. The change seems excessively complex imo, but maybe I'm not seeing the use cases for skip. To my mind, calculating a nextCursorMark is cheap and only relies on having a strict sort ordering, which is also cheap to check. If that condition is met, you should get a nextCursorMark in your response regardless of whether you specified a cursorMark in the request, to allow you to efficiently get the next page. This would still leave slightly pathological performance if you skip to page N, and then iterate back to page 0, which Luc's idea of a previousCursorMark can solve. cursorMark is easy to implement, you can ignore docs which sort lower than that mark. Can you do similar with previousCursorMark?, as would it not require to keep a buffer of rows documents, and stop when a document which sorts higher than the supplied mark appears. Seems more complex, but maybe I'm not understanding the internals correctly. Fortunately for us, 90% of our users prefer infinite scroll, and 97% of them never go beyond page 3. Cheers Tom
Re: Creating new cluster with existing config in zookeeper
On Wed, Mar 23, 2016 at 3:43 PM, Robert Brown wrote: > So I setup a new solr server to point to my existing ZK configs. > > When going to the admin UI on this new server I can see the shards/replica's > of the existing collection, and can even query it, even tho this new server > has no cores on it itself. > > Is this all expected behaviour? > > Is there any performance gain with what I have at this precise stage? The > extra server certainly makes it appear i could balance more load/requests, > but I guess the queries are just being forwarded on to the servers with the > actual data? > > Am I correct in thinking I can now create a new collection on this host, and > begin to build up a new cluster? and they won't interfere with each other > at all? > > Also, that I'll be able to see both collections when using the admin UI > Cloud page on any of the servers in either collection? > I'm confused slightly: SolrCloud is a (singular) cluster of servers, storing all of its state and configuration underneath a single zookeeper path. The cluster contains collections. Collections are tied to a particular config set within the cluster. Collections are made up of 1 or more shards. Each shard is a core, and there are 1 or more replicas of each core. You can add more servers to the cluster, and then create a new collection with the same config as an existing collection, but it is still part of the same cluster. Of course, you could think of a set of servers within a cluster as a "logical" cluster if it just serves particular collection, but "cluster" to me would be all of the servers within the same zookeeper tree, because that is where cluster state is maintained. Cheers Tom
SolrCloud no leader for collection
Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections, most of them in a 1 shard x 8 replicas configuration. We have 5 ZK nodes. During the night, we attempted to reindex one of the larger collections. We reindex by pushing json docs to the update handler from a number of processes. It seemed this overwhelmed the servers, and caused all of the collections to fail and end up in either a down or a recovering state, often with no leader. Restarting and rebooting the servers brought a lot of the collections back online, but we are left with a few collections for which all the nodes hosting those replicas are up, but the replica reports as either "active" or "down", and with no leader. Trying to force a leader election has no effect, it keeps choosing a leader that is in "down" state. Removing all the nodes that are in "down" state and forcing a leader election also has no effect. Any ideas? The only viable option I see is to create a new collection, index it and then remove the old collection and alias it in. Cheers Tom
Anticipated Solr 5.5.1 release date
Hi all We're currently using Solr 5.5.0 and converting our regular old style facets into JSON facets, and are running in to SOLR-8155 and SOLR-8835. I can see these have already been back-ported to 5.5.x branch, does anyone know when 5.5.1 may be released? We don't particularly want to move to Solr 6, as we have only just finished validating 5.5.0 with our original queries! Cheers Tom
Re: Anticipated Solr 5.5.1 release date
Awesome, thanks :) On Fri, Apr 15, 2016 at 4:19 PM, Anshum Gupta wrote: > Hi Tom, > > I plan on getting a release candidate out for vote by Monday. If all goes > well, it'd be about a week from then for the official release. > > On Fri, Apr 15, 2016 at 6:52 AM, Tom Evans wrote: > >> Hi all >> >> We're currently using Solr 5.5.0 and converting our regular old style >> facets into JSON facets, and are running in to SOLR-8155 and >> SOLR-8835. I can see these have already been back-ported to 5.5.x >> branch, does anyone know when 5.5.1 may be released? >> >> We don't particularly want to move to Solr 6, as we have only just >> finished validating 5.5.0 with our original queries! >> >> Cheers >> >> Tom >> > > > > -- > Anshum Gupta
Re: Verifying - SOLR Cloud replaces load balancer?
On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff wrote: > Thanks all - very helpful. > > @Shawn - your reply implies that even if I'm hitting the URL for a single > endpoint via HTTP - the "balancing" will still occur across the Solr Cloud > (I understand the caveat about that single endpoint being a potential point > of failure). I just want to verify that I'm interpreting your response > correctly... > > (I have been asked to provide IT with a comprehensive list of options prior > to a design discussion - which is why I'm trying to get clear about the > various options) > > In a nutshell, I think I understand the following: > > a. Even if hitting a single URL, the Solr Cloud will "balance" across all > available nodes for searching > Caveat: That single URL represents a potential single point of > failure and this should be taken into account > > b. SolrJ's CloudSolrClient API provides the ability to distribute load -- > based on Zookeeper's "knowledge" of all available Solr instances. > Note: This is more robust than "a" due to the fact that it > eliminates the "single point of failure" > > c. Use of a load balancer hitting all known Solr instances will be fine - > although the search requests may not run on the Solr instance the load > balancer targeted - due to "a" above. > > Corrections or refinements welcomed... With option a), although queries will be distributed across the cluster, all queries will be going through that single node. Not only is that a single point of failure, but you risk saturating the inter-node network traffic, possibly resulting in lower QPS and higher latency on your queries. With option b), as well as SolrJ, recent versions of pysolr have a ZK-aware SolrCloud client that behaves in a similar way. With option c), you can use the preferLocalShards so that shards that are local to the queried node are used in preference to distributed shards. Depending on your shard/cluster topology, this can increase performance if you are returning large amounts of data - many or large fields or many documents. Cheers Tom
Re: Indexing 700 docs per second
On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson wrote: > Hi, > > I have a requirement to index (mainly updation) 700 docs per second. > Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260 > byes (6 fields out of which only 2 will undergo updation at the above > rate). This collection has around 122Million docs and that count is pretty > much a constant. > > 1. Can I manage this updation rate with a non-sharded ie single Solr > instance set up? > 2. Also is atomic update or a full update (the whole doc) of the changed > records the better approach in this case. > > Could some one please share their views/ experience? Try it and see - everyone's data/schemas are different and can affect indexing speed. It certainly sounds achievable enough - presumably you can at least produce the documents at that rate? Cheers Tom
User Authentication
Hi Solr Community I have been trying to add user authentication to our Solr 5.3.1 RedHat install. I’ve found some examples on user authentication on the Jetty side. But they have failed. Does any one have a step by step example on authentication for the admin screen? And a core? Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830
Re: User Authentication
Alex I got a super secret release of Solr 5.3.1, wasn’t suppose to say anything. Yes I’m running 5.2.1, I will check out the release notes for 5.3. Was looking for three types of user authentication, I guess. 1. the Admin Console 2. User auth for each Core ( and select and update) on a server. 3. HTML interface access (example: ajax-solr<https://github.com/evolvingweb/ajax-solr>) Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch mailto:arafa...@gmail.com>> wrote: Thanks for the email from the future. It is good to start to prepare for 5.3.1 now that 5.3 is nearly out. Joking aside (and assuming Solr 5.2.1), what exactly are you trying to achieve? Solr should not actually be exposed to the users directly. It should be hiding in a backend only visible to your middleware. If you are looking for a HTML interface that talks directly to Solr after authentication, that's not the right way to set it up. That said, some security features are being rolled out and you should definitely check the release notes for the 5.3. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 August 2015 at 10:01, LeZotte, Tom wrote: Hi Solr Community I have been trying to add user authentication to our Solr 5.3.1 RedHat install. I’ve found some examples on user authentication on the Jetty side. But they have failed. Does any one have a step by step example on authentication for the admin screen? And a core? Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830
Re: User Authentication
Bosco, We use CAS for user authentication, not sure if we have Kerberos working anywhere. Also we are not using ZooKeeper, because we are only running one server currently. thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 On Aug 24, 2015, at 3:12 PM, Don Bosco Durai mailto:bo...@apache.org>> wrote: Just curious, is Kerberos an option for you? If so, mostly all your 3 use cases will addressed. Bosco On 8/24/15, 12:18 PM, "Steven White" mailto:swhite4...@gmail.com>> wrote: Hi Noble, Is everything in the link you provided applicable to Solr 5.2.1? Thanks Steve On Mon, Aug 24, 2015 at 2:20 PM, Noble Paul mailto:noble.p...@gmail.com>> wrote: did you manage to look at the reference guide? https://cwiki.apache.org/confluence/display/solr/Securing+Solr On Mon, Aug 24, 2015 at 9:23 PM, LeZotte, Tom wrote: Alex I got a super secret release of Solr 5.3.1, wasn¹t suppose to say anything. Yes I¹m running 5.2.1, I will check out the release notes for 5.3. Was looking for three types of user authentication, I guess. 1. the Admin Console 2. User auth for each Core ( and select and update) on a server. 3. HTML interface access (example: ajax-solr< https://github.com/evolvingweb/ajax-solr>) Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch mailto:arafa...@gmail.com> <mailto:arafa...@gmail.com>> wrote: Thanks for the email from the future. It is good to start to prepare for 5.3.1 now that 5.3 is nearly out. Joking aside (and assuming Solr 5.2.1), what exactly are you trying to achieve? Solr should not actually be exposed to the users directly. It should be hiding in a backend only visible to your middleware. If you are looking for a HTML interface that talks directly to Solr after authentication, that's not the right way to set it up. That said, some security features are being rolled out and you should definitely check the release notes for the 5.3. Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 24 August 2015 at 10:01, LeZotte, Tom mailto:tom.lezo...@vanderbilt.edu>> wrote: Hi Solr Community I have been trying to add user authentication to our Solr 5.3.1 RedHat install. I¹ve found some examples on user authentication on the Jetty side. But they have failed. Does any one have a step by step example on authentication for the admin screen? And a core? Thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830 -- - Noble Paul
Solr 5.3 Faceting on Children with Block Join Parser
Apologies for cross posting a question from SO here. I am very interested in the new faceting on child documents feature of Solr 5.3 and would like to know if somebody has figured out how to do it as asked in the question on http://stackoverflow.com/questions/32212949/solr-5-3-faceting-on-children-with-block-join-parser Thanks for any hints, Tom The question is: Solr 5.3 supports faceting on nested documents [1], with a great tutorial from Yonik [2]. In the tutorial example, the query to get the documents for faceting is directly performed on the child documents: $ curl http://localhost:8983/solr/demo/query -d ' q=author_s:yonik&fl=id,comment_t& json.facet={ genres : { type: terms, field: cat_s, domain: { blockParent : "type_s:book" } } }' What I do not know is how to facet on child documents returned from a Block Join Parent Query Parser [3] and provided through ExpandComponent [4]. What I have working so far is the same as in the example from the ExpandComponent [4]: Query the child fields to return the parent documents (see 1.), then expand the result to get the relevant child documents (see 2.) 1. q={!parent which="type_s:parent" v='text_t:solr'} 2. &expand=true&expand.field=ISBN_s&expand.q=*:* What I need: Having steps 1.) and 2.) already working, how can we facet on some field (does not matter which) of the returned child documents from (2.) ? [1]: http://yonik.com/solr-5-3/ [2]: http://yonik.com/solr-nested-objects/ [3]: https://cwiki.apache.org/confluence/display/solr/Other+Parsers [4]: http://heliosearch.org/expand-block-join/
tmp directory over load
HI Solr/Tika uses the /tmp directory to process documents. At times the directory hits 100%. This causes alarms from Nagios for us. Is there a way in Solr/Tika to limit the amount of space used in /tmp? Value could be 80% or 570MB. thanks Tom LeZotte Health I.T. - Senior Product Developer (p) 615-875-8830
Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id
On Mon, Nov 2, 2015 at 1:38 PM, fabigol wrote: > Thank > All works. > I have 2 last questions: > How can i put 0 by defaults " clean" during a indexation? > > To conclure, i wand to understand: > > > Requests: 7 (1/s), Fetched: 452447 (45245/s), Skipped: 0, Processed: 17433 > (1743/s) > > What is the "requests"? > What is 'Fetched"? > What is "Processed"? > > Thank again for your answer > Depends upon how DIH is configured - different things return different numbers. For a SqlEntityProcessor, "Requests" is the number of SQL queries, "Fetched" is the number of rows read from those queries, and "Processed" is the number of documents processed by SOLR. > For the second question, i try: > > false > > > and > true > false > Putting things in "invariants" overrides whatever is passed for that parameter in the request parameters. By putting "false" in invariants, you are making it impossible to clean + index as part of DIH, because "clean" is always false. Cheers Tom
Best way to track cumulative GC pauses in Solr
Hi all We have some issues with our Solr servers spending too much time paused doing GC. From turning on gc debug, and extracting numbers from the GC log, we're getting an idea of just how much of a problem. I'm currently doing this in a hacky, inefficient way: grep -h 'Total time for which application threads were stopped:' solr_gc* \ | awk '($11 > 0.3) { print $1, $11 }' \ | sed 's#:.*:##' \ | sort -n \ | sum_by_date.py (Yes, I really am using sed, grep and awk all in one line. Just wrong :) The "sum_by_date.py" program simply adds up all the values with the same first column, and remembers the largest value seen. This is giving me the cumulative GC time for extended pauses (over 0.5s), and the maximum pause seen in a given time period (hourly), eg: 2015-11-13T11 119.124037 2.203569 2015-11-13T12 184.683309 3.156565 2015-11-13T13 65.934526 1.978202 2015-11-13T14 63.970378 1.411700 This is fine for seeing that we have a problem. However, really I need to get this in to our monitoring systems - we use munin. I'm struggling to work out the best way to extract this information for our monitoring systems, and I think this might be my naivety about Java, and working out what should be logged. I've turned on JMX debugging, and looking at the different beans available using jconsole, but I'm drowning in information. What would be the best thing to monitor? Ideally, like the stats above, I'd like to know the cumulative time spent paused in GC since the last poll, and the longest GC pause that we see. munin polls every 5 minutes, are there suitable counters exposed by JMX that it could extract? Thanks in advance Tom
Re: Best way to track cumulative GC pauses in Solr
On Fri, Nov 13, 2015 at 4:50 PM, Walter Underwood wrote: > Also, what GC settings are you using? We may be able to make some suggestions. > > Cumulative GC pauses aren’t very interesting to me. I’m more interested in > the longest ones, 90th percentile, 95th, etc. > Any advice would be great, but what I'm primarily interested in is how people are monitoring these statistics in real time, for all time, on production servers. Eg, for looking at the disk or RAM usage of one of my servers, I can look at the historical usage in the last week, last month, last year and so on. I need to get these stats in to the same monitoring tools as we use for monitoring every other vital aspect of our servers. Looking at log files can be useful, but I don't want to keep arbitrarily large log files on our servers, nor extract data from them, I want to record it for posterity in one system that understands sampling. We already use and maintain our own munin systems, so I'm not interested in paid-for equivalents of munin - regardless of how simple to set up they are, they don't integrate with our other performance monitoring stats, and I would never get budget anyway. So really: 1) Is it OK to turn JMX monitoring on on production systems? The comments in solr.in.sh suggest not. 2) What JMX beans and attributes should I be using to monitor GC pauses, particularly maximum length of a single pause in a period, and the total length of pauses in that period? Cheers Tom
Re: Defining SOLR nested fields
On Sun, Dec 13, 2015 at 6:40 PM, santosh sidnal wrote: > Hi All, > > I want to define nested fileds in SOLR using schema.xml. we are using Apache > Solr 4.7.0. > > i see some links which says how to do, but not sure how can i do it in > schema.xml > https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers > > > any help over here is appreciable. > With nested documents, it is better to not think of them as "children", but as related documents. All the documents in your index will follow exactly the same schema, whether they are "children" or "parents", and the nested aspect of a a document simply allows you to restrict your queries based upon that relationship. Solr is extremely efficient dealing with sparse documents (docs with only a few fields defined), so one way is to define all your fields for "parent" and "child" in the schema, and only use the appropriate ones in the right document. Another way is to use a schema-less structure, although I'm not a fan of that for error checking reasons. You can also define a suffix or prefix for fields that you use as part of your methodology, so that you know what domain it belongs in, but that would just be for your benefit, Solr would not complain if you put a "child" field in a parent or vice-versa. Cheers Tom PS: I would not use Solr 4.7 for this. Nested docs are a new-ish feature, you may encounter bugs that have been fixed in later versions, and performance has certainly been improved in later versions. Faceting on a specific domain (eg, on children or parents) is only supported by the JSON facet API, which was added in 5.2, and the current stable version of Solr is 5.4.
Moving to SolrCloud, specifying dataDir correctly
Hi all We're currently in the process of migrating our distributed search running on 5.0 to SolrCloud running on 5.4, and setting up a test cluster for performance testing etc. We have several cores/collections, and in each core's solrconfig.xml, we were specifying an empty , and specifying the same core.baseDataDir in core.properties. When I tried this in SolrCloud mode, specifying "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine for the first collection, but then the second collection tried to use the same directory to store its index, which obviously failed. I fixed this by changing solrconfig.xml in each collection to specify a specific directory, like so: ${solr.data.dir:}products Looking back after the weekend, I'm not a big fan of this. Is there a way to add a core.properties to ZK, or a way to specify core.baseDatadir on the command line, or just a better way of handling this that I'm not aware of? Cheers Tom
Re: Moving to SolrCloud, specifying dataDir correctly
On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey wrote: > On 12/14/2015 10:49 AM, Tom Evans wrote: >> When I tried this in SolrCloud mode, specifying >> "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine >> for the first collection, but then the second collection tried to use >> the same directory to store its index, which obviously failed. I fixed >> this by changing solrconfig.xml in each collection to specify a >> specific directory, like so: >> >> ${solr.data.dir:}products >> >> Looking back after the weekend, I'm not a big fan of this. Is there a >> way to add a core.properties to ZK, or a way to specify >> core.baseDatadir on the command line, or just a better way of handling >> this that I'm not aware of? > > Since you're running SolrCloud, just let Solr handle the dataDir, don't > try to override it. It will default to "data" relative to the > instanceDir. Each instanceDir is likely to be in the solr home. > > With SolrCloud, your cores will not contain a "conf" directory (unless > you create it manually), therefore the on-disk locations will be *only* > data, there's not really any need to have separate locations for > instanceDir and dataDir. All active configuration information for > SolrCloud is in zookeeper. > That makes sense, but I guess I was asking the wrong question :) We have our SSDs mounted on /data/solr, which is where our indexes should go, but our solr install is on /opt/solr, with the default solr home in /opt/solr/server/solr. How do we change where the indexes get put so they end up on the fast storage? Cheers Tom
Search over a multiValued field
Hi, I am running Solr 5.0.0 and have a question about proximity search and multiValued fields. I am indexing xml files of the following form with foundField being a field defined as multiValued and text_en my in schema.xml. 8 "Oranges from South California - ordered" "Green Apples - available" "Black Report Books - ordered" There are several such documents, and for instance, I would like to query all documents having in the foundField "Oranges" and "ordered". The following proximity query takes care of it: q=foundField:("oranges AND ordered"~2) However, a field could have more words, and I also cannot know the proximity of the desired query words in advance. Setting the proximity value too high results in false positives, the following query also returns the document (although "available" was in the entry about Apples): foundField:("oranges AND available"~200) I do not think that tweaking a proximity value is the correct approach. How can I search to match contents in a multiValued field per Value as described above, without running into the problem? Many thanks for any help
Re: Search over a multiValued field
Jack, This is exactly what I was looking for, thanks. I found the positionIncrementGap attribute in the schema.xml for the text_en I was putting in "AND" because I read in the Solr documentation that "The OR operator is the default conjunction operator." Does it mean that words between " symbols, such as "Orange ordered" are treated as a single term, with (implicitly) AND conjunction between them? Where could I found more info about this? I am currently reading https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser Thanks again On Tue, Mar 3, 2015 at 3:58 PM, Jack Krupansky wrote: > Just set the positionIncrementGap for the multivalued field to a much > higher value, like 1000 or 5000. That's the purpose of this attribute, to > assure that reasonable proximity matches don't match across multiple > values. > > Also, leave "AND" out of the query phrases - you're just trying to match > the product name and availability. > > > -- Jack Krupansky > > On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel wrote: > > > Hi, > > > > I am running Solr 5.0.0 and have a question about proximity search and > > multiValued fields. > > > > I am indexing xml files of the following form with foundField being a > field > > defined as multiValued and text_en my in schema.xml. > > > > > > > > 8 > > "Oranges from South California - > ordered" > > "Green Apples - available" > > "Black Report Books - ordered" > > > > > > There are several such documents, and for instance, I would like to query > > all documents having in the foundField "Oranges" and "ordered". The > > following proximity query takes care of it: > > > > q=foundField:("oranges AND ordered"~2) > > > > However, a field could have more words, and I also cannot know the > > proximity of the desired query words in advance. Setting the proximity > > value too high results in false positives, the following query also > returns > > the document (although "available" was in the entry about Apples): > > > > foundField:("oranges AND available"~200) > > > > I do not think that tweaking a proximity value is the correct approach. > > > > How can I search to match contents in a multiValued field per Value as > > described above, without running into the problem? > > > > Many thanks for any help > > >
Re: Search over a multiValued field
Erick, Thanks a lot for the explanation, makes sense now. Tom On Tue, Mar 3, 2015 at 5:54 PM, Erick Erickson wrote: > bq: Does it mean that words between " symbols, such as "Orange ordered" are > treated as a single term, with (implicitly) AND conjunction between them? > > not at all. When you quote things, you're getting a "phrase query", > perhaps one > with slop. So something like > "a b" means that 'a' must appear right next to 'b'. This is something > like an AND > in the sense that both terms must appear, but it is far more > restrictive since it takes into > account the position of the terms in the field. > > "a b"~10 means that both words must appear within 10 transpositions in > the same field. > You can think of "transposition" as how many intervening terms there > are, so something > like "a b"~2 would match docs with "a x b", but not "a x y z b". > > And this is where positionIncrementGap comes in. By putting 1000 in > for it, you guarantee > "a b"~999 won't match 'a' in one field and 'b' in another. > > whereas a AND b would match across successive MV entries no matter what the > gap. > > HTH, > Erick > > On Tue, Mar 3, 2015 at 2:22 PM, Tom Devel wrote: > > Jack, > > > > This is exactly what I was looking for, thanks. I found the > > positionIncrementGap attribute in the schema.xml for the text_en > > > > I was putting in "AND" because I read in the Solr documentation that "The > > OR operator is the default conjunction operator." > > > > Does it mean that words between " symbols, such as "Orange ordered" are > > treated as a single term, with (implicitly) AND conjunction between them? > > > > Where could I found more info about this? > > > > I am currently reading > > > https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser > > > > Thanks again > > > > On Tue, Mar 3, 2015 at 3:58 PM, Jack Krupansky > > > wrote: > > > >> Just set the positionIncrementGap for the multivalued field to a much > >> higher value, like 1000 or 5000. That's the purpose of this attribute, > to > >> assure that reasonable proximity matches don't match across multiple > >> values. > >> > >> Also, leave "AND" out of the query phrases - you're just trying to match > >> the product name and availability. > >> > >> > >> -- Jack Krupansky > >> > >> On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel wrote: > >> > >> > Hi, > >> > > >> > I am running Solr 5.0.0 and have a question about proximity search and > >> > multiValued fields. > >> > > >> > I am indexing xml files of the following form with foundField being a > >> field > >> > defined as multiValued and text_en my in schema.xml. > >> > > >> > > >> > > >> > 8 > >> > "Oranges from South California - > >> ordered" > >> > "Green Apples - available" > >> > "Black Report Books - ordered" > >> > > >> > > >> > There are several such documents, and for instance, I would like to > query > >> > all documents having in the foundField "Oranges" and "ordered". The > >> > following proximity query takes care of it: > >> > > >> > q=foundField:("oranges AND ordered"~2) > >> > > >> > However, a field could have more words, and I also cannot know the > >> > proximity of the desired query words in advance. Setting the proximity > >> > value too high results in false positives, the following query also > >> returns > >> > the document (although "available" was in the entry about Apples): > >> > > >> > foundField:("oranges AND available"~200) > >> > > >> > I do not think that tweaking a proximity value is the correct > approach. > >> > > >> > How can I search to match contents in a multiValued field per Value as > >> > described above, without running into the problem? > >> > > >> > Many thanks for any help > >> > > >> >
Order of defining fields and dynamic fields in schema.xml
Hi, I am running solr 5 using basic_configs and have a questions about the order of defining fields and dynamic fields in the schema.xml file? For example, there is a field "hierarchy.of.fields.Project" I am capturing as below as "text_en_splitting", but the rest of the fields in this hierarchy, I would like as "text_en" Since the dynamicField with * is technically spanning over the Project field, should its definition go above, or below the Project field? Or this case, I have a hierarchy where currently only one field should be captured "another.hierarchy.of.fields.Description", the rest for now should be just ignored. Is here any significance of which definition comes first? Thanks for any hints, Tom
Re: Order of defining fields and dynamic fields in schema.xml
Thats good to know. On http://wiki.apache.org/solr/SchemaXml it also states about dynamicFields that "you can create field rules that Solr will use to understand what datatype should be used whenever it is given a field name that is not explicitly defined, but matches a prefix or suffix used in a dynamicField. " Thanks On Fri, Mar 6, 2015 at 10:43 AM, Alexandre Rafalovitch wrote: > I don't believe the order in file matters for anything apart from > initParams section. The longer - more specific one - matches first. > > > Regards, >Alex. > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 6 March 2015 at 11:21, Tom Devel wrote: > > Hi, > > > > I am running solr 5 using basic_configs and have a questions about the > > order of defining fields and dynamic fields in the schema.xml file? > > > > For example, there is a field "hierarchy.of.fields.Project" I am > capturing > > as below as "text_en_splitting", but the rest of the fields in this > > hierarchy, I would like as "text_en" > > > > Since the dynamicField with * is technically spanning over the Project > > field, should its definition go above, or below the Project field? > > > > > indexed="true" stored="true" multiValued="true" required="false" /> > > > indexed="true" stored="true" multiValued="true" required="false" /> > > > > > > Or this case, I have a hierarchy where currently only one field should be > > captured "another.hierarchy.of.fields.Description", the rest for now > should > > be just ignored. Is here any significance of which definition comes > first? > > > > > indexed="false" stored="false" multiValued="true" required="false" /> > > > type="text_en"indexed="true" stored="true" multiValued="true" > > required="false" /> > > > > Thanks for any hints, > > Tom >
Setting up SOLR 5 from an RPM
Hi all We're migrating to SOLR 5 (from 4.8), and our infrastructure guys would prefer we installed SOLR from an RPM rather than extracting the tarball where we need it. They are creating the RPM file themselves, and it installs an init.d script and the equivalent of the tarball to /opt/solr. We're having problems running SOLR from the installed files, as SOLR wants to (I think) extract the WAR file and create various temporary files below /opt/solr/server. We currently have this structure: /data/solr - root directory of our solr instance /data/solr/{logs,run} - log/run directories /data/solr/cores - configuration for our cores and solr.in.sh /opt/solr - the RPM installed solr 5 The user running solr can modify anything under /data/solr, but nothing under /opt/solr. Is this sort of configuration supported? Am I missing some variable in our solr.in.sh that sets where temporary files can be extracted? We currently set: SOLR_PID_DIR=/data/solr/run SOLR_HOME=/data/solr/cores SOLR_LOGS_DIR=/data/solr/logs Cheers Tom
Re: Setting up SOLR 5 from an RPM
On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans wrote: > Hi all > > We're migrating to SOLR 5 (from 4.8), and our infrastructure guys > would prefer we installed SOLR from an RPM rather than extracting the > tarball where we need it. They are creating the RPM file themselves, > and it installs an init.d script and the equivalent of the tarball to > /opt/solr. > > We're having problems running SOLR from the installed files, as SOLR > wants to (I think) extract the WAR file and create various temporary > files below /opt/solr/server. >From the SOLR 5 reference guide, section "Managing SOLR", sub-section "Taking SOLR to production", it seems changing the ownership of the installed files to the user that will run SOLR is an explicit requirement if you do not wish to run as root. It would be better if this was not required. With most applications you do not normally require permission to modify the installed files in order to run the application, eg I do not need write permission to /usr/share/vim to run vim, it is a shame I need write permission to /opt/solr to run solr. Cheers Tom
Re: Setting up SOLR 5 from an RPM
On Wed, Mar 25, 2015 at 2:40 PM, Shawn Heisey wrote: > I think you will only need to change the ownership of the solr home and > the location where the .war file is extracted, which by default is > server/solr-webapp. The user must be able to *read* the program data, > but should not need to write to it. If you are using the start script > included with Solr 5 and one of the examples, I believe the logging > destination will also be located under the solr home, but you should > make sure that's the case. Thanks Shawn, this sort of makes sense. The thing which I cannot seem to do is change the location where the war file is extracted. I think this is probably because, as of solr 5, I am not supposed to know or be aware that there is a war file, or that the war file is hosted in jetty, which makes it tricky to specify the jetty temporary directory. Our use case is that we want to create a single system image that would be usable for several projects, each project would check out its solr home and run solr as their own user (possibly on the same server). Eg, /data/projectA being a solr home for one project, /data/projectB being a solr home for another project, both running solr from the same location. Also, on a dev server, I want to install solr once, and each member of my team run it from that single location. Because they cannot change the temporary directory, and they cannot all own server/solr-webapp, this does not work and they must each have their own copy of the solr install. I think the way we will go for this is in production to run all our solr instance as the "solr" user, who will own the files in /opt/solr, and have their solr home directory wherever they choose. In dev, we will just do something... Cheers Tom
Confusing SOLR 5 memory usage
Hi all I have two SOLR 5 servers, one is the master and one is the slave. They both have 12 cores, fully replicated and giving identical results when querying them. The only difference between configuration on the two servers is that one is set to slave from the other - identical core configs and solr.in.sh. They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are setting the heap size identically: SOLR_JAVA_MEM="-Xms512m -Xmx7168m" The two servers are balanced behind haproxy, and identical numbers and types of queries flow to both servers. Indexing only happens once a day. When viewing the memory usage of the servers, the master server's JVM has 8.8GB RSS, but the slave only has 1.2GB RSS. Can someone hit me with the cluebat please? :) Cheers Tom
Re: Confusing SOLR 5 memory usage
We monitor them with munin, so I have charts if attachments are acceptable? Having said that, they have only been running for a day with this memory allocation.. Describing them, the master consistently has 8GB used for apps, the 8GB used in cache, whilst the slave consistently only uses ~1.5GB for apps, 14GB used in cache. We are trying to use our SOLR servers to do a lot more facet queries, previously we were mainly doing searches, and the SolrPerformanceProblems wiki page mentions that faceting (amongst others) require a lot of JVM heap, so I'm confused why it is not using the heap we've allocated on one server, whilst it is on the other server. Perhaps our master server needs even more heap? Also, my infra guy is wondering why I asked him to add more memory to the slave server, if it is "just" in cache, although I did try to explain that ideally, I'd have even more in cache - we have about 35GB of index data. Cheers Tom On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma wrote: > Hi - what do you see if you monitor memory over time? You should see a > typical saw tooth. > Markus > > -Original message- >> From:Tom Evans >> Sent: Tuesday 21st April 2015 12:22 >> To: solr-user@lucene.apache.org >> Subject: Confusing SOLR 5 memory usage >> >> Hi all >> >> I have two SOLR 5 servers, one is the master and one is the slave. >> They both have 12 cores, fully replicated and giving identical results >> when querying them. The only difference between configuration on the >> two servers is that one is set to slave from the other - identical >> core configs and solr.in.sh. >> >> They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are >> setting the heap size identically: >> >> SOLR_JAVA_MEM="-Xms512m -Xmx7168m" >> >> The two servers are balanced behind haproxy, and identical numbers and >> types of queries flow to both servers. Indexing only happens once a >> day. >> >> When viewing the memory usage of the servers, the master server's JVM >> has 8.8GB RSS, but the slave only has 1.2GB RSS. >> >> Can someone hit me with the cluebat please? :) >> >> Cheers >> >> Tom >>
Re: Confusing SOLR 5 memory usage
I do apologise for wasting anyone's time on this, the PEBKAC (my keyboard and chair unfortunately). When adding the new server to haproxy, I updated the label for the balancer entry to the new server, but left the host name the same, so the server that wasn't using any RAM... wasn't getting any requests. Again, sorry! Tom On Tue, Apr 21, 2015 at 11:54 AM, Tom Evans wrote: > We monitor them with munin, so I have charts if attachments are > acceptable? Having said that, they have only been running for a day > with this memory allocation.. > > Describing them, the master consistently has 8GB used for apps, the > 8GB used in cache, whilst the slave consistently only uses ~1.5GB for > apps, 14GB used in cache. > > We are trying to use our SOLR servers to do a lot more facet queries, > previously we were mainly doing searches, and the > SolrPerformanceProblems wiki page mentions that faceting (amongst > others) require a lot of JVM heap, so I'm confused why it is not using > the heap we've allocated on one server, whilst it is on the other > server. Perhaps our master server needs even more heap? > > Also, my infra guy is wondering why I asked him to add more memory to > the slave server, if it is "just" in cache, although I did try to > explain that ideally, I'd have even more in cache - we have about 35GB > of index data. > > Cheers > > Tom > > On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma > wrote: >> Hi - what do you see if you monitor memory over time? You should see a >> typical saw tooth. >> Markus >> >> -Original message- >>> From:Tom Evans >>> Sent: Tuesday 21st April 2015 12:22 >>> To: solr-user@lucene.apache.org >>> Subject: Confusing SOLR 5 memory usage >>> >>> Hi all >>> >>> I have two SOLR 5 servers, one is the master and one is the slave. >>> They both have 12 cores, fully replicated and giving identical results >>> when querying them. The only difference between configuration on the >>> two servers is that one is set to slave from the other - identical >>> core configs and solr.in.sh. >>> >>> They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are >>> setting the heap size identically: >>> >>> SOLR_JAVA_MEM="-Xms512m -Xmx7168m" >>> >>> The two servers are balanced behind haproxy, and identical numbers and >>> types of queries flow to both servers. Indexing only happens once a >>> day. >>> >>> When viewing the memory usage of the servers, the master server's JVM >>> has 8.8GB RSS, but the slave only has 1.2GB RSS. >>> >>> Can someone hit me with the cluebat please? :) >>> >>> Cheers >>> >>> Tom >>>
Re: Checking of Solr Memory and Disk usage
On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo wrote: > Hi, > > So has anyone knows what is the issue with the "Heap Memory Usage" reading > showing the value -1. Should I open an issue in Jira? I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers the core statistics have values for heap memory, on the solr 5.0.0 ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK on both versions. I don't see this issue in the fixed bugs in 5.1.0, but I only looked at the headlines of the tickets.. http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes Cheers Tom
Block Join Query update documents, how to do it correctly?
I am using the Block Join Query Parser with success, following the example on: https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers As this example shows, each parent document can have a number of documents embedded, and each document, be it a parent or a child, has its own unique identifier. Now I would like to update some of the parent documents, and read that there are horror stories with duplicate documents, scrambled data etc., the two prominent JIRA entries for this are: https://issues.apache.org/jira/browse/SOLR-6700 https://issues.apache.org/jira/browse/SOLR-6096 My question is, how do you usually update such documents, for example to update a value for the parent or a value for one of its children? I tried to repost the whole modified document (the parent and ALL of its children as one file), and it seems to work on a small toy example, but of course I cannot be sure for a larger instance with thousands of documents, and I would like to know if this is the correct way to go or not. To make it clear, if originally I used bin/solr post on on the following file: 1 Solr has block join support parentDocument 2 SolrCloud supports it too! Now I could do bin/solr post on a file: 1 Updated field: Solr has block join support parentDocument 2 Updated field: SolrCloud supports it too! Will this avoid these inconsistent and scrambled or duplicate data on Solr instances as discussed in the JIRAs? How do you usually do this? Thanks for any help or hints. Tom
Solr
Hello, I have customized my Solr results so that they display only 3 fields: the document ID, name and last_modified date. The results are in JSON. This is a sample of my Javascript function to execute the query: var query = ""; //set user input to query query = window.document.box.input.value; //solr URL var sol = " http://localhost:8983/solr/gettingstarted_shard1_replica1/select?q=";; var sol2 = "&wt=json&fl=title,id,category,last_modified&rows=1000&indent=true"; //redirect window.location.href = sol+query+sol2; // The output example would look like: { "id":"/solr/docs/ISO/Employee Benefits Information/BCN.doc", "title":["BCN Auto Policy Verbiage:"], "last_modified":["2014-01-07T15:19:00Z"]}, I want to format my Solr results so that the document ID will be displayed as a link that users can click on and load the BCN.doc file. Any tips on how to do this? I am a stuck. All help is appreciated! Thanks, T
changing web context and port for SolrCloud Zookeeper
I need to change the web context and the port for a SolrCloud installation. Example, change: host:8080/some-api-here/ to this: host:8983/solr/ Does anyone know how to do this with SolrCloud? There are values stored in clusterstate.json and /leader/elect and I could change them but that seems a little messy. Thanks
Re: changing web context and port for SolrCloud Zookeeper
My Solr installation is running on Tomcat on port 8080 with a web context name that is different than /solr. We want to move to a basic jetty setup with all the defaults. I haven’t found a clean way to do this. A lot of the values like baseurl and /leader/elect/shard1 have values that need to be updated. If I try shutting down the servers, change the zookeeper settings and then restart Solr in Jetty I get issues - like Solr thinks they are replicas. So I’m looking to see if anyone knows what is the cleanest way to move from a Tomcat/8080 install to a Jetty/8983 one. Thanks > On May 11, 2016, at 1:59 PM, John Bickerstaff > wrote: > > I may be answering the wrong question - but SolrCloud goes in by default on > 8983, yes? Is yours currently on 8080? > > I don't recall where, but I think I saw a config file setting for the port > number (In Solr I mean) > > Am I on the right track or are you asking something other than how to get > Solr on host:8983/solr ? > > On Wed, May 11, 2016 at 11:56 AM, Tom Gullo wrote: > >> I need to change the web context and the port for a SolrCloud installation. >> >> Example, change: >> >> host:8080/some-api-here/ >> >> to this: >> >> host:8983/solr/ >> >> Does anyone know how to do this with SolrCloud? There are values stored >> in clusterstate.json and /leader/elect and I could change them >> but that seems a little messy. >> >> Thanks
Re: changing web context and port for SolrCloud Zookeeper
That helps. I ended up updating the sole.in.sh file in /etc/default and that was in getting picked up. Thanks > On May 11, 2016, at 2:05 PM, Tom Gullo wrote: > > My Solr installation is running on Tomcat on port 8080 with a web context > name that is different than /solr. We want to move to a basic jetty setup > with all the defaults. I haven’t found a clean way to do this. A lot of the > values like baseurl and /leader/elect/shard1 have values that need to be > updated. If I try shutting down the servers, change the zookeeper settings > and then restart Solr in Jetty I get issues - like Solr thinks they are > replicas. So I’m looking to see if anyone knows what is the cleanest way to > move from a Tomcat/8080 install to a Jetty/8983 one. > > Thanks > >> On May 11, 2016, at 1:59 PM, John Bickerstaff >> wrote: >> >> I may be answering the wrong question - but SolrCloud goes in by default on >> 8983, yes? Is yours currently on 8080? >> >> I don't recall where, but I think I saw a config file setting for the port >> number (In Solr I mean) >> >> Am I on the right track or are you asking something other than how to get >> Solr on host:8983/solr ? >> >> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo wrote: >> >>> I need to change the web context and the port for a SolrCloud installation. >>> >>> Example, change: >>> >>> host:8080/some-api-here/ >>> >>> to this: >>> >>> host:8983/solr/ >>> >>> Does anyone know how to do this with SolrCloud? There are values stored >>> in clusterstate.json and /leader/elect and I could change them >>> but that seems a little messy. >>> >>> Thanks >
Re: Creating a collection with 1 shard gives a weird range
On Tue, May 17, 2016 at 9:40 AM, John Smith wrote: > I'm trying to create a collection starting with only one shard > (numShards=1) using a compositeID router. The purpose is to start small > and begin splitting shards when the index grows larger. The shard > created gets a weird range value: 8000-7fff, which doesn't look > effective. Indeed, if a try to import some documents using a DIH, none > gets added. > > If I create the same collection with 2 shards, the ranges seem more > logical (0-7fff & 8000-). In this case documents are > indexed correctly. > > Is this behavior by design, i.e. is a minimum of 2 shards required? If > not, how can I create a working collection with a single shard? > > This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8. > I believe this is as designed, see this email from Shawn: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c570d0a03.5010...@elyograg.org%3E Cheers Tom
Re: SolrCloud increase replication factor
On Mon, May 23, 2016 at 10:37 AM, Hendrik Haddorp wrote: > Hi, > > I have a SolrCloud 6.0 setup and created my collection with a > replication factor of 1. Now I want to increase the replication factor > but would like the replicas for the same shard to be on different nodes, > so that my collection does not fail when one node fails. I tried two > approaches so far: > > 1) When I use the collections API with the MODIFYCOLLECTION action [1] I > can set the replication factor but that did not result in the creation > of additional replicas. The Solr Admin UI showed that my replication > factor changed but otherwise nothing happened. A reload of the > collection did also result in no change. > > 2) Using the ADDREPLICA action [2] from the collections API I have to > add the replicas to the shard individually, which is a bit more > complicated but otherwise worked. During testing this did however at > least once result in the replica being created on the same node. My > collection was split in 4 shards and for 2 of them all replicas ended up > on the same node. > > So is the only option to create the replicas manually and also pick the > nodes manually or is the perceived behavior wrong? > > regards, > Hendrik > > [1] > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-modifycoll > [2] > https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica With ADDREPLICA, you can specify the node to create the replica on. If you are using a script to increase/remove replicas, you can simply incorporate the logic you desire in to your script - you can also use CLUSTERSTATUS to get a list of nodes/collections/shards etc in order to inform the logic in the script. This is the approach we took, we have a fabric script to add/remove extra nodes to/from the cluster, it works well. The alternative is to put the logic in to Solr itself, using what Solr calls a "snitch" to define the rules on where replicas are created. The snitch is specified at collection creation time, or you can use MODIFYCOLLECTION to set it after the fact. See this wiki patch for details: https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement Cheers Tom
Re: Import html data in mysql and map schemas using onlySolrCELL+TIKA+DIH [scottchu]
On Tue, May 24, 2016 at 3:06 PM, Scott Chu wrote: > p.s. There're really many many extensive, worthy stuffs in Solr. If the > project team can provide some "dictionary" of them, It would be a "Santa > Claus" > for we solr users. Ha! Just a X'mas wish! Sigh! I know it's quite not > possbile. > I really like to study them one after another, to learn about all of them. > However, Internet IT goes too fast to have time to congest all of the great > stuffs in Solr. The reference guide is both extensive and also broadly informative. Start from the top page and browse away! https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide Handy to keep the glossary handy for any terms that you don't recognise: https://cwiki.apache.org/confluence/display/solr/Solr+Glossary Cheers Tom
Re: result grouping in sharded index
Do you have to group, or can you collapse instead? https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results Cheers Tom On Tue, Jun 14, 2016 at 4:57 PM, Jay Potharaju wrote: > Any suggestions on how to handle result grouping in sharded index? > > > On Mon, Jun 13, 2016 at 1:15 PM, Jay Potharaju > wrote: > >> Hi, >> I am working on a functionality that would require me to group documents >> by a id field. I read that the ngroups feature would not work in a sharded >> index. >> Can someone recommend how to handle this in a sharded index? >> >> >> Solr Version: 5.5 >> >> >> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats >> >> -- >> Thanks >> Jay >> >> > > > > -- > Thanks > Jay Potharaju
Strange highlighting on search
Hi all I'm investigating a bug where by every term in the highlighted field gets marked for highlighting instead of just the words that match the fulltext portion of the query. This is on Solr 5.5.0, but I didn't see any bug fixes related to highlighting in 5.5.1 or 6.0 release notes. The query that affects it is where we have a not clause on a specific field (not the fulltext field) and also only include documents where that field has a value: q: cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223) This returns the correct results, but the highlighting has matched every word in the results (see below for debugQuery output). If I change the query to put the exclusion in to an fq, the highlighting is correct again (and the results are correct): q: cosmetics_packaging_fulltext:(Mist) fq: {!cache=false} ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223) Is there any way I can make the query and highlighting work as expected as part of q? Is there any downside to putting the exclusion part in the fq in terms of performance? We don't use score at all for our results, we always order by other parameters. Cheers Tom Query with strange highlighting: { "responseHeader":{ "status":0, "QTime":314, "params":{ "q":"cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)", "hl":"true", "hl.simple.post":"", "indent":"true", "fl":"id,product", "hl.fragsize":"0", "hl.fl":"product", "rows":"5", "wt":"json", "debugQuery":"true", "hl.simple.pre":""}}, "response":{"numFound":10132,"start":0,"docs":[ { "id":"2403841-1498608", "product":"Mist"}, { "id":"2410603-1502577", "product":"Mist"}, { "id":"5988531-3882415", "product":"Ao + Mist"}, { "id":"6020805-3904203", "product":"UV Mist Cushion SPF 50+ PA+++"}, { "id":"2617977-1629335", "product":"Ultra Radiance Facial Re-Hydrating Mist"}] }, "highlighting":{ "2403841-1498608":{ "product":["Mist"]}, "2410603-1502577":{ "product":["Mist"]}, "5988531-3882415":{ "product":["Ao + Mist"]}, "6020805-3904203":{ "product":["UV Mist Cushion SPF 50+ PA+++"]}, "2617977-1629335":{ "product":["Ultra Radiance Facial Re-Hydrating Mist"]}}, "debug":{ "rawquerystring":"cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)", "querystring":"cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)", "parsedquery":"+cosmetics_packaging_fulltext:mist +ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223", "parsedquery_toString":"+cosmetics_packaging_fulltext:mist +ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223", "explain":{ "2403841-1498608":"\n40.082462 = sum of:\n 39.92971 = weight(cosmetics_packaging_fulltext:mist in 13983) [ClassicSimilarity], result of:\n39.92971 = score(doc=13983,freq=39.0), product of:\n 0.9882648 = queryWeight, product of:\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n0.15275055 = queryNorm\n 40.40386 = fieldWeight in 13983, product of:\n6.244998 = tf(freq=39.0), with freq of:\n 39.0 = termFreq=39.0\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n1.0 = fieldNorm(doc=13983)\n 0.15275055 = ingredient_tag_id:[0 TO *], product of:\n1.0 = boost\n0.15275055 = queryNorm\n", "2410603-1502577":"\n40.082462 = sum of:\n 39.92971 = weight(cosmetics_packaging_fulltext:mist in 14023) [ClassicSimilarity], result of:\n39.92971 = score(doc=14023,freq=39.0), product of:\n 0.9882648 = queryWeight, product of:\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n0.15275055 = queryNorm\n 40.40386 = fieldWeight in 14023, product of:\n6.244998 = tf(freq=39.0), with freq of:\n 39.0 = termFreq=39.0\n6.469795 = idf(docFreq=22502, maxDocs=5342472)\n1.0 = fieldNorm(doc=14023)\n 0.15275055 = ingredient_tag
Node not recovering, leader elections not occuring
Hi all - problem with a SolrCloud 5.5.0, we have a node that has most of the collections on it marked as "Recovering" or "Recovery Failed". It attempts to recover from the leader, but the leader responds with: Error while trying to recover. core=iris_shard1_replica1:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://172.31.1.171:3/solr: We are not the leader at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://172.31.1.171:3/solr: We are not the leader at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576) at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284) at org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280) ... 5 more and recovery never occurs. Each collection in this state has plenty (10+) of active replicas, but stopping the server that is marked as the leader doesn't trigger a leader election amongst these replicas. REBALANCELEADERS did nothing. FORCELEADER complains that there is already a leader. FORCELEADER with the purported leader stopped took 45 seconds, reported status of "0" (and no other message) and kept the down node as the leader (!) Deleting the failed collection from the failed node and re-adding it has the same "Leader said I'm not the leader" error message. Any other ideas? Cheers Tom
Re: Node not recovering, leader elections not occuring
There are 11 collections, each only has one shard, and each node has 10 replicas (9 collections are on every node, 2 are just on one node). We're not seeing any OOM errors on restart. I think we're being patient waiting for the leader election to occur. We stopped the troublesome "leader that is not the leader" server about 15-20 minutes ago, but we still have not had a leader election. Cheers Tom On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson wrote: > How many replicas per Solr JVM? And do you > see any OOM errors when you bounce a server? > And how patient are you being, because it can > take 3 minutes for a leaderless shard to decide > it needs to elect a leader. > > See SOLR-7280 and SOLR-7191 for the case > where lots of replicas are in the same JVM, > the tell-tale symptom is errors in the log as you > bring Solr up saying something like > "OutOfMemory error unable to create native thread" > > SOLR-7280 has patches for 6x and 7x, with a 5x one > being added momentarily. > > Best, > Erick > > On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans wrote: >> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most >> of the collections on it marked as "Recovering" or "Recovery Failed". >> It attempts to recover from the leader, but the leader responds with: >> >> Error while trying to recover. >> core=iris_shard1_replica1:java.util.concurrent.ExecutionException: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://172.31.1.171:3/solr: We are not the >> leader >> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >> at >> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) >> at >> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) >> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >> Error from server at http://172.31.1.171:3/solr: We are not the >> leader >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576) >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284) >> at >> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280) >> ... 5 more >> >> and recovery never occurs. >> >> Each collection in this state has plenty (10+) of active replicas, but >> stopping the server that is marked as the leader doesn't trigger a >> leader election amongst these replicas. >> >> REBALANCELEADERS did nothing. >> FORCELEADER complains that there is already a leader. >> FORCELEADER with the purported leader stopped took 45 seconds, >> reported status of "0" (and no other message) and kept the down node >> as the leader (!) >> Deleting the failed collection from the failed node and re-adding it >> has the same "Leader said I'm not the leader" error message. >> >> Any other ideas? >> >> Cheers >> >> Tom
Re: Node not recovering, leader elections not occuring
On the nodes that have the replica in a recovering state we now see: 19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException: No registered leader was found after waiting for 4000ms , collection: lookups slice: shard1 at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607) at org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 19-07-2016 16:18:28 INFO RecoveryStrategy:444 - Replay not started, or was not successful... still buffering updates. 19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed - trying again... (164) 19-07-2016 16:18:28 INFO RecoveryStrategy:503 - Wait [12.0] seconds before trying to recover again (attempt=165) This is with the "leader that is not the leader" shut down. Issuing a FORCELEADER via collections API doesn't in fact force a leader election to occur. Is there any other way to prompt Solr to have an election? Cheers Tom On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans wrote: > There are 11 collections, each only has one shard, and each node has > 10 replicas (9 collections are on every node, 2 are just on one node). > We're not seeing any OOM errors on restart. > > I think we're being patient waiting for the leader election to occur. > We stopped the troublesome "leader that is not the leader" server > about 15-20 minutes ago, but we still have not had a leader election. > > Cheers > > Tom > > On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson > wrote: >> How many replicas per Solr JVM? And do you >> see any OOM errors when you bounce a server? >> And how patient are you being, because it can >> take 3 minutes for a leaderless shard to decide >> it needs to elect a leader. >> >> See SOLR-7280 and SOLR-7191 for the case >> where lots of replicas are in the same JVM, >> the tell-tale symptom is errors in the log as you >> bring Solr up saying something like >> "OutOfMemory error unable to create native thread" >> >> SOLR-7280 has patches for 6x and 7x, with a 5x one >> being added momentarily. >> >> Best, >> Erick >> >> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans wrote: >>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most >>> of the collections on it marked as "Recovering" or "Recovery Failed". >>> It attempts to recover from the leader, but the leader responds with: >>> >>> Error while trying to recover. >>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException: >>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >>> Error from server at http://172.31.1.171:3/solr: We are not the >>> leader >>> at java.util.concurrent.FutureTask.report(FutureTask.java:122) >>> at java.util.concurrent.FutureTask.get(FutureTask.java:192) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596) >>> at >>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353) >>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224) >>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>> at >>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>> at java.lang.Thread.run(Thread.java:745) >>> Caused by: >>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: >>> Error from server at http://172.31.1.171:3/solr: We are not the >>> leader >>> at >>> org.apache.solr.client.solrj.impl.HttpSolrClient.execu
min()/max() on date fields using JSON facets
Hi all I'm trying to replace a use of the stats module with JSON facets in order to calculate the min/max date range of documents in a query. For the same search, "stats.field=date_published" returns this: {u'date_published': {u'count': 86760, u'max': u'2016-07-13T00:00:00Z', u'mean': u'2013-12-11T07:09:17.676Z', u'min': u'2011-01-04T00:00:00Z', u'missing': 0, u'stddev': 50006856043.410477, u'sum': u'3814570-11-06T00:00:00Z', u'sumOfSquares': 1.670619719649826e+29}} For the equivalent JSON facet - "{'date.max': 'max(date_published)', 'date.min': 'min(date_published)'}" - I'm returned this: {u'count': 86760, u'date.max': 146836800.0, u'date.min': 129409920.0} What do these numbers represent - I'm guessing it is milliseconds since epoch? In UTC? Is there any way to control the output format or TZ? Is there any benefit in using JSON facets to determine this, or should I just continue using stats? Cheers Tom