Re: dataimporthandler: nested query is called multiple times
alex, thank you for the link. i enabled the trace for 'org.apache.solr.handler.dataimport' and it seems as if the database is only called once: 2013-03-21T09:40:43 1363855243889 50 org.apache.solr.handler.dataimport.JdbcDataSource FINE org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator <init> 11 Executing SQL: select * from doc_properties where DOCID='0u3xouyscdhye61o' therefore i assume the output shown in the dataimporthandler UI is incorrect. i could doublecheck with the database logs cheerio, patrick On 20.03.2013 12:07, Alexandre Rafalovitch wrote: There was something like this on Stack Overflow: http://stackoverflow.com/questions/15164166/solr-filelistentityprocessor-is-executing-sub-entities-multiple-times Upgrading Solr helped partially, but the conclusion was not fully satisfactory. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Mar 20, 2013 at 6:48 AM, patrick wrote: hi, the dataimport-config-file i'm using with solr3.6.2 uses a nested select statement. the first query retrieves the documents while the nested one retrieves the corresponding properties. when running the dataimporthandler with the verbose/debug flag turned on the output lists more than one query for 'entity:attributes' - this list is increased for each 'entity:item': select DOCID from documents 0:0:0.50 --- row #1- 000emnslnbh88hdd<**/str> -** select * from doc_properties where DOCID='000emnslnbh88hdd' select * from doc_properties where DOCID='000emnslnbh88hdd' 0:0:0.37 0:0:0.37 --- row #1- I message_**direction -** --- row #2- heb@test message_**event_source --- row #1- 000hsjunnbh7weq8<**/str> -** select * from doc_properties where DOCID='000hsjunnbh7weq8' select * from doc_properties where DOCID='000hsjunnbh7weq8' select * from doc_properties where DOCID='000hsjunnbh7weq8' select * from doc_properties where DOCID='000hsjunnbh7weq8' 0:0:0.1 0:0:0.1 0:0:0.1 0:0:0.1 --- row #1- I message_**direction -** --- row #2- heb@test message_**event_source ... i was wondering if there's something wrong with my configuration - thank you for clarifying, patrick
dataimporthandler: nested query is called multiple times
hi, the dataimport-config-file i'm using with solr3.6.2 uses a nested select statement. the first query retrieves the documents while the nested one retrieves the corresponding properties. when running the dataimporthandler with the verbose/debug flag turned on the output lists more than one query for 'entity:attributes' - this list is increased for each 'entity:item': select DOCID from documents 0:0:0.50 --- row #1- 000emnslnbh88hdd - select * from doc_properties where DOCID='000emnslnbh88hdd' select * from doc_properties where DOCID='000emnslnbh88hdd' 0:0:0.37 0:0:0.37 --- row #1- I message_direction - --- row #2- heb@test message_event_source --- row #1- 000hsjunnbh7weq8 - select * from doc_properties where DOCID='000hsjunnbh7weq8' select * from doc_properties where DOCID='000hsjunnbh7weq8' select * from doc_properties where DOCID='000hsjunnbh7weq8' select * from doc_properties where DOCID='000hsjunnbh7weq8' 0:0:0.1 0:0:0.1 0:0:0.1 0:0:0.1 --- row #1- I message_direction - --- row #2- heb@test message_event_source ... i was wondering if there's something wrong with my configuration - thank you for clarifying, patrick
how to recover when indexing with proxy & shards
hi, i'm considering of using more than 3 solr shards and assign a (separate) proxy to do the loadbalancing when indexing. using SolrJ is my way to do the indexing. the question is if i get any information about the whereabouts of the shard in which the document is stored. this information would be helpful in case a specific shard has to be re-indexed (no indexing downtime, isolated recovery). i assume the HTTP-response only contains the IP address of the proxy. thank you for any hints! cheerio, patrick
numFound inconsistent for different rows-param
hi, i'm running two solr v3.6 instances: rdta01:9983/solr/msg-core : 8 documents rdta01:28983/solr/msg-core : 4 documents the following two queries with rows=10 resp rows=0 return different numFound results which confuses me. i hope someone can clarify this behaviour. URL with rows=10: - http://rdta01:9983/solr/msg-core/select?q=*:*&shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-core&indent=on&start=0&rows=10 numFound=8 (incorrect, second shard is missing) URL with rows=0: http://rdta01:9983/solr/msg-core/select?q=*:*&shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-core&indent=on&start=0&rows=0 numFound=12 (correct) cheerio, patrick
Re: numFound inconsistent for different rows-param
i resolved my confusion and discovered that the documents of the second shard contained the same 'unique' id. rows=0 displayed the 'correct' numFound since (as i understand) there was no merge of the results. cheerio, patrick On 25.07.2012 17:07, patrick wrote: hi, i'm running two solr v3.6 instances: rdta01:9983/solr/msg-core : 8 documents rdta01:28983/solr/msg-core : 4 documents the following two queries with rows=10 resp rows=0 return different numFound results which confuses me. i hope someone can clarify this behaviour. URL with rows=10: - http://rdta01:9983/solr/msg-core/select?q=*:*&shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-core&indent=on&start=0&rows=10 numFound=8 (incorrect, second shard is missing) URL with rows=0: http://rdta01:9983/solr/msg-core/select?q=*:*&shards=rdta01%3A9983%2Fsolr%2Fmsg-core%2Crdta01%3A28983%2Fsolr%2Fmsg-core&indent=on&start=0&rows=0 numFound=12 (correct) cheerio, patrick
Re: High Cpu sys usage
Hi, >From the sar output you supplied, it looks like you might have a memory issue >on your hosts. The memory usage just before your crash seems to be *very* >close to 100%. Even the slightest increase (Solr itself, or possibly by a >system service) could caused the system crash. What are the specifications of >your hosts and how much memory are you allocating? Cheers, -patrick On 16/03/2016, 14:52, "YouPeng Yang" wrote: >Hi > It happened again,and worse thing is that my system went to crash.we can >even not connect to it with ssh. > I use the sar command to capture the statistics information about it.Here >are my details: > > >[1]cpu(by using sar -u),we have to restart our system just as the red font >LINUX RESTART in the logs. >-- >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 >91.40 >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 >90.94 >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 >90.34 >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 >63.23 >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 > 0.16 >Average:all 8.21 0.00 1.57 0.05 0.00 >90.17 > >04:42:04 PM LINUX RESTART > >04:50:01 PM CPU %user %nice %system %iowait%steal >%idle >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 >95.75 >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 >89.77 >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 >92.11 >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 >92.48 >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 >92.93 >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 >93.75 >-- > >[2]mem(by using sar -r) >-- >03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 >47.77 >03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 >47.77 >03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 >47.76 >03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 >47.85 >03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 >48.07 >Average: 5051607 193100937 97.45362421 81775777 142758861 >47.50 > >04:42:04 PM LINUX RESTART > >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit >%commit >05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 >44.90 >05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 >44.91 >05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 >44.90 >05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 >44.91 >05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 >44.92 >05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 >44.95 >Average:136996792 61155752 30.86206645 30415642 134991776 >44.91 >-- > > >As the blue font parts show that my hardware crash from 03:30:35.It is hung >up until I restart it manually at 04:42:04 >ALl the above information just snapshot the performance when it crashed >while there is nothing cover the reason.I have also >check the /var/log/messages and find nothing useful. > >Note that I run the command- sar -v .It shows something abnormal: > >02:50:01 PM 11542262 9216 76446 258 >03:00:01 PM 11645526 9536 76421 258 >03:10:01 PM 11748690 9216 76451 258 >03:20:01 PM 11850191 9152 76331 258 >03:30:35 PM 11972313 10112132625 258 >03:42:40 PM 12177319 13760340227 258 >Average: 8293601 8950 68187 161 > >04:42:04 PM LINUX RESTART > >04:50:01 PM dentunusd file-nr inode-nrpty-nr >05:00:01 PM 35410 7616 35223 4 >05:10:01 PM137320 7296 42632 6 >05:20:01 PM247010 7296 42839 9 >05:30:01 PM358434 7360 42697 9 >05:40:01 PM471543 7040 4292910 >05:50:01 PM583787 7296 4283713 >
Re: High Cpu sys usage
Yeah, I did’t pay attention to the cached memory at all, my bad! I remember running into a similar situation a couple of years ago, one of the things to investigate our memory profile was to produce a full heap dump and manually analyse that using a tool like MAT. Cheers, -patrick On 17/03/2016, 21:58, "Otis Gospodnetić" wrote: >Hi, > >On Wed, Mar 16, 2016 at 10:59 AM, Patrick Plaatje >wrote: > >> Hi, >> >> From the sar output you supplied, it looks like you might have a memory >> issue on your hosts. The memory usage just before your crash seems to be >> *very* close to 100%. Even the slightest increase (Solr itself, or possibly >> by a system service) could caused the system crash. What are the >> specifications of your hosts and how much memory are you allocating? > > >That's normal actually - http://www.linuxatemyram.com/ > >You *want* Linux to be using all your memory - you paid for it :) > >Otis >-- >Monitoring - Log Management - Alerting - Anomaly Detection >Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > >> > > >> >> >> On 16/03/2016, 14:52, "YouPeng Yang" wrote: >> >> >Hi >> > It happened again,and worse thing is that my system went to crash.we can >> >even not connect to it with ssh. >> > I use the sar command to capture the statistics information about it.Here >> >are my details: >> > >> > >> >[1]cpu(by using sar -u),we have to restart our system just as the red font >> >LINUX RESTART in the logs. >> >> >-- >> >03:00:01 PM all 7.61 0.00 0.92 0.07 0.00 >> >91.40 >> >03:10:01 PM all 7.71 0.00 1.29 0.06 0.00 >> >90.94 >> >03:20:01 PM all 7.62 0.00 1.98 0.06 0.00 >> >90.34 >> >03:30:35 PM all 5.65 0.00 31.08 0.04 0.00 >> >63.23 >> >03:42:40 PM all 47.58 0.00 52.25 0.00 0.00 >> > 0.16 >> >Average:all 8.21 0.00 1.57 0.05 0.00 >> >90.17 >> > >> >04:42:04 PM LINUX RESTART >> > >> >04:50:01 PM CPU %user %nice %system %iowait%steal >> >%idle >> >05:00:01 PM all 3.49 0.00 0.62 0.15 0.00 >> >95.75 >> >05:10:01 PM all 9.03 0.00 0.92 0.28 0.00 >> >89.77 >> >05:20:01 PM all 7.06 0.00 0.78 0.05 0.00 >> >92.11 >> >05:30:01 PM all 6.67 0.00 0.79 0.06 0.00 >> >92.48 >> >05:40:01 PM all 6.26 0.00 0.76 0.05 0.00 >> >92.93 >> >05:50:01 PM all 5.49 0.00 0.71 0.05 0.00 >> >93.75 >> >> >-- >> > >> >[2]mem(by using sar -r) >> >> >-- >> >03:00:01 PM 1519272 196633272 99.23361112 76364340 143574212 >> >47.77 >> >03:10:01 PM 1451764 196700780 99.27361196 76336340 143581608 >> >47.77 >> >03:20:01 PM 1453400 196699144 99.27361448 76248584 143551128 >> >47.76 >> >03:30:35 PM 1513844 196638700 99.24361648 76022016 143828244 >> >47.85 >> >03:42:40 PM 1481108 196671436 99.25361676 75718320 144478784 >> >48.07 >> >Average: 5051607 193100937 97.45362421 81775777 142758861 >> >47.50 >> > >> >04:42:04 PM LINUX RESTART >> > >> >04:50:01 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit >> >%commit >> >05:00:01 PM 154357132 43795412 22.10 92012 18648644 134950460 >> >44.90 >> >05:10:01 PM 136468244 61684300 31.13219572 31709216 134966548 >> >44.91 >> >05:20:01 PM 135092452 63060092 31.82221488 32162324 134949788 >> >44.90 >> >05:30:01 PM 133410464 64742080 32.67233848 32793848 134976828 >> >44.91 >> >05:40:01 PM 132022052 66130492 33.37235812 33278908 135007268 >> >44.92 >> >05:50:01 PM 130630408 67522136 34.08237140 33900912 135099764 >> >44.95 >> >Average:136996792 6
Querying Dynamic Fields
I have a simple Solr schema that uses dynamic fields to create most of my fields. This works great. Unfortunately, I now need to ask Solr to give me the names of the fields in the schema. I'm using: http://localhost:8983/solr/core/schema/fields This returns the statically defined fields, but does not return the ones that were created matching my dynamic definitions, such as *_s, *_i, *_txt, etc. I know Solr is aware of these fields, because I can query against them. What is the secret sauce to query their names and data types? Thanks, Patrick Hoeffel Senior Software Engineer Intelligent Software Solutions (www.issinc.com<http://www.issinc.com/>) (719) 452-7371 (direct) (719) 210-3706 (mobile) "Bringing Knowledge to Light"
Apache Solr Reference Guide 5.0
Greetings, I was looking at the PDF version of the Apache Solr Reference Guide 5.0 and noticed that it has no TOC nor any section numbering. http://apache.claz.org/lucene/solr/ref-guide/apache-solr-ref-guide-5.0.pdf The lack of a TOC and section headings makes navigation difficult. I have just started making suggestions on the documentation and was wondering if there is a reason why the TOC and section headings are missing? (that isn't apparent from the document) Thanks! Hope everyone is near a great weekend! Patrick
Re: Apache Solr Reference Guide 5.0
Shawn, Thanks! I was using Document Viewer and not Adobe Acrobat so was unclear. The TOC I meant was as in a traditional print publication with section #s, etc. Not a navigation TOC sans numbering as in Adobe. The Confluence documentation (I can't see the actual stylesheet in use, I don't think) here: https://confluence.atlassian.com/display/DOC/Customising+Exports+to+PDF Says: * Disabling the Table of Contents To prevent the table of contents from being generated in your PDF document, add the div.toc-macro rule to the PDF Stylesheet and set its display property to none: * Which is why I was asking if there was a reason for the TOC and section numbering not appearing. They can be defeated but that doesn't appear to be the default setting. This came up because a section said it would cover topics N - S and I could not determine if all those topics fell in that section or not. Thanks! Hope you are having a great day! Patrick On 03/06/2015 12:28 PM, Shawn Heisey wrote: On 3/6/2015 10:20 AM, Patrick Durusau wrote: I was looking at the PDF version of the Apache Solr Reference Guide 5.0 and noticed that it has no TOC nor any section numbering. http://apache.claz.org/lucene/solr/ref-guide/apache-solr-ref-guide-5.0.pdf The lack of a TOC and section headings makes navigation difficult. I have just started making suggestions on the documentation and was wondering if there is a reason why the TOC and section headings are missing? (that isn't apparent from the document) The TOC is built into the PDF and it's up to the PDF viewer to display it. Here's a screenshot of the ref guide in Adobe Reader with a clickable TOC open. https://www.dropbox.com/s/3ajuri1emj61imu/refguide-5.0-TOC.png?dl=0 Section numbering might be a good idea, if it's not too intrusive or difficult. Thanks, Shawn
How do I tell Tika to not complement a field's value defined in my Solr schema when indexing a binary document?
I use Solr to index different kinds of database tables. I have a Solr index containing a field named category. I make sure that the category field in Solr gets occupied with the right value depending on the table. This I can use to build facet queries which works fine. The problem I have is with tables that contain records which represent binary documents like PDF's. I use the extract query (TIKA) to index the contents of the binary document along with the data from the database record. Tika sometimes finds metadata in the document which has the same name as one of my index fields I have in my schema.xml, like category. I end up with the category field being a multi-value field containing the category data from my database record AND the additional data from the category (meta)field extracted by TIKA from the actual binary document. It seems that the extracthandler adds every field it may find to my index if there is a corresponding field in my index. How can I prevent this from happening? All I need is the textual representation of the binary document added as content and not the extra (meta?) fields. I don't want the extra data TIKA may find to be added to any field in my index. However I do want to keep the data in the category field which comes from my database record. So adding a fmap.category="ignored_" won't help me because then the data of my database record will be ignored as well. Another reason for wanting to prevent this is that I cannot know in advance which other fields TIKA might come up with when the document is extracted. In other words choosing more elaborated names (like a namespace like prefix) for my index fields will never guarantee field name collisions 100%. So, how can I prevent the data the extract comes up with is added to my index field or am I missing a point here?
can't seem to get delta imports to work.
Hi, I am having problems getting the delta import working. Full import works fine. I am using current version of solr (6.1). I have been looking at this pretty much all day and can't find what I am not doing correctly... I did try the Using query attribute for both full and delta import and that worked, but as soon I ran it for a full import via clean=true my queries performance went very bad (oracle execution plain must of went bonkers). Anyways, I would appreciate any help. Thanks Here is my dataimportHandler config: [hubadm@emcappd43:solr-6.1.0]$ cat ./server/solr/dmtec1/conf/db-data-config.xml Here is the log output: 2016-08-31 19:45:42.641 INFO (qtp403424356-68) [ x:dmtec1] o.a.s.h.d.DataImporter Loading DIH Configuration: db-data-config.xml 2016-08-31 19:45:42.648 INFO (qtp403424356-68) [ x:dmtec1] o.a.s.h.d.DataImporter Data Configuration loaded successfully 2016-08-31 19:45:42.649 INFO (qtp403424356-68) [ x:dmtec1] o.a.s.c.S.Request [dmtec1] webapp=/solr path=/dataimport params={indent=on&wt=json&command=reload-config&_=1472648332418} status=0 QTime=9 2016-08-31 19:45:42.680 INFO (qtp403424356-77) [ x:dmtec1] o.a.s.c.S.Request [dmtec1] webapp=/solr path=/admin/mbeans params={cat=QUERYHANDLER&wt=json&_=1472648332418} status=0 QTime=1 2016-08-31 19:45:42.695 INFO (qtp403424356-84) [ x:dmtec1] o.a.s.c.S.Request [dmtec1] webapp=/solr path=/dataimport params={indent=on&wt=json&command=show-config&_=1472648332418} status=0 QTime=1 2016-08-31 19:45:42.696 INFO (qtp403424356-49) [ x:dmtec1] o.a.s.c.S.Request [dmtec1] webapp=/solr path=/dataimport params={indent=on&wt=json&command=status&_=1472648332418} status=0 QTime=0 2016-08-31 19:45:48.550 INFO (qtp403424356-68) [ x:dmtec1] o.a.s.h.d.DataImporter Loading DIH Configuration: db-data-config.xml 2016-08-31 19:45:48.558 INFO (qtp403424356-68) [ x:dmtec1] o.a.s.h.d.DataImporter Data Configuration loaded successfully 2016-08-31 19:45:48.560 INFO (qtp403424356-68) [ x:dmtec1] o.a.s.c.S.Request [dmtec1] webapp=/solr path=/dataimport params={core=dmtec1&optimize=false&indent=on&commit=true&clean=false&wt=json&command=delta-import&_=1472648332418&verbose=false} status=0 QTime=10 2016-08-31 19:45:48.560 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DataImporter Starting Delta Import 2016-08-31 19:45:48.574 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.SimplePropertiesWriter Read dataimport.properties 2016-08-31 19:45:48.576 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Starting delta collection. 2016-08-31 19:45:48.577 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Running ModifiedRowKey() for Entity: viewables 2016-08-31 19:45:48.577 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Completed ModifiedRowKey for Entity: viewables rows obtained : 0 2016-08-31 19:45:48.577 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Completed DeletedRowKey for Entity: viewables rows obtained : 0 2016-08-31 19:45:48.577 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Completed parentDeltaQuery for Entity: viewables 2016-08-31 19:45:48.577 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Running ModifiedRowKey() for Entity: relParts 2016-08-31 19:45:48.578 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Completed ModifiedRowKey for Entity: relParts rows obtained : 0 2016-08-31 19:45:48.578 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Completed DeletedRowKey for Entity: relParts rows obtained : 0 2016-08-31 19:45:48.578 INFO (Thread-39) [ x:dmtec1] o.a.s.h.d.DocBuilder Completed parentDeltaQuery for Entity: relParts 2016-08-31 19:45:48.578 INFO (Threa
RE: CDCR - how to deal with the transaction log files
I'm working on my first setup of CDCR, and I'm seeing the same "The log reader for target collection {collection name} is not initialised" as you saw. It looks like you're creating collections on a regular basis, but for me, I create it one time and never again. I've been creating the collection first from defaults and then applying the CDCR-aware solrconfig changes afterward. It sounds like maybe I need to create the configset in ZK first, then create the collections, first on the Target and then on the Source, and I should be good? Thanks, Patrick Hoeffel Senior Software Engineer (Direct) 719-452-7371 (Mobile) 719-210-3706 patrick.hoef...@polarisalpha.com PolarisAlpha.com -Original Message- From: jmyatt [mailto:jmy...@wayfair.com] Sent: Wednesday, July 12, 2017 4:49 PM To: solr-user@lucene.apache.org Subject: Re: CDCR - how to deal with the transaction log files glad to hear you found your solution! I have been combing over this post and others on this discussion board many times and have tried so many tweaks to configuration, order of steps, etc, all with absolutely no success in getting the Source cluster tlogs to delete. So incredibly frustrating. If anyone has other pearls of wisdom I'd love some advice. Quick hits on what I've tried: - solrconfig exactly like Sean's (target and source respectively) expect no autoSoftCommit - I am also calling cdcr?action=DISABLEBUFFER (on source as well as on target) explicitly before starting since the config setting of defaultState=disabled doesn't seem to work - when I create the collection on source first, I get the warning "The log reader for target collection {collection name} is not initialised". When I reverse the order (create the collection on target first), no such warning - tlogs replicate as expected, hard commits on both target and source cause tlogs to rollover, etc - all of that works as expected - action=QUEUES on source reflects the queueSize accurately. Also *always* shows updateLogSynchronizer state as "stopped" - action=LASTPROCESSEDVERSION on both source and target always seems correct (I don't see the -1 that Sean mentioned). - I'm creating new collections every time and running full data imports that take 5-10 minutes. Again, all data replication, log rollover, and autocommit activity seems to work as expected, and logs on target are deleted. It's just those pesky source tlogs I can't get to delete. -- View this message in context: http://lucene.472066.n3.nabble.com/CDCR-how-to-deal-with-the-transaction-log-files-tp4345062p4345715.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: CDCR - how to deal with the transaction log files
Amrit, Problem solved! My biggest mistake was in my SOURCE-side configuration. The zkHost field needed the entire zkHost string, including the CHROOT indicator. I suppose that should have been obvious to me, but the examples only showed the IP Address of the target ZK, and I made a poor assumption. 10.161.0.7:2181,10.161.0.6:2181,10.161.0.5:2181/chroot/solr ks_v1 ks_v1 10.161.0.7:2181 <=== Problem was here. ks_v1 ks_v1 After that, I just made sure I did this: 1. Stop all Solr nodes at both SOURCE and TARGET. 2. $ rm -rf $SOLR_HOME/server/solr/collection_name/data/tlog/*.* 3. On the TARGET: a. $ collection/cdcr?action=DISABLEBUFFER b. $ collection/cdcr?action=START 4. On the Source: a. $ collection/cdcr?action=DISABLEBUFFER b. $ collection/cdcr?action=START At this point any existing data in the SOURCE collection started flowing into the TARGET collection, and it has remained congruent ever since. Thanks, Patrick Hoeffel Senior Software Engineer (Direct) 719-452-7371 (Mobile) 719-210-3706 patrick.hoef...@polarisalpha.com PolarisAlpha.com -Original Message- From: Amrit Sarkar [mailto:sarkaramr...@gmail.com] Sent: Friday, July 21, 2017 7:21 AM To: solr-user@lucene.apache.org Cc: jmy...@wayfair.com Subject: Re: CDCR - how to deal with the transaction log files Patrick, Yes! You created default UpdateLog which got written to a disk and then you changed it to CdcrUpdateLog in configs. I find no reason it would create a proper COLLECTIONCHECKPOINT on target tlog. One thing you can try before creating / starting from scratch is restarting source cluster nodes, the leaders of shard will try to create the same COLLECTIONCHECKPOINT, which may or may not be successful. Amrit Sarkar Search Engineer Lucidworks, Inc. 415-589-9269 www.lucidworks.com Twitter http://twitter.com/lucidworks LinkedIn: https://www.linkedin.com/in/sarkaramrit2 On Fri, Jul 21, 2017 at 11:09 AM, Patrick Hoeffel < patrick.hoef...@polarisalpha.com> wrote: > I'm working on my first setup of CDCR, and I'm seeing the same "The > log reader for target collection {collection name} is not initialised" > as you saw. > > It looks like you're creating collections on a regular basis, but for > me, I create it one time and never again. I've been creating the > collection first from defaults and then applying the CDCR-aware > solrconfig changes afterward. It sounds like maybe I need to create > the configset in ZK first, then create the collections, first on the > Target and then on the Source, and I should be good? > > Thanks, > > Patrick Hoeffel > Senior Software Engineer > (Direct) 719-452-7371 > (Mobile) 719-210-3706 > patrick.hoef...@polarisalpha.com > PolarisAlpha.com > > > -Original Message- > From: jmyatt [mailto:jmy...@wayfair.com] > Sent: Wednesday, July 12, 2017 4:49 PM > To: solr-user@lucene.apache.org > Subject: Re: CDCR - how to deal with the transaction log files > > glad to hear you found your solution! I have been combing over this > post and others on this discussion board many times and have tried so > many tweaks to configuration, order of steps, etc, all with absolutely > no success in getting the Source cluster tlogs to delete. So > incredibly frustrating. If anyone has other pearls of wisdom I'd love some > advice. > Quick hits on what I've tried: > > - solrconfig exactly like Sean's (target and source respectively) > expect no autoSoftCommit > - I am also calling cdcr?action=DISABLEBUFFER (on source as well as on > target) explicitly before starting since the config setting of > defaultState=disabled doesn't seem to work > - when I create the collection on source first, I get the warning "The > log reader for target collection {collection name} is not > initialised". When I reverse the order (create the collection on > target first), no such warning > - tlogs replicate as expected, hard commits on both target and source > cause tlogs to rollover, etc - all of that works as expected > - action=QUEUES on source reflects the queueSize accurately. Also > *always* shows updateLogSynchronizer state as "stopped" > - action=LASTPROCESSEDVERSION on both source and target always seems > correct (I don't see the -1 that Sean mentioned). > - I'm creating new collections every time and running full data > imports that take 5-10 minutes. Again, all data replication, log > rollover, and autocommit activity seems to work as expected, and logs > on target are deleted. It's just those pesky source tlogs I can't get to > delete. > > > > -- > View this message in context: http://lucene.472066.n3. > nabble.com/CDCR-how-to-deal-with-the-transaction-log- > files-tp4345062p4345715.html > Sent from the Solr - User mailing list archive at Nabble.com. >
JSON facet SUM precision and accuracy is incorrect
Appreciate if anyone can help raise an issue for the JSON facet sum error my staff Edwin raised earlier but have not gotten any response from the Solr community and developers. Our production operation is urgently needing this accuracy to proceed as it impacts audit issues. Best regards, Dr.Patrick On Tue, Jul 25, 2017 at 6:27 PM, Zheng Lin Edwin Yeo wrote: > This is the way which I put my JSON facet. > > totalAmount:"sum(sum(amount1_d,amount2_d))" > > amount1_d: 69446961 <6944%206961>.2 > amount2_d: 0 > > Result I get: 69446959 <6944%206959>.27 > > > Regards, > Edwin > > > On 25 July 2017 at 20:44, Zheng Lin Edwin Yeo > wrote: > > > Hi, > > > > I'm trying to do a sum of two double fields in JSON Facet. One of the > > field has a value of 69446961 <6944%206961>.2, while the other is 0. However, when I > get > > the result, I'm getting a value of 69446959 <6944%206959>.27. This is 1.93 lesser than > > the original value. > > > > What could be the reason? > > > > I'm using Solr 6.5.1. > > > > Regards, > > Edwin > >
Solr Issue
Hey Guys, i´ve got a problem with my Solr Highlighter.. When I search for a word, i get some results. For every result i want to display the highlighted text and here is my problem. Some of the returned documents have a highlighted text the other ones doesnt. I don´t know why it is but i need to fix this problem. Below is the configuration of my managed-schema. The configuration of the highlighter in solrconfig.xml is default. I hope someone can help me. If you need more details you can ask me for sure. managed-schema: id Mit freundlichen Grüßen Patrick Fallert [cid:image001.jpg@01D327BC.EA2F] Rainer-Haungs-Straße 7 D-77933 Lahr zentrale: fax: mobil: +49 7821 9509-0 +49 7821 9509-99 i...@schrempp-edv.de <mailto:i...@schrempp-edv.de> www.schrempp-edv.de <http://www.schrempp-edv.de/> Geschäftsführer: Brigitta Schrempp Gesamtleitung, Stefan Basler Entwicklung. Register-Nummer: HRB 391291,Register-Gericht: Freiburg i. Breisgau. Steuernummer: 10050/03799. Umsatzsteuer-Identifikationsnummer: DE206688941 Vertraulichkeitshinweis: Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese E-Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet. Diese E-Mail wurde doppelt auf Viren überprüft. Dies garantiert aber keine Virenfreiheit. Wir übernehmen keine Haftung für eventuelle Schäden, die durch diese E-Mail oder deren Anhänge entstehen könnten.
TermVectors and ExactStatsCache
Hi! I have a SolrCloud 6.6 collection with 3 shards setup where I need the TermVectors TF and DF values for queries. I have configured the ExactStatsCache in the solrConfig: When I query "detector works", it returns different docfreq values based on the shard the document comes from: "termVectors":[ "27504103",[ "uniqueKey","27504103", "kc",[ "detector works",[ "tf",1, "df",3, "tf-idf",0.]]], "27507925",[ "uniqueKey","27507925", "kc",[ "detector works",[ "tf",1, "df",3, "tf-idf",0.]]], "27504105",[ "uniqueKey","27504105", "kc",[ "detector works",[ "tf",1, "df",2, "tf-idf",0.5]]], "27507927",[ "uniqueKey","27507927", "kc",[ "detector works",[ "tf",1, "df",2, "tf-idf",0.5]]], "27507929",[ "uniqueKey","27507929", "kc",[ "detector works",[ "tf",1, "df",1, "tf-idf",1.0]]], "27504107",[ "uniqueKey","27504107", "kc",[ "detector works",[ "tf",1, "df",3, "tf-idf",0.} I expect to see the DF values to be 6 and TF-IDF to be adjusted on that value. I can see in the debug logs that the cache was active. I have found a pending bug (since Solr 5.5: https://issues.apache.org/jira/browse/SOLR-8893) that explains that this ExactStatsCache is used to compute the correct TF-IDF for the query but not for the TermVectors component. Is there any way to get the correctly merged DF values (and TF-IDF) from multiple shards? Is there a way to get from which shard a document comes from so I could compute my own correct DF? Thank you, Patrick
TermVectors and ExactStatsCache
Hi! I have a SolrCloud 6.6 collection with 3 shards setup where I need the TermVectors TF and DF values when querying. I have configured the ExactStatsCache in the solrConfig: When I query "detector works" in my collection, it returns different docfreq values based on the shard the document comes from: "termVectors":[ "27504103",[ "uniqueKey","27504103", "kc",[ "detector works",[ "tf",1, "df",3, "tf-idf",0.]]], "27507925",[ "uniqueKey","27507925", "kc",[ "detector works",[ "tf",1, "df",3, "tf-idf",0.]]], "27504105",[ "uniqueKey","27504105", "kc",[ "detector works",[ "tf",1, "df",2, "tf-idf",0.5]]], "27507927",[ "uniqueKey","27507927", "kc",[ "detector works",[ "tf",1, "df",2, "tf-idf",0.5]]], "27507929",[ "uniqueKey","27507929", "kc",[ "detector works",[ "tf",1, "df",1, "tf-idf",1.0]]], "27504107",[ "uniqueKey","27504107", "kc",[ "detector works",[ "tf",1, "df",3, "tf-idf",0.} I expect to see the DF values to be 6 and TF-IDF to be adjusted on that value. I can see in the debug logs that the cache was active. I have found a pending bug (since Solr 5.5: https://issues.apache.org/jira/browse/SOLR-8893) that explains that this ExactStatsCache is used to compute the correct TF-IDF for the query but not for the TermVectors component. Is there any way to get the correctly merged DF values (and TF-IDF) from multiple shards? Is there a way to get from which shard a document comes from so I could compute my own correct DF? Thank you, Patrick
Unexpected query result
I'm using Solr 4.4.0 running on Tomcat 7.0.29. The solrconfig.xlm is as-delivered (excepted for the Solr home directory of course). I could pass on the schema.xml, though I doubt this would help much, as the following will show. If I select all documents containing "russia" in the text, which is the default field, ie if I execute the query "russia", I find only 1 document, which is correct. If I select all documents containing "web" in the text ("web"), the result is 29, which is also correct. If I search for all documents that do not contain "russia" ("NOT(russia)"), the result is still correct (202). If I search for all documents that contain "web" and do not contain "russia" ("web AND NOT(russia)"), the result is, once again, correct (28, because the document containing "russia" also contains "web"). But if I search for all documents that contain "web" or do not contain "russia" ("web OR NOT(russia)"), the result is still 28, though I should get 203 matches (the whole set). Has anyone got an explanation ?? For information, the AND and OR work correctly if I don't use a NOT somewhere in the query, i.e. : "web AND russia" --> OK "web OR russia" --> OK -- View this message in context: http://lucene.472066.n3.nabble.com/Unexpected-query-result-tp416.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unexpected query result
Thank you for your very quick reply - and for your solution, that works perfectly well. Still, I wonder why this simple and straightforward syntax "web OR NOT(russia)" needs some translation to be processed correctly... >From the many related posts I read before asking my question, I know that I'm not the first one to be puzzled by this behavior. Wouldn't it be a good idea to modify the (Lucene, I guess ?) parser so that the subsequent processing would produce a correct result ? Thanks again for your help ! -- View this message in context: http://lucene.472066.n3.nabble.com/Unexpected-query-result-tp416p4100015.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 3.6.1 stalling with high CPU and blocking on field cache
I've been tracking a problem in our Solr environment for awhile with periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I might get some insight from some others on this list. The load on the server is normally anywhere between 1-3. It's an 8-core machine with 40GB of RAM. I have about 25GB of index data that is replicated to this server every 5 minutes. It's taking about 200 connections per second and roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The stall causes the load to go to as high as 90. It is all CPU bound in user space - all cores go to 99% utilization (spinlock?). When doing a thread dump, the following line is blocked in all running Tomcat threads: org.apache.lucene.search.FieldCacheImpl$Cache.get ( FieldCacheImpl.java:230 ) Looking the source code in 3.6.1, that is a function call to syncronized() which blocks all threads and causes the backlog. I've tried to correlate these events to the replication events - but even with replication disabled - this still happens. We run multiple data centers using Solr and I was comparing garbage collection processes between and noted that the old generation is collected very differently on this data center versus others. The old generation is collected as a massive collect event (several gigabytes worth) - the other data center is more saw toothed and collects only in 500MB-1GB at a time. Here's my parameters to java (the same in all environments): /usr/java/jre/bin/java \ -verbose:gc \ -XX:+PrintGCDetails \ -server \ -Dcom.sun.management.jmxremote \ -XX:+UseConcMarkSweepGC \ -XX:+UseParNewGC \ -XX:+CMSIncrementalMode \ -XX:+CMSParallelRemarkEnabled \ -XX:+CMSIncrementalPacing \ -XX:NewRatio=3 \ -Xms30720M \ -Xmx30720M \ -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath /usr/local/share/apache-tomcat/bin/bootstrap.jar \ -Dcatalina.base=/usr/local/share/apache-tomcat \ -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ org.apache.catalina.startup.Bootstrap start I've tried a few GC option changes from this (been running this way for a couple of years now) - primarily removing CMS Incremental mode as we have 8 cores and remarks on the internet suggest that it is only for smaller SMP setups. Removing CMS did not fix anything. I've considered that the heap is way too large (30GB from 40GB) and may not leave enough memory for mmap operations (MMap appears to be used in the field cache). Based on active memory utilization in Java, seems like I might be able to reduce down to 22GB safely - but I'm not sure if that will help with the CPU issues. I think field cache is used for sorting and faceting. I've started to investigate facet.method, but from what I can tell, this doesn't seem to influence sorting at all - only facet queries. I've tried setting useFilterForSortQuery, and seems to require less field cache but doesn't address the stalling issues. Is there something I am overlooking? Perhaps the system is becoming oversubscribed in terms of resources? Thanks for any help that is offered. -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone 309-743-0809 Fax .. 309-743-0830
Re: Solr 3.6.1 stalling with high CPU and blocking on field cache
We do perform a lot of sorting - on multiple fields in fact. We have different kinds of Solr configurations - our news searches do little with regards to faceting, but heavily sort. We provide classified ad searches and that heavily uses faceting. I might try reducing the JVM memory some and amount of perm generation as suggested earlier. It feels like a GC issue and loading the cache just happens to be the victim of a stop-the-world event at the worse possible time. > My gut instinct is that your heap size is way too high. Try decreasing it to > like 5-10G. I know you say it uses more than that, but that just seems > bizarre unless you're doing something like faceting and/or sorting on every > field. > > -Michael > > -Original Message- > From: Patrick O'Lone [mailto:pol...@townnews.com] > Sent: Tuesday, November 26, 2013 11:59 AM > To: solr-user@lucene.apache.org > Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache > > I've been tracking a problem in our Solr environment for awhile with periodic > stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I > might get some insight from some others on this list. > > The load on the server is normally anywhere between 1-3. It's an 8-core > machine with 40GB of RAM. I have about 25GB of index data that is replicated > to this server every 5 minutes. It's taking about 200 connections per second > and roughly every 5-10 minutes it will stall for about 30 seconds to a > minute. The stall causes the load to go to as high as 90. It is all CPU bound > in user space - all cores go to 99% utilization (spinlock?). When doing a > thread dump, the following line is blocked in all running Tomcat threads: > > org.apache.lucene.search.FieldCacheImpl$Cache.get ( > FieldCacheImpl.java:230 ) > > Looking the source code in 3.6.1, that is a function call to > syncronized() which blocks all threads and causes the backlog. I've tried to > correlate these events to the replication events - but even with replication > disabled - this still happens. We run multiple data centers using Solr and I > was comparing garbage collection processes between and noted that the old > generation is collected very differently on this data center versus others. > The old generation is collected as a massive collect event (several gigabytes > worth) - the other data center is more saw toothed and collects only in > 500MB-1GB at a time. Here's my parameters to java (the same in all > environments): > > /usr/java/jre/bin/java \ > -verbose:gc \ > -XX:+PrintGCDetails \ > -server \ > -Dcom.sun.management.jmxremote \ > -XX:+UseConcMarkSweepGC \ > -XX:+UseParNewGC \ > -XX:+CMSIncrementalMode \ > -XX:+CMSParallelRemarkEnabled \ > -XX:+CMSIncrementalPacing \ > -XX:NewRatio=3 \ > -Xms30720M \ > -Xmx30720M \ > -Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath > /usr/local/share/apache-tomcat/bin/bootstrap.jar \ > -Dcatalina.base=/usr/local/share/apache-tomcat \ > -Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ > org.apache.catalina.startup.Bootstrap start > > I've tried a few GC option changes from this (been running this way for a > couple of years now) - primarily removing CMS Incremental mode as we have 8 > cores and remarks on the internet suggest that it is only for smaller SMP > setups. Removing CMS did not fix anything. > > I've considered that the heap is way too large (30GB from 40GB) and may not > leave enough memory for mmap operations (MMap appears to be used in the field > cache). Based on active memory utilization in Java, seems like I might be > able to reduce down to 22GB safely - but I'm not sure if that will help with > the CPU issues. > > I think field cache is used for sorting and faceting. I've started to > investigate facet.method, but from what I can tell, this doesn't seem to > influence sorting at all - only facet queries. I've tried setting > useFilterForSortQuery, and seems to require less field cache but doesn't > address the stalling issues. > > Is there something I am overlooking? Perhaps the system is becoming > oversubscribed in terms of resources? Thanks for any help that is offered. > > -- > Patrick O'Lone > Director of Software Development > TownNews.com > > E-mail ... pol...@townnews.com > Phone 309-743-0809 > Fax .. 309-743-0830 > > -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone 309-743-0809 Fax .. 309-743-0830
facet.method=fcs vs facet.method=fc on solr slaves
Is there any advantage on a Solr slave to receive queries using facet.method=fcs instead of the default of facet.method=fc? Most of the segment files are unchanged between replication events - but I wasn't sure if replication would cause the unchanged segment field caches to be lost anyway. -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone 309-743-0809 Fax .. 309-743-0830
Re: facet.method=fcs vs facet.method=fc on solr slaves
So does it make the most sense then to force, by default, facet.method=fcs on slave nodes that receive updates every 5 minutes but with large segments that don't change every update? Right now, everything I have configured uses facet.method=fc since we don't declare it at all. Randomly, after replication, I have several threads that will hang on reading data from field cache and I'm trying to think of things I can do to mitigate that. Thanks for the info. > Hello Patrick, > > Replication flushes UnInvertedField cache that impacts fc, but doesn't > harm Lucene's FieldCache which is for fcs. You can check how much time > in millis is spend on UnInvertedField cache regeneration in INFO logs like > "UnInverted multi-valued field ,time=### ..." > > > On Thu, Dec 5, 2013 at 12:15 AM, Patrick O'Lone <mailto:pol...@townnews.com>> wrote: > > Is there any advantage on a Solr slave to receive queries using > facet.method=fcs instead of the default of facet.method=fc? Most of the > segment files are unchanged between replication events - but I wasn't > sure if replication would cause the unchanged segment field caches to be > lost anyway. > -- > Patrick O'Lone > Director of Software Development > TownNews.com > > E-mail ... pol...@townnews.com <mailto:pol...@townnews.com> > Phone 309-743-0809 > Fax .. 309-743-0830 > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > <mailto:mkhlud...@griddynamics.com> -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone 309-743-0809 Fax .. 309-743-0830
Re: Solr 3.6.1 stalling with high CPU and blocking on field cache
I have a new question about this issue - I create a filter queries of the form: fq=start_time:[* TO NOW/5MINUTE] This is used to restrict the set of documents to only items that have a start time within the next 5 minutes. Most of my indexes have millions of documents with few documents that start sometime in the future. Nearly all of my queries include this, would this cause every other search thread to block until the filter query is re-cached every 5 minutes and if so, is there a better way to do it? Thanks for any continued help with this issue! > We have a webapp running with a very high HEAP size (24GB) and we have > no problems with it AFTER we enabled the new GC that is meant to replace > sometime in the future the CMS GC, but you have to have Java 6 update > "Some number I couldn't find but latest should cover" to be able to use: > > 1. Remove all GC options you have and... > 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/ > > As a test of course, more information you can read on the following (and > interesting) article, we also have Solr running with these options, no > more pauses or HEAP size hitting the sky. > > Don't get bored reading the 1st (and small) introduction page of the > article, page 2 and 3 will make lot of sense: > http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061 > > > HTH, > > Guido. > > On 26/11/13 21:59, Patrick O'Lone wrote: >> We do perform a lot of sorting - on multiple fields in fact. We have >> different kinds of Solr configurations - our news searches do little >> with regards to faceting, but heavily sort. We provide classified ad >> searches and that heavily uses faceting. I might try reducing the JVM >> memory some and amount of perm generation as suggested earlier. It feels >> like a GC issue and loading the cache just happens to be the victim of a >> stop-the-world event at the worse possible time. >> >>> My gut instinct is that your heap size is way too high. Try >>> decreasing it to like 5-10G. I know you say it uses more than that, >>> but that just seems bizarre unless you're doing something like >>> faceting and/or sorting on every field. >>> >>> -Michael >>> >>> -Original Message- >>> From: Patrick O'Lone [mailto:pol...@townnews.com] >>> Sent: Tuesday, November 26, 2013 11:59 AM >>> To: solr-user@lucene.apache.org >>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache >>> >>> I've been tracking a problem in our Solr environment for awhile with >>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to >>> try and thought I might get some insight from some others on this list. >>> >>> The load on the server is normally anywhere between 1-3. It's an >>> 8-core machine with 40GB of RAM. I have about 25GB of index data that >>> is replicated to this server every 5 minutes. It's taking about 200 >>> connections per second and roughly every 5-10 minutes it will stall >>> for about 30 seconds to a minute. The stall causes the load to go to >>> as high as 90. It is all CPU bound in user space - all cores go to >>> 99% utilization (spinlock?). When doing a thread dump, the following >>> line is blocked in all running Tomcat threads: >>> >>> org.apache.lucene.search.FieldCacheImpl$Cache.get ( >>> FieldCacheImpl.java:230 ) >>> >>> Looking the source code in 3.6.1, that is a function call to >>> syncronized() which blocks all threads and causes the backlog. I've >>> tried to correlate these events to the replication events - but even >>> with replication disabled - this still happens. We run multiple data >>> centers using Solr and I was comparing garbage collection processes >>> between and noted that the old generation is collected very >>> differently on this data center versus others. The old generation is >>> collected as a massive collect event (several gigabytes worth) - the >>> other data center is more saw toothed and collects only in 500MB-1GB >>> at a time. Here's my parameters to java (the same in all environments): >>> >>> /usr/java/jre/bin/java \ >>> -verbose:gc \ >>> -XX:+PrintGCDetails \ >>> -server \ >>> -Dcom.sun.management.jmxremote \ >>> -XX:+UseConcMarkSweepGC \ >>> -XX:+UseParNewGC \ >>> -XX:+CMSIncrementalMode \ >>> -XX:+CMSParallelRemarkEnabled \ >>> -XX:+CMSIncrementalPacing \ >>> -XX:NewRatio=3 \ >>> -Xm
Re: Solr 3.6.1 stalling with high CPU and blocking on field cache
Unfortunately, in a test environment, this happens in version 4.4.0 of Solr as well. > I was trying to locate the release notes for 3.6.x it is too old, if I > were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you > since it is a minor release, locate the release notes and see if > something that is affecting you got fixed, also, I would be thinking on > moving on to 4.x which is quite stable and fast. > > Like anything with Java and concurrency, it will just get better (and > faster) with bigger numbers and concurrency frameworks becoming more and > more reliable, standard and stable. > > Regards, > > Guido. > > On 09/12/13 15:07, Patrick O'Lone wrote: >> I have a new question about this issue - I create a filter queries of >> the form: >> >> fq=start_time:[* TO NOW/5MINUTE] >> >> This is used to restrict the set of documents to only items that have a >> start time within the next 5 minutes. Most of my indexes have millions >> of documents with few documents that start sometime in the future. >> Nearly all of my queries include this, would this cause every other >> search thread to block until the filter query is re-cached every 5 >> minutes and if so, is there a better way to do it? Thanks for any >> continued help with this issue! >> >>> We have a webapp running with a very high HEAP size (24GB) and we have >>> no problems with it AFTER we enabled the new GC that is meant to replace >>> sometime in the future the CMS GC, but you have to have Java 6 update >>> "Some number I couldn't find but latest should cover" to be able to use: >>> >>> 1. Remove all GC options you have and... >>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/ >>> >>> As a test of course, more information you can read on the following (and >>> interesting) article, we also have Solr running with these options, no >>> more pauses or HEAP size hitting the sky. >>> >>> Don't get bored reading the 1st (and small) introduction page of the >>> article, page 2 and 3 will make lot of sense: >>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061 >>> >>> >>> >>> HTH, >>> >>> Guido. >>> >>> On 26/11/13 21:59, Patrick O'Lone wrote: >>>> We do perform a lot of sorting - on multiple fields in fact. We have >>>> different kinds of Solr configurations - our news searches do little >>>> with regards to faceting, but heavily sort. We provide classified ad >>>> searches and that heavily uses faceting. I might try reducing the JVM >>>> memory some and amount of perm generation as suggested earlier. It >>>> feels >>>> like a GC issue and loading the cache just happens to be the victim >>>> of a >>>> stop-the-world event at the worse possible time. >>>> >>>>> My gut instinct is that your heap size is way too high. Try >>>>> decreasing it to like 5-10G. I know you say it uses more than that, >>>>> but that just seems bizarre unless you're doing something like >>>>> faceting and/or sorting on every field. >>>>> >>>>> -Michael >>>>> >>>>> -Original Message- >>>>> From: Patrick O'Lone [mailto:pol...@townnews.com] >>>>> Sent: Tuesday, November 26, 2013 11:59 AM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache >>>>> >>>>> I've been tracking a problem in our Solr environment for awhile with >>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to >>>>> try and thought I might get some insight from some others on this >>>>> list. >>>>> >>>>> The load on the server is normally anywhere between 1-3. It's an >>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that >>>>> is replicated to this server every 5 minutes. It's taking about 200 >>>>> connections per second and roughly every 5-10 minutes it will stall >>>>> for about 30 seconds to a minute. The stall causes the load to go to >>>>> as high as 90. It is all CPU bound in user space - all cores go to >>>>> 99% utilization (spinlock?). When doing a thread dump, the following >>>>> line is blocked in all running Tomcat threads: >>&
Re: Solr 3.6.1 stalling with high CPU and blocking on field cache
Yeah, I tried G1, but it did not help - I don't think it is a garbage collection issue. I've made various changes to iCMS as well and the issue ALWAYS happens - no matter what I do. If I'm taking heavy traffic (200 requests per second) - as soon as I hit a 5 minute mark - the world stops - garbage collection would be less predictable. Nearly all of my requests have this 5 minute windowing behavior on time though, which is why I have it as a strong suspect now. If it blocks on that - even for a couple of seconds, my traffic backlog will be 600-800 requests. > Did you add the Garbage collection JVM options I suggested you? > > -XX:+UseG1GC -XX:MaxGCPauseMillis=50 > > Guido. > > On 09/12/13 16:33, Patrick O'Lone wrote: >> Unfortunately, in a test environment, this happens in version 4.4.0 of >> Solr as well. >> >>> I was trying to locate the release notes for 3.6.x it is too old, if I >>> were you I would update to 3.6.2 (from 3.6.1), it shouldn't affect you >>> since it is a minor release, locate the release notes and see if >>> something that is affecting you got fixed, also, I would be thinking on >>> moving on to 4.x which is quite stable and fast. >>> >>> Like anything with Java and concurrency, it will just get better (and >>> faster) with bigger numbers and concurrency frameworks becoming more and >>> more reliable, standard and stable. >>> >>> Regards, >>> >>> Guido. >>> >>> On 09/12/13 15:07, Patrick O'Lone wrote: >>>> I have a new question about this issue - I create a filter queries of >>>> the form: >>>> >>>> fq=start_time:[* TO NOW/5MINUTE] >>>> >>>> This is used to restrict the set of documents to only items that have a >>>> start time within the next 5 minutes. Most of my indexes have millions >>>> of documents with few documents that start sometime in the future. >>>> Nearly all of my queries include this, would this cause every other >>>> search thread to block until the filter query is re-cached every 5 >>>> minutes and if so, is there a better way to do it? Thanks for any >>>> continued help with this issue! >>>> >>>>> We have a webapp running with a very high HEAP size (24GB) and we have >>>>> no problems with it AFTER we enabled the new GC that is meant to >>>>> replace >>>>> sometime in the future the CMS GC, but you have to have Java 6 update >>>>> "Some number I couldn't find but latest should cover" to be able to >>>>> use: >>>>> >>>>> 1. Remove all GC options you have and... >>>>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/ >>>>> >>>>> As a test of course, more information you can read on the following >>>>> (and >>>>> interesting) article, we also have Solr running with these options, no >>>>> more pauses or HEAP size hitting the sky. >>>>> >>>>> Don't get bored reading the 1st (and small) introduction page of the >>>>> article, page 2 and 3 will make lot of sense: >>>>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061 >>>>> >>>>> >>>>> >>>>> >>>>> HTH, >>>>> >>>>> Guido. >>>>> >>>>> On 26/11/13 21:59, Patrick O'Lone wrote: >>>>>> We do perform a lot of sorting - on multiple fields in fact. We have >>>>>> different kinds of Solr configurations - our news searches do little >>>>>> with regards to faceting, but heavily sort. We provide classified ad >>>>>> searches and that heavily uses faceting. I might try reducing the JVM >>>>>> memory some and amount of perm generation as suggested earlier. It >>>>>> feels >>>>>> like a GC issue and loading the cache just happens to be the victim >>>>>> of a >>>>>> stop-the-world event at the worse possible time. >>>>>> >>>>>>> My gut instinct is that your heap size is way too high. Try >>>>>>> decreasing it to like 5-10G. I know you say it uses more than that, >>>>>>> but that just seems bizarre unless you're doing something like >>>>>>> faceting and/or sorting on every field. >>>>>>> >>>>>>> -Michael &g
Re: Solr 3.6.1 stalling with high CPU and blocking on field cache
Well, I want to include everything will start in the next 5 minute interval and everything that came before. The query is more like: fq=start_time:[* TO NOW+5MINUTE/5MINUTE] so that it rounds to the nearest 5 minute interval on the right-hand side. But, as soon as 1 second after that 5 minute window, everything pauses wanting for filter cache (at least that's my working theory based on observation). Is it possible to do something like: fq=start_time:[* TO NOW+1DAY/DAY]&q=start_time:[* TO NOW/MINUTE] where it would use the filter cache to narrow down by day resolution and then filter as part of the standard query, or something like that? My thought is that this would still gain a benefit from a query cache, but somewhat slower since it must remove results for things appearing later in the day. > If you want a start time within the next 5 minutes, I think your filter > is not the good one. > * will be replaced by the first date in your field > > Try : > fq=start_time:[NOW TO NOW+5MINUTE] > > Franck Brisbart > > > Le lundi 09 décembre 2013 à 09:07 -0600, Patrick O'Lone a écrit : >> I have a new question about this issue - I create a filter queries of >> the form: >> >> fq=start_time:[* TO NOW/5MINUTE] >> >> This is used to restrict the set of documents to only items that have a >> start time within the next 5 minutes. Most of my indexes have millions >> of documents with few documents that start sometime in the future. >> Nearly all of my queries include this, would this cause every other >> search thread to block until the filter query is re-cached every 5 >> minutes and if so, is there a better way to do it? Thanks for any >> continued help with this issue! >> >>> We have a webapp running with a very high HEAP size (24GB) and we have >>> no problems with it AFTER we enabled the new GC that is meant to replace >>> sometime in the future the CMS GC, but you have to have Java 6 update >>> "Some number I couldn't find but latest should cover" to be able to use: >>> >>> 1. Remove all GC options you have and... >>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/ >>> >>> As a test of course, more information you can read on the following (and >>> interesting) article, we also have Solr running with these options, no >>> more pauses or HEAP size hitting the sky. >>> >>> Don't get bored reading the 1st (and small) introduction page of the >>> article, page 2 and 3 will make lot of sense: >>> http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061 >>> >>> >>> HTH, >>> >>> Guido. >>> >>> On 26/11/13 21:59, Patrick O'Lone wrote: >>>> We do perform a lot of sorting - on multiple fields in fact. We have >>>> different kinds of Solr configurations - our news searches do little >>>> with regards to faceting, but heavily sort. We provide classified ad >>>> searches and that heavily uses faceting. I might try reducing the JVM >>>> memory some and amount of perm generation as suggested earlier. It feels >>>> like a GC issue and loading the cache just happens to be the victim of a >>>> stop-the-world event at the worse possible time. >>>> >>>>> My gut instinct is that your heap size is way too high. Try >>>>> decreasing it to like 5-10G. I know you say it uses more than that, >>>>> but that just seems bizarre unless you're doing something like >>>>> faceting and/or sorting on every field. >>>>> >>>>> -Michael >>>>> >>>>> -Original Message- >>>>> From: Patrick O'Lone [mailto:pol...@townnews.com] >>>>> Sent: Tuesday, November 26, 2013 11:59 AM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache >>>>> >>>>> I've been tracking a problem in our Solr environment for awhile with >>>>> periodic stalls of Solr 3.6.1. I'm running up to a wall on ideas to >>>>> try and thought I might get some insight from some others on this list. >>>>> >>>>> The load on the server is normally anywhere between 1-3. It's an >>>>> 8-core machine with 40GB of RAM. I have about 25GB of index data that >>>>> is replicated to this server every 5 minutes. It's taking about 200 >>>>> connections per second and roughly every 5-10 minutes it will stall >>&
Re: Solr 3.6.1 stalling with high CPU and blocking on field cache
I initially thought this was the case as well. These are slave nodes that receive updates every 5-10 minutes. However, this issue happens even if replication is turned off and no update handler is provided at all. I have confirmed against my data that simply querying the fq for a start_time in a range takes 11-13 seconds to actually populate the cache. If I make the fq not cache at all, my QTime raises by about 100ms, but does not have the stalling effect. A purely negative query also seems to have this effect, that is: fq=-start_time:[NOW/MINUTE TO *] But, I'm not sure if that is because it actually caches the negative query or if it discards it entirely. > Patrick, > > Are you getting these stalls following a commit? If so then the issue is > most likely fieldCache warming pauses. To stop your users from seeing > this pause you'll need to add static warming queries to your > solrconfig.xml to warm the fieldCache before it's registered . > > > On Mon, Dec 9, 2013 at 12:33 PM, Patrick O'Lone <mailto:pol...@townnews.com>> wrote: > > Well, I want to include everything will start in the next 5 minute > interval and everything that came before. The query is more like: > > fq=start_time:[* TO NOW+5MINUTE/5MINUTE] > > so that it rounds to the nearest 5 minute interval on the right-hand > side. But, as soon as 1 second after that 5 minute window, everything > pauses wanting for filter cache (at least that's my working theory based > on observation). Is it possible to do something like: > > fq=start_time:[* TO NOW+1DAY/DAY]&q=start_time:[* TO NOW/MINUTE] > > where it would use the filter cache to narrow down by day resolution and > then filter as part of the standard query, or something like that? > > My thought is that this would still gain a benefit from a query cache, > but somewhat slower since it must remove results for things appearing > later in the day. > > > If you want a start time within the next 5 minutes, I think your > filter > > is not the good one. > > * will be replaced by the first date in your field > > > > Try : > > fq=start_time:[NOW TO NOW+5MINUTE] > > > > Franck Brisbart > > > > > > Le lundi 09 d�cembre 2013 � 09:07 -0600, Patrick O'Lone a �crit : > >> I have a new question about this issue - I create a filter queries of > >> the form: > >> > >> fq=start_time:[* TO NOW/5MINUTE] > >> > >> This is used to restrict the set of documents to only items that > have a > >> start time within the next 5 minutes. Most of my indexes have > millions > >> of documents with few documents that start sometime in the future. > >> Nearly all of my queries include this, would this cause every other > >> search thread to block until the filter query is re-cached every 5 > >> minutes and if so, is there a better way to do it? Thanks for any > >> continued help with this issue! > >> > >>> We have a webapp running with a very high HEAP size (24GB) and > we have > >>> no problems with it AFTER we enabled the new GC that is meant to > replace > >>> sometime in the future the CMS GC, but you have to have Java 6 > update > >>> "Some number I couldn't find but latest should cover" to be able > to use: > >>> > >>> 1. Remove all GC options you have and... > >>> 2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/ > >>> > >>> As a test of course, more information you can read on the > following (and > >>> interesting) article, we also have Solr running with these > options, no > >>> more pauses or HEAP size hitting the sky. > >>> > >>> Don't get bored reading the 1st (and small) introduction page of the > >>> article, page 2 and 3 will make lot of sense: > >>> > > http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061 > >>> > >>> > >>> HTH, > >>> > >>> Guido. > >>> > >>> On 26/11/13 21:59, Patrick O'Lone wrote: > >>>> We do perform a lot of sorting - on multiple fields in fact. We > have > >>>> different kinds of Solr configurations - our news searches do > little > >>>>
LFU cache and autowarming
If I was to use the LFU cache instead of FastLRU on the filter cache, if I enable auto-warming on that cache type - does it warm the most frequently used fq on the filter cache? Thanks for any info! -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone 309-743-0809 Fax .. 309-743-0830
Re: LFU cache and autowarming
Well, I haven't tested it - if it's not ready yet I will probably avoid for now. > On 12/19/2013 1:46 PM, Patrick O'Lone wrote: >> If I was to use the LFU cache instead of FastLRU on the filter cache, if >> I enable auto-warming on that cache type - does it warm the most >> frequently used fq on the filter cache? Thanks for any info! > > I wrote that cache. It's a really really crappy implementation, I would > only expect it to work well if it's the cache is very very small. > > I do have a replacement implementation that's just about ready, but I've > not been able to find 'round tuits to work on getting it polished and > committed. > > https://issues.apache.org/jira/browse/SOLR-2906 > https://issues.apache.org/jira/browse/SOLR-3393 > > Thanks, > Shawn > > -- Patrick O'Lone Director of Software Development TownNews.com E-mail ... pol...@townnews.com Phone 309-743-0809 Fax .. 309-743-0830
Re: no servers hosting shard
After a full bounce of Tomcat, I'm now getting a new exception (below). I can browse the Zookeeper config in the Solr admin UI, and can confirm that there's a node for '/collections/customerOrderSearch/leaders/shard2', but no node for 'collections/customerOrderSearch/leaders/shard1'. Still, any ideas or guidance on how to recover would be appreciated. We've restarted all three zookeeper instances and both Solr instances, but that hasn't made any appreciable difference. --p. 2014-01-07 10:06:14,980 [coreLoadExecutor-4-thread-1] ERROR org.apache.solr.core.CoreContainer - null:org.apache.solr.common.cloud.ZooKeeperException: at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:309) at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:556) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:365) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.SolrException: Error getting leader from zk for shard shard1 at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:864) at org.apache.solr.cloud.ZkController.register(ZkController.java:773) at org.apache.solr.cloud.ZkController.register(ZkController.java:723) at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:286) ... 11 more Caused by: org.apache.solr.common.SolrException: Could not get leader props at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:911) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:875) at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:839) ... 14 more Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /collections/customerOrderSearch/leaders/shard1 at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:252) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:249) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:249) at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:889) ... 16 more On Tue, Jan 7, 2014 at 9:57 AM, patrick conant wrote: > In our Solr instance we have two shards each running on two servers. The > server that was the leader for one of the shards ran into a problem, and > when we restarted the service, Solar is no longer electing a leader for the > shard. > > The stack traces from the logs are below, and the 'Cloud Dump' from the > Solr console is attached. We're running Solr 4.4.0. Any guidance on how > to recover from this? Restarting or redeploying the service doesn't seem > to make any difference. > > Thanks, > Pat. > > > 2014-01-07 00:00:10,754 [http-8080-62] ERROR org.apache.solr.core.SolrCore > - org.apache.solr.common.SolrException: no servers hosting shard: > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149) > at > org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > > 2014-01-07 09:38:33,701 [http-8080-21] ERROR org.apache.solr.core.SolrCore > - org.apache.solr.common.SolrException: No registered leader was found, > collection:customerOrderSearch slice:shard1 > at > org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:487) > at > org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:470) > at > org.
Re: no servers hosting shard
We found a way to recover. This sequence allowed everything to start up successfully. - Stop all Solr instances - Stop all Zookeeper instances - Start all Zookeeper instances - Start Solr instances one at a time. Restarting the first Solr instance took several minutes, but the subsequent instances started up much more quickly. Cheers, Pat. On Tue, Jan 7, 2014 at 10:20 AM, patrick conant wrote: > After a full bounce of Tomcat, I'm now getting a new exception (below). I > can browse the Zookeeper config in the Solr admin UI, and can confirm that > there's a node for '/collections/customerOrderSearch/leaders/shard2', but > no node for 'collections/customerOrderSearch/leaders/shard1'. Still, any > ideas or guidance on how to recover would be appreciated. We've restarted > all three zookeeper instances and both Solr instances, but that hasn't made > any appreciable difference. > > --p. > > > > > 2014-01-07 10:06:14,980 [coreLoadExecutor-4-thread-1] ERROR > org.apache.solr.core.CoreContainer - > null:org.apache.solr.common.cloud.ZooKeeperException: > at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:309) > at org.apache.solr.core.CoreContainer.registerCore(CoreContainer.java:556) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:365) > at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:356) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > Caused by: org.apache.solr.common.SolrException: Error getting leader from > zk for shard shard1 > at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:864) > at org.apache.solr.cloud.ZkController.register(ZkController.java:773) > at org.apache.solr.cloud.ZkController.register(ZkController.java:723) > at org.apache.solr.core.ZkContainer.registerInZk(ZkContainer.java:286) > ... 11 more > Caused by: org.apache.solr.common.SolrException: Could not get leader props > at > org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:911) > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:875) > at org.apache.solr.cloud.ZkController.getLeader(ZkController.java:839) > ... 14 more > Caused by: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for /collections/customerOrderSearch/leaders/shard1 > at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1151) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:252) > at > org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:249) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:65) > at > org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:249) > at org.apache.solr.cloud.ZkController.getLeaderProps(ZkController.java:889) > ... 16 more > > > > On Tue, Jan 7, 2014 at 9:57 AM, patrick conant > wrote: > >> In our Solr instance we have two shards each running on two servers. The >> server that was the leader for one of the shards ran into a problem, and >> when we restarted the service, Solar is no longer electing a leader for the >> shard. >> >> The stack traces from the logs are below, and the 'Cloud Dump' from the >> Solr console is attached. We're running Solr 4.4.0. Any guidance on how >> to recover from this? Restarting or redeploying the service doesn't seem >> to make any difference. >> >> Thanks, >> Pat. >> >> >> 2014-01-07 00:00:10,754 [http-8080-62] ERROR >> org.apache.solr.core.SolrCore - org.apache.solr.common.SolrException: no >> servers hosting shard: >> at >> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:149) >> at >> org.apache.solr.handler.component.HttpShardHandler$1.call(HttpShardHandler.java:119) >> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(E
Best way to map holidays to corresponding date
Hey, maybe someone already faced the situation and could give me a hint. Given one query includes "Easter" or "Sylvester" I search for the best place to translate the string to the corresponding date. Is there any solr.Mapping*Factory for that? Do I need to implement it in a custom Solr Query Parser etc.? Regards, Patrick
Handling growth
Hello everyone, I am working with a Solr collection that is several terabytes in size over several hundred millions of documents. Each document is very rich, and over the past few years we have consistently quadrupled the size our collection annually. Unfortunately, this sits on a single node with only a few hundred megabytes of memory - so our performance is less than ideal. I am looking into implementing a SolrCloud cluster. From reading a few books (i.e. Solr in Action), various internet blogs, and the reference guide, it states to build a cluster with room to grow. I can probably provision enough hardware for a year from today worth of growth, however I would like to have a plan beyond that. Shard splitting seems pretty straight forward. We are in a continuous adding documents and never change existing ones. Based on that, one individual recommended for me to implement custom hashing and route the latest documents to the shard with the least documents, and when that shard fills up add a new shard and index on the new shard, rinse and repeat. The last one makes sense. However, my concern with the last one is I lose the distributed indexing, implementation concerns and the maintainability. My question for the community is what are your thoughts are on this, and do you have any suggestion and/or recommendations on planning for future growth? Look forward to your responses, Patrick
RE: Handling growth
Good eye, that should have been gigabytes. When adding to the new shard, is the shard already part of the the collection? What mechanism have you found useful in accomplishing this (i.e. routing)? On Nov 14, 2014 7:07 AM, "Toke Eskildsen" wrote: > Patrick Henry [patricktheawesomeg...@gmail.com] wrote: > > >I am working with a Solr collection that is several terabytes in size over > > several hundred millions of documents. Each document is very rich, and > > over the past few years we have consistently quadrupled the size our > > collection annually. Unfortunately, this sits on a single node with > only a > > few hundred megabytes of memory - so our performance is less than ideal. > > I assume you mean gigabytes of memory. If you have not already done so, > switching to SSDs for storage should buy you some more time. > > > [Going for SolrCloud] We are in a continuous adding documents and never > change > > existing ones. Based on that, one individual recommended for me to > > implement custom hashing and route the latest documents to the shard with > > the least documents, and when that shard fills up add a new shard and > index > > on the new shard, rinse and repeat. > > We have quite a similar setup, where we produce a never-changing shard > once every 8 days and add it to our cloud. One could also combine this > setup with a single live shard, for keeping the full index constantly up to > date. The memory overhead of running an immutable shard is smaller than a > mutable one and easier to fine-tune. It also allows you to optimize the > index down to a single segment, which requires a bit less processing power > and saves memory when faceting. There's a description of our setup at > http://sbdevel.wordpress.com/net-archive-search/ > > From an administrative point of view, we like having complete control over > each shard. We keep track of what goes in it and in case of schema or > analyze chain changes, we can re-build each shard one at a time and deploy > them continuously, instead of having to re-build everything in one go on a > parallel setup. Of course, fundamental changes to the schema would require > a complete re-build before deploy, so we hope to avoid that. > > - Toke Eskildsen >
Re: Handling growth
Michael, Interesting, I'm still unfamiliar with limitations (if any) of aliasing. Does architecture utilize realtime get? On Nov 18, 2014 11:49 AM, "Michael Della Bitta" < michael.della.bi...@appinions.com> wrote: > We're achieving some success by treating aliases as collections and > collections as shards. > > More specifically, there's a read alias that spans all the collections, > and a write alias that points at the 'latest' collection. Every week, I > create a new collection, add it to the read alias, and point the write > alias at it. > > Michael > > On 11/14/14 07:06, Toke Eskildsen wrote: > >> Patrick Henry [patricktheawesomeg...@gmail.com] wrote: >> >> I am working with a Solr collection that is several terabytes in size >>> over >>> several hundred millions of documents. Each document is very rich, and >>> over the past few years we have consistently quadrupled the size our >>> collection annually. Unfortunately, this sits on a single node with >>> only a >>> few hundred megabytes of memory - so our performance is less than ideal. >>> >> I assume you mean gigabytes of memory. If you have not already done so, >> switching to SSDs for storage should buy you some more time. >> >> [Going for SolrCloud] We are in a continuous adding documents and never >>> change >>> existing ones. Based on that, one individual recommended for me to >>> implement custom hashing and route the latest documents to the shard with >>> the least documents, and when that shard fills up add a new shard and >>> index >>> on the new shard, rinse and repeat. >>> >> We have quite a similar setup, where we produce a never-changing shard >> once every 8 days and add it to our cloud. One could also combine this >> setup with a single live shard, for keeping the full index constantly up to >> date. The memory overhead of running an immutable shard is smaller than a >> mutable one and easier to fine-tune. It also allows you to optimize the >> index down to a single segment, which requires a bit less processing power >> and saves memory when faceting. There's a description of our setup at >> http://sbdevel.wordpress.com/net-archive-search/ >> >> From an administrative point of view, we like having complete control >> over each shard. We keep track of what goes in it and in case of schema or >> analyze chain changes, we can re-build each shard one at a time and deploy >> them continuously, instead of having to re-build everything in one go on a >> parallel setup. Of course, fundamental changes to the schema would require >> a complete re-build before deploy, so we hope to avoid that. >> >> - Toke Eskildsen >> > >
SolrCloud: Collection API question and problem with core loading
Hi there, I run 2 solr instances ( Tomcat 7, Solr 4.3.0 , one shard),one external Zookeeper instance and have lots of cores. I use collection API to create the new core dynamically after the configuration for the core is uploaded to the Zookeeper and it all works fine. As there are so many cores it takes very long time to load them at start up I would like to start up the server quickly and load the cores on demand. When the core is created via collection API it is created with default parameter : loadOnStartup="true" ( this can be seen in solr.xml ) Question: is there a way to specify this parameter so it can be set 'false' in collection API ? Problem: If I manually set loadOnStartup="true" for the core I had exception below when I used CloudSolrServer to query the core : Error: org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request Seems to me that CloudSolrServer will not trigger the core to be loaded. Is it possible to get the core loaded using CloudSolrServer? Regards, Patrick
RE: SolrCloud with Zookeeper ensemble : fail to restart master server
After a number of testing I found that running embedded zookeeper isn't a good idea especially only run one Zookeeper instance. When the Solr instance with ZooKeeper embedded gets rebooted it got confused who should be the leader therefore it will not start while others(followers) are still running. I now use standalone Zookeeper instance and that works well. Thanks Erick for giving the right direction, much appreciated! Regards, Patrick -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, 20 March 2013 2:57 a.m. To: solr-user@lucene.apache.org Subject: Re: SolrCloud with Zookeeper ensemble : fail to restart master server First, the bootstrap_conf and numShards should only be specified the _first_ time you start up your leader. bootstrap_conf's purpose is to push the configuration files to Zookeeper. numShards is a one-time-only parameter that you shouldn't specify more than once, it is ignored afterwards I think. Once the conf files are up in zookeeper, then they don't need to be pushed again until they change, and you can use the command-line tools to do that Terminology: we're trying to get away from master/slave and use leader/replica in SolrCloud mode to distinguish it from the old replication process, so just checking to be sure that you probably really mean leader/replica, right? Watch your admin/SolrCloud link as you bring machines up and down. That page will show you the state of each of your machines. Normally there's no trouble bringing the leader up and down, _except_ it sounds like you have your zookeeper running embedded. A quorum of ZK nodes (in this case one) needs to be running for SolrCloud to operate. Still, that shouldn't prevent your machine running ZK from coming back up. So I'm a bit puzzled, but let's straighten out the startup stuff and watch your solr log on your leader when you bring it up, that should generate some more questions.. Best Erick On Mon, Mar 18, 2013 at 11:12 PM, Patrick Mi wrote: > Hi there, > > I have experienced some problems starting the master server. > > Solr4.2 under Tomcat 7 on Centos6. > > Configuration : > 3 solr instances running on different machines, one shard, 3 cores, 2 > replicas, using Zookeeper comes with Solr > > The master server A has the following run option: -Dbootstrap_conf=true > -DzkRun -DnumShards=1, > The slave servers B and C have : -DzkHost=masterServerIP:2181 > > It works well for add/update/delete etc after I start up master and slave > servers in order. > > When the master A is up stop/start slave B and C are OK. > > When slave B and C are running I couldn't restart master A. Only after I > shutdown B and C then I can start master A. > > Is this a feature or bug or something I haven't configure properly? > > Thanks advance for your help > > Regards, > Patrick > >
OPENNLP current patch compiling problem for 4.x branch
Hi, I checked out from here http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_3_0 and downloaded the latest patch LUCENE-2899-current.patch. Applied the patch ok but when I did 'ant compile' I got the following error: == [javac] /home/lucene_solr_4_3_0/lucene/analysis/opennlp/src/java/org/apache/lucene/a nalysis/opennlp/FilterPayloadsFilter.java:43: error r: cannot find symbol [javac] super(Version.LUCENE_44, input); [javac] ^ [javac] symbol: variable LUCENE_44 [javac] location: class Version [javac] 1 error == Compiled it on trunk without problem. Is this patch supposed to work for 4.X? Regards, Patrick
RE: OPENNLP current patch compiling problem for 4.x branch
Thanks Steve, that worked for branch_4x -Original Message- From: Steve Rowe [mailto:sar...@gmail.com] Sent: Friday, 24 May 2013 3:19 a.m. To: solr-user@lucene.apache.org Subject: Re: OPENNLP current patch compiling problem for 4.x branch Hi Patrick, I think you should check out and apply the patch to branch_4x, rather than the lucene_solr_4_3_0 tag: http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x Steve On May 23, 2013, at 2:08 AM, Patrick Mi wrote: > Hi, > > I checked out from here > http://svn.apache.org/repos/asf/lucene/dev/tags/lucene_solr_4_3_0 and > downloaded the latest patch LUCENE-2899-current.patch. > > Applied the patch ok but when I did 'ant compile' I got the following error: > > > == >[javac] > /home/lucene_solr_4_3_0/lucene/analysis/opennlp/src/java/org/apache/lucene/a > nalysis/opennlp/FilterPayloadsFilter.java:43: error > r: cannot find symbol >[javac] super(Version.LUCENE_44, input); >[javac] ^ >[javac] symbol: variable LUCENE_44 >[javac] location: class Version >[javac] 1 error > == > > Compiled it on trunk without problem. > > Is this patch supposed to work for 4.X? > > Regards, > Patrick >
OPENNLP problems
Hi there, Checked out branch_4x and applied the latest patch LUCENE-2899-current.patch however I ran into 2 problems Followed the wiki page instruction and set up a field with this type aiming to keep nouns and verbs and do a facet on the field == == Struggled to get that going until I put the extra parameter keepPayloads="true" in as below. Question: am I doing the right thing? Is this a mistake on wiki Second problem: Posted the document xml one by one to the solr and the result was what I expected. 1 check in the hotel However if I put multiple documents into the same xml file and post it in one go only the first document gets processed( only 'check' and 'hotel' were showing in the facet result.) 1 check in the hotel 2 removes the payloads 3 retains only nouns and verbs Same problem when updated the data using csv upload. Is that a bug or something I did wrong? Thanks in advance! Regards, Patrick
RE: OPENNLP problems
Hi Lance, I updated the src from 4.x and applied the latest patch LUCENE-2899-x.patch uploaded on 6th June but still had the same problem. Regards, Patrick -Original Message- From: Lance Norskog [mailto:goks...@gmail.com] Sent: Thursday, 6 June 2013 5:16 p.m. To: solr-user@lucene.apache.org Subject: Re: OPENNLP problems Patrick- I found the problem with multiple documents. The problem was that the API for the life cycle of a Tokenizer changed, and I only noticed part of the change. You can now upload multiple documents in one post, and the OpenNLPTokenizer will process each document. You're right, the example on the wiki is wrong. The FilterPayloadsFilter default is to remove the given payloads, and needs keepPayloads="true" to retain them. The fixed patch is up as LUCENE-2899-x.patch. Again, thanks for trying it. Lance https://issues.apache.org/jira/browse/LUCENE-2899 On 05/28/2013 10:08 PM, Patrick Mi wrote: > Hi there, > > Checked out branch_4x and applied the latest patch > LUCENE-2899-current.patch however I ran into 2 problems > > Followed the wiki page instruction and set up a field with this type aiming > to keep nouns and verbs and do a facet on the field > == > positionIncrementGap="100"> > > tokenizerModel="opennlp/en-token.bin"/> > posTaggerModel="opennlp/en-pos-maxent.bin"/> > payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/> > > > > == > > Struggled to get that going until I put the extra parameter > keepPayloads="true" in as below. >payloadList="NN,NNS,NNP,NNPS,VB,VBD,VBG,VBN,VBP,VBZ,FW"/> > > Question: am I doing the right thing? Is this a mistake on wiki > > Second problem: > > Posted the document xml one by one to the solr and the result was what I > expected. > > > >1 >check in the hotel > > > However if I put multiple documents into the same xml file and post it in > one go only the first document gets processed( only 'check' and 'hotel' were > showing in the facet result.) > > > >1 >check in the hotel > > >2 >removes the payloads > > >3 >retains only nouns and verbs > > > > Same problem when updated the data using csv upload. > > Is that a bug or something I did wrong? > > Thanks in advance! > > Regards, > Patrick > >
Stemming and other tokenizers
Hello, I want to implement some king of AutoStemming that will detect the language of a field based on a tag at the start of this field like #en# my field is stored on disc but I don't want this tag to be stored. Is there a way to avoid this field to be stored ? To me all the filters and the tokenizers interact only with the indexed field and not the stored one. Am I wrong ? Is it possible to you to do such a filter. Patrick.
Re: Master Slave Question
Real Time indexing (solr 4) or decrease replication poll and auto commit time. 2011/9/10 Jamie Johnson > Is it appropriate to query the master servers when replicating? I ask > because there could be a case where we index say 50 documents to the > master, they have not yet been replicated and a user asks for page 2, > when they ask for page 2 the request could be sent to a slave and get > 0. Is there a way to avoid this? My thought was to not allow > querying of the master but I'm not sure that this could be configured > in solr >
Re: Stemming and other tokenizers
I can't create one field per language, that is the problem but I'll dig into it following your indications. I let you know what I could come out with. Patrick. 2011/9/11 Jan Høydahl > Hi, > > You'll not be able to detect language and change stemmer on the same field > in one go. You need to create one fieldType in your schema per language you > want to use, and then use LanguageIdentification (SOLR-1979) to do the magic > of detecting language and renaming the field. If you set > langid.override=false, languid.map=true and populate your "language" field > with the known language, you will probably get the desired effect. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 10. sep. 2011, at 03:24, Patrick Sauts wrote: > > > Hello, > > > > > > > > I want to implement some king of AutoStemming that will detect the > language > > of a field based on a tag at the start of this field like #en# my field > is > > stored on disc but I don't want this tag to be stored. Is there a way to > > avoid this field to be stored ? > > > > To me all the filters and the tokenizers interact only with the indexed > > field and not the stored one. > > > > Am I wrong ? > > > > Is it possible to you to do such a filter. > > > > > > > > Patrick. > > > >
RE: Weird behaviors with not operators.
Maybe this will answer your question http://wiki.apache.org/solr/FAQ Why does 'foo AND -baz' match docs, but 'foo AND (-bar)' doesn't ? Boolean queries must have at least one "positive" expression (ie; MUST or SHOULD) in order to match. Solr tries to help with this, and if asked to execute a BooleanQuery that does contains only negatived clauses _at the topmost level_, it adds a match all docs query (ie: *:*) If the top level BoolenQuery contains somewhere inside of it a nested BooleanQuery which contains only negated clauses, that nested query will not be modified, and it (by definition) an't match any documents -- if it is required, that means the outer query will not match. More Detail: * https://issues.apache.org/jira/browse/SOLR-80 * https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201006.mbox/%3Cal pine.deb.1.10.1006011609080.29...@radix.cryptio.net%3E Patrick. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Monday, September 12, 2011 3:04 PM To: solr-user@lucene.apache.org Subject: Re: Weird behaviors with not operators. : I'm crashing into a weird behavior with - operators. I went ahead and added a FAQ on this using some text from a previous nearly identical email ... https://wiki.apache.org/solr/FAQ#Why_does_.27foo_AND_-baz.27_match_docs.2C_b ut_.27foo_AND_.28-bar.29.27_doesn.27t_.3F please reply if you have followup questions. -Hoss
RE: Weird behaviors with not operators.
I mean it's a known bug. Hostetter AND (-chris *:*) Should do the trick. Depending on your request. NAME:(-chris *:*) -Original Message- From: Patrick Sauts [mailto:patrick.via...@gmail.com] Sent: Monday, September 12, 2011 3:57 PM To: solr-user@lucene.apache.org Subject: RE: Weird behaviors with not operators. Maybe this will answer your question http://wiki.apache.org/solr/FAQ Why does 'foo AND -baz' match docs, but 'foo AND (-bar)' doesn't ? Boolean queries must have at least one "positive" expression (ie; MUST or SHOULD) in order to match. Solr tries to help with this, and if asked to execute a BooleanQuery that does contains only negatived clauses _at the topmost level_, it adds a match all docs query (ie: *:*) If the top level BoolenQuery contains somewhere inside of it a nested BooleanQuery which contains only negated clauses, that nested query will not be modified, and it (by definition) an't match any documents -- if it is required, that means the outer query will not match. More Detail: * https://issues.apache.org/jira/browse/SOLR-80 * https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201006.mbox/%3Cal pine.deb.1.10.1006011609080.29...@radix.cryptio.net%3E Patrick. -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Monday, September 12, 2011 3:04 PM To: solr-user@lucene.apache.org Subject: Re: Weird behaviors with not operators. : I'm crashing into a weird behavior with - operators. I went ahead and added a FAQ on this using some text from a previous nearly identical email ... https://wiki.apache.org/solr/FAQ#Why_does_.27foo_AND_-baz.27_match_docs.2C_b ut_.27foo_AND_.28-bar.29.27_doesn.27t_.3F please reply if you have followup questions. -Hoss
facet.method=fc
Is the parameter facet.method=fc still needed ? Thank you. Patrick.
Solr-3.5.0/Nutch-1.4 - SolrDeleteDuplicates fails
Greetings! This may be a Nutch question and if so, I will repost to the Nutch list. I can run the following commands with Solr-3.5.0/Nutch-1.4: bin/nutch crawl urls -dir crawl -depth 3 -topN 5 then: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* successfully. But, if I run: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 It fails with the following messages: SolrIndexer: starting at 2011-12-11 14:01:27 Adding 11 documents SolrIndexer: finished at 2011-12-11 14:01:28, elapsed: 00:00:01 SolrDeleteDuplicates: starting at 2011-12-11 14:01:28 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) I am running on Ubuntu 10.10 with 12 GB of memory, Java version 1.6.0_26. I can delete the crawl directory and replicate this error consistently. Suggestions? Other than "...use the way that doesn't fail." ;-) I am concerned that a different invocation of Solr failing consistently represents something that may cause trouble elsewhere when least expected. (And hard to isolate as the problem.) Thanks! Hope everyone is having a great weekend! Patrick PS: From the hadoop log (when it fails) if that's helpful: 2011-12-11 15:21:51,436 INFO solr.SolrWriter - Adding 11 documents 2011-12-11 15:21:52,250 INFO solr.SolrIndexer - SolrIndexer: finished at 2011-12-11 15:21:52, elapsed: 00:00:01 2011-12-11 15:21:52,251 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2011-12-11 15:21:52 2011-12-11 15:21:52,251 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ 2011-12-11 15:21:52,330 WARN mapred.LocalJobRunner - job_local_0020 java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:388) at org.apache.hadoop.io.Text.set(Text.java:178) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) -- Patrick Durusau patr...@durusau.net Chair, V1 - US TAG to JTC 1/SC 34 Convener, JTC 1/SC 34/WG 3 (Topic Maps) Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300 Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps) OASIS Technical Advisory Board (TAB) - member Another Word For It (blog): http://tm.durusau.net Homepage: http://www.durusau.net Twitter: patrickDurusau
Solr-3.5.0/Nutch-1.4 - SolrDeleteDuplicates fails
Greetings! On the Nutch Tutorial: I can run the following commands with Solr-3.5.0/Nutch-1.4: bin/nutch crawl urls -dir crawl -depth 3 -topN 5 then: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* successfully. But, if I run: bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 It fails with the following messages: SolrIndexer: starting at 2011-12-11 14:01:27 Adding 11 documents SolrIndexer: finished at 2011-12-11 14:01:28, elapsed: 00:00:01 SolrDeleteDuplicates: starting at 2011-12-11 14:01:28 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) I am running on Ubuntu 10.10 with 12 GB of memory, Java version 1.6.0_26. I can delete the crawl directory and replicate this error consistently. Suggestions? Other than "...use the way that doesn't fail." ;-) I am concerned that a different invocation of Solr failing consistently represents something that may cause trouble elsewhere when least expected. (And hard to isolate as the problem.) Thanks! Hope everyone is having a great weekend! Patrick PS: From the hadoop log (when it fails) if that's helpful: 2011-12-11 15:21:51,436 INFO solr.SolrWriter - Adding 11 documents 2011-12-11 15:21:52,250 INFO solr.SolrIndexer - SolrIndexer: finished at 2011-12-11 15:21:52, elapsed: 00:00:01 2011-12-11 15:21:52,251 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: starting at 2011-12-11 15:21:52 2011-12-11 15:21:52,251 INFO solr.SolrDeleteDuplicates - SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ 2011-12-11 15:21:52,330 WARN mapred.LocalJobRunner - job_local_0020 java.lang.NullPointerException at org.apache.hadoop.io.Text.encode(Text.java:388) at org.apache.hadoop.io.Text.set(Text.java:178) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:270) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat$1.next(SolrDeleteDuplicates.java:241) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:192) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:176) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:48) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177) -- Patrick Durusau patr...@durusau.net Chair, V1 - US TAG to JTC 1/SC 34 Convener, JTC 1/SC 34/WG 3 (Topic Maps) Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300 Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps) OASIS Technical Advisory Board (TAB) - member Another Word For It (blog): http://tm.durusau.net Homepage: http://www.durusau.net Twitter: patrickDurusau
Re: How to get SolrServer within my own servlet
Have à look here first and you're will probably be using SolrEmbeddedServer. http://wiki.apache.org/solr/Solrj Patrick Op 13 dec. 2011 om 20:38 heeft Joey het volgende geschreven: > Anybody could help? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-get-SolrServer-within-my-own-servlet-tp3583304p3583368.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to get SolrServer within my own servlet
Hey Joey, You should first configure your deployed Solr instance by adding/changing the schema.xml and solrconfig.xml. After that you can use SolrJ to connect to that Solr instance and add documents to it. On the link i posted earlier, you'll find à couple of examples on how to do that. - Patrick Verstuurd vanaf mijn iPhone Op 13 dec. 2011 om 20:53 heeft Joey het volgende geschreven: > Thanks Patrick for the reply. > > What I did was un-jar solr.war and created my own web application. Now I > want to write my own servlet to index all files inside a folder. > > I suppose there is already solrserver instance initialized when my web app > started. > > How can I access that solr server instance in my servlet? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-get-SolrServer-within-my-own-servlet-tp3583304p3583416.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: blocking access by user-agent
Hi Roland, you can configure Jetty to use a simple .htaccess file to allow only specific IP adresses access to your webapp. Have a look here on how to do thta: http://www.viaboxxsystems.de/how-to-configure-your-jetty-webapp-to-grant-access-for-dedicated-ip-addresses-only If you want more sophisticated access control, you need it to be included in an extra layer between Solr and the devices accressing your Solr instance. - Patrick 2011/12/21 RT > Hi, > > I would like to control what applications get access to the solr database. > I am using jetty as the appcontainer. > > Is this at all achievable? If yes, how? > > Internet search has not yielded anything I could use so far. > > Thanks in advance. > > Roland > -- Patrick Plaatje Senior Consultant <http://www.nmobile.nl/>
Re: Searching partial phone numbers
Hi Marotosg, you can index the phonenumber field with the ngram field type, which allows for partial (wildcard) searches on this field. Have a look here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory Cheers, Patrick 2012/1/19 marotosg > Hi. > I have phone numbers in my solr schema in a field. At the moment i have > this > field as string. > I would like to be able to make searches that find parts of a phone > number. > > For instance: > Number +35384589458 > > search by *+35384* or search by *84589*. > > Do you know if this is posible? > > Thanks a lot > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Searching-partial-phone-numbers-tp3671908p3671908.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Patrick Plaatje Senior Consultant <http://www.nmobile.nl/>
Re: How to accelerate your Solr-Lucene appication by 4x
Partially agree. If just the facts are given, and not a complete sales talk instead, it'll be fine. Don't overdo it like this though. Cheers, Patrick 2012/1/19 Darren Govoni > I think the occassional "Hey, we made something cool you might be > interested in!" notice, even if commercial, is ok > because it addresses numerous issues we struggle with on this list. > > Now, if it were something completely off-base or unrelated (e.g. male > enhancement pills), then yeah, I agree. > > On 01/18/2012 11:08 PM, Steven A Rowe wrote: > >> Hi Darren, >> >> I think it's rare because it's rare: if this were found to be a useful >> advertising space, rare would cease to be descriptive of it. But I could >> be wrong. >> >> Steve >> >> -Original Message- >>> From: Darren Govoni [mailto:dar...@ontrenet.com] >>> Sent: Wednesday, January 18, 2012 8:40 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: How to accelerate your Solr-Lucene appication by 4x >>> >>> And to be honest, many people on this list are professionals who not >>> only build their own solutions, but also buy tools and tech. >>> >>> I don't see what the big deal is if some clever company has something of >>> imminent value here to share it. Considering that its a rare event. >>> >>> On 01/18/2012 08:28 PM, Jason Rutherglen wrote: >>> >>>> Steven, >>>> >>>> If you are going to admonish people for advertising, it should be >>>> equally dished out or not at all. >>>> >>>> On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe wrote: >>>> >>>>> Hi Peter, >>>>> >>>>> Commercial solicitations are taboo here, except in the context of a >>>>> >>>> request for help that is directly relevant to a product or service. >>> >>>> Please don’t do this again. >>>>> >>>>> Steve Rowe >>>>> >>>>> From: Peter Velikin [mailto:pe...@velobit.com] >>>>> Sent: Wednesday, January 18, 2012 6:33 PM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: How to accelerate your Solr-Lucene appication by 4x >>>>> >>>>> Hello Solr users, >>>>> >>>>> Did you know that you can boost the performance of your Solr >>>>> >>>> application using your existing servers? All you need is commodity SSD >>> and >>> plug-and-play software like VeloBit. >>> >>>> At ZoomInfo, a leading business information provider, VeloBit increased >>>>> >>>> the performance of the Solr-Lucene-powered application by 4x. >>> >>>> I would love to tell you more about VeloBit and find out if we can >>>>> >>>> deliver same business benefits at your company. Click >>> here<http://www.velobit.com/**15-minute-brief<http://www.velobit.com/15-minute-brief>> >>> for a 15-minute >>> briefing<http://www.velobit.**com/15-minute-brief<http://www.velobit.com/15-minute-brief>> >>> on the VeloBit >>> technology. >>> >>>> Here is more information on how VeloBit helped ZoomInfo: >>>>> >>>>> * Increased Solr-Lucene performance by 4x using existing servers >>>>> >>>> and commodity SSD >>> >>>> * Installed VeloBit plug-and-play SSD caching software in 5-minutes >>>>> >>>> transparent to running applications and storage infrastructure >>> >>>> * Reduced by 75% the hardware and monthly operating costs required >>>>> >>>> to support service level agreements >>> >>>> Technical Details: >>>>> >>>>> * Environment: Solr‐Lucene indexed directory search service fronted >>>>> >>>> by J2EE web application technology >>> >>>> * Index size: 600 GB >>>>> * Number of items indexed: 50 million >>>>> * Primary storage: 6 x SAS HDD >>>>> * SSD Cache: VeloBit software + OCZ Vertex 3 >>>>> >>>>> Click >>>>> here<http://www.velobit.com/**use-cases/enterprise-search/<http://www.velobit.com/use-cases/enterprise-search/>> >>>>> to >>>>> >>>> read more about the ZoomInfo Solr-Lucene case >>> study<http://www.velobit.com/**us
Re: Problem instantiating CommonsHttpSolrServer using solrj
I went through jar hell yesterday. I finally got Solrj working. http://jarfinder.com was a big help. Rock on, PLA Patrick L Archibald http://patrickarchibald.com On Fri, Aug 13, 2010 at 7:25 PM, Chris Hostetter wrote: > > : I get the following runtime error: > : > : Exception in thread "main" java.lang.NoClassDefFoundError: > : org/apache/solr/client/solrj/SolrServerException > : Caused by: java.lang.ClassNotFoundException: > : org.apache.solr.client.solrj.SolrServerException > ... > : I am following the this link : http://wiki.apache.org/solr/Solrj ,and > : have included all the jar files specified there, in the classpath. > > Are you certain? > > the class it can't find is > org.apache.solr.client.solrj.SolrServerException which is definitely in > the apache-solr-solrj-*.jar > > did you perchance copy the list of jars verbatim from that wiki? because > someone seems to have made a typo and called it "solr-solrj-1.4.0.jar" > instead of "apache-solr-solrj-1.4.0.jar" but if you actually *look* at the > jars available, it's pretty obvious. > > > -Hoss > >
Limitations of prohibited clausses in sub-expression - pure negative query
I can find the answer but is this problem solved in Solr 1.4.1 ? Thx for your answers.
RE: Limitations of prohibited clausses in sub-expression - pure negative query
Maybe SOLR-80 jira issue ? As written in Solr 1.4 book; "pure negative query doesn't work correctly ." you have to add 'AND *:* ' thx From: Patrick Sauts [mailto:patrick.via...@gmail.com] Sent: mardi 28 septembre 2010 11:53 To: 'solr-user@lucene.apache.org' Subject: Limitations of prohibited clausses in sub-expression - pure negative query I can find the answer but is this problem solved in Solr 1.4.1 ? Thx for your answers.
DataDirectory: relative path doesn't work
I am running Solr4.0/Tomcat 7 on Centos6 According to this page http://wiki.apache.org/solr/SolrConfigXml if is not absolute, then it is relative to the instanceDir of the SolrCore. However the index directory is always created under the directory where I start the Tomcat (startup.sh) rather than under instanceDir of the SolrCore. Am I doing something wrong in configuration? Regards, Patrick
RE: DataDirectory: relative path doesn't work
Thanks for fixing the wiki page http://wiki.apache.org/solr/SolrConfigXml now it says this: 'If this directory is not absolute, then it is relative to the directory you're in when you start SOLR.' It will be nice if you drop me a line here after you make the change on the document ... -Original Message- From: Patrick Mi [mailto:patrick...@touchpointgroup.com] Sent: Tuesday, 26 February 2013 5:49 p.m. To: solr-user@lucene.apache.org Subject: DataDirectory: relative path doesn't work I am running Solr4.0/Tomcat 7 on Centos6 According to this page http://wiki.apache.org/solr/SolrConfigXml if is not absolute, then it is relative to the instanceDir of the SolrCore. However the index directory is always created under the directory where I start the Tomcat (startup.sh) rather than under instanceDir of the SolrCore. Am I doing something wrong in configuration? Regards, Patrick
SolrCloud with Zookeeper ensemble : fail to restart master server
Hi there, I have experienced some problems starting the master server. Solr4.2 under Tomcat 7 on Centos6. Configuration : 3 solr instances running on different machines, one shard, 3 cores, 2 replicas, using Zookeeper comes with Solr The master server A has the following run option: -Dbootstrap_conf=true -DzkRun -DnumShards=1, The slave servers B and C have : -DzkHost=masterServerIP:2181 It works well for add/update/delete etc after I start up master and slave servers in order. When the master A is up stop/start slave B and C are OK. When slave B and C are running I couldn't restart master A. Only after I shutdown B and C then I can start master A. Is this a feature or bug or something I haven't configure properly? Thanks advance for your help Regards, Patrick
Re: is there any practice to load index into RAM to accelerate solr performance?
A start maybe to use a RAM disk for that. Mount is as a normal disk and have the index files stored there. Have a read here: http://en.wikipedia.org/wiki/RAM_disk Cheers, Patrick 2012/2/8 Ted Dunning > This is true with Lucene as it stands. It would be much faster if there > were a specialized in-memory index such as is typically used with high > performance search engines. > > On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog wrote: > > > Experience has shown that it is much faster to run Solr with a small > > amount of memory and let the rest of the ram be used by the operating > > system "disk cache". That is, the OS is very good at keeping the right > > disk blocks in memory, much better than Solr. > > > > How much RAM is in the server and how much RAM does the JVM get? How > > big are the documents, and how large is the term index for your > > searches? How many documents do you get with each search? And, do you > > use filter queries- these are very powerful at limiting searches. > > > > 2012/2/7 James : > > > Is there any practice to load index into RAM to accelerate solr > > performance? > > > The over all documents is about 100 million. The search time around > > 100ms. I am seeking some method to accelerate the respond time for solr. > > > Just check that there is some practice use SSD disk. And SSD is also > > cost much, just want to know is there some method like to load the index > > file in RAM and keep the RAM index and disk index synchronized. Then I > can > > search on the RAM index. > > > > > > > > -- > > Lance Norskog > > goks...@gmail.com > > > -- Patrick Plaatje Senior Consultant <http://www.nmobile.nl/>
Show SQL-DIH datasource name in result list
Hey, does somebody know, if there is a command option in Solr to show which datasource provided the result. Or with other words: is it possible to output in the result the tag name given in or ? Let me explain: - I'm using the SQL-DIH with a lot of datasources and several entities. Every datasource has a name e.g. and every entity, too, e.g. - Now at the result list I would need to know, which e.g. table the result provided, e.g. 0 Thanks, Patrick
Solr 1.4.1 and carrot2 clustering
Dear all, I really enjoy using Solr so far. During the last days I tried to activate the ClusteringComponent in Solr as indicated here http://wiki.apache.org/solr/ClusteringComponent and copied all the relevant java libraries in the WEB-INF/lib folder of my tomcat installation of Solr. But everytime I try to issue a request to my Solr server using http://localhost:9005/apache-solr-1.4.1/job0/select?q=*:*&fl=title,score,url&start=0&rows=100&indent=on&clustering=true I get the following error message: java.lang.NoClassDefFoundError: bak/pcj/set/IntSet at org.carrot2.text.preprocessing.PreprocessingPipeline.(PreprocessingPipeline.java:47) at org.carrot2.clustering.lingo.LingoClusteringAlgorithm.(LingoClusteringAlgorithm.java:108) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at java.lang.Class.newInstance0(Class.java:355) at java.lang.Class.newInstance(Class.java:308) at org.carrot2.util.pool.SoftUnboundedPool.borrowObject(SoftUnboundedPool.java:114) at org.carrot2.core.CachingController.borrowProcessingComponent(CachingController.java:329) Hence I have downloaded the corresponding pcj-1.2.jar providing the interface "bak.pcj.set.IntSet" and I have also put it in the WEB-INF/lib folder But I still keep getting this error message though the corresponding interface MUST be on the classpath now. Can anyone help me out with this one? I'm really eager to give this clustering extension a try from within Solr using the 1.4.1 version that I have already running on my server. Thanks for a brief feedback. Best regards, Patrick
Re: [POLL] Where do you get Lucene/Solr from? Maven? ASF Mirrors?
[X] ASF Mirrors (linked in our release announcements or via the Lucene website) [] Maven repository (whether you use Maven, Ant+Ivy, Buildr, etc.) [] I/we build them from source via an SVN/Git checkout. [] Other (someone in your company mirrors them internally or via a downstream project)
solr 1.4.1 -> 3.6.1; SOLR-758
Regarding https://issues.apache.org/jira/browse/SOLR-758 (Enhance DisMaxQParserPlugin to support full-Solr syntax and to support alternate escaping strategies.) I'm updating from solr 1.4.1 to 3.6.1 (I'm aware that it is not beautiful). After applying the attached patches to 3.6.1 I'm experiencing this problem: - SEVERE: org.apache.solr.common.SolrException: Error Instantiating QParserPlugin, org.apache.solr.search.AdvancedQParserPlugin is not a org.apache.solr.search.QParserPlugin at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:421) at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:441) at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1612) [...] These patches seems no valid anymore. Which leads me to the more experienced users here: - Although not directly mentioned in https://issues.apache.org/jira/browse/SOLR-758, is there any other (new) QParser which obsoletes the DisMax? - Futhermore I tried to make the patches apply ("forward porting"), but always get the error "Error Instantiating QParserPlugin, org.apache.solr.search.AdvancedQParserPlugin is not a org.apache.solr.search.QParserPlugin", although the class dependency is linear: ./core/src/java/org/apache/solr/search/AdvancedQParserPlugin.java: [...] public class AdvancedQParserPlugin extends DisMaxQParserPlugin { [...] ./core/src/java/org/apache/solr/search/DisMaxQParserPlugin.java: [...] public class DisMaxQParserPlugin extends QParserPlugin { [...] Thanks, Patrick
Re: Solr3.6 DeleteByQuery not working with negated query
Hi Markus, Why do you think it's not deleting amyrhing,? Thanks, Patrick Op 22 okt. 2012 08:36 schreef "Markus.Mirsberger" het volgende: > Hi, > > I am trying to delete a some documents in my index by query. > When I just select them with this negated query, I get all the documents I > want to delete but when I use this query in the DeleteByQuery it is not > working > Im trying to delete all elements which value ends with 'somename/' > When I use this for selection it works and I get exactly the right > documents (about 10.000. so too many to delete one by one:) ) > > curl http://:8080/solr/**core/update/?commit=true -H > "Content-Type: text/xml" --data-binary '-** > field:*somename/'; > > And here the response: > > > 0 name="QTime">11091 > > > I tried to perform it in the browser too by using /update?stream.body ... > but the result is the same. > And no Error in the Solr-Log. > > I hope someone can help me ... I dont want do this manually :) > > Regards, > Markus >
Re: Solr3.6 DeleteByQuery not working with negated query
Did you make sure to commit after the delete? Patrick Op 22 okt. 2012 08:43 schreef "Markus.Mirsberger" het volgende: > Hi, Patrick, > > Because I have the same amount of documents in my index than before I > perform the query. > And when I use the negated query just to select the documents I can see > they still there (and of course all other documents too :) ) > > Regards, > Markus > > > > > On 22.10.2012 14:38, Patrick Plaatje wrote: > >> Hi Markus, >> >> Why do you think it's not deleting amyrhing,? >> >> Thanks, >> Patrick >> Op 22 okt. 2012 08:36 schreef "Markus.Mirsberger" < >> markus.mirsber...@gmx.de> >> het volgende: >> >> Hi, >>> >>> I am trying to delete a some documents in my index by query. >>> When I just select them with this negated query, I get all the documents >>> I >>> want to delete but when I use this query in the DeleteByQuery it is not >>> working >>> Im trying to delete all elements which value ends with 'somename/' >>> When I use this for selection it works and I get exactly the right >>> documents (about 10.000. so too many to delete one by one:) ) >>> >>> curl http://:8080/solr/core/update/?commit=true -H >>> "Content-Type: text/xml" --data-binary '-** >>> field:*somename/'; >>> >>> And here the response: >>> >>> >>> 0>> name="QTime">11091 >>> >>> >>> I tried to perform it in the browser too by using /update?stream.body >>> ... >>> but the result is the same. >>> And no Error in the Solr-Log. >>> >>> I hope someone can help me ... I dont want do this manually :) >>> >>> Regards, >>> Markus >>> >>> >
Re: Problems with WordDelimiterFilterFactory
Hi Bern, the problem is the character sequence "--". A query is not allowed to have minus characters that consequent upon another one. Remove one minus character and the query will be parsed without problems. Because of this parsing problem, I'd recommend a query cleanup before the submit to the Solr server that replaces each sequence of minus characters by a single one. Regards, Patrick Bernadette Houghton schrieb: > Sorry, the last line was truncated - > > HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse > '(Asia -- Civilization AND status_i:(2)) ': Encountered "-" at line 1, column > 7. Was expecting one of: "(" ... "*" ... ... ... > ... ... "[" ... "{" ... ... > > -Original Message- > From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] > Sent: Friday, 9 October 2009 8:22 AM > To: 'solr-user@lucene.apache.org' > Subject: RE: Problems with WordDelimiterFilterFactory > > Here's the query and the error - > > Oct 09 08:20:17 [debug] [196] Solr query string:(Asia -- Civilization > AND status_i:(2)) > Oct 09 08:20:17 [debug] [196] Solr sort by: score desc > Oct 09 08:20:17 [error] Error on searching: "400" Status: > org.apache.lucene.queryParser.ParseException: Cannot parse ' (Asia -- > Civilization AND status_i:(2)) ': Encount > > Bern > > -Original Message- > From: Christian Zambrano [mailto:czamb...@gmail.com] > Sent: Thursday, 8 October 2009 12:48 PM > To: solr-user@lucene.apache.org > Cc: solr-user@lucene.apache.org > Subject: Re: Problems with WordDelimiterFilterFactory > > Bern, > > I am interested on the solr query. In other words, the query that your > system sends to solr. > > Thanks, > > > Christian > > On Oct 7, 2009, at 5:56 PM, Bernadette Houghton > > wrote: > >> Hi Christian, try this one - http://www.deakin.edu.au/dro/view/DU:3601 >> >> Either scroll down and click one of the "television broadcasting -- >> asia" links, or type it in the Quick Search box. >> >> >> TIA >> >> bern >> >> -Original Message- >> From: Christian Zambrano [mailto:czamb...@gmail.com] >> Sent: Thursday, 8 October 2009 9:43 AM >> To: solr-user@lucene.apache.org >> Subject: Re: Problems with WordDelimiterFilterFactory >> >> Could you please provide the exact URL of a query where you are >> experiencing this problem? >> eg(Not URL encoded): q=fieldName:"hot and cold: temperatures" >> >> On 10/07/2009 05:32 PM, Bernadette Houghton wrote: >>> We are having some issues with our solr parent application not >>> retrieving records as expected. >>> >>> For example, if the input query includes a colon (e.g. hot and >>> cold: temperatures), the relevant record (which contains a colon in >>> the same place) does not get retrieved; if the input query does not >>> include the colon, all is fine. Ditto if the user searches for a >>> query containing hyphens, e.g. "asia - civilization, although with >>> the qualifier that something like "asia-civilization" (no spaces >>> either side of the hyphen) works fine, whereas "asia - >>> civilization" (spaces either side of hyphen) doesn't work. >>> >>> Our schema.xml contains the following - >>> >>> >> positionIncrementGap="100"> >>> >>> >>> >>> >> class="solr.ISOLatin1AccentFilterFactory"/> >>> >> words="stopwords.txt"/> >>> >> generateWordParts="1" generateNumberParts="1" catenateWords="1" >>> catenateNumbers="1" catenateAll="0"/> >>> >>> >> protected="protwords.txt"/> >>> >>> >>> >>> >>> >> class="solr.ISOLatin1AccentFilterFactory"/> >>> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> >>> >> words="stopwords.txt"/> >>> >> generateWordParts="1" generateNumberParts="1" catenateWords="0" >>> catenateNumbers="0" catenateAll="0"/> >>> >>> >> protec
multi-word synonyms and analysis.jsp vs real field analysis (query, index)
Hi list, I worked on a field type and its analyzing chain, at which I want to use the SynonymFilter with entries similar to: foo bar=>foo_bar During the analysis phase, I used the /admin/analysis.jsp view to test the analyzing results produced by the created field type. The output shows that a query "foo bar" will first be separated by the WhitespaceTokenizer to the two tokens "foo" and "bar", and that the SynonymFilter will replace the both tokens with "foo_bar". But as I tried this at "real" query time with the request handler "standard" and also with "dismax", the tokens "foo" and "bar" were not replaced. The parsedQueryString was something similar to "field:foo field:bar". At index time, it works like expected. Has anybody experienced this and/or knows a workaround, a solution for it? Thanks, Patrick
Re: multi-word synonyms and analysis.jsp vs real field analysis (query, index)
Hi Koji, using phrase queries is no alternative for us, because all query parts has to be optional parts. The phrase query workaround will work for a query "foo bar", but only for this exact query. If the user queries for "foo bar baz", it will be changed to "foo_bar baz", but it will not match the indexed documents that only contains "foo_bar". And this is, what we need here. The cause of my problem should be the query parsing, but I don't know, if there is any solution for it. I need a possibility that works like the analysis/query parsing within /admin/analysis.jsp view. Patrick Koji Sekiguchi schrieb: > Patrick, > >> parsedQueryString was something similar to "field:foo field:bar". At >> index time, it works like expected. > > I guess because you are searching q=foo bar, this causes OR query. > Use q="foo bar", instead. > > Koji > > > Patrick Jungermann wrote: >> Hi list, >> >> I worked on a field type and its analyzing chain, at which I want to use >> the SynonymFilter with entries similar to: >> >> foo bar=>foo_bar >> >> During the analysis phase, I used the /admin/analysis.jsp view to test >> the analyzing results produced by the created field type. The output >> shows that a query "foo bar" will first be separated by the >> WhitespaceTokenizer to the two tokens "foo" and "bar", and that the >> SynonymFilter will replace the both tokens with "foo_bar". But as I >> tried this at "real" query time with the request handler "standard" and >> also with "dismax", the tokens "foo" and "bar" were not replaced. The >> parsedQueryString was something similar to "field:foo field:bar". At >> index time, it works like expected. >> >> Has anybody experienced this and/or knows a workaround, a solution for >> it? >> >> >> Thanks, Patrick >> >> >> >> >> >> >> >
Re: multi-word synonyms and analysis.jsp vs real field analysis (query, index)
Hi Chantal, yes, I'm using the SynonymFilter at index and query chain. Using it only at query time or only at index time was part of former considerations, but both don't fit all of our requirements. But as I wrote in my first mail, it works only within the /admin/analysis.jsp view and not at "real" query time. Patrick Chantal Ackermann schrieb: > Hi Patrick, > > have you added that SynonymFilter to the index chain and the query > chain? You have to add it to both if you want to have it replaced at > index and query time. It might also be enough to add it to the query > chain only. Than your index still preserves the original data. > > Cheers, > Chantal > > Patrick Jungermann schrieb: >> Hi Koji, >> >> using phrase queries is no alternative for us, because all query parts >> has to be optional parts. The phrase query workaround will work for a >> query "foo bar", but only for this exact query. If the user queries for >> "foo bar baz", it will be changed to "foo_bar baz", but it will not >> match the indexed documents that only contains "foo_bar". And this is, >> what we need here. >> >> The cause of my problem should be the query parsing, but I don't know, >> if there is any solution for it. I need a possibility that works like >> the analysis/query parsing within /admin/analysis.jsp view. >> >> >> Patrick >> >> >> >> Koji Sekiguchi schrieb: >>> Patrick, >>> >>>> parsedQueryString was something similar to "field:foo field:bar". At >>>> index time, it works like expected. >>> I guess because you are searching q=foo bar, this causes OR query. >>> Use q="foo bar", instead. >>> >>> Koji >>> >>> >>> Patrick Jungermann wrote: >>>> Hi list, >>>> >>>> I worked on a field type and its analyzing chain, at which I want to >>>> use >>>> the SynonymFilter with entries similar to: >>>> >>>> foo bar=>foo_bar >>>> >>>> During the analysis phase, I used the /admin/analysis.jsp view to test >>>> the analyzing results produced by the created field type. The output >>>> shows that a query "foo bar" will first be separated by the >>>> WhitespaceTokenizer to the two tokens "foo" and "bar", and that the >>>> SynonymFilter will replace the both tokens with "foo_bar". But as I >>>> tried this at "real" query time with the request handler "standard" and >>>> also with "dismax", the tokens "foo" and "bar" were not replaced. The >>>> parsedQueryString was something similar to "field:foo field:bar". At >>>> index time, it works like expected. >>>> >>>> Has anybody experienced this and/or knows a workaround, a solution for >>>> it? >>>> >>>> >>>> Thanks, Patrick >>>> >>>> >>>> >>>> >>>> >>>> >>>> >
query highlighting
Hi list, is there any possibility to get highlighting also for the query string? Example: Query: fooo bar Tokens after query analysis: foo[0,4], bar[5,8] Token "foo" matches a token of one of the queried fields. -> Query higlighting: "fooo" Thanks, Patrick
Re: multi-word synonyms and analysis.jsp vs real field analysis (query, index)
Hi Koji, the problem is, that this doesn't fit all of our requirements. We have some Solr documents that must not be matched by "foo" or "bar" but by "foo bar" as part of the query. Also, we have some other documents that could be matched by "foo" and "foo bar" or "bar" and "foo bar". The best way to handle this, seems to be by using synonyms that allows the precise configuration of this and that could be managed by an editorial staff. Besides, foo bar=>foo_bar works at anything (index time, analysis.jsp) but query time. Patrick Koji Sekiguchi schrieb: > Hi Patrick, > > Why don't you define: > > foo bar, foo_bar (and expand="true") > > instead of: > > foo bar=>foo_bar > > in only indexing side? Doesn't it make a change for the better? > > Koji > > > Patrick Jungermann wrote: >> Hi Koji, >> >> using phrase queries is no alternative for us, because all query parts >> has to be optional parts. The phrase query workaround will work for a >> query "foo bar", but only for this exact query. If the user queries for >> "foo bar baz", it will be changed to "foo_bar baz", but it will not >> match the indexed documents that only contains "foo_bar". And this is, >> what we need here. >> >> The cause of my problem should be the query parsing, but I don't know, >> if there is any solution for it. I need a possibility that works like >> the analysis/query parsing within /admin/analysis.jsp view. >> >> >> Patrick >> >> >> >> Koji Sekiguchi schrieb: >> >>> Patrick, >>> >>> >>>> parsedQueryString was something similar to "field:foo field:bar". At >>>> index time, it works like expected. >>>> >>> I guess because you are searching q=foo bar, this causes OR query. >>> Use q="foo bar", instead. >>> >>> Koji >>> >>> >>> Patrick Jungermann wrote: >>> >>>> Hi list, >>>> >>>> I worked on a field type and its analyzing chain, at which I want to >>>> use >>>> the SynonymFilter with entries similar to: >>>> >>>> foo bar=>foo_bar >>>> >>>> During the analysis phase, I used the /admin/analysis.jsp view to test >>>> the analyzing results produced by the created field type. The output >>>> shows that a query "foo bar" will first be separated by the >>>> WhitespaceTokenizer to the two tokens "foo" and "bar", and that the >>>> SynonymFilter will replace the both tokens with "foo_bar". But as I >>>> tried this at "real" query time with the request handler "standard" and >>>> also with "dismax", the tokens "foo" and "bar" were not replaced. The >>>> parsedQueryString was something similar to "field:foo field:bar". At >>>> index time, it works like expected. >>>> >>>> Has anybody experienced this and/or knows a workaround, a solution for >>>> it? >>>> >>>> >>>> Thanks, Patrick >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >> >> >
Re: multi-word synonyms and analysis.jsp vs real field analysis (query, index)
Thanks Hoss, after your hints that had partially confirmed my considerations, I had made some tests with the FieldQParser. At the beginning, I had have some problems, but finally, I was able to solve the problem of multi-word synonyms at query time in a way that is suitable for us - and possibly for others, too. At my solution, I re-used the FieldQParserPlugin. At first, I ported it to the new API (incrementToken instead of next, etc.) and then I modified the code so, that no PhraseQueries will be created but only BooleanQueries. Now with my new QParserPlugin that based on the FieldQParserPlugin, it's possible to search for things like "foo bar baz", where "foo bar" has to be changed to "foo_bar" and where at the end the tokens "foo_bar" und "baz" will be created, so that both could match independently. Patrick Chris Hostetter schrieb: > : The cause of my problem should be the query parsing, but I don't know, > : if there is any solution for it. I need a possibility that works like > : the analysis/query parsing within /admin/analysis.jsp view. > > The behavior you are describing is very well documented on the wiki... > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory > > in general, QueryParsers parse input strigs according to their > parsing rules, then send each component of th input string to the > analyzer. this is a fundentmal behavior, w/o it the query parser would > have no way of knowing when to make a phrase query, or a term query, or > which field to use. > > You may find something like the FieldQParserPlugin helpful as it has *no* > markup of it's own, it just hands the string off to an analyzer based on > the specified field ... but it will still generate a phrase query when a > single piece of input generates multiple tokens with non-zero offsets from > eachother, which also confuses people sometimes (not sure if that's what > you'd want) > > : >> SynonymFilter will replace the both tokens with "foo_bar". But as I > : >> tried this at "real" query time with the request handler "standard" and > > you've used the phrase '"real" query time' (in contrast to analysis.jsp) a > few times in this thread ... to be clear about something: there is nothing > different between analysis.jsp and what happens when a query is executed, > the reason you see different behavior is because you are pasteeing what > you consider a "query string" into the analysis form, but that's not what > happens at query time, and it's not what that form expects -- that form is > designed for users to paste in the strings that the query parser would > extract from it's query syntax. it's not suprising that you'll get > something different then if you just did a straight search on the same > input, any different then it would be suprising if pasting > "fieldname:value +otherfield:value" in analysis.jsp didn't produce the > same tokens as a query for that string. > > > -Hoss > > From - Fri
Re: Solrj Javabin and JSON
Hi Stefan, you don't need to convert the Java objects built from the result returned as Javabin. Instead of this, you could easily use the JSON return format by setting "wt=json". See also at [0] for more information about this. Patrick [0] http://wiki.apache.org/solr/SolJSON SGE0 schrieb: > Hi Paul, > > > fair enough. Is this included in the Solrj package ? Any examples how to do > this ? > > > Stefan > > > > Noble Paul നോബിള് नोब्ळ्-2 wrote: >> There is no point converting javabin to json. javabin is in >> intermediate format it is converted to the java objects as soon as >> comes. You just need means to convert the java object to json. >> >> >> >> On Sat, Oct 24, 2009 at 12:10 PM, SGE0 wrote: >>> Hi, >>> >>> did anyone write a Javabin to JSON convertor and is willing to share this >>> ? >>> >>> In our servlet we use a CommonsHttpSolrServer instance to execute a >>> query. >>> >>> The problem is that is returns Javabin format and we need to send the >>> result >>> back to the browser using JSON format. >>> >>> And no, the browser is not allowed to directly query Lucene with the >>> wt=json >>> format. >>> >>> Regards, >>> >>> S. >>> -- >>> View this message in context: >>> http://www.nabble.com/Solrj-Javabin-and-JSON-tp26036551p26036551.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> -- >> - >> Noble Paul | Principal Engineer| AOL | http://aol.com >> >> >
Re: Multi-Term Synonyms
Hi Brad, I was trying this, too, and there is a possibility how to get multi-term synonyms to work properly. I wrote my solution already on this list. My solution was as follows: [cite] after your hints that had partially confirmed my considerations, I had made some tests with the FieldQParser. At the beginning, I had have some problems, but finally, I was able to solve the problem of multi-word synonyms at query time in a way that is suitable for us - and possibly for others, too. At my solution, I re-used the FieldQParserPlugin. At first, I ported it to the new API (incrementToken instead of next, etc.) and then I modified the code so, that no PhraseQueries will be created but only BooleanQueries. Now with my new QParserPlugin that based on the FieldQParserPlugin, it's possible to search for things like "foo bar baz", where "foo bar" has to be changed to "foo_bar" and where at the end the tokens "foo_bar" und "baz" will be created, so that both could match independently. [/cite] Our current version is re-worked again, so that also multi-field queries are possible. If you want to use such a solution, you have probably to go without complex query parsing et cetera. I also have to write your own modified QParser, that fit your special needs. Also some higher features, like they are offered by other QParsers could be integrated. It's all up to you and your needs. Patrick brad anderson schrieb: > Thanks for the help. Can't believe I missed that part in the wiki. > > 2009/11/24 Tom Hill > >> Hi Brad, >> >> >> I suspect that this section from the wiki for SynonymFilterFactory might be >> relevant: >> >> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory >> >> *"Keep in mind that while the SynonymFilter will happily work with synonyms >> containing multiple words (ie: "**sea biscuit, sea biscit, seabiscuit**") >> The recommended approach for dealing with synonyms like this, is to expand >> the synonym when indexing. This is because there are two potential issues >> that can arrise at query time:* >> >> 1. >> >> *The Lucene QueryParser tokenizes on white space before giving any text >> to the Analyzer, so if a person searches for the words **sea biscit** the >> analyzer will be given the words "sea" and "biscit" seperately, and will >> not >> know that they match a synonym."* >> >> ... >> >> Tom >> >> On Tue, Nov 24, 2009 at 10:47 AM, brad anderson >> wrote: >>> Hi Folks, >>> >>> I was trying to get multi term synonyms to work. I'm experiencing some >>> strange behavior and would like some feedback. >>> >>> In the synonyms file I have the line: >>> >>> thomas, boll holly, thomas a, john q => tom >>> >>> And I have a document with the text field as; >>> >>> tom >>> >>> However, when I do a search on boll holly, it does not return the >> document >>> with tom. The same thing happens if I do a query on john q. But if I do a >>> query on thomas, it gives me the document. Also, if I quote "boll holly" >> or >>> "john q" it gives back the document. >>> >>> When I look at the analyzer page on the solr admin page, it is >> transforming >>> "boll holly" to "tom" when it isn't quoted. Why is it that it is not >>> returning the document? Is there some configuration I can make so it does >>> return the document if I do an unquoted search on "boll holly"? >>> >>> My synonym filter is defined as follows, and is only defined on the query >>> side: >>> >>> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> >>> >>> >>> I've also tried changing the synonym file to be >>> >>> tom, thomas, boll holly, thomas a, john q >>> >>> This produces the same results. >>> >>> Thanks, >>> Brad >>> >
Re: synonyms
Hello Peter, by using the existing SynonymFilterFactory, it is not possible to use a database instead of a text file. This file will be read at startup and the internal synonym catalogue (SynonymMap) will be created. You could create your own filter factory that could create the needed synonym catalogue by using a database. Look into the SynonymFilterFactory and the SynonymFilter and you could get this to work. As another possibility, you could create the needed synonym text file by a script or something else, before the startup of Solr server. This could probably be the easiest way. -Patrick Peter A. Kirk schrieb: > Hi > > > > It appears that Solr reads a synonym list at startup from a text file. > > Is it possible to alter this behaviour so that Solr obtains the synonym list > from a database instead? > > > > Thanks, > > Peter > >
Re: Huge load and long response times during search
Try solr.FastLRUCache instead of solr.LRUCache it's the new cache gesture for solr 1.4. And maybe true in main index section or diminish mergefactor see http://wiki.apache.org/lucene-java/ImproveSearchingSpeed Tomasz Kępski a écrit : Hi, I'm using SOLR(1.4) to search among about 3,500,000 documents. After the server kernel was updated to 64bit system has started to suffer. Our server has 8G of RAM and double Intel Core 2 DUO. We used to have average loads around 2-2,5. It was not as good as it should but as long HTTP response times was acceptable we do not care to much ;-) Since few days avg loads are usually around 6, sometimes goes even to 20. PHP, Mysql and Postgresql based application is rather fine, but when tries to access SOLR it takes ages to load page. In top java process (Jetty) takes 200-250% of CPU, iotop shows that most of the disk operations are done by SOLR threads as well. When we do shut down Jetty load goes down to 1,5 or even less than 1. My index has ~12G below is a part of my solrconf.xml: 1024 true true 40 200 solr 0 name="rows">10 solr price 0 10 solr name="sort">rekomendacja 0 name="rows">10 static newSearcher warming query from solrconfig.xml fast_warm 0 10 static firstSearcher warming query from solrconfig.xml false dismax explicit 0.01 name^90.0 scategory^450.0 brand^90.0 text^0.01 description^30 brand,description,id,name,price,score 4<100% 5<90% 100 *:* sample query parameters from log looks like this: 2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={spellcheck=true&wt=json&rows=20&json.nl=map&start=520&facet=true&spellcheck.collate=true&fl=id,name,description,preparation,url,shop_id&q=camera&qt=dismax&version=1.3&hl.fl=name,description,atributes,brand,url&facet.field=shop_id&facet.field=brand&hl.fragsize=200&spellcheck.count=5&hl.snippets=3&hl=true} hits=3784 status=0 QTime=83 2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/spellCheckCompRH params={spellcheck=true&wt=json&rows=20&json.nl=map&start=520&facet=true&spellcheck.collate=true&fl=id,name,description,preparation,url,shop_id&q=camera&qt=dismax&version=1.3&hl.fl=name,description,atributes,brand,url&facet.field=shop_id&facet.field=brand&hl.fragsize=200&spellcheck.count=5&hl.snippets=3&hl=true} hits=3784 status=0 QTime=16 And at last the question ;-) How to speed up the search? Which parameters should I check first to find out what is the bottleneck? Sorry for verbose entry but I would like to give as clear point of view as possible Thanks in advance, Tom
Keyword extraction
Hi all, Strugling with a question I recently got from a collegue: is it possible to extract keywords from indexed content? In my opinion it should be possible to find out on what words the ranking of the indexed content is the highest (Lucene or Solr), but have no clue where to begin. Anyone having suggestions? Best, Patrick
RE: Keyword extraction
Hi All, as an addition to my previous post, no interestingTerms are returned when i execute the folowing url: http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.interes tingTerms=list&mlt=true&mlt.match.include=true I get a moreLikeThis list though, any thoughts? Best, Patrick
RE: Keyword extraction
Hi Aleksander, Thanx for clearing this up. I am confident that this is a way to explore for me as I'm just starting to grasp the matter. Do you know why I'm not getting any results with the query posted earlier then? It gives me the folowing only: Instead of delivering details of the interestingTerms. Thanks in advance Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 13:03 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction I do not agree with you at all. The concept of MoreLikeThis is based on the fundamental idea of TF-IDF weighting, and not term frequency alone. Please take a look at: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThis.html As you can see, it is possible to use cut-off thresholds to significantly reduce the number of unimportant terms, and generate highly suitable queries based on the tf-idf frequency of the term, since as you point out, high frequency terms alone tends to be useless for querying, but taking the document frequency into account drastically increases the importance of the term! In solr, use parameters to manipulate your desired results: http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e22ec5d1519c456b2c For instance: mlt.mintf - Minimum Term Frequency - the frequency below which terms will be ignored in the source doc. mlt.mindf - Minimum Document Frequency - the frequency at which words will be ignored which do not occur in at least this many docs. You can also set thresholds for term length etc. Hope this gives you a better idea of things. - Aleks On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> wrote: > Dear Partick, I had the same problem with MoreLikeThis function. > > After briefly reading and analyzing the source code of moreLikeThis > function in solr, I conducted: > > MoreLikeThis uses term vectors to ranks all the terms from a document > by its frequency. According to its ranking, it will start to generate > queries, artificially, and search for documents. > > So, moreLikeThis will retrieve related documents by artificially > generating queries based on most frequent terms. > > There's a big problem with "most frequent terms" from documents. Most > frequent words are usually meaningless, or so called function words, > or, people from Information Retrieval like to call them stopwords. > However, ignoring technical problems of implementation of > moreLikeThis function, this approach is very dangerous, since queries > are generated artificially based on a given document. > Writting queries for retrieving a document is a human task, and it > assumes some knowledge (user knows what document he wants). > > I advice to use others approaches, depending on your expectation. For > example, you can extract similar documents just by searching for > documents with similar title (more like this doesn't work in this case). > > I hope it helps, > Best Regards, > Vitalie Scurtu > --- On Wed, 11/26/08, Plaatje, Patrick <[EMAIL PROTECTED]> > wrote: > From: Plaatje, Patrick <[EMAIL PROTECTED]> > Subject: RE: Keyword extraction > To: solr-user@lucene.apache.org > Date: Wednesday, November 26, 2008, 10:52 AM > > Hi All, > as an addition to my previous post, no interestingTerms are returned > when i execute the folowing url: > http://localhost:8080/solr/select/?q=id=18477975&mlt.fl=text&mlt.inter > es tingTerms=list&mlt=true&mlt.match.include=true > I get a moreLikeThis list though, any thoughts? > Best, > Patrick > > > > -- Aleksander M. Stensby Senior software developer Integrasco A/S www.integrasco.no
RE: Keyword extraction
Hi Aleksander, This was a typo on my end, the original query included a semicolon instead of an equal sign. But I think it has to do with my field not being stored and not being identified as termVectors="true". I'm recreating the index now, and see if this fixes the problem. Best, patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 14:37 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Hi there! Well, first of all i think you have an error in your query, if I'm not mistaken. You say http://localhost:8080/solr/select/?q=id=18477975... but since you are referring to the field called "id", you must say: http://localhost:8080/solr/select/?q=id:18477975... (use colon instead of the equals sign). I think that will do the trick. If not, try adding the &debugQuery=on at the end of your request url, to see debug output on how the query is parsed and if/how any documents are matched against your query. Hope this helps. Cheers, Aleksander On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: > Hi Aleksander, > > Thanx for clearing this up. I am confident that this is a way to > explore for me as I'm just starting to grasp the matter. Do you know > why I'm not getting any results with the query posted earlier then? It > gives me the folowing only: > > > > > Instead of delivering details of the interestingTerms. > > Thanks in advance > > Patrick > > > -Original Message- > From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] > Sent: woensdag 26 november 2008 13:03 > To: solr-user@lucene.apache.org > Subject: Re: Keyword extraction > > I do not agree with you at all. The concept of MoreLikeThis is based > on the fundamental idea of TF-IDF weighting, and not term frequency alone. > Please take a look at: > http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simil > ar/MoreLikeThis.html As you can see, it is possible to use cut-off > thresholds to significantly reduce the number of unimportant terms, > and generate highly suitable queries based on the tf-idf frequency of > the term, since as you point out, high frequency terms alone tends to > be useless for querying, but taking the document frequency into > account drastically increases the importance of the term! > > In solr, use parameters to manipulate your desired results: > http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e2 > 2ec5d1519c456b2c > For instance: > mlt.mintf - Minimum Term Frequency - the frequency below which terms > will be ignored in the source doc. > mlt.mindf - Minimum Document Frequency - the frequency at which words > will be ignored which do not occur in at least this many docs. > You can also set thresholds for term length etc. > > Hope this gives you a better idea of things. > - Aleks > > On Wed, 26 Nov 2008 12:38:38 +0100, Scurtu Vitalie <[EMAIL PROTECTED]> > wrote: > >> Dear Partick, I had the same problem with MoreLikeThis function. >> >> After briefly reading and analyzing the source code of moreLikeThis >> function in solr, I conducted: >> >> MoreLikeThis uses term vectors to ranks all the terms from a document >> by its frequency. According to its ranking, it will start to generate >> queries, artificially, and search for documents. >> >> So, moreLikeThis will retrieve related documents by artificially >> generating queries based on most frequent terms. >> >> There's a big problem with "most frequent terms" from documents. >> Most frequent words are usually meaningless, or so called function >> words, or, people from Information Retrieval like to call them stopwords. >> However, ignoring technical problems of implementation of >> moreLikeThis function, this approach is very dangerous, since queries >> are generated artificially based on a given document. >> Writting queries for retrieving a document is a human task, and it >> assumes some knowledge (user knows what document he wants). >> >> I advice to use others approaches, depending on your expectation. For >> example, you can extract similar documents just by searching for >> documents with similar title (more like this doesn't work in this case). >> >> I hope it helps, >> Best Regards, >> Vitalie Scurtu >> --- On Wed, 11/26/08, Plaatje, Patrick >> <[EMAIL PROTECTED]> >> wrote: >> From: Plaatje, Patrick <[EMAIL PROTECTED]> >> Subject: RE: Keyword extraction >> To: solr-user@lucene.apache.org >> Date: Wednesday, November 26, 2008, 10:52 AM >
RE: Keyword extraction
Hi Aleksander, With all the help of you and the other comments, we're now at a point where a MoreLikeThis list is returned, and shows 10 related records. However on the query executed there are no keywords whatsoever being returned. Is the querystring still wrong or is something else required? The querystring we're currently executing is: http://suempnr3:8080/solr/select/?q=amsterdam&mlt.fl=text&mlt.displayTerms=list&mlt=true Best, Patrick -Original Message- From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] Sent: woensdag 26 november 2008 15:07 To: solr-user@lucene.apache.org Subject: Re: Keyword extraction Ah, yes, That is important. In lucene, the MLT will see if the term vector is stored, and if it is not it will still be able to perform the querying, but in a much much much less efficient way.. Lucene will analyze the document (and the variable DEFAULT_MAX_NUM_TOKENS_PARSED will be used to limit the number of tokens that will be parsed). (don't want to go into details on this since I haven't really dug through the code:p) But when the field isn't stored either, it is rather difficult to re-analyze the document;) On a general note, if you want to "really" understand how the MLT works, take a look at the wiki or read this thorough blog post: http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ Regards, Aleksander On Wed, 26 Nov 2008 14:41:52 +0100, Plaatje, Patrick <[EMAIL PROTECTED]> wrote: > Hi Aleksander, > > This was a typo on my end, the original query included a semicolon > instead of an equal sign. But I think it has to do with my field not > being stored and not being identified as termVectors="true". I'm > recreating the index now, and see if this fixes the problem. > > Best, > > patrick > > -Original Message- > From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] > Sent: woensdag 26 november 2008 14:37 > To: solr-user@lucene.apache.org > Subject: Re: Keyword extraction > > Hi there! > Well, first of all i think you have an error in your query, if I'm not > mistaken. > You say http://localhost:8080/solr/select/?q=id=18477975... > but since you are referring to the field called "id", you must say: > http://localhost:8080/solr/select/?q=id:18477975... > (use colon instead of the equals sign). > I think that will do the trick. > If not, try adding the &debugQuery=on at the end of your request url, > to see debug output on how the query is parsed and if/how any > documents are matched against your query. > Hope this helps. > > Cheers, > Aleksander > > > > On Wed, 26 Nov 2008 13:08:30 +0100, Plaatje, Patrick > <[EMAIL PROTECTED]> wrote: > >> Hi Aleksander, >> >> Thanx for clearing this up. I am confident that this is a way to >> explore for me as I'm just starting to grasp the matter. Do you know >> why I'm not getting any results with the query posted earlier then? >> It gives me the folowing only: >> >> >> >> >> Instead of delivering details of the interestingTerms. >> >> Thanks in advance >> >> Patrick >> >> >> -Original Message- >> From: Aleksander M. Stensby [mailto:[EMAIL PROTECTED] >> Sent: woensdag 26 november 2008 13:03 >> To: solr-user@lucene.apache.org >> Subject: Re: Keyword extraction >> >> I do not agree with you at all. The concept of MoreLikeThis is based >> on the fundamental idea of TF-IDF weighting, and not term frequency >> alone. >> Please take a look at: >> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/simi >> l ar/MoreLikeThis.html As you can see, it is possible to use cut-off >> thresholds to significantly reduce the number of unimportant terms, >> and generate highly suitable queries based on the tf-idf frequency of >> the term, since as you point out, high frequency terms alone tends to >> be useless for querying, but taking the document frequency into >> account drastically increases the importance of the term! >> >> In solr, use parameters to manipulate your desired results: >> http://wiki.apache.org/solr/MoreLikeThis#head-6460069f297626f2a982f1e >> 2 >> 2ec5d1519c456b2c >> For instance: >> mlt.mintf - Minimum Term Frequency - the frequency below which terms >> will be ignored in the source doc. >> mlt.mindf - Minimum Document Frequency - the frequency at which words >> will be ignored which do not occur in at least this many docs. >> You can also set thresholds for term length etc. >> >> Hope this gives you a better idea of things. >> - Aleks >>
RE: php client. json communication
Or have a look at the Wiki, probably a better way to start: http://wiki.apache.org/solr/SolPHP Best, Patrick -- Just trying to help http://www.ipros.nl/ -- -Original Message- From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com] Sent: dinsdag 16 december 2008 15:14 To: solr-user@lucene.apache.org Subject: Re: php client. json communication Check out this link http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html If anyone of you used it can you share your experiences. Thanks, Kishore Veleti A.V.K. Julian Davchev wrote: > > Hi, > I am about to integrate solr for index/search of my documents/data. > It's php application but I see it should be no problem as solr works > with xml by default. > Is there any read php lib that will ease/help whole communication with > solr and if possible to send/receive json data. > > I looked up archive list and seems not many discussions in php. Also > from manual it seems that it can only get json response but request > should always be xml. > Cheers, > > -- View this message in context: http://www.nabble.com/php-client.-json-communication-tp21033573p21033806 .html Sent from the Solr - User mailing list archive at Nabble.com.
Using DIH, getting exception
Hi All, I'm trying to use the Data import handler, with the data config below (snippet): The variables are all good (userrname+password, etc), but I'm getting the following exception, any thoughts? org.apache.solr.handler.dataimport.DataImportHandlerException: No dataSource :null available for entity :item Processing Document # Best, Patrick
RE: checkout 1.4 snapshot
Hi, You can find the SVN repository here: http://www.apache.org/dev/version-control.html#anon-svn I'm not sure if this represent the 1.4 version, but as being the trunk it's the latest version. Best, Patrick -Original Message- From: roberto [mailto:miles.c...@gmail.com] Sent: dinsdag 16 december 2008 22:13 To: solr-user@lucene.apache.org Subject: checkout 1.4 snapshot Hello, Someone could tell me how can i checkout the 1.4 snapshot ? thanks, -- "Without love, we are birds with broken wings." Morrie
RE: checkout 1.4 snapshot
Sorry all, Wrong url in the post, right url should be: http://svn.apache.org/repos/asf/lucene/solr/ Best, Patrick -Original Message- From: Plaatje, Patrick [mailto:patrick.plaa...@getronics.com] Sent: dinsdag 16 december 2008 22:19 To: solr-user@lucene.apache.org Subject: RE: checkout 1.4 snapshot Hi, You can find the SVN repository here: http://www.apache.org/dev/version-control.html#anon-svn I'm not sure if this represent the 1.4 version, but as being the trunk it's the latest version. Best, Patrick -Original Message- From: roberto [mailto:miles.c...@gmail.com] Sent: dinsdag 16 december 2008 22:13 To: solr-user@lucene.apache.org Subject: checkout 1.4 snapshot Hello, Someone could tell me how can i checkout the 1.4 snapshot ? thanks, -- "Without love, we are birds with broken wings." Morrie
RE: php client. json communication
Glad that's sorted. On the other issue (directly accessing solr from any client) I think I saw a discussion on the list earlier, but I don't know what the result was, browse through the archives and look for something about security (I think). Best, patrick -Original Message- From: Julian Davchev [mailto:j...@drun.net] Sent: dinsdag 16 december 2008 23:02 To: solr-user@lucene.apache.org Subject: Re: php client. json communication I think I got it now. Search request is actually just simple url with few params...no json or xml or fancy stuff needed. I was concerned with this cause I need to use solr with javascript directly, bypassing application and directly searching stuff. Plaatje, Patrick wrote: > Hi Julian, > > I'm a bit confused. The indexing is indeed being done through XML, but > in searching it is possible to get JSON results by using the wt=json > parameter, have a look here: > > http://wiki.apache.org/solr/SolJSON > > Best, > > Patrick > > > -Original Message- > From: Julian Davchev [mailto:j...@drun.net] > Sent: dinsdag 16 december 2008 22:39 > To: solr-user@lucene.apache.org > Subject: Re: php client. json communication > > Hi, > 1. Thanks for links, I looked at both. Still I think that solr or > at least those php clients doesn't support jason as input. > It's clear that it's possible to get json response.but search is > only possible via xml queries. > > > Plaatje, Patrick wrote: > >> Or have a look at the Wiki, probably a better way to start: >> >> http://wiki.apache.org/solr/SolPHP >> >> Best, >> >> Patrick >> >> -- >> Just trying to help >> http://www.ipros.nl/ >> -- >> >> -Original Message- >> From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com] >> Sent: dinsdag 16 december 2008 15:14 >> To: solr-user@lucene.apache.org >> Subject: Re: php client. json communication >> >> >> Check out this link >> http://www.ibm.com/developerworks/library/os-php-apachesolr/index.htm >> l >> >> If anyone of you used it can you share your experiences. >> >> Thanks, >> Kishore Veleti A.V.K. >> >> >> Julian Davchev wrote: >> >> >>> Hi, >>> I am about to integrate solr for index/search of my documents/data. >>> It's php application but I see it should be no problem as solr works >>> with xml by default. >>> Is there any read php lib that will ease/help whole communication >>> with >>> >>> >> >> >>> solr and if possible to send/receive json data. >>> >>> I looked up archive list and seems not many discussions in php. Also >>> from manual it seems that it can only get json response but request >>> should always be xml. >>> Cheers, >>> >>> >>> >>> >> -- >> View this message in context: >> http://www.nabble.com/php-client.-json-communication-tp21033573p21033 >> 8 >> 06 >> .html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> >> > >
RE: php client. json communication
Hi Julian, I'm a bit confused. The indexing is indeed being done through XML, but in searching it is possible to get JSON results by using the wt=json parameter, have a look here: http://wiki.apache.org/solr/SolJSON Best, Patrick -Original Message- From: Julian Davchev [mailto:j...@drun.net] Sent: dinsdag 16 december 2008 22:39 To: solr-user@lucene.apache.org Subject: Re: php client. json communication Hi, 1. Thanks for links, I looked at both. Still I think that solr or at least those php clients doesn't support jason as input. It's clear that it's possible to get json response.but search is only possible via xml queries. Plaatje, Patrick wrote: > Or have a look at the Wiki, probably a better way to start: > > http://wiki.apache.org/solr/SolPHP > > Best, > > Patrick > > -- > Just trying to help > http://www.ipros.nl/ > -- > > -Original Message- > From: KishoreVeleti CoreObjects [mailto:kisho...@coreobjects.com] > Sent: dinsdag 16 december 2008 15:14 > To: solr-user@lucene.apache.org > Subject: Re: php client. json communication > > > Check out this link > http://www.ibm.com/developerworks/library/os-php-apachesolr/index.html > > If anyone of you used it can you share your experiences. > > Thanks, > Kishore Veleti A.V.K. > > > Julian Davchev wrote: > >> Hi, >> I am about to integrate solr for index/search of my documents/data. >> It's php application but I see it should be no problem as solr works >> with xml by default. >> Is there any read php lib that will ease/help whole communication >> with >> > > >> solr and if possible to send/receive json data. >> >> I looked up archive list and seems not many discussions in php. Also >> from manual it seems that it can only get json response but request >> should always be xml. >> Cheers, >> >> >> > > -- > View this message in context: > http://www.nabble.com/php-client.-json-communication-tp21033573p210338 > 06 > .html > Sent from the Solr - User mailing list archive at Nabble.com. > >
RE: Change in config file (synonym.txt) requires container restart?
Hi , I'm wondering if you could not implement a custom filter which reads the file realtime (you might even keep the create synonym map in memory for a predefined time). This then doesn't need a restart of the container. Best, Patrick -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: vrijdag 19 december 2008 7:30 To: solr-user@lucene.apache.org Subject: Re: Change in config file (synonym.txt) requires container restart? Please note that a core reload will also stop Solr from serving any search requests in the time it reloads. On Fri, Dec 19, 2008 at 8:24 AM, Sagar Khetkade wrote: > > But i am using CommonsHttpSolrServer for Solr server configuation as > it is accepts the url. So here how can i reload the core. > > -Sagar> Date: Thu, 18 Dec 2008 07:55:02 -0500> From: > -Sagar> markrmil...@gmail.com> > To: solr-user@lucene.apache.org> Subject: Re: Change in config file > (synonym.txt) requires container restart?> > Sagar Khetkade wrote:> > > Hi,> > > > > I am using SolrJ client to connect to the Solr 1.3 server and the > > > whole > POC (doing a feasibility study ) reside in Tomcat web server. If any > change I am making in the synonym.txt file to add the synonym in the > file to make it reflect I have to restart the tomcat server. The > synonym filter factory that I am using are in both in analyzers for > type index and query in schema.xml. Please tell me whether this > approach is good or any other way to make the change reflect while > searching without restarting of tomcat server.> > > > Thanks and > Regards,> > Sagar Khetkade> > > _> > > Chose your Life Partner? Join MSN Matrimony FREE> > > http://in.msn.com/matrimony> > > > You can also reload the core.> > - Mark > _ > Chose your Life Partner? Join MSN Matrimony FREE > http://in.msn.com/matrimony > -- Regards, Shalin Shekhar Mangar.
Getting request object within search component
Hi All, I developed my own custom search component, in which I need to get the requestors ip-address. But I can't seem to find a request object from where I can get this string, ideas anyone? Best, Patrick
RE: Solr statistics of top searches and results returned
Hi, At the moment Solr does not have such functionality. I have written a plugin for Solr though which uses a second Solr core to store/index the searches. If you're interested, send me an email and I'll get you the source for the plugin. Regards, Patrick -Original Message- From: solrpowr [mailto:solrp...@hotmail.com] Sent: dinsdag 19 mei 2009 20:21 To: solr-user@lucene.apache.org Subject: Solr statistics of top searches and results returned Hi, Besides my own offline processing via logs, does solr have the functionality to give me statistics such as top searches, how many results were returned on these searches, and/or how long it took to get these results on average. Thanks, Bob -- View this message in context: http://www.nabble.com/Solr-statistics-of-top-searches-and-results-returned-tp23621779p23621779.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Solr statistics of top searches and results returned
Hi Shalin, Let me investigate. I think the challenge will be in storingmanaging these statistics. I'll get back to the list when I have thought of something. Rgrds, Patrick -Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: woensdag 20 mei 2009 10:33 To: solr-user@lucene.apache.org Subject: Re: Solr statistics of top searches and results returned On Wed, May 20, 2009 at 1:31 PM, Plaatje, Patrick < patrick.plaa...@getronics.com> wrote: > > At the moment Solr does not have such functionality. I have written a > plugin for Solr though which uses a second Solr core to store/index > the searches. If you're interested, send me an email and I'll get you > the source for the plugin. > > Patrick, this will be a useful addition. However instead of doing this with another core, we can keep running statistics which can be shown on the statistics page itself. What do you think? A related approach for showing slow queries was discussed recently. There's an issue open which has more details: https://issues.apache.org/jira/browse/SOLR-1101 -- Regards, Shalin Shekhar Mangar.
RE: Solr statistics of top searches and results returned
Hi all, I created a script that uses a Solr Search Component, which hooks into the main solr core and catches the searches being done. After this it tokenizes the search and send both the tokenized as well as the original query to another Solr core. I have not written a factory for this, but if required, it shouldn't be so hard to modify the script and code Database support into it. You can find the source here: http://www.ipros.nl/uploads/Stats-component.zip It includes a README, and a schema.xml that should be used. Please let me know you're thoughts. Best, Patrick -Original Message- From: Umar Shah [mailto:u...@wisdomtap.com] Sent: vrijdag 22 mei 2009 10:03 To: solr-user@lucene.apache.org Subject: Re: Solr statistics of top searches and results returned Hi, good feature to have, maintaining top N would also require storing all the search queries done so far and keep updating (or atleast in some time window). having pluggable persistent storage for all time search queries would be great. tell me how can I help? -umar On Fri, May 22, 2009 at 12:21 PM, Shalin Shekhar Mangar wrote: > On Fri, May 22, 2009 at 3:22 AM, Grant Ingersoll wrote: > >> >> I think you will want some type of persistence mechanism otherwise >> you will end up consuming a lot of resources keeping track of all the >> query strings, unless I'm missing something. Either a Lucene index >> (Solr core) or the option of embedding a DB. Ideally, it would be >> pluggable such that people could choose their storage mechanism. >> Most people do this kind of thing offline via log analysis as logs can grow >> quite large quite quickly. >> > > For a general case, yes. But I was thinking more of a top 'n' queries > as a running statistic. > > -- > Regards, > Shalin Shekhar Mangar. >
RE: Solr statistics of top searches and results returned
Hi, In our specific implementation this is not really an issue, but I can imagine it could impact performance. I guess a new thread could spawned, which takes care of any performance issues, thanks for pointing it out. I'll post a message when I coded the change. Regards, Patrick -Original Message- From: rswart [mailto:rjsw...@gmail.com] Sent: dinsdag 26 mei 2009 16:42 To: solr-user@lucene.apache.org Subject: RE: Solr statistics of top searches and results returned If this is is not done in an async way wouldn't this have a serious performance impact? Plaatje, Patrick wrote: > > Hi all, > > I created a script that uses a Solr Search Component, which hooks into > the main solr core and catches the searches being done. After this it > tokenizes the search and send both the tokenized as well as the > original query to another Solr core. I have not written a factory for > this, but if required, it shouldn't be so hard to modify the script > and code Database support into it. > > You can find the source here: > > http://www.ipros.nl/uploads/Stats-component.zip > > It includes a README, and a schema.xml that should be used. > > Please let me know you're thoughts. > > Best, > > Patrick > > > > > > -Original Message- > From: Umar Shah [mailto:u...@wisdomtap.com] > Sent: vrijdag 22 mei 2009 10:03 > To: solr-user@lucene.apache.org > Subject: Re: Solr statistics of top searches and results returned > > Hi, > > good feature to have, > maintaining top N would also require storing all the search queries > done so far and keep updating (or atleast in some time window). > > having pluggable persistent storage for all time search queries would > be great. > > tell me how can I help? > > -umar > > On Fri, May 22, 2009 at 12:21 PM, Shalin Shekhar Mangar > wrote: >> On Fri, May 22, 2009 at 3:22 AM, Grant Ingersoll >> wrote: >> >>> >>> I think you will want some type of persistence mechanism otherwise >>> you will end up consuming a lot of resources keeping track of all >>> the query strings, unless I'm missing something. Either a Lucene >>> index (Solr core) or the option of embedding a DB. Ideally, it >>> would be pluggable such that people could choose their storage mechanism. >>> Most people do this kind of thing offline via log analysis as logs >>> can grow quite large quite quickly. >>> >> >> For a general case, yes. But I was thinking more of a top 'n' queries >> as a running statistic. >> >> -- >> Regards, >> Shalin Shekhar Mangar. >> > > -- View this message in context: http://www.nabble.com/Solr-statistics-of-top-searches-and-results-returned-tp23621779p23724277.html Sent from the Solr - User mailing list archive at Nabble.com.
Delete, Commit, Add Interaction
We're indexing a potentially large collection of documentsinto smaller subgroups we call "collections". Each document has a field that identifies the collection it belongs to, in addition to a unique document id field: foo-1 foo .. foo-2 foo . . etc. "collection" and "id" are defined in schema.xml as string fields. When a collection is being added to the index, it's possible that there is an existing "foo" collection in the index that needs to be replaced. The ids in the new collection will reuse many of the ids in the old collection, but the replacement is not a document-for-document replacement process -- there may be more or less documents in the new collection. So the replacement operation goes as follows: collection:foo . Each of these XML commands happens on a separate HTTP connection. If the collection doesn't already exist in the index, then the delete is essentially a noop. Finally, here's the behavior we're seeing. In some cases, usually when the index is starting to get larger (approaching 500,000 documents), the above procedure will fail to add anything to the index. That is, none of the commands return an error code, there is no indication of a problem in the log files and the process DOES take some amount of time to complete. But at the end of the process, there are no documents in the index whose collection is "foo". This can happen whether or not there is an existing "foo" collection already in the index -- in fact, the typical case is that there is not. So my question is: Is there any chance that the delete, commit, and add commands are interacting in such a way as to cause the add to happen before the delete so that the add is just replacing the existing "foo" documents and then the delete is coming along and deleting everything? My understanding is that the wait attributes to the commit command should flush the delete out to the index before the add can start but I have no knowledge of the true sequencing of events in either Solr or Lucene. If this is happening, how can I know when the delete has been processed before initiating the add process? Thanks, Patrick Johnstone
StreamingUpdateSolrServer
Hi All, I'm testing StreamingUpdateSolrServer for indexing but I don't see the last : finished: org.apache.solr.client.solrj.impl.StreamingUpdateSolrServer$Runner@ in my logs. Do I have to use a special function to wait until update is effective ? Another question (maybe easy for you) I'm running solr on a tomcat 5.0.28 and sometimes, not at a time of rsync or big traffic or commit, it doesn't respond anymore and uptime is very high. Thank you for your help. Patrick.
Invalid CRLF - StreamingUpdateSolrServer ?
I'm using solr 1.4 on tomcat 5.0.28, with client StreamingUpdateSolrServer with 10threads and xml communication via Post method. Is there a way to avoid this error (data lost)? And is StreamingUpdateSolrServer reliable ? GRAVE: org.apache.solr.common.SolrException: Invalid CRLF at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:72) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) at java.lang.Thread.run(Thread.java:619) Caused by: com.ctc.wstx.exc.WstxIOException: Invalid CRLF