Changing existing index to use block-join
Hello, I read about the possibility to have nested documents with solr block-join since version 4.5. I’m wondering if I can change an existing index to use this new opportunity. Right now I’m having an index which stores informations about a journal and each of its articles. For example an journal has the id - and the articles have ids like --01, --02, …. I also already using a field called j-id in all documents to refer to the id of the journal (so all articles of a the journal in the given example have the j-id -). I’m using this j-id to group all results of an journal with the group feature. Obviously this solution lacks of some features like faceting or finding the parent journal of an article without doing a second request. So, the new block-joing feature seems to solve some of these problems (sadly not all – as far as I see, I can’t get the parent document and the articles where the search term was found in a nested result). So, my question now: can I change my existing index in just adding a is_parent and a _root_ field and saving the journal id there like I did with j-id or do I have to reindex all my documents? I made some test in adding the id of the parent journal in the _root_ field of the articles and trying to make a query like q={!parent which='is_parent:true'}+description:test but it didn’t seem to work. I only got an error message: java.lang.IllegalArgumentException: docID must be >= 0 and < maxDoc=1418849 (got docID=-1)\r\n\tat org.apache.lucene.index.BaseCompositeReader.readerIndex(BaseCompositeReader.java:182)\r\n\tat org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:109)\r\n\tat org.apache.lucene.index.IndexReader.document(IndexReader.java:436)\r\n\tat org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:657)\r\n\tat org.apache.solr.response.TextResponseWriter.writeDocuments(TextResponseWriter.java:270)\r\n\tat org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:172)\r\n\tat org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183)\r\n\tat org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299)\r\n\tat org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95)\r\n\tat org.apache.solr.response.JSONResponseWriter.write(JSONResponseWriter.java:60)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:698)\r\n\tat org.apache.solr. servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:426)\r\n\tat org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)\r\n\tat org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)\r\n\tat org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)\r\n\tat org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222)\r\n\tat org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123)\r\n\tat org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)\r\n\tat org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:99)\r\n\tat org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:929)\r\n\tat org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)\r\n\tat org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)\r\n\tat org.apache.coyote.ht tp11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1002)\r\n\tat org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:585)\r\n\tat org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310)\r\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)\r\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\r\n\tat java.lang.Thread.run(Thread.java:724)\r\n", "code": 500 Do you have any advice how to fix this or to use block-join properly? Thanks, Gesh
Re: Changing existing index to use block-join
Zitat von Mikhail Khludnev : On Sat, Jan 18, 2014 at 11:25 PM, wrote: So, my question now: can I change my existing index in just adding a is_parent and a _root_ field and saving the journal id there like I did with j-id or do I have to reindex all my documents? Absolutely, to use block-join you need to index nested documents as blocks, as it's described at http://blog.griddynamics.com/2013/09/solr-block-join-support.html eg https://gist.github.com/mkhludnev/6406734#file-t-shirts-xml Thank you for the clarification. But there is no way to add new children without indexing the parent document and all existing childs again? So, in the example on github, if I want to add new sizes and colors to an existing T-Shirt, I have to reindex the already existing T-Shirt and all it's variations again? I understand that the blocks are created at index time, so I can't change an existing index to build blocks just in adding the _root_ field, but I don't get why it's not possible to add new children or did I missinterpret your statement? Thanks, -Gesh
Searching and scoring with block join
Hello again, I'm using the solr block-join feature to index a journal and all of it's articles. Here a short example: 527fcbf8-c140-4ae6-8f51-68cd2efc1343 Sozialmagazin 8 2008 0340-8469 ... juventa ... true 527fcb34-4570-4a86-b9e7-68cd2efc1343 A World out of Balance 62 Amthor ... ... 527fcbf8-84ec-424f-9d58-68cd2efc1343 Die Philosophie des Helfens 50 Keck ... ... I read about the search syntax in this article: http://blog.griddynamics.com/2013/09/solr-block-join-support.html Yet I'm wondering, how to use it properly. If I want to make a "fulltext" search over all journals and their articles and getting the journals with the highest score as result, how should my query look like? I know that I can't just make a query like this: {!parent which=is_parent:true}+Term, most likely I'll get this error: child query must only match non-parent docs, but parent docID= matched childScorer=class org.apache.lucene.search.TermScorer So, how do I make a query that is searching in both, journals and articles, giving me the journals ordered by their score? How do I get the score of the child documents to be added to the score of the parent document? Thank you for your help. - Gesh
Re: Searching and scoring with block join
Zitat von Mikhail Khludnev : On Wed, Jan 22, 2014 at 10:17 PM, wrote: I know that I can't just make a query like this: {!parent which=is_parent:true}+Term, most likely I'll get this error: child query must only match non-parent docs, but parent docID= matched childScorer=class org.apache.lucene.search.TermScorer Hello Gesh, As it's state there child clause should not match any parent docs, but the query +Term matches them because it applies some default field which, I believe belongs to parent docs. That blog has an example of searching across both 'scopes' q=+BRAND_s:Nike +_query_:"{!parent which=type_s:parent}+COLOR_s:Red +SIZE_s:XL" mind exact fields specified for both scopes. In your case you need to switch from conjunction '+' to disjunction. Hello Mikhail, Yes, that's correct. I also already tried the query you brought as example, but I have problems with the scoring. I'm using edismax as defType, but I'm not quite sure how to use it with a {!parent } query. For example, if I do this query, the score is always 0 {!parent which=is_parent:true}+content_de:Test The blog says: ToParentBlockJoinQuery supports a few modes of score calculations. {!parent} parser has None mode hardcoded. So, can I change the hardcoded mode somehow? I didn't find any further documentation about the parameters of {!parent}. If I'm doing this request, the score seems only be calculated by the results found in "title". title:Test _query_:"{!parent which=is_parent:true}+content_de:Test" Sorry if I ask stupid questions but I just have started to work with solr and some techniques are not very familiar. Thanks -Gesh
Re: Searching and scoring with block join
Zitat von Mikhail Khludnev : nesting query parsers is shown at http://blog.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html try to start from the following: title:Test _query_:"{!parent which=is_parent:true}{!dismax qf=content_de}Test" mind about local params referencing eg {!... v=$nest}&nest=... Thank you for the hint. I don't really know how {!dismax ...} and local parameter referencing are solving my problem. I read your blog entry, but I have some issues to understand how I can use your explanations. Would you mind giving me a short example how these query params helping me to get a proper result with a combined score for parent and children? Thank you very much. there is no such parm in https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/search/join/BlockJoinParentQParser.java#L67 Raise an feature request issue, at least, don't hesitate to contribute. Ah, okay, it was a misunderstanding then. I created an issue: https://issues.apache.org/jira/browse/SOLR-5662 Sorry if I ask stupid questions but I just have started to work with solr and some techniques are not very familiar. Thanks -Gesh
Indexing and searching documents in different languages
Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdateProcessorFactory to identify it. So, this is my configuration in solrconfig.xml class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory"> true title,subtitle,content language_s 0.3 general en,fr,de,it,es true true So, the detection works fine and I put some dynamic fields in schema.xml to store the results: stored="true" multiValued="true"/> stored="true" multiValued="true"/> stored="true" multiValued="true"/> stored="true" multiValued="true"/> stored="true" multiValued="true"/> My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:+term+title_de:+term... Okay, using copyField and copy all fields into the "text" field...but "text" has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Re: Indexing and searching documents in different languages
Thx, I'll try this approach. Zitat von Alexandre Rafalovitch : Have you looked at edismax and the 'qf' fields parameter? It allows you to define the fields to search. Also, you can define those parameters in solrconfig.xml and not have to send them down the wire. Finally, you can define several different request handlers (e.g. /ensearch, /frsearch) and have each of them use different 'qf' values, possibly with 'fl' field also defined and with field name aliasing from language-specific to generic names. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Tue, Apr 9, 2013 at 2:32 PM, wrote: Hello, I'm trying to index a large number of documents in different languages. I don't know the language of the document, so I'm using TikaLanguageIdentifierUpdatePr**ocessorFactory to identify it. So, this is my configuration in solrconfig.xml true title,**subtitle,content **language_s 0.3 **general en,fr,**de,it,es true **true So, the detection works fine and I put some dynamic fields in schema.xml to store the results: My main problem now is how to search the document without knowing the language of the searched document. I don't want to have a huge querystring like ?q=title_en:+term+subtitle_en:**+term+title_de:+term... Okay, using copyField and copy all fields into the "text" field...but "text" has the type text_general, so the language specific indexing is not working. I could use at least a combined field for every language (like text_en, text_fr...) but still, my querystring gets very long and to add new languages is terribly uncomfortable. So, what can I do? Is there a better solution to index and search documents in many languages without knowing the language of the document and the query before? - Geschan
Very bad search performance with group=true
Hi, I'm indexing pdf documents to use full text search with solr. To get the number of the page where the result was found, I save every page separately and group the results with a field called doc_id. (See this topic: http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201303.mbox/%3c1362242815.4092.140661199082425.338ed...@webmail.messagingengine.com%3E ) This works fine if I search in a single document, but if I search over the whole database for a term, the results are really really slow, especially if group.limit is above 10. I indexed about 150.000 pages for now, but in the end it will be more than 1.000.000 pages. How can I improve search performance? I'm using this configuration: explicit json true text edismax id^10.0 ean^10.0 title^10.0 subtitle^10.0 original_title^5.0 content^3.0 content_en^3.0 content_fr^3.0 content_de^3.0 content_it^3.0 content_es^3.0 keyword^5.0 text^0.5 author^2.0 editor^1.0 publisher^3.0 category^1.0 series^5.0 information^1.0 100% *:* 10 id, title, subtitle, original_title, author, editor, publisher, category, series, score true doc_id 20 true content_* content true THX for your help. - Gesh
Get page number of searchresult of a pdf in solr
Hello, I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page. So what I need is the page number and a short text snippet of every search result. I'm using SOLR 4.1 for indexing pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result. I only get the document where the search term was found in. -Gesh
Re: Get page number of searchresult of a pdf in solr
Is it possible to write a plugin that is converting each page separately with Tika and saving all pages in one document (maybe in a dynamic field like "page_*")? I would like to have only one document stored in SOLR for each pdf (it fit's better to the way my web application is managing these documents and I would like to use the same id as unique identifier). To be honest, I can't understand why SOLR is not able to find the pages where the search term was found in. It's a quite common task in my opinion. -Gesh Zitat von Michael Della Bitta : My guess is the best way to do this is to index each page separately and to store a link to the PDF/page in each document. That would probably require you to preprocess the PDFs to turn each one into a single page per PDF, or to extract the text per page another way. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn?t a Game On Thu, Feb 28, 2013 at 3:26 PM, wrote: Hello, I'm building a web application where users can search for pdf documents and view them with pdf.js. I would like to display the search results with a short snippet of the paragraph where the search term where found and a link to open the document at the right page. So what I need is the page number and a short text snippet of every search result. I'm using SOLR 4.1 for indexing pdf documents. The indexing itself works fine but I don't know how to get the page number and paragraph of a search result. I only get the document where the search term was found in. -Gesh
Help me understand these newrelic graphs
Here are some screen shots of our Solr Cloud cluster via Newrelic http://postimg.org/gallery/2hyzyeyc/ We currently have a 5 node cluster and all indexing is done on separate machines and shipped over. Our machines are running on SSD's with 18G of ram (Index size is 8G). We only have 1 shard at the moment with replicas on all 5 machines. I'm guessing thats a bit of a waste? How come when we do our bulk updating the response time actually decreases? I would think the load would be higher therefor response time should be higher. Any way I can decrease the response time? Thanks
Re: Help me understand these newrelic graphs
Ahh.. its including the add operation. That makes sense I then. A bit silly on NR's part they don't break it down. Otis, our index is only 8G so I don't consider that big by any means but our queries can get a bit complex with a bit of faceting. Do you still think it makes sense to shard? How easy would this be to get working? On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi, > > I think NR has support for breaking by handler, no? Just checked - no. > Only webapp controller, but that doesn't apply to Solr. > > SPM should be more helpful when it comes to monitoring Solr - you can > filter by host, handler, collection/core, etc. -- you can see the demo - > https://apps.sematext.com/demo - though this is plain Solr, not SolrCloud. > > If your index is big or queries are complex, shard it and parallelize > search. > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Thu, Mar 13, 2014 at 6:17 PM, ralph tice wrote: > > > I think your response time is including the average response for an add > > operation, which generally returns very quickly and due to sheer number > are > > averaging out the response time of your queries. New Relic should break > > out requests based on which handler they're hitting but they don't seem > to. > > > > > > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev > >wrote: > > > > > Here are some screen shots of our Solr Cloud cluster via Newrelic > > > > > > http://postimg.org/gallery/2hyzyeyc/ > > > > > > We currently have a 5 node cluster and all indexing is done on separate > > > machines and shipped over. Our machines are running on SSD's with 18G > of > > > ram (Index size is 8G). We only have 1 shard at the moment with > replicas > > on > > > all 5 machines. I'm guessing thats a bit of a waste? > > > > > > How come when we do our bulk updating the response time actually > > decreases? > > > I would think the load would be higher therefor response time should be > > > higher. Any way I can decrease the response time? > > > > > > Thanks > > > > > >
Re: Help me understand these newrelic graphs
If that is the case, what would help? On Thu, Mar 13, 2014 at 8:46 PM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > It really depends, hard to give a definitive instruction without more > pieces of info. > e.g. if your CPUs are all maxed out and you already have a high number of > concurrent queries than sharding may not be of any help at all. > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Thu, Mar 13, 2014 at 7:42 PM, Software Dev >wrote: > > > Ahh.. its including the add operation. That makes sense I then. A bit > silly > > on NR's part they don't break it down. > > > > Otis, our index is only 8G so I don't consider that big by any means but > > our queries can get a bit complex with a bit of faceting. Do you still > > think it makes sense to shard? How easy would this be to get working? > > > > > > On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic < > > otis.gospodne...@gmail.com> wrote: > > > > > Hi, > > > > > > I think NR has support for breaking by handler, no? Just checked - no. > > > Only webapp controller, but that doesn't apply to Solr. > > > > > > SPM should be more helpful when it comes to monitoring Solr - you can > > > filter by host, handler, collection/core, etc. -- you can see the demo > - > > > https://apps.sematext.com/demo - though this is plain Solr, not > > SolrCloud. > > > > > > If your index is big or queries are complex, shard it and parallelize > > > search. > > > > > > Otis > > > -- > > > Performance Monitoring * Log Analytics * Search Analytics > > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > > > > On Thu, Mar 13, 2014 at 6:17 PM, ralph tice > > wrote: > > > > > > > I think your response time is including the average response for an > add > > > > operation, which generally returns very quickly and due to sheer > number > > > are > > > > averaging out the response time of your queries. New Relic should > > break > > > > out requests based on which handler they're hitting but they don't > seem > > > to. > > > > > > > > > > > > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev < > > static.void@gmail.com > > > > >wrote: > > > > > > > > > Here are some screen shots of our Solr Cloud cluster via Newrelic > > > > > > > > > > http://postimg.org/gallery/2hyzyeyc/ > > > > > > > > > > We currently have a 5 node cluster and all indexing is done on > > separate > > > > > machines and shipped over. Our machines are running on SSD's with > 18G > > > of > > > > > ram (Index size is 8G). We only have 1 shard at the moment with > > > replicas > > > > on > > > > > all 5 machines. I'm guessing thats a bit of a waste? > > > > > > > > > > How come when we do our bulk updating the response time actually > > > > decreases? > > > > > I would think the load would be higher therefor response time > should > > be > > > > > higher. Any way I can decrease the response time? > > > > > > > > > > Thanks > > > > > > > > > > > > > > >
Re: Help me understand these newrelic graphs
Here is a screenshot of the host information: http://postimg.org/image/vub5ihxix/ As you can see we have 24 core CPU's and the load is only at 5-7.5. On Fri, Mar 14, 2014 at 10:02 AM, Software Dev wrote: > If that is the case, what would help? > > > On Thu, Mar 13, 2014 at 8:46 PM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > >> It really depends, hard to give a definitive instruction without more >> pieces of info. >> e.g. if your CPUs are all maxed out and you already have a high number of >> concurrent queries than sharding may not be of any help at all. >> >> Otis >> -- >> Performance Monitoring * Log Analytics * Search Analytics >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Thu, Mar 13, 2014 at 7:42 PM, Software Dev > >wrote: >> >> > Ahh.. its including the add operation. That makes sense I then. A bit >> silly >> > on NR's part they don't break it down. >> > >> > Otis, our index is only 8G so I don't consider that big by any means but >> > our queries can get a bit complex with a bit of faceting. Do you still >> > think it makes sense to shard? How easy would this be to get working? >> > >> > >> > On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic < >> > otis.gospodne...@gmail.com> wrote: >> > >> > > Hi, >> > > >> > > I think NR has support for breaking by handler, no? Just checked - >> no. >> > > Only webapp controller, but that doesn't apply to Solr. >> > > >> > > SPM should be more helpful when it comes to monitoring Solr - you can >> > > filter by host, handler, collection/core, etc. -- you can see the >> demo - >> > > https://apps.sematext.com/demo - though this is plain Solr, not >> > SolrCloud. >> > > >> > > If your index is big or queries are complex, shard it and parallelize >> > > search. >> > > >> > > Otis >> > > -- >> > > Performance Monitoring * Log Analytics * Search Analytics >> > > Solr & Elasticsearch Support * http://sematext.com/ >> > > >> > > >> > > On Thu, Mar 13, 2014 at 6:17 PM, ralph tice >> > wrote: >> > > >> > > > I think your response time is including the average response for an >> add >> > > > operation, which generally returns very quickly and due to sheer >> number >> > > are >> > > > averaging out the response time of your queries. New Relic should >> > break >> > > > out requests based on which handler they're hitting but they don't >> seem >> > > to. >> > > > >> > > > >> > > > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev < >> > static.void@gmail.com >> > > > >wrote: >> > > > >> > > > > Here are some screen shots of our Solr Cloud cluster via Newrelic >> > > > > >> > > > > http://postimg.org/gallery/2hyzyeyc/ >> > > > > >> > > > > We currently have a 5 node cluster and all indexing is done on >> > separate >> > > > > machines and shipped over. Our machines are running on SSD's with >> 18G >> > > of >> > > > > ram (Index size is 8G). We only have 1 shard at the moment with >> > > replicas >> > > > on >> > > > > all 5 machines. I'm guessing thats a bit of a waste? >> > > > > >> > > > > How come when we do our bulk updating the response time actually >> > > > decreases? >> > > > > I would think the load would be higher therefor response time >> should >> > be >> > > > > higher. Any way I can decrease the response time? >> > > > > >> > > > > Thanks >> > > > > >> > > > >> > > >> > >> > >
Re: Help me understand these newrelic graphs
Otis, I want to get those spikes down lower if possible. As mentioned in the above posts that the 25ms timing you are seeing is not really accurate because that's the average response time for ALL requests including the bulk add operations which are generally super fast. Our true response time is around 100ms. On Fri, Mar 14, 2014 at 10:54 AM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Are you trying to bring that 24.9 ms response time down? > Looks like there is room for more aggressive sharing there, yes. > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Fri, Mar 14, 2014 at 1:07 PM, Software Dev >wrote: > > > Here is a screenshot of the host information: > > http://postimg.org/image/vub5ihxix/ > > > > As you can see we have 24 core CPU's and the load is only at 5-7.5. > > > > > > On Fri, Mar 14, 2014 at 10:02 AM, Software Dev < > static.void@gmail.com > > >wrote: > > > > > If that is the case, what would help? > > > > > > > > > On Thu, Mar 13, 2014 at 8:46 PM, Otis Gospodnetic < > > > otis.gospodne...@gmail.com> wrote: > > > > > >> It really depends, hard to give a definitive instruction without more > > >> pieces of info. > > >> e.g. if your CPUs are all maxed out and you already have a high number > > of > > >> concurrent queries than sharding may not be of any help at all. > > >> > > >> Otis > > >> -- > > >> Performance Monitoring * Log Analytics * Search Analytics > > >> Solr & Elasticsearch Support * http://sematext.com/ > > >> > > >> > > >> On Thu, Mar 13, 2014 at 7:42 PM, Software Dev < > > static.void@gmail.com > > >> >wrote: > > >> > > >> > Ahh.. its including the add operation. That makes sense I then. A > bit > > >> silly > > >> > on NR's part they don't break it down. > > >> > > > >> > Otis, our index is only 8G so I don't consider that big by any means > > but > > >> > our queries can get a bit complex with a bit of faceting. Do you > still > > >> > think it makes sense to shard? How easy would this be to get > working? > > >> > > > >> > > > >> > On Thu, Mar 13, 2014 at 4:02 PM, Otis Gospodnetic < > > >> > otis.gospodne...@gmail.com> wrote: > > >> > > > >> > > Hi, > > >> > > > > >> > > I think NR has support for breaking by handler, no? Just checked > - > > >> no. > > >> > > Only webapp controller, but that doesn't apply to Solr. > > >> > > > > >> > > SPM should be more helpful when it comes to monitoring Solr - you > > can > > >> > > filter by host, handler, collection/core, etc. -- you can see the > > >> demo - > > >> > > https://apps.sematext.com/demo - though this is plain Solr, not > > >> > SolrCloud. > > >> > > > > >> > > If your index is big or queries are complex, shard it and > > parallelize > > >> > > search. > > >> > > > > >> > > Otis > > >> > > -- > > >> > > Performance Monitoring * Log Analytics * Search Analytics > > >> > > Solr & Elasticsearch Support * http://sematext.com/ > > >> > > > > >> > > > > >> > > On Thu, Mar 13, 2014 at 6:17 PM, ralph tice > > > >> > wrote: > > >> > > > > >> > > > I think your response time is including the average response for > > an > > >> add > > >> > > > operation, which generally returns very quickly and due to sheer > > >> number > > >> > > are > > >> > > > averaging out the response time of your queries. New Relic > should > > >> > break > > >> > > > out requests based on which handler they're hitting but they > don't > > >> seem > > >> > > to. > > >> > > > > > >> > > > > > >> > > > On Thu, Mar 13, 2014 at 2:18 PM, Software Dev < > > >> > static.void@gmail.com > > >> > > > >wrote: > > >> > > > > > >> > > > > Here are some screen shots of our Solr Cloud cluster via > > Newrelic > > >> > > > > > > >> > > > > http://postimg.org/gallery/2hyzyeyc/ > > >> > > > > > > >> > > > > We currently have a 5 node cluster and all indexing is done on > > >> > separate > > >> > > > > machines and shipped over. Our machines are running on SSD's > > with > > >> 18G > > >> > > of > > >> > > > > ram (Index size is 8G). We only have 1 shard at the moment > with > > >> > > replicas > > >> > > > on > > >> > > > > all 5 machines. I'm guessing thats a bit of a waste? > > >> > > > > > > >> > > > > How come when we do our bulk updating the response time > actually > > >> > > > decreases? > > >> > > > > I would think the load would be higher therefor response time > > >> should > > >> > be > > >> > > > > higher. Any way I can decrease the response time? > > >> > > > > > > >> > > > > Thanks > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > > > > > > >
Solr Cloud collection keep going down?
We have 2 collections with 1 shard each replicated over 5 servers in the cluster. We see a lot of flapping (down or recovering) on one of the collections. When this happens the other collection hosted on the same machine is still marked as active. When this happens it takes a fairly long time (~30 minutes) for the collection to come back online, if at all. I find that its usually more reliable to completely shutdown solr on the affected machine and bring it back up with its core disabled. We then re-enable the core when its marked as active. A few questions: 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing that marks one collection as down but the other on the same machine as up? 2) Why does recovery take forever when a node goes down.. even if its only down for 30 seconds. Our index is only 7-8G and we are running on SSD's. 3) What can be done to diagnose and fix this problem?
Re: Solr Cloud collection keep going down?
iter.write(OutputStreamWriter.java:207) at org.apache.solr.util.FastWriter.flush(FastWriter.java:141) at org.apache.solr.util.FastWriter.write(FastWriter.java:55) at org.apache.solr.response.RubyWriter.writeStr(RubyResponseWriter.java:87) at org.apache.solr.response.JSONWriter.writeNamedListAsFlat(JSONResponseWriter.java:285) at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:301) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188) at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183) at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188) at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183) at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299) at org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:188) at org.apache.solr.response.JSONWriter.writeNamedListAsMapWithDups(JSONResponseWriter.java:183) at org.apache.solr.response.JSONWriter.writeNamedList(JSONResponseWriter.java:299) at org.apache.solr.response.JSONWriter.writeResponse(JSONResponseWriter.java:95) at org.apache.solr.response.RubyResponseWriter.write(RubyResponseWriter.java:37) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:768) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:440) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:368) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489) at org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53) at org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240) at org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72) at org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at org.eclipse.jetty.io.ByteArrayBuffer.writeTo(ByteArrayBuffer.java:375) at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:164) at org.eclipse.jetty.io.bio.StreamEndPoint.flush(StreamEndPoint.java:182) at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:838) ... 51 more ,code=500} On Sat, Mar 22, 2014 at 12:23 PM, Software Dev wrote: > We have 2 collections with 1 shard each replicated over 5 servers in the > cluster. We see a lot of flapping (down or recovering) on one of the > collections. When this happens the other collection hosted on the same > machine is still marked as active. When this happens it takes a fairly long > time (~30 minutes) for the collection to come back online, if at all. I find > that its usually more reliable to completely shutdown solr on the affected > machine and bring it back up with its core disabled. We then re-enable the > core when its marked as active. > > A few questions: > > 1) What is the healt
Re: Solr Cloud collection keep going down?
Shawn, Thanks for pointing me in the right direction. After consulting the above document I *think* that the problem may be too large of a heap and which may be affecting GC collection and hence causing ZK timeouts. We have around 20G of memory on these machines with a min/max of heap at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for aside for disk cache. Why did we choose 6-10? No other reason than we wanted to allot enough for disk cache and then everything else was thrown and Solr. Does this sound about right? I took some screenshots for VisualVM and our NewRelic reporting as well as some relevant portions of our SolrConfig.xml. Any thoughts/comments would be greatly appreciated. http://postimg.org/gallery/4t73sdks/1fc10f9c/ Thanks On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey wrote: > On 3/22/2014 1:23 PM, Software Dev wrote: >> We have 2 collections with 1 shard each replicated over 5 servers in the >> cluster. We see a lot of flapping (down or recovering) on one of the >> collections. When this happens the other collection hosted on the same >> machine is still marked as active. When this happens it takes a fairly long >> time (~30 minutes) for the collection to come back online, if at all. I >> find that its usually more reliable to completely shutdown solr on the >> affected machine and bring it back up with its core disabled. We then >> re-enable the core when its marked as active. >> >> A few questions: >> >> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing >> that marks one collection as down but the other on the same machine as up? >> >> 2) Why does recovery take forever when a node goes down.. even if its only >> down for 30 seconds. Our index is only 7-8G and we are running on SSD's. >> >> 3) What can be done to diagnose and fix this problem? > > Unless you are actually using the ping request handler, the healthcheck > config will not matter. Or were you referring to something else? > > Referencing the logs you included in your reply: The EofException > errors happen because your client code times out and disconnects before > the request it made has completed. That is most likely just a symptom > that has nothing at all to do with the problem. > > Read the following wiki page. What I'm going to say below will > reference information you can find there: > > http://wiki.apache.org/solr/SolrPerformanceProblems > > Relevant side note: The default zookeeper client timeout is 15 seconds. > A typical zookeeper config defines tickTime as 2 seconds, and the > timeout cannot be configured to be more than 20 times the tickTime, > which means it cannot go beyond 40 seconds. The default timeout value > 15 seconds is usually more than enough, unless you are having > performance problems. > > If you are not actually taking Solr instances down, then the fact that > you are seeing the log replay messages indicates to me that something is > taking so much time that the connection to Zookeeper times out. When it > finally responds, it will attempt to recover the index, which means > first it will replay the transaction log and then it might replicate the > index from the shard leader. > > Replaying the transaction log is likely the reason it takes so long to > recover. The wiki page I linked above has a "slow startup" section that > explains how to fix this. > > There is some kind of underlying problem that is causing the zookeeper > connection to timeout. It is most likely garbage collection pauses or > insufficient RAM to cache the index, possibly both. > > You did not indicate how much total RAM you have or how big your Java > heap is. As the wiki page mentions in the SSD section, SSD is not a > substitute for having enough RAM to cache at significant percentage of > your index. > > Thanks, > Shawn >
Question on highlighting edgegrams
In 3.5.0 we have the following. If we searched for "c" with highlighting enabled we would get back results such as: cdat crocdile cool beans But in the latest Solr (4.7) we get the full words highlighted back. Did something change from these versions with regards to highlighting? Thanks
Re: Question on highlighting edgegrams
Bump On Mon, Mar 24, 2014 at 3:00 PM, Software Dev wrote: > In 3.5.0 we have the following. > > positionIncrementGap="100"> > > > > maxGramSize="30"/> > > > > > > > > If we searched for "c" with highlighting enabled we would get back > results such as: > > cdat > crocdile > cool beans > > But in the latest Solr (4.7) we get the full words highlighted back. > Did something change from these versions with regards to highlighting? > > Thanks
Replication (Solr Cloud)
I see that by default in SolrCloud that my collections are replicating. Should this be disabled in SolrCloud as this is already handled by it? >From the documentation: "The Replication screen shows you the current replication state for the named core you have specified. In Solr, replication is for the index only. SolrCloud has supplanted much of this functionality, but if you are still using index replication, you can use this screen to see the replication state:" I just want to make sure before I disable it that if we send an update to one server that the document will be correctly replicated across all nodes. Thanks
Re: Replication (Solr Cloud)
Thanks for the reply. Ill make sure NOT to disable it.
Re: Solr Cloud collection keep going down?
Can anyone else chime in? Thanks On Mon, Mar 24, 2014 at 10:10 AM, Software Dev wrote: > Shawn, > > Thanks for pointing me in the right direction. After consulting the > above document I *think* that the problem may be too large of a heap > and which may be affecting GC collection and hence causing ZK > timeouts. > > We have around 20G of memory on these machines with a min/max of heap > at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for > aside for disk cache. Why did we choose 6-10? No other reason than we > wanted to allot enough for disk cache and then everything else was > thrown and Solr. Does this sound about right? > > I took some screenshots for VisualVM and our NewRelic reporting as > well as some relevant portions of our SolrConfig.xml. Any > thoughts/comments would be greatly appreciated. > > http://postimg.org/gallery/4t73sdks/1fc10f9c/ > > Thanks > > > > > On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey wrote: >> On 3/22/2014 1:23 PM, Software Dev wrote: >>> We have 2 collections with 1 shard each replicated over 5 servers in the >>> cluster. We see a lot of flapping (down or recovering) on one of the >>> collections. When this happens the other collection hosted on the same >>> machine is still marked as active. When this happens it takes a fairly long >>> time (~30 minutes) for the collection to come back online, if at all. I >>> find that its usually more reliable to completely shutdown solr on the >>> affected machine and bring it back up with its core disabled. We then >>> re-enable the core when its marked as active. >>> >>> A few questions: >>> >>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing >>> that marks one collection as down but the other on the same machine as up? >>> >>> 2) Why does recovery take forever when a node goes down.. even if its only >>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's. >>> >>> 3) What can be done to diagnose and fix this problem? >> >> Unless you are actually using the ping request handler, the healthcheck >> config will not matter. Or were you referring to something else? >> >> Referencing the logs you included in your reply: The EofException >> errors happen because your client code times out and disconnects before >> the request it made has completed. That is most likely just a symptom >> that has nothing at all to do with the problem. >> >> Read the following wiki page. What I'm going to say below will >> reference information you can find there: >> >> http://wiki.apache.org/solr/SolrPerformanceProblems >> >> Relevant side note: The default zookeeper client timeout is 15 seconds. >> A typical zookeeper config defines tickTime as 2 seconds, and the >> timeout cannot be configured to be more than 20 times the tickTime, >> which means it cannot go beyond 40 seconds. The default timeout value >> 15 seconds is usually more than enough, unless you are having >> performance problems. >> >> If you are not actually taking Solr instances down, then the fact that >> you are seeing the log replay messages indicates to me that something is >> taking so much time that the connection to Zookeeper times out. When it >> finally responds, it will attempt to recover the index, which means >> first it will replay the transaction log and then it might replicate the >> index from the shard leader. >> >> Replaying the transaction log is likely the reason it takes so long to >> recover. The wiki page I linked above has a "slow startup" section that >> explains how to fix this. >> >> There is some kind of underlying problem that is causing the zookeeper >> connection to timeout. It is most likely garbage collection pauses or >> insufficient RAM to cache the index, possibly both. >> >> You did not indicate how much total RAM you have or how big your Java >> heap is. As the wiki page mentions in the SSD section, SSD is not a >> substitute for having enough RAM to cache at significant percentage of >> your index. >> >> Thanks, >> Shawn >>
Re: Replication (Solr Cloud)
One other question. If I optimize a collection on one node, does this get replicated to all others when finished? On Tue, Mar 25, 2014 at 10:13 AM, Software Dev wrote: > Thanks for the reply. Ill make sure NOT to disable it.
Re: Replication (Solr Cloud)
Ehh.. found out the hard way. I optimized the collection on 1 machine and when it was completed it replicated to the others and took my cluster down. Shitty On Tue, Mar 25, 2014 at 10:46 AM, Software Dev wrote: > One other question. If I optimize a collection on one node, does this > get replicated to all others when finished? > > On Tue, Mar 25, 2014 at 10:13 AM, Software Dev > wrote: >> Thanks for the reply. Ill make sure NOT to disable it.
Re: Replication (Solr Cloud)
So its generally a bad idea to optimize I gather? - In older versions it might have done them all at once, but I believe that newer versions only do one core at a time. On Tue, Mar 25, 2014 at 11:16 AM, Shawn Heisey wrote: > On 3/25/2014 11:59 AM, Software Dev wrote: >> >> Ehh.. found out the hard way. I optimized the collection on 1 machine >> and when it was completed it replicated to the others and took my >> cluster down. Shitty > > > It doesn't get replicated -- each core in the collection will be optimized. > In older versions it might have done them all at once, but I believe that > newer versions only do one core at a time. > > Doing an optimize on a Solr core results in a LOT of I/O. If your Solr > install is having performance issues, that will push it over the edge. When > SolrCloud ends up with a performance problem in one place, they tend to > multiply and cause MORE problems. It can get bad enough that the whole > cluster goes down because it's trying to do a recovery on every node. For > that reason, it's extremely important that you have enough system resources > available across your cloud (RAM in particular) to avoid performance issues. > > Thanks, > Shawn >
Re: Replication (Solr Cloud)
"In older versions it might have done them all at once, but I believe that newer versions only do one core at a time." It looks like it did it all at once and I'm on the latest (4.7) On Tue, Mar 25, 2014 at 11:27 AM, Software Dev wrote: > So its generally a bad idea to optimize I gather? > > - In older versions it might have done them all at once, but I believe > that newer versions only do one core at a time. > > On Tue, Mar 25, 2014 at 11:16 AM, Shawn Heisey wrote: >> On 3/25/2014 11:59 AM, Software Dev wrote: >>> >>> Ehh.. found out the hard way. I optimized the collection on 1 machine >>> and when it was completed it replicated to the others and took my >>> cluster down. Shitty >> >> >> It doesn't get replicated -- each core in the collection will be optimized. >> In older versions it might have done them all at once, but I believe that >> newer versions only do one core at a time. >> >> Doing an optimize on a Solr core results in a LOT of I/O. If your Solr >> install is having performance issues, that will push it over the edge. When >> SolrCloud ends up with a performance problem in one place, they tend to >> multiply and cause MORE problems. It can get bad enough that the whole >> cluster goes down because it's trying to do a recovery on every node. For >> that reason, it's extremely important that you have enough system resources >> available across your cloud (RAM in particular) to avoid performance issues. >> >> Thanks, >> Shawn >>
Re: Question on highlighting edgegrams
Same problem here: http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html On Tue, Mar 25, 2014 at 9:39 AM, Software Dev wrote: > Bump > > On Mon, Mar 24, 2014 at 3:00 PM, Software Dev > wrote: >> In 3.5.0 we have the following. >> >> > positionIncrementGap="100"> >> >> >> >> > maxGramSize="30"/> >> >> >> >> >> >> >> >> If we searched for "c" with highlighting enabled we would get back >> results such as: >> >> cdat >> crocdile >> cool beans >> >> But in the latest Solr (4.7) we get the full words highlighted back. >> Did something change from these versions with regards to highlighting? >> >> Thanks
What contributes to disk IO?
What are the main contributing factors for Solr Cloud generating a lot of disk IO? A lot of reads? Writes? Insufficient RAM? I would think if there was enough disk cache available for the whole index there would be little to no disk IO.
Re: Question on highlighting edgegrams
Is this a known bug? On Tue, Mar 25, 2014 at 1:12 PM, Software Dev wrote: > Same problem here: > http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html > > On Tue, Mar 25, 2014 at 9:39 AM, Software Dev > wrote: >> Bump >> >> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev >> wrote: >>> In 3.5.0 we have the following. >>> >>> >> positionIncrementGap="100"> >>> >>> >>> >>> >> maxGramSize="30"/> >>> >>> >>> >>> >>> >>> >>> >>> If we searched for "c" with highlighting enabled we would get back >>> results such as: >>> >>> cdat >>> crocdile >>> cool beans >>> >>> But in the latest Solr (4.7) we get the full words highlighted back. >>> Did something change from these versions with regards to highlighting? >>> >>> Thanks
What are my options?
We have a collection named "items". These are simply products that we sell. A large part of our scoring involves boosting on certain metrics for each product (amount sold, total GMS, ratings, etc). Some of these metrics are actually split across multiple tables. We are currently re-indexing the complete document anytime any of these values changes. I'm wondering if there is a better way? Some ideas: 1) Partial update the document. Is this even possible? 2) Add a parent-child relationship on Item and its metrics? 3) Dump all metrics to a file and use that as it changes throughout the day? I forgot the actual component that does it. Either way, can it handle multiple values? 4) Something else? I appreciate any feedback. Thanks
Re: Question on highlighting edgegrams
Certainly I am not the only user experiencing this? On Wed, Mar 26, 2014 at 1:11 PM, Software Dev wrote: > Is this a known bug? > > On Tue, Mar 25, 2014 at 1:12 PM, Software Dev > wrote: >> Same problem here: >> http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html >> >> On Tue, Mar 25, 2014 at 9:39 AM, Software Dev >> wrote: >>> Bump >>> >>> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev >>> wrote: >>>> In 3.5.0 we have the following. >>>> >>>> >>> positionIncrementGap="100"> >>>> >>>> >>>> >>>> >>> maxGramSize="30"/> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> If we searched for "c" with highlighting enabled we would get back >>>> results such as: >>>> >>>> cdat >>>> crocdile >>>> cool beans >>>> >>>> But in the latest Solr (4.7) we get the full words highlighted back. >>>> Did something change from these versions with regards to highlighting? >>>> >>>> Thanks
Re: Question on highlighting edgegrams
Shalin, I am running 4.7 and seeing this behavior :( On Thu, Mar 27, 2014 at 10:36 PM, Shalin Shekhar Mangar wrote: > Yes, there are known bugs with EdgeNGram filters. I think they are fixed in > 4.4 > > See https://issues.apache.org/jira/browse/LUCENE-3907 > > On Fri, Mar 28, 2014 at 10:17 AM, Software Dev > wrote: >> Certainly I am not the only user experiencing this? >> >> On Wed, Mar 26, 2014 at 1:11 PM, Software Dev >> wrote: >>> Is this a known bug? >>> >>> On Tue, Mar 25, 2014 at 1:12 PM, Software Dev >>> wrote: >>>> Same problem here: >>>> http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html >>>> >>>> On Tue, Mar 25, 2014 at 9:39 AM, Software Dev >>>> wrote: >>>>> Bump >>>>> >>>>> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev >>>>> wrote: >>>>>> In 3.5.0 we have the following. >>>>>> >>>>>> >>>>> positionIncrementGap="100"> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> maxGramSize="30"/> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> If we searched for "c" with highlighting enabled we would get back >>>>>> results such as: >>>>>> >>>>>> cdat >>>>>> crocdile >>>>>> cool beans >>>>>> >>>>>> But in the latest Solr (4.7) we get the full words highlighted back. >>>>>> Did something change from these versions with regards to highlighting? >>>>>> >>>>>> Thanks > > > > -- > Regards, > Shalin Shekhar Mangar.
Highlighting bug with edgegrams
In 3.5.0 we have the following. If we searched for "c" with highlighting enabled we would get back results such as: cdat crocdile cool beans But in the latest Solr (4.7.1) we get the full words highlighted back. Did something change from these versions with regards to highlighting? Thanks Found an old post but no info: http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html
Re: Sharding and replicas (Solr Cloud)
Sorry about the confusion. I meant I created my config via the ZkCLI and then I wanted to create my core via the CollectionsAPI. I *think* I have it working but was wondering why there are a crazy amount of core names under the admin "Core Selector"? When I create X amount of shards via the bootstrap command I think it only creates 1 core. Am I missing something? On Thu, Nov 7, 2013 at 1:06 PM, Shawn Heisey wrote: > On 11/7/2013 1:58 PM, Mark wrote: > >> If I create my collection via the ZkCLI (https://cwiki.apache.org/ >> confluence/display/solr/Command+Line+Utilities) how do I configure the >> number of shards and replicas? >> > > I was not aware that you could create collections with zkcli. I did not > think that was possible. Use the collections API: > > http://wiki.apache.org/solr/SolrCloud#Managing_collections_via_the_ > Collections_API > > Thanks, > Shawn > >
Re: Sharding and replicas (Solr Cloud)
I too want to be in control of everything that is created. Here is what I'm trying to do. 1) Start up a cluster of 5 Solr Instances 2) Import the configuration to Zookeeper 3) Manually create a collection via the collections api with number of shards and replication factor Now there are some issues with step 3. After creating the collection reload the GUI I always see: - *collection1:* org.apache.solr.common.cloud.ZooKeeperException:org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null until I restart the cluster. Is there a way around this? Also after creating the collection it creates a directory in $SOLR_HOME/home. So in this example it created ${SOLR_HOME}/collection1_shard1_replica1 and ${SOLR_HOME}/collection1_shard1_replica2. What happens when I rename both of these to the same in the core admin? On Thu, Nov 7, 2013 at 3:15 PM, Shawn Heisey wrote: > On 11/7/2013 2:52 PM, Software Dev wrote: > >> Sorry about the confusion. I meant I created my config via the ZkCLI and >> then I wanted to create my core via the CollectionsAPI. I *think* I have >> it >> working but was wondering why there are a crazy amount of core names under >> the admin "Core Selector"? >> >> When I create X amount of shards via the bootstrap command I think it only >> creates 1 core. Am I missing something? >> > > If you create it with numShards=1 and replicationFactor=2, you'll end up > with a total of 2 cores across all your Solr instances. For my simple > cloud install, these are the numbers that I'm using. One shard, a total of > two copies. > > If you create it with the numbers given on the wiki page, numShards=3 and > replicationFactor=4, there would be a total of 12 cores created across all > your servers. The maxShardsPerNode parameter defaults to 1, which means > that only 1 core per instance (SolrCloud node) is allowed for that > collection. If there aren't enough Solr instances for the numbers you have > entered, the creation will fail. > > I don't know any details about what the bootstrap_conf parameter actually > does when it creates collections. I've never used it - I want to be in > control of the configs and collections that get created. > > Thanks, > Shawn > >
Solr Cloud Bulk Indexing Questions
We are testing our shiny new Solr Cloud architecture but we are experiencing some issues when doing bulk indexing. We have 5 solr cloud machines running and 3 indexing machines (separate from the cloud servers). The indexing machines pull off ids from a queue then they index and ship over a document via a CloudSolrServer. It appears that the indexers are too fast because the load (particularly disk io) on the solr cloud machines spikes through the roof making the entire cluster unusable. It's kind of odd because the total index size is not even large..ie, < 10GB. Are there any optimization/enhancements I could try to help alleviate these problems? I should note that for the above collection we have only have 1 shard thats replicated across all machines so all machines have the full index. Would we benefit from switching to a ConcurrentUpdateSolrServer where all updates get sent to 1 machine and 1 machine only? We could then remove this machine from our cluster than that handles user requests. Thanks for any input.
Re: Solr Cloud Bulk Indexing Questions
We commit have a soft commit every 5 seconds and hard commit every 30. As far as docs/second it would guess around 200/sec which doesn't seem that high. On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson wrote: > Questions: How often do you commit your updates? What is your > indexing rate in docs/second? > > In a SolrCloud setup, you should be using a CloudSolrServer. If the > server is having trouble keeping up with updates, switching to CUSS > probably wouldn't help. > > So I suspect there's something not optimal about your setup that's > the culprit. > > Best, > Erick > > On Mon, Jan 20, 2014 at 4:00 PM, Software Dev > wrote: > > We are testing our shiny new Solr Cloud architecture but we are > > experiencing some issues when doing bulk indexing. > > > > We have 5 solr cloud machines running and 3 indexing machines (separate > > from the cloud servers). The indexing machines pull off ids from a queue > > then they index and ship over a document via a CloudSolrServer. It > appears > > that the indexers are too fast because the load (particularly disk io) on > > the solr cloud machines spikes through the roof making the entire cluster > > unusable. It's kind of odd because the total index size is not even > > large..ie, < 10GB. Are there any optimization/enhancements I could try to > > help alleviate these problems? > > > > I should note that for the above collection we have only have 1 shard > thats > > replicated across all machines so all machines have the full index. > > > > Would we benefit from switching to a ConcurrentUpdateSolrServer where all > > updates get sent to 1 machine and 1 machine only? We could then remove > this > > machine from our cluster than that handles user requests. > > > > Thanks for any input. >
Re: Solr Cloud Bulk Indexing Questions
We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all updates get sent to one machine or something? On Mon, Jan 20, 2014 at 2:42 PM, Software Dev wrote: > We commit have a soft commit every 5 seconds and hard commit every 30. As > far as docs/second it would guess around 200/sec which doesn't seem that > high. > > > On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson > wrote: > >> Questions: How often do you commit your updates? What is your >> indexing rate in docs/second? >> >> In a SolrCloud setup, you should be using a CloudSolrServer. If the >> server is having trouble keeping up with updates, switching to CUSS >> probably wouldn't help. >> >> So I suspect there's something not optimal about your setup that's >> the culprit. >> >> Best, >> Erick >> >> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev >> wrote: >> > We are testing our shiny new Solr Cloud architecture but we are >> > experiencing some issues when doing bulk indexing. >> > >> > We have 5 solr cloud machines running and 3 indexing machines (separate >> > from the cloud servers). The indexing machines pull off ids from a queue >> > then they index and ship over a document via a CloudSolrServer. It >> appears >> > that the indexers are too fast because the load (particularly disk io) >> on >> > the solr cloud machines spikes through the roof making the entire >> cluster >> > unusable. It's kind of odd because the total index size is not even >> > large..ie, < 10GB. Are there any optimization/enhancements I could try >> to >> > help alleviate these problems? >> > >> > I should note that for the above collection we have only have 1 shard >> thats >> > replicated across all machines so all machines have the full index. >> > >> > Would we benefit from switching to a ConcurrentUpdateSolrServer where >> all >> > updates get sent to 1 machine and 1 machine only? We could then remove >> this >> > machine from our cluster than that handles user requests. >> > >> > Thanks for any input. >> > >
Re: Solr Cloud Bulk Indexing Questions
4.6.0 On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller wrote: > What version are you running? > > - Mark > > On Jan 20, 2014, at 5:43 PM, Software Dev > wrote: > > > We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all > > updates get sent to one machine or something? > > > > > > On Mon, Jan 20, 2014 at 2:42 PM, Software Dev >wrote: > > > >> We commit have a soft commit every 5 seconds and hard commit every 30. > As > >> far as docs/second it would guess around 200/sec which doesn't seem that > >> high. > >> > >> > >> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson < > erickerick...@gmail.com>wrote: > >> > >>> Questions: How often do you commit your updates? What is your > >>> indexing rate in docs/second? > >>> > >>> In a SolrCloud setup, you should be using a CloudSolrServer. If the > >>> server is having trouble keeping up with updates, switching to CUSS > >>> probably wouldn't help. > >>> > >>> So I suspect there's something not optimal about your setup that's > >>> the culprit. > >>> > >>> Best, > >>> Erick > >>> > >>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev < > static.void@gmail.com> > >>> wrote: > >>>> We are testing our shiny new Solr Cloud architecture but we are > >>>> experiencing some issues when doing bulk indexing. > >>>> > >>>> We have 5 solr cloud machines running and 3 indexing machines > (separate > >>>> from the cloud servers). The indexing machines pull off ids from a > queue > >>>> then they index and ship over a document via a CloudSolrServer. It > >>> appears > >>>> that the indexers are too fast because the load (particularly disk io) > >>> on > >>>> the solr cloud machines spikes through the roof making the entire > >>> cluster > >>>> unusable. It's kind of odd because the total index size is not even > >>>> large..ie, < 10GB. Are there any optimization/enhancements I could try > >>> to > >>>> help alleviate these problems? > >>>> > >>>> I should note that for the above collection we have only have 1 shard > >>> thats > >>>> replicated across all machines so all machines have the full index. > >>>> > >>>> Would we benefit from switching to a ConcurrentUpdateSolrServer where > >>> all > >>>> updates get sent to 1 machine and 1 machine only? We could then remove > >>> this > >>>> machine from our cluster than that handles user requests. > >>>> > >>>> Thanks for any input. > >>> > >> > >> > >
Removing a node from Solr Cloud
What is the process for completely removing a node from Solr Cloud? We recently removed one but t its still showing up as "Gone" in the Cloud admin. Thanks
Setting leaderVoteWait for auto discovered cores
How is this accomplished? We currently have an empty solr.xml (auto-discovery) so I'm not sure where to put this value?
Re: Removing a node from Solr Cloud
Thanks. Anyway to accomplish this if the machine crashed (ie, can't unload it from that admin)? On Tue, Jan 21, 2014 at 11:25 AM, Anshum Gupta wrote: > You could unload the cores. This optionally also deletes the data and > instance directory. > Look at http://wiki.apache.org/solr/CoreAdmin#UNLOAD. > > > On Tue, Jan 21, 2014 at 10:22 AM, Software Dev >wrote: > > > What is the process for completely removing a node from Solr Cloud? We > > recently removed one but t its still showing up as "Gone" in the Cloud > > admin. > > > > Thanks > > > > > > -- > > Anshum Gupta > http://www.anshumgupta.net >
Re: Solr Cloud Bulk Indexing Questions
Any other suggestions? On Mon, Jan 20, 2014 at 2:49 PM, Software Dev wrote: > 4.6.0 > > > On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller wrote: > >> What version are you running? >> >> - Mark >> >> On Jan 20, 2014, at 5:43 PM, Software Dev >> wrote: >> >> > We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do all >> > updates get sent to one machine or something? >> > >> > >> > On Mon, Jan 20, 2014 at 2:42 PM, Software Dev < >> static.void@gmail.com>wrote: >> > >> >> We commit have a soft commit every 5 seconds and hard commit every 30. >> As >> >> far as docs/second it would guess around 200/sec which doesn't seem >> that >> >> high. >> >> >> >> >> >> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson < >> erickerick...@gmail.com>wrote: >> >> >> >>> Questions: How often do you commit your updates? What is your >> >>> indexing rate in docs/second? >> >>> >> >>> In a SolrCloud setup, you should be using a CloudSolrServer. If the >> >>> server is having trouble keeping up with updates, switching to CUSS >> >>> probably wouldn't help. >> >>> >> >>> So I suspect there's something not optimal about your setup that's >> >>> the culprit. >> >>> >> >>> Best, >> >>> Erick >> >>> >> >>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev < >> static.void@gmail.com> >> >>> wrote: >> >>>> We are testing our shiny new Solr Cloud architecture but we are >> >>>> experiencing some issues when doing bulk indexing. >> >>>> >> >>>> We have 5 solr cloud machines running and 3 indexing machines >> (separate >> >>>> from the cloud servers). The indexing machines pull off ids from a >> queue >> >>>> then they index and ship over a document via a CloudSolrServer. It >> >>> appears >> >>>> that the indexers are too fast because the load (particularly disk >> io) >> >>> on >> >>>> the solr cloud machines spikes through the roof making the entire >> >>> cluster >> >>>> unusable. It's kind of odd because the total index size is not even >> >>>> large..ie, < 10GB. Are there any optimization/enhancements I could >> try >> >>> to >> >>>> help alleviate these problems? >> >>>> >> >>>> I should note that for the above collection we have only have 1 shard >> >>> thats >> >>>> replicated across all machines so all machines have the full index. >> >>>> >> >>>> Would we benefit from switching to a ConcurrentUpdateSolrServer where >> >>> all >> >>>> updates get sent to 1 machine and 1 machine only? We could then >> remove >> >>> this >> >>>> machine from our cluster than that handles user requests. >> >>>> >> >>>> Thanks for any input. >> >>> >> >> >> >> >> >> >
Re: Solr Cloud Bulk Indexing Questions
A suggestion would be to hard commit much less often, ie every 10 minutes, and see if there is a change. - Will try this How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache ? - We have 18G of ram 12 dedicated to Solr but as of right now the total index size is only 5GB Ah, and what about network IO ? Could that be a limiting factor ? - What is the size of your documents ? A few KB, MB, ... ? Under 1MB - Again, total index size is only 5GB so I dont know if this would be a problem On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez wrote: > 1 node having more load should be the leader (because of the extra work > of receiving and distributing updates, but my experiences show only a > bit more CPU usage, and no difference in disk IO). > > A suggestion would be to hard commit much less often, ie every 10 > minutes, and see if there is a change. > How much system RAM ? JVM Heap ? Enough space in RAM for system disk cache > ? > What is the size of your documents ? A few KB, MB, ... ? > Ah, and what about network IO ? Could that be a limiting factor ? > > > André > > > On 2014-01-21 23:40, Software Dev wrote: > >> Any other suggestions? >> >> >> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev >> wrote: >> >> 4.6.0 >>> >>> >>> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller >> >wrote: >>> >>> What version are you running? >>>> >>>> - Mark >>>> >>>> On Jan 20, 2014, at 5:43 PM, Software Dev >>>> wrote: >>>> >>>> We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do >>>>> all >>>>> updates get sent to one machine or something? >>>>> >>>>> >>>>> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev < >>>>> >>>> static.void@gmail.com>wrote: >>>> >>>>> We commit have a soft commit every 5 seconds and hard commit every 30. >>>>>> >>>>> As >>>> >>>>> far as docs/second it would guess around 200/sec which doesn't seem >>>>>> >>>>> that >>>> >>>>> high. >>>>>> >>>>>> >>>>>> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson < >>>>>> >>>>> erickerick...@gmail.com>wrote: >>>> >>>>> Questions: How often do you commit your updates? What is your >>>>>>> indexing rate in docs/second? >>>>>>> >>>>>>> In a SolrCloud setup, you should be using a CloudSolrServer. If the >>>>>>> server is having trouble keeping up with updates, switching to CUSS >>>>>>> probably wouldn't help. >>>>>>> >>>>>>> So I suspect there's something not optimal about your setup that's >>>>>>> the culprit. >>>>>>> >>>>>>> Best, >>>>>>> Erick >>>>>>> >>>>>>> On Mon, Jan 20, 2014 at 4:00 PM, Software Dev < >>>>>>> >>>>>> static.void@gmail.com> >>>> >>>>> wrote: >>>>>>> >>>>>>>> We are testing our shiny new Solr Cloud architecture but we are >>>>>>>> experiencing some issues when doing bulk indexing. >>>>>>>> >>>>>>>> We have 5 solr cloud machines running and 3 indexing machines >>>>>>>> >>>>>>> (separate >>>> >>>>> from the cloud servers). The indexing machines pull off ids from a >>>>>>>> >>>>>>> queue >>>> >>>>> then they index and ship over a document via a CloudSolrServer. It >>>>>>>> >>>>>>> appears >>>>>>> >>>>>>>> that the indexers are too fast because the load (particularly disk >>>>>>>> >>>>>>> io) >>>> >>>>> on >>>>>>> >>>>>>>> the solr cloud machines spikes through the roof making the entire >>>>>>>> >>>>>>> cluster >>>>>>> >>>>>>>> unusable. It's kind of odd because the total index size is not even >>>>>>>> large..ie, < 10GB. Are there any optimization/enhancements I could >>>>>>>> >>>>>>> try >>>> >>>>> to >>>>>>> >>>>>>>> help alleviate these problems? >>>>>>>> >>>>>>>> I should note that for the above collection we have only have 1 >>>>>>>> shard >>>>>>>> >>>>>>> thats >>>>>>> >>>>>>>> replicated across all machines so all machines have the full index. >>>>>>>> >>>>>>>> Would we benefit from switching to a ConcurrentUpdateSolrServer >>>>>>>> where >>>>>>>> >>>>>>> all >>>>>>> >>>>>>>> updates get sent to 1 machine and 1 machine only? We could then >>>>>>>> >>>>>>> remove >>>> >>>>> this >>>>>>> >>>>>>>> machine from our cluster than that handles user requests. >>>>>>>> >>>>>>>> Thanks for any input. >>>>>>>> >>>>>>> >>>>>> >>>> >> -- >> André Bois-Crettez >> >> Software Architect >> Search Developer >> http://www.kelkoo.com/ >> > > Kelkoo SAS > Société par Actions Simplifiée > Au capital de € 4.168.964,30 > Siège social : 8, rue du Sentier 75002 Paris > 425 093 069 RCS Paris > > Ce message et les pièces jointes sont confidentiels et établis à > l'attention exclusive de leurs destinataires. Si vous n'êtes pas le > destinataire de ce message, merci de le détruire et d'en avertir > l'expéditeur. >
Re: Solr Cloud Bulk Indexing Questions
Thanks for suggestions. After reading that document I feel even more confused though because I always thought that hard commits should be less frequent that hard commits. Is there any way to configure autoCommit, softCommit values on a per request basis? The majority of the time we have small flow of updates coming in and we would like to see them in ASAP. However we occasionally need to do some bulk indexing (once a week or less) and the need to see those updates right away isn't as critical. I would say 95% of the time we are in "Index-Light Query-Light/Heavy" mode and the other 5% is "Index-Heavy Query-Light/Heavy" mode. Thanks On Wed, Jan 22, 2014 at 5:33 PM, Erick Erickson wrote: > When you're doing hard commits, is it with openSeacher = true or > false? It should probably be false... > > Here's a rundown of the soft/hard commit consequences: > > > http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > I suspect (but, of course, can't prove) that you're over-committing > and hitting segment > merges without meaning to... > > FWIW, > Erick > > On Wed, Jan 22, 2014 at 1:46 PM, Software Dev > wrote: > > A suggestion would be to hard commit much less often, ie every 10 > > minutes, and see if there is a change. > > > > - Will try this > > > > How much system RAM ? JVM Heap ? Enough space in RAM for system disk > cache ? > > > > - We have 18G of ram 12 dedicated to Solr but as of right now the total > > index size is only 5GB > > > > Ah, and what about network IO ? Could that be a limiting factor ? > > > > - What is the size of your documents ? A few KB, MB, ... ? > > > > Under 1MB > > > > - Again, total index size is only 5GB so I dont know if this would be a > > problem > > > > > > > > > > > > > > On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez > > wrote: > > > >> 1 node having more load should be the leader (because of the extra work > >> of receiving and distributing updates, but my experiences show only a > >> bit more CPU usage, and no difference in disk IO). > >> > >> A suggestion would be to hard commit much less often, ie every 10 > >> minutes, and see if there is a change. > >> How much system RAM ? JVM Heap ? Enough space in RAM for system disk > cache > >> ? > >> What is the size of your documents ? A few KB, MB, ... ? > >> Ah, and what about network IO ? Could that be a limiting factor ? > >> > >> > >> André > >> > >> > >> On 2014-01-21 23:40, Software Dev wrote: > >> > >>> Any other suggestions? > >>> > >>> > >>> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev < > static.void@gmail.com> > >>> wrote: > >>> > >>> 4.6.0 > >>>> > >>>> > >>>> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller >>>> >wrote: > >>>> > >>>> What version are you running? > >>>>> > >>>>> - Mark > >>>>> > >>>>> On Jan 20, 2014, at 5:43 PM, Software Dev > > >>>>> wrote: > >>>>> > >>>>> We also noticed that disk IO shoots up to 100% on 1 of the nodes. Do > >>>>>> all > >>>>>> updates get sent to one machine or something? > >>>>>> > >>>>>> > >>>>>> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev < > >>>>>> > >>>>> static.void@gmail.com>wrote: > >>>>> > >>>>>> We commit have a soft commit every 5 seconds and hard commit every > 30. > >>>>>>> > >>>>>> As > >>>>> > >>>>>> far as docs/second it would guess around 200/sec which doesn't seem > >>>>>>> > >>>>>> that > >>>>> > >>>>>> high. > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Jan 20, 2014 at 2:26 PM, Erick Erickson < > >>>>>>> > >>>>>> erickerick...@gmail.com>wrote: > >>>>> > >>>>>> Questions: How often do you commit your updates? What is your > >>>>>>>> indexing rate in docs/second? > >>>>>>>> > >>>>>>>&g
Re: Solr Cloud Bulk Indexing Questions
Also, any suggestions on debugging? What should I look for and how? Thanks On Thu, Jan 23, 2014 at 10:01 AM, Software Dev wrote: > Thanks for suggestions. After reading that document I feel even more > confused though because I always thought that hard commits should be less > frequent that hard commits. > > Is there any way to configure autoCommit, softCommit values on a per > request basis? The majority of the time we have small flow of updates > coming in and we would like to see them in ASAP. However we occasionally > need to do some bulk indexing (once a week or less) and the need to see > those updates right away isn't as critical. > > I would say 95% of the time we are in "Index-Light Query-Light/Heavy" mode > and the other 5% is "Index-Heavy Query-Light/Heavy" mode. > > Thanks > > > On Wed, Jan 22, 2014 at 5:33 PM, Erick Erickson > wrote: > >> When you're doing hard commits, is it with openSeacher = true or >> false? It should probably be false... >> >> Here's a rundown of the soft/hard commit consequences: >> >> >> http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ >> >> I suspect (but, of course, can't prove) that you're over-committing >> and hitting segment >> merges without meaning to... >> >> FWIW, >> Erick >> >> On Wed, Jan 22, 2014 at 1:46 PM, Software Dev >> wrote: >> > A suggestion would be to hard commit much less often, ie every 10 >> > minutes, and see if there is a change. >> > >> > - Will try this >> > >> > How much system RAM ? JVM Heap ? Enough space in RAM for system disk >> cache ? >> > >> > - We have 18G of ram 12 dedicated to Solr but as of right now the total >> > index size is only 5GB >> > >> > Ah, and what about network IO ? Could that be a limiting factor ? >> > >> > - What is the size of your documents ? A few KB, MB, ... ? >> > >> > Under 1MB >> > >> > - Again, total index size is only 5GB so I dont know if this would be a >> > problem >> > >> > >> > >> > >> > >> > >> > On Wed, Jan 22, 2014 at 12:26 AM, Andre Bois-Crettez >> > wrote: >> > >> >> 1 node having more load should be the leader (because of the extra work >> >> of receiving and distributing updates, but my experiences show only a >> >> bit more CPU usage, and no difference in disk IO). >> >> >> >> A suggestion would be to hard commit much less often, ie every 10 >> >> minutes, and see if there is a change. >> >> How much system RAM ? JVM Heap ? Enough space in RAM for system disk >> cache >> >> ? >> >> What is the size of your documents ? A few KB, MB, ... ? >> >> Ah, and what about network IO ? Could that be a limiting factor ? >> >> >> >> >> >> André >> >> >> >> >> >> On 2014-01-21 23:40, Software Dev wrote: >> >> >> >>> Any other suggestions? >> >>> >> >>> >> >>> On Mon, Jan 20, 2014 at 2:49 PM, Software Dev < >> static.void@gmail.com> >> >>> wrote: >> >>> >> >>> 4.6.0 >> >>>> >> >>>> >> >>>> On Mon, Jan 20, 2014 at 2:47 PM, Mark Miller > >>>> >wrote: >> >>>> >> >>>> What version are you running? >> >>>>> >> >>>>> - Mark >> >>>>> >> >>>>> On Jan 20, 2014, at 5:43 PM, Software Dev < >> static.void@gmail.com> >> >>>>> wrote: >> >>>>> >> >>>>> We also noticed that disk IO shoots up to 100% on 1 of the nodes. >> Do >> >>>>>> all >> >>>>>> updates get sent to one machine or something? >> >>>>>> >> >>>>>> >> >>>>>> On Mon, Jan 20, 2014 at 2:42 PM, Software Dev < >> >>>>>> >> >>>>> static.void@gmail.com>wrote: >> >>>>> >> >>>>>> We commit have a soft commit every 5 seconds and hard commit every >> 30. >> >>>>>>> >> >>>>>> As >> >>>>> >> >>>>>> far as docs/second it would guess around 200/s
Re: Solr Cloud Bulk Indexing Questions
Does maxWriteMBPerSec apply to NRTCachingDirectoryFactory? I only see maxMergeSizeMB and maxCachedMB as configuration values. On Thu, Jan 23, 2014 at 11:05 AM, Otis Gospodnetic < otis.gospodne...@gmail.com> wrote: > Hi, > > Have you tried maxWriteMBPerSec? > > http://search-lucene.com/?q=maxWriteMBPerSec&fc_project=Solr > > Otis > -- > Performance Monitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Mon, Jan 20, 2014 at 4:00 PM, Software Dev >wrote: > > > We are testing our shiny new Solr Cloud architecture but we are > > experiencing some issues when doing bulk indexing. > > > > We have 5 solr cloud machines running and 3 indexing machines (separate > > from the cloud servers). The indexing machines pull off ids from a queue > > then they index and ship over a document via a CloudSolrServer. It > appears > > that the indexers are too fast because the load (particularly disk io) on > > the solr cloud machines spikes through the roof making the entire cluster > > unusable. It's kind of odd because the total index size is not even > > large..ie, < 10GB. Are there any optimization/enhancements I could try to > > help alleviate these problems? > > > > I should note that for the above collection we have only have 1 shard > thats > > replicated across all machines so all machines have the full index. > > > > Would we benefit from switching to a ConcurrentUpdateSolrServer where all > > updates get sent to 1 machine and 1 machine only? We could then remove > this > > machine from our cluster than that handles user requests. > > > > Thanks for any input. > > >
SolrCloudServer questions
Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
Disabling Commit/Auto-Commit (SolrCloud)
Is there a way to disable commit/hard-commit at runtime? For example, we usually have our hard commit and soft-commit set really low but when we do bulk indexing we would like to disable this to increase performance. If there isn't a an easy way of doing this would simply pushing a new solrconfig to solrcloud work?
Re: SolrCloudServer questions
Which of any of these settings would be beneficial when bulk uploading? On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller wrote: > > > On Jan 31, 2014, at 1:56 PM, Greg Walters > wrote: > > > I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore > my response. > > > >> -updatesToLeaders > > > > Only send documents to shard leaders while indexing. This saves > cross-talk between slaves and leaders which results in more efficient > document routing. > > Right, but recently this has less of an affect because CloudSolrServer can > now hash documents and directly send them to the right place. This option > has become more historical. Just make sure you set the correct id field on > the CloudSolrServer instance for this hashing to work (I think it defaults > to "id"). > > > > >> shutdownLBHttpSolrServer > > > > CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute > requests (that aren't updates directly to leaders). Where did you find > this? I don't see this in the javadoc anywhere but it is a boolean in the > CloudSolrServer class. It looks like when you create a new CloudSolrServer > and pass it your own LBHttpSolrServer the boolean gets set to false and the > CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. > > > >> parellelUpdates > > > > The javadoc's done have any description for this one but I checked out > the code for CloudSolrServer and if parallelUpdates it looks like it > executes update statements to multiple shards at the same time. > > Right, we should def add some javadoc, but this sends updates to shards in > parallel rather than with a single thread. Can really increase update > speed. Still not as powerful as using CloudSolrServer from multiple > threads, but a nice improvement non the less. > > > - Mark > > http://about.me/markrmiller > > > > > I'm no dev but I can read so please excuse any errors on my part. > > > > Thanks, > > Greg > > > > On Jan 31, 2014, at 11:40 AM, Software Dev > wrote: > > > >> Can someone clarify what the following options are: > >> > >> - updatesToLeaders > >> - shutdownLBHttpSolrServer > >> - parallelUpdates > >> > >> Also, I remember in older version of Solr there was an efficient format > >> that was used between SolrJ and Solr that is more compact. Does this > sill > >> exist in the latest version of Solr? If so, is it the default? > >> > >> Thanks > > > >
Re: SolrCloudServer questions
Out use case is we have 3 indexing machines pulling off a kafka queue and they are all sending individual updates. On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller wrote: > Just make sure parallel updates is set to true. > > If you want to load even faster, you can use the bulk add methods, or if > you need more fine grained responses, use the single add from multiple > threads (though bulk add can also be done via multiple threads if you > really want to try and push the max). > > - Mark > > http://about.me/markrmiller > > On Jan 31, 2014, at 3:50 PM, Software Dev > wrote: > > > Which of any of these settings would be beneficial when bulk uploading? > > > > > > On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller > wrote: > > > >> > >> > >> On Jan 31, 2014, at 1:56 PM, Greg Walters > >> wrote: > >> > >>> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore > >> my response. > >>> > >>>> -updatesToLeaders > >>> > >>> Only send documents to shard leaders while indexing. This saves > >> cross-talk between slaves and leaders which results in more efficient > >> document routing. > >> > >> Right, but recently this has less of an affect because CloudSolrServer > can > >> now hash documents and directly send them to the right place. This > option > >> has become more historical. Just make sure you set the correct id field > on > >> the CloudSolrServer instance for this hashing to work (I think it > defaults > >> to "id"). > >> > >>> > >>>> shutdownLBHttpSolrServer > >>> > >>> CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute > >> requests (that aren't updates directly to leaders). Where did you find > >> this? I don't see this in the javadoc anywhere but it is a boolean in > the > >> CloudSolrServer class. It looks like when you create a new > CloudSolrServer > >> and pass it your own LBHttpSolrServer the boolean gets set to false and > the > >> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut > down. > >>> > >>>> parellelUpdates > >>> > >>> The javadoc's done have any description for this one but I checked out > >> the code for CloudSolrServer and if parallelUpdates it looks like it > >> executes update statements to multiple shards at the same time. > >> > >> Right, we should def add some javadoc, but this sends updates to shards > in > >> parallel rather than with a single thread. Can really increase update > >> speed. Still not as powerful as using CloudSolrServer from multiple > >> threads, but a nice improvement non the less. > >> > >> > >> - Mark > >> > >> http://about.me/markrmiller > >> > >>> > >>> I'm no dev but I can read so please excuse any errors on my part. > >>> > >>> Thanks, > >>> Greg > >>> > >>> On Jan 31, 2014, at 11:40 AM, Software Dev > >> wrote: > >>> > >>>> Can someone clarify what the following options are: > >>>> > >>>> - updatesToLeaders > >>>> - shutdownLBHttpSolrServer > >>>> - parallelUpdates > >>>> > >>>> Also, I remember in older version of Solr there was an efficient > format > >>>> that was used between SolrJ and Solr that is more compact. Does this > >> sill > >>>> exist in the latest version of Solr? If so, is it the default? > >>>> > >>>> Thanks > >>> > >> > >> > >
Re: SolrCloudServer questions
Also, if we are seeing a huge cpu spike on the leader when doing a bulk index, would changing any of the options help? On Sat, Feb 1, 2014 at 2:59 PM, Software Dev wrote: > Out use case is we have 3 indexing machines pulling off a kafka queue and > they are all sending individual updates. > > > On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller wrote: > >> Just make sure parallel updates is set to true. >> >> If you want to load even faster, you can use the bulk add methods, or if >> you need more fine grained responses, use the single add from multiple >> threads (though bulk add can also be done via multiple threads if you >> really want to try and push the max). >> >> - Mark >> >> http://about.me/markrmiller >> >> On Jan 31, 2014, at 3:50 PM, Software Dev >> wrote: >> >> > Which of any of these settings would be beneficial when bulk uploading? >> > >> > >> > On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller >> wrote: >> > >> >> >> >> >> >> On Jan 31, 2014, at 1:56 PM, Greg Walters >> >> wrote: >> >> >> >>> I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore >> >> my response. >> >>> >> >>>> -updatesToLeaders >> >>> >> >>> Only send documents to shard leaders while indexing. This saves >> >> cross-talk between slaves and leaders which results in more efficient >> >> document routing. >> >> >> >> Right, but recently this has less of an affect because CloudSolrServer >> can >> >> now hash documents and directly send them to the right place. This >> option >> >> has become more historical. Just make sure you set the correct id >> field on >> >> the CloudSolrServer instance for this hashing to work (I think it >> defaults >> >> to "id"). >> >> >> >>> >> >>>> shutdownLBHttpSolrServer >> >>> >> >>> CloudSolrServer uses a LBHttpSolrServer behind the scenes to >> distribute >> >> requests (that aren't updates directly to leaders). Where did you find >> >> this? I don't see this in the javadoc anywhere but it is a boolean in >> the >> >> CloudSolrServer class. It looks like when you create a new >> CloudSolrServer >> >> and pass it your own LBHttpSolrServer the boolean gets set to false >> and the >> >> CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut >> down. >> >>> >> >>>> parellelUpdates >> >>> >> >>> The javadoc's done have any description for this one but I checked out >> >> the code for CloudSolrServer and if parallelUpdates it looks like it >> >> executes update statements to multiple shards at the same time. >> >> >> >> Right, we should def add some javadoc, but this sends updates to >> shards in >> >> parallel rather than with a single thread. Can really increase update >> >> speed. Still not as powerful as using CloudSolrServer from multiple >> >> threads, but a nice improvement non the less. >> >> >> >> >> >> - Mark >> >> >> >> http://about.me/markrmiller >> >> >> >>> >> >>> I'm no dev but I can read so please excuse any errors on my part. >> >>> >> >>> Thanks, >> >>> Greg >> >>> >> >>> On Jan 31, 2014, at 11:40 AM, Software Dev > > >> >> wrote: >> >>> >> >>>> Can someone clarify what the following options are: >> >>>> >> >>>> - updatesToLeaders >> >>>> - shutdownLBHttpSolrServer >> >>>> - parallelUpdates >> >>>> >> >>>> Also, I remember in older version of Solr there was an efficient >> format >> >>>> that was used between SolrJ and Solr that is more compact. Does this >> >> sill >> >>>> exist in the latest version of Solr? If so, is it the default? >> >>>> >> >>>> Thanks >> >>> >> >> >> >> >> >> >
How does Solr parse schema.xml?
Can anyone point me in the right direction. I'm trying to duplicate the functionality of the analysis request handler so we can wrap a service around it to return the terms given a string of text. We would like to read the same schema.xml file to configure the analyzer,tokenizer, etc but I can't seem to find the class that actually does the parsing of that file. Thanks
Re: Does Solr flush to disk even before ramBufferSizeMB is hit?
Thanks Shawn. If Solr writes this info to Disk as soon as possible (which is what I am seeing) then ramBuffer setting seems to be misleading. Anyone else has any thoughts on this? -Saroj On Mon, Aug 29, 2011 at 6:14 AM, Shawn Heisey wrote: > On 8/28/2011 11:18 PM, roz dev wrote: > >> I notice that even though InfoStream does not mention that data is being >> flushed to disk, new segment files were created on the server. >> Size of these files kept growing even though there was enough Heap >> available >> and 856MB Ram was not even used. >> > > With the caveat that I am not an expert and someone may correct me, I'll > offer this: It's been my experience that Solr will write the files that > constitute stored fields as soon as they are available, because that > information is always the same and nothing will change in those files based > on the next chunk of data. > > Thanks, > Shawn > >
Re: DataImportHandler using new connection on each query
I am not sure if current version has this, but DIH used to reload connections after some idle time if (currTime - connLastUsed > CONN_TIME_OUT) { synchronized (this) { Connection tmpConn = factory.call(); closeConnection(); connLastUsed = System.currentTimeMillis(); return conn = tmpConn; } Where CONN_TIME_OUT = 10 seconds On Fri, Sep 2, 2011 at 12:36 AM, Chris Hostetter wrote: > > : However, I tested this against a slower SQL Server and I saw > : dramatically worse results. Instead of re-using their database, each of > : the sub-entities is recreating a connection each time the query runs. > > are you seeing any specific errors logged before these new connections are > created? > > I don't *think* there's anything in the DIH JDBC/SQL code that causes it > to timeout existing connections -- is it possible this is sometihng > specific to the JDBC Driver you are using? > > Or maybe you are using the DIH "threads" option along with a JNDI/JDBC > based pool of connections that is configured to create new Connections on > demand, and with the fast DB it can reuse them but on the slow DB it does > enough stuff in parallel to keep asking for new connections to be created? > > > If it's DIH creating new connections over and over then i'm pretty sure > you should see an INFO level log message like this for each connection... > > LOG.info("Creating a connection for entity " > + context.getEntityAttribute(DataImporter.NAME) + " with URL: " > + url); > > ...are those messages different against you fast DB and your slow DB? > > -Hoss >
Re: DataImportHandler using new connection on each query
take care, "running 10 hours" != "idling 10 seconds" and trying again. Those are different cases. It is not dropping *used* connections (good to know it works that good, thanks for reporting!), just not reusing connections more than 10 seconds idle On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty wrote: > On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey wrote: > [...] >> I use DIH with MySQL. When things are going well, a full rebuild will leave >> connections open and active for over two hours. This is the case with >> 1.4.0, 1.4.1, 3.1.0, and 3.2.0. Due to some kind of problem on the database >> server, last night I had a rebuild going for more than 11 hours with no >> problems, verified from the processlist on the server. > > Will second that. Have had DIH connections open to both > mysql, and MS-SQL for 8-10h. Dropped connections could > be traced to network issues, or some other exception. > > Regards, > Gora >
Re: DataImportHandler using new connection on each query
watch out, "running 10 hours" != "idling 10 seconds" and trying again. Those are different cases. It is not dropping *used* connections (good to know it works that good, thanks for reporting!), just not reusing connections more than 10 seconds idle On Fri, Sep 2, 2011 at 10:26 PM, Gora Mohanty wrote: > On Sat, Sep 3, 2011 at 1:38 AM, Shawn Heisey wrote: > [...] >> I use DIH with MySQL. When things are going well, a full rebuild will leave >> connections open and active for over two hours. This is the case with >> 1.4.0, 1.4.1, 3.1.0, and 3.2.0. Due to some kind of problem on the database >> server, last night I had a rebuild going for more than 11 hours with no >> problems, verified from the processlist on the server. > > Will second that. Have had DIH connections open to both > mysql, and MS-SQL for 8-10h. Dropped connections could > be traced to network issues, or some other exception. > > Regards, > Gora >
Which Solr / Lucene direcotory for ramfs?
probably stupid question, Which Directory implementation should be the best suited for index mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)
solr-user@lucene.apache.org
probably stupid question, Which Directory implementation should be the best suited for index mounted on ramfs/tmpfs? I guess plain old FSDirectory, (or mmap/nio?)
what is the default value of omitNorms and termVectors in solr schema
Hi As per this document, http://wiki.apache.org/solr/FieldOptionsByUseCase, omitNorms and termVectors have to be "explicitly" specified in some cases. I am wondering what is the default value of these settings if solr schema definition does not state them. *Example:* In above case, will Solr create norms for this field and term vector as well? Any ideas? Thanks Saroj
cache invalidation in slaves
Hi All Solr has different types of caches such as filterCache, queryResultCache and document Cache . I know that if a commit is done then a new searcher is opened and new caches are built. And, this makes sense. What happens when commits are happening on master and slaves are pulling all the delta updates. Do slaves trash their cache and rebuild them every time there is a new delta index updates downloaded to slave? Thanks Saroj
q and fq in solr 1.4.1
Hi All I am sure that q vs fq question has been answered several times. But, I still have a question which I would like to know the answers for: if we have a solr query like this q=*&fq=field_1:XYZ&fq=field_2:ABC&sortBy=field_3+asc How does SolrIndexSearcher fire query in 1.4.1 Will it fire query against whole index first because q=* then filter the results against field_1 and field_2 or is it in parallel? and, if we say that get only 20 rows at a time then will solr do following 1) get all the docs (because q is set to *) and sort them by field_3 2) then, filter the results by field_1 and field_2 Or, will it apply sorting after doing the filter? Please let me know how Solr 1.4.1 works. Thanks Saroj
Production Issue: SolrJ client throwing this error even though field type is not defined in schema
Hi All We are getting this error in our Production Solr Setup. Message: Element type "t_sort" must be followed by either attribute specifications, ">" or "/>". Solr version is 1.4.1 Stack trace indicates that solr is returning malformed document. Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing query at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) at com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232) ... 15 more Caused by: org.apache.solr.common.SolrException: parsing error at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) at org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) ... 17 more Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[3,136974] Message: Element type "t_sort" must be followed by either attribute specifications, ">" or "/>". at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594) at org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282) at org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410) at org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360) at org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241) at org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125) ... 21 more
Re: Production Issue: SolrJ client throwing - Element type must be followed by either attribute specifications, ">" or "/>".
Wanted to update the list with our finding. We reduced the number of documents which are being retrieved from Solr and this error did not appear again. Might be the case that due to high number of documents, solr is returning incomplete documents. -Saroj On Wed, Sep 21, 2011 at 12:13 PM, roz dev wrote: > Hi All > > We are getting this error in our Production Solr Setup. > > Message: Element type "t_sort" must be followed by either attribute > specifications, ">" or "/>". > Solr version is 1.4.1 > > Stack trace indicates that solr is returning malformed document. > > > Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing > query > at > org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) > at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) > at > com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232) > ... 15 more > Caused by: org.apache.solr.common.SolrException: parsing error > at > org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140) > at > org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481) > at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) > at > org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) > ... 17 more > Caused by: javax.xml.stream.XMLStreamException: ParseError at > [row,col]:[3,136974] > Message: Element type "t_sort" must be followed by either attribute > specifications, ">" or "/>". > at > com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594) > at > org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282) > at > org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410) > at > org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360) > at > org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241) > at > org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125) > ... 21 more > >
Update ingest rate drops suddenly
just looking for hints where to look for... We were testing single threaded ingest rate on solr, trunk version on atypical collection (a lot of small documents), and we noticed something we are not able to explain. Setup: We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, machine with enough memory and 8 cores. Schema has 5 stored fields, 4 of them indexed no positions no norms. Average net document size (optimized index size / number of documents) is around 100 bytes. On a test with 40 Mio document: - we had update ingest rate on first 4,4Mio documents @ incredible 34k records / second... - then it dropped, suddenly to 20k records per second and this rate remained stable (variance 1k) until... - we hit 13Mio, where ingest rate dropped again really hard, from one instant in time to another to 10k records per second. it stayed there until we reached the end @40Mio (slightly reducing, to ca 9k, but this is not long enough to see trend). Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully regular). CPU in turn was following the ingest rate trend, inicating that we were waiting on something. No searches , no commits, nothing. autoCommit was turned off. Updates were streaming directly from the database. - I did not expect something like this, knowing lucene merges in background. Also, having such sudden drops in ingest rate is indicative that we are not leaking something. (drop would have been much more gradual). It is some caches, but why two really significant drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 k/second :) I am not really acquainted with the new MergePolicy and flushing settings, but I suspect this is something there we could tweak. Could it be windows is somehow, hmm, quirky with solr default directory on win64/jvm (I think it is MMAP by default)... We did not saturate IO with such a small documents I guess, It is a just couple of Gig over 1-2 hours. All in all, it works good, but is having such hard update ingest rate drops normal? Thanks, eks.
Re: Update ingest rate drops suddenly
Thanks Otis, we will look into these issues again, slightly deeper. Network problems are not likely, but DB, I do not know, this is huge select ... we will try to scan db, without indexing, just to see if it can sustain... But gut feeling says, nope, this is not the one. IO saturation would surprise me, but you never know. Might be very well that SSD is somehow having problems with this sustained throughput. 8 Core... no, this was single update thread. we left default index settings (do not tweak if it works :) 32 32MB sounds like a lot of our documents (100b average on disk size). Assuming ram efficiency of 50% (?), we lend at 100k buffered documents. Yes, this is kind of smallish as every ~3 seconds we fill-up ramBuffer. (our Analyzers surprised me with 30k+ records per second). 256 will do the job, ~24 seconds should be plenty of "idle" time for IO-OS-JVM to sort out MMAP issues, if any (windows was newer MMAP performance champion when using it from java, but once you dance around it, it works ok)... Max jvm heap on this test was 768m, memory never went above 500m, Using -XX:-UseParallelGC ... this is definitely not a gc problem. cheers, eks On Sun, Sep 25, 2011 at 6:20 AM, Otis Gospodnetic wrote: > eks, > > This is clear as day - you're using Winblows! Kidding. > > I'd: > * watch IO with something like vmstat 2 and see if the rate drops correlate > to increased disk IO or IO wait time > * monitor the DB from which you were pulling the data - maybe the DB or the > server that runs it had issues > * monitor the network over which you pull data from DB > > If none of the above reveals the problem I'd still: > * grab all data you need to index and copy it locally > * index everything locally > > Out of curiosity, how big is your ramBufferSizeMB and your -Xmx? > And on that 8-core box you have ~8 indexing threads going? > > Otis > > Sematext is Hiring -- http://sematext.com/about/jobs.html > > > > >> >>From: eks dev >>To: solr-user >>Sent: Saturday, September 24, 2011 3:18 PM >>Subject: Update ingest rate drops suddenly >> >>just looking for hints where to look for... >> >>We were testing single threaded ingest rate on solr, trunk version on >>atypical collection (a lot of small documents), and we noticed >>something we are not able to explain. >> >>Setup: >>We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, >>machine with enough memory and 8 cores. Schema has 5 stored fields, >>4 of them indexed no positions no norms. >>Average net document size (optimized index size / number of documents) >>is around 100 bytes. >> >>On a test with 40 Mio document: >>- we had update ingest rate on first 4,4Mio documents @ incredible >>34k records / second... >>- then it dropped, suddenly to 20k records per second and this rate >>remained stable (variance 1k) until... >>- we hit 13Mio, where ingest rate dropped again really hard, from one >>instant in time to another to 10k records per second. >> >>it stayed there until we reached the end @40Mio (slightly reducing, to >>ca 9k, but this is not long enough to see trend). >> >>Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully >>regular). CPU in turn was following the ingest rate trend, inicating >>that we were waiting on something. No searches , no commits, nothing. >> >>autoCommit was turned off. Updates were streaming directly from the database. >> >>- >>I did not expect something like this, knowing lucene merges in >>background. Also, having such sudden drops in ingest rate is >>indicative that we are not leaking something. (drop would have been >>much more gradual). It is some caches, but why two really significant >>drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 >>k/second :) >> >>I am not really acquainted with the new MergePolicy and flushing >>settings, but I suspect this is something there we could tweak. >> >>Could it be windows is somehow, hmm, quirky with solr default >>directory on win64/jvm (I think it is MMAP by default)... We did not >>saturate IO with such a small documents I guess, It is a just couple >>of Gig over 1-2 hours. >> >>All in all, it works good, but is having such hard update ingest rate >>drops normal? >> >>Thanks, >>eks. >> >> >>
Re: Update ingest rate drops suddenly
Just to bring closure on this one, we were slurping data from the wrong DB (hardly desktop class machine)... Solr did not cough on 41Mio records @34k updates / sec., single threaded. Great! On Sat, Sep 24, 2011 at 9:18 PM, eks dev wrote: > just looking for hints where to look for... > > We were testing single threaded ingest rate on solr, trunk version on > atypical collection (a lot of small documents), and we noticed > something we are not able to explain. > > Setup: > We use defaults for index settings, windows 64 bit, jdk 7 U2. on SSD, > machine with enough memory and 8 cores. Schema has 5 stored fields, > 4 of them indexed no positions no norms. > Average net document size (optimized index size / number of documents) > is around 100 bytes. > > On a test with 40 Mio document: > - we had update ingest rate on first 4,4Mio documents @ incredible > 34k records / second... > - then it dropped, suddenly to 20k records per second and this rate > remained stable (variance 1k) until... > - we hit 13Mio, where ingest rate dropped again really hard, from one > instant in time to another to 10k records per second. > > it stayed there until we reached the end @40Mio (slightly reducing, to > ca 9k, but this is not long enough to see trend). > > Nothing unusual happening with jvm memory ( tooth-saw 200- 450M fully > regular). CPU in turn was following the ingest rate trend, inicating > that we were waiting on something. No searches , no commits, nothing. > > autoCommit was turned off. Updates were streaming directly from the database. > > - > I did not expect something like this, knowing lucene merges in > background. Also, having such sudden drops in ingest rate is > indicative that we are not leaking something. (drop would have been > much more gradual). It is some caches, but why two really significant > drops? 33k/sec to 20k and than to 10k... We would love to keep it @34 > k/second :) > > I am not really acquainted with the new MergePolicy and flushing > settings, but I suspect this is something there we could tweak. > > Could it be windows is somehow, hmm, quirky with solr default > directory on win64/jvm (I think it is MMAP by default)... We did not > saturate IO with such a small documents I guess, It is a just couple > of Gig over 1-2 hours. > > All in all, it works good, but is having such hard update ingest rate > drops normal? > > Thanks, > eks. >
Re: Production Issue: SolrJ client throwing this error even though field type is not defined in schema
This issue disappeared when we reduced the number of documents which were being returned from Solr. Looks to be some issue with Tomcat or Solr, returning truncated responses. -Saroj On Sun, Sep 25, 2011 at 9:21 AM, wrote: > If I had to give a gentle nudge, I would ask you to validate your schema > XML file. You can do so by looking for any w3c XML validator website and > just copy pasting the text there to find out where its malformed. > > Sent from my iPhone > > On Sep 24, 2011, at 2:01 PM, Erick Erickson > wrote: > > > You might want to review: > > > > http://wiki.apache.org/solr/UsingMailingLists > > > > There's really not much to go on here. > > > > Best > > Erick > > > > On Wed, Sep 21, 2011 at 12:13 PM, roz dev wrote: > >> Hi All > >> > >> We are getting this error in our Production Solr Setup. > >> > >> Message: Element type "t_sort" must be followed by either attribute > >> specifications, ">" or "/>". > >> Solr version is 1.4.1 > >> > >> Stack trace indicates that solr is returning malformed document. > >> > >> > >> Caused by: org.apache.solr.client.solrj.SolrServerException: Error > >> executing query > >>at > org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95) > >>at > org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118) > >>at > com.gap.gid.search.impl.SearchServiceImpl.executeQuery(SearchServiceImpl.java:232) > >>... 15 more > >> Caused by: org.apache.solr.common.SolrException: parsing error > >>at > org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:140) > >>at > org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:101) > >>at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:481) > >>at > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244) > >>at > org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89) > >>... 17 more > >> Caused by: javax.xml.stream.XMLStreamException: ParseError at > >> [row,col]:[3,136974] > >> Message: Element type "t_sort" must be followed by either attribute > >> specifications, ">" or "/>". > >>at > com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:594) > >>at > org.apache.solr.client.solrj.impl.XMLResponseParser.readArray(XMLResponseParser.java:282) > >>at > org.apache.solr.client.solrj.impl.XMLResponseParser.readDocument(XMLResponseParser.java:410) > >>at > org.apache.solr.client.solrj.impl.XMLResponseParser.readDocuments(XMLResponseParser.java:360) > >>at > org.apache.solr.client.solrj.impl.XMLResponseParser.readNamedList(XMLResponseParser.java:241) > >>at > org.apache.solr.client.solrj.impl.XMLResponseParser.processResponse(XMLResponseParser.java:125) > >>... 21 more > >> >
Re: capacity planning
Re. "I have little experience with VM servers for search." We had huge performance penalty on VMs, CPU was bottleneck. We couldn't freely run measurements to figure out what the problem really was (hosting was contracted by customer...), but it was something pretty scary, kind of 8-10 times slower than advertised dedicated equivalent. Whatever its worth, if you can afford it, keep lucene away from it. Lucene is highly optimized machine, and someone twiddling with context switches is not welcome there. Of course, if you get IO bound, it makes no big diff anyhow. This is just my singular experience, might be the hosting team did not configure it right, or something changed in meantime (~ 4 Years old experience), but we burnt our fingers that hard I still remember it On Tue, Oct 11, 2011 at 7:49 PM, Toke Eskildsen wrote: > Travis Low [t...@4centurion.com] wrote: > > Toke, thanks. Comments embedded (hope that's okay): > > Inline or top-posting? Long discussion, but for mailing lists I clearly > prefer the former. > > [Toke: Estimate characters] > > > Yes. We estimate each of the 23K DB records has 600 pages of text for > the > > combined documents, 300 words per page, 5 characters per word. Which > > coincidentally works out to about 21GB, so good guessing there. :) > > Heh. Lucky Guess indeed, although the factors were off. Anyway, 21GB does > not sound scary at all. > > > The way it works is we have researchers modifying the DB records during > the > > day, and they may upload documents at that time. We estimate 50-60 > uploads > > throughout the day. If possible, we'd like to index them as they are > > uploaded, but if that would negatively affect the search, then we can > > rebuild the index nightly. > > > > Which is better? > > The analyzing part is only CPU and you're running multi-core so as long as > you only analyze using one thread you're safe there. That leaves us with > I/O: Even for spinning drives, a daily load of just 60 updates of 1MB of > extracted text each shouldn't have any real effect - with the usual caveat > that large merges should be avoided by either optimizing at night or > tweaking merge policy to avoid large segments. With such a relatively small > index, (re)opening and warm up should be painless too. > > Summary: 300GB is a fair amount of data and takes some power to crunch. > However, in the Solr/Lucene end your index size and your update rates are > nothing to worry about. Usual caveat for advanced use and all that applies. > > [Toke: i7, 8GB, 1TB spinning, 256GB SSD] > > > We have a very beefy VM server that we will use for benchmarking, but > your > > specs provide a starting point. Thanks very much for that. > > I have little experience with VM servers for search. Although we use a lot > of virtual machines, we use dedicated machines for our searchers, primarily > to ensure low latency for I/O. They might be fine for that too, but we > haven't tried it yet. > > Glad to be of help, > Toke
Index format difference between 4.0 and 3.4
Hi All, We are using Solr 1.4.1 in production and are considering an upgrade to newer version. It seems that Solr 3.x requires a complete rebuild of index as the format seems to have changed. Is Solr 4.0 index file format compatible with Solr 3.x format? Please advise. Thanks Saroj
codec="Pulsing" per field broken?
on the latest trunk, my schema.xml with field type declaration containing //codec="Pulsing"// does not work any more (throws exception from FieldType). It used to work wit approx. a month old trunk version. I didn't dig deeper, can be that the old schema.xml was broken and worked by accident. org.apache.solr.common.SolrException: Plugin Initializing failure for [schema.xml] fieldType at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:183) at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:368) at org.apache.solr.schema.IndexSchema.(IndexSchema.java:107) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:651) at org.apache.solr.core.CoreContainer.load(CoreContainer.java:409) at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:243) at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:93) at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:713) at org.mortbay.jetty.servlet.Context.startContext(Context.java:140) at org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1282) at org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:518) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130) at org.mortbay.jetty.Server.doStart(Server.java:224) at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) at runjettyrun.Bootstrap.main(Bootstrap.java:86) Caused by: java.lang.RuntimeException: schema fieldtype storableCity(X.StorableField) invalid arguments:{codec=Pulsing} at org.apache.solr.schema.FieldType.setArgs(FieldType.java:177) at org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:127) at org.apache.solr.schema.FieldTypePluginLoader.init(FieldTypePluginLoader.java:43) at org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:180) ... 18 more
Re: codec="Pulsing" per field broken?
Thanks Robert, I've missed LUCENE-3490... Awesome! On Sun, Dec 11, 2011 at 6:37 PM, Robert Muir wrote: > On Sun, Dec 11, 2011 at 11:34 AM, eks dev wrote: >> on the latest trunk, my schema.xml with field type declaration >> containing //codec="Pulsing"// does not work any more (throws >> exception from FieldType). It used to work wit approx. a month old >> trunk version. >> >> I didn't dig deeper, can be that the old schema.xml was broken and >> worked by accident. >> > > Hi, > > The short answer is, you should change this to //postingsFormat="Pulsing40"// > See > http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema_codec.xml > > The longer answer is that the Codec API in lucene trunk was extended recently: > https://issues.apache.org/jira/browse/LUCENE-3490 > > Previously "Codec" only allowed you to customize the format of the > postings lists. > We are working to have it cover the entire index segment (at the > moment nearly everything except deletes and encoding of compound files > can be customized). > > For example, look at SimpleText now: > http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/simpletext/ > As you see, it now implements plain-text stored fields, term vectors, > norms, segments file, fieldinfos, etc. > See Codec.java > (http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/Codec.java) > or LUCENE-3490 for more details. > > Because of this, what you had before is now just "PostingsFormat", as > Pulsing is just a wrapper around a postings implementation that > inlines low frequency terms. > Lucene's default Codec uses a per-field postings setup, so you can > still configure the postings per-field, just differently. > > -- > lucidimagination.com
hot deploy of newer version of solr schema in production
Hi All, I need community's feedback about deploying newer versions of solr schema into production while existing (older) schema is in use by applications. How do people perform these things? What has been the learning of people about this. Any thoughts are welcome. Thanks Saroj
Re: filter query from external list of Solr unique IDs
if your index is read-only in production, can you add mapping unique_id-Lucene docId in your kv store and and build filters externally? That would make unique Key obsolete in your production index, as you would work at lucene doc id level. That way, you offline the problem to update/optimize phase. Ugly part is a lot of updates on your kv-store... I am not really familiar with solr, but working directly with lucene this is doable, even having parallel index that has unique ID as a stored field, and another index with indexed fields on update master, and than having only this index with indexed fields in production. On Fri, Oct 15, 2010 at 8:59 PM, Burton-West, Tom wrote: > Hi Jonathan, > > The advantages of the obvious approach you outline are that it is simple, > it fits in to the existing Solr model, it doesn't require any customization > or modification to Solr/Lucene java code. Unfortunately, it does not scale > well. We originally tried just what you suggest for our implementation of > Collection Builder. For a user's personal collection we had a table that > maps the collection id to the unique Solr ids. > Then when they wanted to search their collection, we just took their search > and added a filter query with the fq=(id:1 OR id:2 OR). I seem to > remember running in to a limit on the number of OR clauses allowed. Even if > you can set that limit larger, there are a number of efficiency issues. > > We ended up constructing a separate Solr index where we have a multi-valued > collection number field. Unfortunately, until incremental field updating > gets implemented, this means that every time someone adds a document to a > collection, the entire document (including 700KB of OCR) needs to be > re-indexed just to update the collection number field. This approach has > allowed us to scale up to a total of something under 100,000 documents, but > we don't think we can scale it much beyond that for various reasons. > > I was actually thinking of some kind of custom Lucene/Solr component that > would for example take a query parameter such as &lookitUp=123 and the > component might do a JDBC query against a database or kv store and return > results in some form that would be efficient for Solr/Lucene to process. (Of > course this assumes that a JDBC query would be more efficient than just > sending a long list of ids to Solr). The other part of the equation is > mapping the unique Solr ids to internal Lucene ids in order to implement a > filter query. I was wondering if something like the unique id to Lucene id > mapper in zoie might be useful or if that is too specific to zoie. SoThis > may be totally off-base, since I haven't looked at the zoie code at all yet. > > In our particular use case, we might be able to build some kind of > in-memory map after we optimize an index and before we mount it in > production. In our workflow, we update the index and optimize it before we > release it and once it is released to production there is no > indexing/merging taking place on the production index (so the internal > Lucene ids don't change.) > > Tom > > > > -Original Message- > From: Jonathan Rochkind [mailto:rochk...@jhu.edu] > Sent: Friday, October 15, 2010 1:07 PM > To: solr-user@lucene.apache.org > Subject: RE: filter query from external list of Solr unique IDs > > Definitely interested in this. > > The naive obvious approach would be just putting all the ID's in the query. > Like fq=(id:1 OR id:2 OR). Or making it another clause in the 'q'. > > Can you outline what's wrong with this approach, to make it more clear > what's needed in a solution? > >
Re: can we configure spellcheck to be invoked after request processing?
James, You are right. I was setting up spell checker incorrectly. It works correctly as you described. Spell checker is invoked after the query component and it does not stop Solr from executing query. Thanks for correcting me. Saroj On Fri, Mar 1, 2013 at 7:30 AM, Dyer, James wrote: > I'm a little confused here because if you are searching q=jeap OR denim , > then you should be getting both documents back. Having spellcheck > configured does not affect your search results at all. Having it in your > request will sometime result in spelling suggestions, usually if one or > more terms you queried is not in the index. But if all of your query terms > are optional then you need only have 1 term match anything to get results. > You should get the same results regardless of whether or not you have > spellcheck in the request. > > While spellcheck does not affect your query results, the results do affect > spellcheck. This is why you should put spellcheck in the "last-components" > section of your request handler configuration. This ensures that the query > is run before spellcheck. > > James Dyer > Ingram Content Group > (615) 213-4311 > > > -Original Message- > From: roz dev [mailto:rozde...@gmail.com] > Sent: Thursday, February 28, 2013 6:33 PM > To: solr-user@lucene.apache.org > Subject: can we configure spellcheck to be invoked after request > processing? > > Hi All, > I may be asking a stupid question but please bear with me. > > Is it possible to configure Spell check to be invoked after Solr has > processed the original query? > > My use case is : > > I am using DirectSpellChecker and have a document which has "Denim" as a > term and there is another document which has "Jeap". > > I am issuing a Search as "Jean" or "Denim" > > I am finding that this Solr query is giving me ZERO results and suggesting > "Jeap" as an alternative. > > I want Solr to try to run the query for "Jean" or "Denim" and if there are > no results found then only suggest "Jeap" as an alternative > > Is this doable in Solr? > > Any suggestions. > > -Saroj > >
Can we manipulate termfreq to count as 1 for multiple matches?
Hi All I am wondering if there is a way to alter term frequency of a certain field as 1, even if there are multiple matches in that document? Use Case is: Let's say that I have a document with 2 fields - Name and - Description And, there is a document with data like this Document_1 Name = Blue Jeans Description = This jeans is very soft. Jeans is pretty nice. Now, If I Search for "Jeans" then "Jeans" is found in 2 places in Description field. Term Frequency for Description is 2 I want Solr to count term frequency for Description as 1 even if "Jeans" is found multiple times in this field. For all other fields, i do want to get the term frequency, as it is. Is this doable in Solr with any of the functions? Any inputs are welcome. Thanks Saroj
Re: hot deploy of newer version of solr schema in production
Thanks Jan for your inputs. I am keen to know about the way people keep running live sites while there is a breaking change which calls for complete re-indexing. we want to build a new index , with new schema (it may take couple of hours) without impacting live e-commerce site. any thoughts are welcome Thanks Saroj On Tue, Jan 24, 2012 at 12:21 AM, Jan Høydahl wrote: > Hi, > > To be able to do a true hot deploy of newer schema without reindexing, you > must carefully see to that none of your changes are breaking changes. So > you should test the process on your development machine and make sure it > works. Adding and deleting fields would work, but not changing the > field-type or analysis of an existing field. Depending on from/to version, > you may want to keep the old schema-version number. > > The process is: > 1. Deploy the new schema, including all dependencies such as dictionaries > 2. Do a RELOAD CORE http://wiki.apache.org/solr/CoreAdmin#RELOAD > > My preference is to do a more thorough upgrade of schema including new > functionality and breaking changes, and then do a full reindex. The > exception is if my index is huge and the reason for Solr upgrade or schema > change is to fix a bug, not to use new functionality. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 24. jan. 2012, at 01:51, roz dev wrote: > > > Hi All, > > > > I need community's feedback about deploying newer versions of solr schema > > into production while existing (older) schema is in use by applications. > > > > How do people perform these things? What has been the learning of people > > about this. > > > > Any thoughts are welcome. > > > > Thanks > > Saroj > >
reader/searcher refresh after replication (commit)
Hi all, I am a bit confused with IndexSearcher refresh lifecycles... In a master slave setup, I override postCommit listener on slave (solr trunk version) to read some user information stored in userCommitData on master -- @Override public final void postCommit() { // This returnes "stale" information that was present before replication finished RefCounted refC = core.getNewestSearcher(true); Map userData = refC.get().getIndexReader().getIndexCommit().getUserData(); } I expected core.getNewestSearcher(true); to return refreshed SolrIndexSearcher, but it didn't When is this information going to be refreshed to the status from the replicated index, I repeat this is postCommit listener? What is the way to get the information from the last commit point? Maybe like this? core.getDeletionPolicy().getLatestCommit().getUserData(); Or I need to explicitly open new searcher (isn't solr does this behind the scenes?) core.openNewSearcher(false, false) Not critical, reopening new searcher works, but I would like to understand these lifecycles, when solr loads latest commit point... Thanks, eks
Re: reader/searcher refresh after replication (commit)
Thanks Mark, Hmm, I would like to have this information asap, not to wait until the first search gets executed (depends on user) . Is solr going to create new searcher as a part of "replication transaction"... Just to make it clear why I need it... I have simple master, many slaves config where master does "batch" updates in big chunks (things user can wait longer to see on search side) but slaves work in soft commit mode internally where I permit them to run away slightly from master in order to know where "incremental update" should start, I read it from UserData Basically, ideally, before commit (after successful replication is finished) ends, I would like to read in these counters to let "incremental update" run from the right point... I need to prevent updating "replicated index" before I read this information (duplicates can appear) are there any "IndexWriter" listeners around? Thanks again, eks. On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller wrote: > Post commit calls are made before a new searcher is opened. > > Might be easier to try to hook in with a new searcher listener? > > On Feb 21, 2012, at 8:23 AM, eks dev wrote: > >> Hi all, >> I am a bit confused with IndexSearcher refresh lifecycles... >> In a master slave setup, I override postCommit listener on slave >> (solr trunk version) to read some user information stored in >> userCommitData on master >> >> -- >> @Override >> public final void postCommit() { >> // This returnes "stale" information that was present before >> replication finished >> RefCounted refC = core.getNewestSearcher(true); >> Map userData = >> refC.get().getIndexReader().getIndexCommit().getUserData(); >> } >> >> I expected core.getNewestSearcher(true); to return refreshed >> SolrIndexSearcher, but it didn't >> >> When is this information going to be refreshed to the status from the >> replicated index, I repeat this is postCommit listener? >> >> What is the way to get the information from the last commit point? >> >> Maybe like this? >> core.getDeletionPolicy().getLatestCommit().getUserData(); >> >> Or I need to explicitly open new searcher (isn't solr does this behind >> the scenes?) >> core.openNewSearcher(false, false) >> >> Not critical, reopening new searcher works, but I would like to >> understand these lifecycles, when solr loads latest commit point... >> >> Thanks, eks > > - Mark Miller > lucidimagination.com > > > > > > > > > > >
Re: reader/searcher refresh after replication (commit)
And drinks on me to those who decoupled implicit commit from close... this was tricky trap On Tue, Feb 21, 2012 at 9:10 PM, eks dev wrote: > Thanks Mark, > Hmm, I would like to have this information asap, not to wait until the > first search gets executed (depends on user) . Is solr going to create > new searcher as a part of "replication transaction"... > > Just to make it clear why I need it... > I have simple master, many slaves config where master does "batch" > updates in big chunks (things user can wait longer to see on search > side) but slaves work in soft commit mode internally where I permit > them to run away slightly from master in order to know where > "incremental update" should start, I read it from UserData > > Basically, ideally, before commit (after successful replication is > finished) ends, I would like to read in these counters to let > "incremental update" run from the right point... > > I need to prevent updating "replicated index" before I read this > information (duplicates can appear) are there any "IndexWriter" > listeners around? > > > Thanks again, > eks. > > > > On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller wrote: >> Post commit calls are made before a new searcher is opened. >> >> Might be easier to try to hook in with a new searcher listener? >> >> On Feb 21, 2012, at 8:23 AM, eks dev wrote: >> >>> Hi all, >>> I am a bit confused with IndexSearcher refresh lifecycles... >>> In a master slave setup, I override postCommit listener on slave >>> (solr trunk version) to read some user information stored in >>> userCommitData on master >>> >>> -- >>> @Override >>> public final void postCommit() { >>> // This returnes "stale" information that was present before >>> replication finished >>> RefCounted refC = core.getNewestSearcher(true); >>> Map userData = >>> refC.get().getIndexReader().getIndexCommit().getUserData(); >>> } >>> >>> I expected core.getNewestSearcher(true); to return refreshed >>> SolrIndexSearcher, but it didn't >>> >>> When is this information going to be refreshed to the status from the >>> replicated index, I repeat this is postCommit listener? >>> >>> What is the way to get the information from the last commit point? >>> >>> Maybe like this? >>> core.getDeletionPolicy().getLatestCommit().getUserData(); >>> >>> Or I need to explicitly open new searcher (isn't solr does this behind >>> the scenes?) >>> core.openNewSearcher(false, false) >>> >>> Not critical, reopening new searcher works, but I would like to >>> understand these lifecycles, when solr loads latest commit point... >>> >>> Thanks, eks >> >> - Mark Miller >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> >>
Re: reader/searcher refresh after replication (commit)
Yes, I consciously let my slaves run away from the master in order to reduce update latency, but every now and then they sync up with master that is doing heavy lifting. The price you pay is that slaves do not see the same documents as the master, but this is the case anyhow with replication, in my setup slave may go ahead of master with updates, this delta gets zeroed after replication and the game starts again. What you have to take into account with this is very small time window where you may "go back in time" on slaves (not seeing documents that were already there), but we are talking about seconds and a couple out of 200Mio documents (only those documents that were softComited on slave during replication, since commit ond master and postCommit on slave). Why do you think something is strange here? > What are you expecting a BeforeCommitListener could do for you, if one > would exist? Why should I be expecting something? I just need to read userCommit Data as soon as replication is done, and I am looking for proper/easy way to do it. (postCommitListener is what I use now). What makes me slightly nervous are those life cycle questions, e.g. when I issue update command before and after postCommit event, which index gets updated, the one just replicated or the one that was there just before replication. There are definitely ways to optimize this, for example to force replication handler to copy only delta files if index gets updated on slave and master (there is already todo somewhere on solr replication Wiki I think). Now replicationHandler copies complete index if this gets detected ... I am all ears if there are better proposals to have low latency updates in multi server setup... On Tue, Feb 21, 2012 at 11:53 PM, Em wrote: > Eks, > > that sounds strange! > > Am I getting you right? > You have a master which indexes batch-updates from time to time. > Furthermore you got some slaves, pulling data from that master to keep > them up-to-date with the newest batch-updates. > Additionally your slaves index own content in soft-commit mode that > needs to be available as soon as possible. > In consequence the slavesare not in sync with the master. > > I am not 100% certain, but chances are good that Solr's > replication-mechanism only changes those segments that are not in sync > with the master. > > What are you expecting a BeforeCommitListener could do for you, if one > would exist? > > Kind regards, > Em > > Am 21.02.2012 21:10, schrieb eks dev: >> Thanks Mark, >> Hmm, I would like to have this information asap, not to wait until the >> first search gets executed (depends on user) . Is solr going to create >> new searcher as a part of "replication transaction"... >> >> Just to make it clear why I need it... >> I have simple master, many slaves config where master does "batch" >> updates in big chunks (things user can wait longer to see on search >> side) but slaves work in soft commit mode internally where I permit >> them to run away slightly from master in order to know where >> "incremental update" should start, I read it from UserData >> >> Basically, ideally, before commit (after successful replication is >> finished) ends, I would like to read in these counters to let >> "incremental update" run from the right point... >> >> I need to prevent updating "replicated index" before I read this >> information (duplicates can appear) are there any "IndexWriter" >> listeners around? >> >> >> Thanks again, >> eks. >> >> >> >> On Tue, Feb 21, 2012 at 8:03 PM, Mark Miller wrote: >>> Post commit calls are made before a new searcher is opened. >>> >>> Might be easier to try to hook in with a new searcher listener? >>> >>> On Feb 21, 2012, at 8:23 AM, eks dev wrote: >>> >>>> Hi all, >>>> I am a bit confused with IndexSearcher refresh lifecycles... >>>> In a master slave setup, I override postCommit listener on slave >>>> (solr trunk version) to read some user information stored in >>>> userCommitData on master >>>> >>>> -- >>>> @Override >>>> public final void postCommit() { >>>> // This returnes "stale" information that was present before >>>> replication finished >>>> RefCounted refC = core.getNewestSearcher(true); >>>> Map userData = >>>> refC.get().getIndexReader().getIndexCommit().getUserData(); >>>> } >>>> >>>> I expected core.getNewestSearcher(true); to return refreshed >>>>
SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
We started observing strange failures from ReplicationHandler when we commit on master trunk version 4-5 days old. It works sometimes, and sometimes not didn't dig deeper yet. Looks like the real culprit hides behind: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Looks familiar to somebody? 120222 154959 SEVERE SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) ... 15 more
Re: Unusually long data import time?
Davon, you ought to try to update from many threads, (I do not know if DIH can do it, check it), but lucene does great job if fed from many update threads... depends where your time gets lost, but it is usually a) analysis chain or b) database if it os a) and your server has spare cpu-cores, you can scale at X NooCores rate On Wed, Feb 22, 2012 at 7:41 PM, Devon Baumgarten wrote: > Ahmet, > > I do not. I commented autoCommit out. > > Devon Baumgarten > > > > -Original Message- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Wednesday, February 22, 2012 12:25 PM > To: solr-user@lucene.apache.org > Subject: Re: Unusually long data import time? > >> Would it be unusual for an import of 160 million documents >> to take 18 hours? Each document is less than 1kb and I >> have the DataImportHandler using the jdbc driver to connect >> to SQL Server 2008. The full-import query calls a stored >> procedure that contains only a select from my target table. >> >> Is there any way I can speed this up? I saw recently someone >> on this list suggested a new user could get all their Solr >> data imported in under an hour. I sure hope that's true! > > Do have autoCommit or autoSoftCommit configured in solrconfig.xml?
dih and solr cloud
out of curiosity, trying to see if new cloud features can replace what I use now... how is this (batch) update forwarding solved at cloud level? imagine simple one shard and one replica case, if I fire up DIH update, is this going to be replicated to replica shard? If yes, - is it going to be sent document by document (network, imagine 100Mio+ update commands going to replica from slave for big batches) - somehow batch into "packages" to reduce load - distributed at index level somehow This is important case, today with master/slave solr replication, but is not mentioned at http://wiki.apache.org/solr/SolrCloud
Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
thanks Mark, I will give it a go and report back... On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller wrote: > Looks like an issue around replication IndexWriter reboot, soft commits and > hard commits. > > I think I've got a workaround for it: > > Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java > === > --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java (revision > 1292344) > +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java (working copy) > @@ -499,6 +499,17 @@ > > // reboot the writer on the new index and get a new searcher > solrCore.getUpdateHandler().newIndexWriter(); > + Future[] waitSearcher = new Future[1]; > + solrCore.getSearcher(true, false, waitSearcher, true); > + if (waitSearcher[0] != null) { > + try { > + waitSearcher[0].get(); > + } catch (InterruptedException e) { > + SolrException.log(LOG,e); > + } catch (ExecutionException e) { > + SolrException.log(LOG,e); > + } > + } > // update our commit point to the right dir > solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, false)); > > That should allow the searcher that the following commit command prompts to > see the *new* IndexWriter. > > On Feb 22, 2012, at 10:56 AM, eks dev wrote: > >> We started observing strange failures from ReplicationHandler when we >> commit on master trunk version 4-5 days old. >> It works sometimes, and sometimes not didn't dig deeper yet. >> >> Looks like the real culprit hides behind: >> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >> >> Looks familiar to somebody? >> >> >> 120222 154959 SEVERE SnapPull failed >> :org.apache.solr.common.SolrException: Error opening new searcher >> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) >> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) >> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) >> at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) >> at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) >> at >> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) >> at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) >> at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >> at >> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) >> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >> at java.lang.Thread.run(Thread.java:722) >> Caused by: org.apache.lucene.store.AlreadyClosedException: this >> IndexWriter is closed >> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) >> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) >> at org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) >> at >> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) >> at >> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) >> at >> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) >> at >> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) >> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) >> ... 15 more > > - Mark Miller > lucidimagination.com > > > > > > > > > > >
Re: SnapPull failed :org.apache.solr.common.SolrException: Error opening new searcher
it loos like it works, with patch, after a couple of hours of testing under same conditions didn't see it happen (without it, approx. every 15 minutes). I do not think it will happen again with this patch. Thanks again and my respect to your debugging capacity, my bug report was really thin. On Thu, Feb 23, 2012 at 8:47 AM, eks dev wrote: > thanks Mark, I will give it a go and report back... > > On Thu, Feb 23, 2012 at 1:31 AM, Mark Miller wrote: >> Looks like an issue around replication IndexWriter reboot, soft commits and >> hard commits. >> >> I think I've got a workaround for it: >> >> Index: solr/core/src/java/org/apache/solr/handler/SnapPuller.java >> === >> --- solr/core/src/java/org/apache/solr/handler/SnapPuller.java (revision >> 1292344) >> +++ solr/core/src/java/org/apache/solr/handler/SnapPuller.java (working >> copy) >> @@ -499,6 +499,17 @@ >> >> // reboot the writer on the new index and get a new searcher >> solrCore.getUpdateHandler().newIndexWriter(); >> + Future[] waitSearcher = new Future[1]; >> + solrCore.getSearcher(true, false, waitSearcher, true); >> + if (waitSearcher[0] != null) { >> + try { >> + waitSearcher[0].get(); >> + } catch (InterruptedException e) { >> + SolrException.log(LOG,e); >> + } catch (ExecutionException e) { >> + SolrException.log(LOG,e); >> + } >> + } >> // update our commit point to the right dir >> solrCore.getUpdateHandler().commit(new CommitUpdateCommand(req, >> false)); >> >> That should allow the searcher that the following commit command prompts to >> see the *new* IndexWriter. >> >> On Feb 22, 2012, at 10:56 AM, eks dev wrote: >> >>> We started observing strange failures from ReplicationHandler when we >>> commit on master trunk version 4-5 days old. >>> It works sometimes, and sometimes not didn't dig deeper yet. >>> >>> Looks like the real culprit hides behind: >>> org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed >>> >>> Looks familiar to somebody? >>> >>> >>> 120222 154959 SEVERE SnapPull failed >>> :org.apache.solr.common.SolrException: Error opening new searcher >>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1138) >>> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1251) >>> at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1043) >>> at org.apache.solr.update.DirectUpdateHandler2.commit(Unknown Source) >>> at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:503) >>> at >>> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:348) >>> at org.apache.solr.handler.ReplicationHandler.doFetch(Unknown Source) >>> at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:163) >>> at >>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) >>> at >>> java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) >>> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) >>> at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) >>> at >>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >>> at >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) >>> at >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) >>> at java.lang.Thread.run(Thread.java:722) >>> Caused by: org.apache.lucene.store.AlreadyClosedException: this >>> IndexWriter is closed >>> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:810) >>> at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:815) >>> at >>> org.apache.lucene.index.IndexWriter.nrtIsCurrent(IndexWriter.java:3984) >>> at >>> org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:254) >>> at >>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:233) >>> at >>> org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:223) >>> at >>> org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) >>> at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1095) >>> ... 15 more >> >> - Mark Miller >> lucidimagination.com >> >> >> >> >> >> >> >> >> >> >>
Solr Cloud, Commits and Master/Slave configuration
Hi All, I am trying to understand features of Solr Cloud, regarding commits and scaling. - If I am using Solr Cloud then do I need to explicitly call commit (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job of writing to disk? - Do We still need to use Master/Slave setup to scale searching? If we have to use Master/Slave setup then do i need to issue hard-commit to make my changes visible to slaves? - If I were to use NRT with Master/Slave setup with soft commit then will the slave be able to see changes made on master with soft commit? Any inputs are welcome. Thanks -Saroj
Re: Solr Cloud, Commits and Master/Slave configuration
SolrCluod is going to be great, NRT feature is really huge step forward, as well as central configuration, elasticity ... The only thing I do not yet understand is treatment of cases that were traditionally covered by Master/Slave setup. Batch update If I get it right (?), updates to replicas are sent one by one, meaning when one server receives update, it gets forwarded to all replicas. This is great for reduced update latency case, but I do not know how is it implemented if you hit it with "batch" update. This would cause huge amount of update commands going to replicas. Not so good for throughput. - Master slave does distribution at segment level, (no need to replicate analysis, far less network traffic). Good for batch updates - SolrCloud does par update command (low latency, but chatty and Analysis step is done N_Servers times). Good for incremental updates Ideally, some sort of "batching" is going to be available in SolrCloud, and some cont roll over it, e.g. forward batches of 1000 documents (basically keep update log slightly longer and forward it as a batch update command). This would still cause duplicate analysis, but would reduce network traffic. Please bare in mind, this is more of a question than a statement, I didn't look at the cloud code. It might be I am completely wrong here! On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson wrote: > As I understand it (and I'm just getting into SolrCloud myself), you can > essentially forget about master/slave stuff. If you're using NRT, > the soft commit will make the docs visible, you don't ned to do a hard > commit (unlike the master/slave days). Essentially, the update is sent > to each shard leader and then fanned out into the replicas for that > leader. All automatically. Leaders are elected automatically. ZooKeeper > is used to keep the cluster information. > > Additionally, SolrCloud keeps a transaction log of the updates, and replays > them if the indexing is interrupted, so you don't risk data loss the way > you used to. > > There aren't really masters/slaves in the old sense any more, so > you have to get out of that thought-mode (it's hard, I know). > > The code is under pretty active development, so any feedback is > valuable > > Best > Erick > > On Mon, Feb 27, 2012 at 3:26 AM, roz dev wrote: >> Hi All, >> >> I am trying to understand features of Solr Cloud, regarding commits and >> scaling. >> >> >> - If I am using Solr Cloud then do I need to explicitly call commit >> (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job of >> writing to disk? >> >> >> - Do We still need to use Master/Slave setup to scale searching? If we >> have to use Master/Slave setup then do i need to issue hard-commit to make >> my changes visible to slaves? >> - If I were to use NRT with Master/Slave setup with soft commit then >> will the slave be able to see changes made on master with soft commit? >> >> Any inputs are welcome. >> >> Thanks >> >> -Saroj
Re: Solr Cloud, Commits and Master/Slave configuration
Thanks Mark, Good, this is probably good enough to give it a try. My analyzers are normally fast, doing duplicate analysis (at each replica) is probably not going to cost a lot, if there is some decent "batching" Can this be somehow controlled (depth of this buffer / time till flush or some such). Which "events" trigger this flushing to replicas (softCommit, commit, something new?) What I found useful is to always think in terms of incremental (low latency) and batch (high throughput) updates. I just then need some knobs to tweak behavior of this update process. I wold really like to move away from Master/Slave, Cloud makes a lot of things way simpler for us users ... Will give it a try in a couple of weeks Later we can even think about putting replication at segment level for "extremely expensive analysis, batch cases", or "initial cluster seeding" as a replication option. But this is then just an optimization. Cheers, eks On Thu, Mar 1, 2012 at 5:24 AM, Mark Miller wrote: > We actually do currently batch updates - we are being somewhat loose when we > say a document at a time. There is a buffer of updates per replica that gets > flushed depending on the requests coming through and the buffer size. > > - Mark Miller > lucidimagination.com > > On Feb 28, 2012, at 3:38 AM, eks dev wrote: > >> SolrCluod is going to be great, NRT feature is really huge step >> forward, as well as central configuration, elasticity ... >> >> The only thing I do not yet understand is treatment of cases that were >> traditionally covered by Master/Slave setup. Batch update >> >> If I get it right (?), updates to replicas are sent one by one, >> meaning when one server receives update, it gets forwarded to all >> replicas. This is great for reduced update latency case, but I do not >> know how is it implemented if you hit it with "batch" update. This >> would cause huge amount of update commands going to replicas. Not so >> good for throughput. >> >> - Master slave does distribution at segment level, (no need to >> replicate analysis, far less network traffic). Good for batch updates >> - SolrCloud does par update command (low latency, but chatty and >> Analysis step is done N_Servers times). Good for incremental updates >> >> Ideally, some sort of "batching" is going to be available in >> SolrCloud, and some cont roll over it, e.g. forward batches of 1000 >> documents (basically keep update log slightly longer and forward it as >> a batch update command). This would still cause duplicate analysis, >> but would reduce network traffic. >> >> Please bare in mind, this is more of a question than a statement, I >> didn't look at the cloud code. It might be I am completely wrong here! >> >> >> >> >> >> On Tue, Feb 28, 2012 at 4:01 AM, Erick Erickson >> wrote: >>> As I understand it (and I'm just getting into SolrCloud myself), you can >>> essentially forget about master/slave stuff. If you're using NRT, >>> the soft commit will make the docs visible, you don't ned to do a hard >>> commit (unlike the master/slave days). Essentially, the update is sent >>> to each shard leader and then fanned out into the replicas for that >>> leader. All automatically. Leaders are elected automatically. ZooKeeper >>> is used to keep the cluster information. >>> >>> Additionally, SolrCloud keeps a transaction log of the updates, and replays >>> them if the indexing is interrupted, so you don't risk data loss the way >>> you used to. >>> >>> There aren't really masters/slaves in the old sense any more, so >>> you have to get out of that thought-mode (it's hard, I know). >>> >>> The code is under pretty active development, so any feedback is >>> valuable >>> >>> Best >>> Erick >>> >>> On Mon, Feb 27, 2012 at 3:26 AM, roz dev wrote: >>>> Hi All, >>>> >>>> I am trying to understand features of Solr Cloud, regarding commits and >>>> scaling. >>>> >>>> >>>> - If I am using Solr Cloud then do I need to explicitly call commit >>>> (hard-commit)? Or, a soft commit is okay and Solr Cloud will do the job >>>> of >>>> writing to disk? >>>> >>>> >>>> - Do We still need to use Master/Slave setup to scale searching? If we >>>> have to use Master/Slave setup then do i need to issue hard-commit to >>>> make >>>> my changes visible to slaves? >>>> - If I were to use NRT with Master/Slave setup with soft commit then >>>> will the slave be able to see changes made on master with soft commit? >>>> >>>> Any inputs are welcome. >>>> >>>> Thanks >>>> >>>> -Saroj > > > > > > > > > > > >
Re: Solr Design question on spatial search
So let's say x=10 miles. Now if I search for San then San Francisco, San Mateo should be returned because there is a retail store in San Francisco. But San Jose should not be returned because it is more than 10 miles away from San Francisco. Had there been a retail store in San Jose then it should be also returned when you search for San. I can restrict the queries to a country. Thanks, ~Venu On Mar 2, 2012, at 5:57 AM, Erick Erickson wrote: > I don't see how this works, since your search for San could also return > San Marino, Italy. Would you then return all retail stores in > X miles of that city? What about San Salvador de Jujuy, Argentina? > > And even in your example, San would match San Mateo. But should > the search then return any stores within X miles of San Mateo? > You have to stop somewhere > > Is there any other information you have that restricts how far to expand the > search? > > Best > Erick > > On Thu, Mar 1, 2012 at 4:57 PM, Venu Gmail Dev > wrote: >> I don't think Spatial search will fully fit into this. I have 2 approaches >> in mind but I am not satisfied with either one of them. >> >> a) Have 2 separate indexes. First one to store the information about all the >> cities and second one to store the retail stores information. Whenever user >> searches for a city then I return all the matching cities from first index >> and then do a spatial search on each of the matched city in the second >> index. But this is too costly. >> >> b) Index only the cities which have a nearby store. Do all the >> calculation(s) before indexing the data so that the search is fast. The >> problem that I see with this approach is that if a new retail store or a >> city is added then I would have to re-index all the data again. >> >> >> On Mar 1, 2012, at 7:59 AM, Dirceu Vieira wrote: >> >>> I believe that what you need is spatial search... >>> >>> Have a look a the documention: http://wiki.apache.org/solr/SpatialSearch >>> >>> On Wed, Feb 29, 2012 at 10:54 PM, Venu Shankar >>> wrote: >>> >>>> Hello, >>>> >>>> I have a design question for Solr. >>>> >>>> I work for an enterprise which has a lot of retail stores (approx. 20K). >>>> These retail stores are spread across the world. My search requirement is >>>> to find all the cities which are within x miles of a retail store. >>>> >>>> So lets say if we have a retail Store in San Francisco and if I search for >>>> "San" then San Francisco, Santa Clara, San Jose, San Juan, etc should be >>>> returned as they are within x miles from San Francisco. I also want to rank >>>> the search results by their distance. >>>> >>>> I can create an index with all the cities in it but I am not sure how do I >>>> ensure that the cities returned in a search result have a nearby retail >>>> store. Any suggestions ? >>>> >>>> Thanks, >>>> Venu, >>>> >>> >>> >>> >>> -- >>> Dirceu Vieira Júnior >>> --- >>> +47 9753 2473 >>> dirceuvjr.blogspot.com >>> twitter.com/dirceuvjr >>
Re: [SoldCloud] Slow indexing
hmm, loks like you are facing exactly the phenomena I asked about. See my question here: http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/61326 On Sun, Mar 4, 2012 at 9:24 PM, Markus Jelsma wrote: > Hi, > > With auto-committing disabled we can now index many millions of documents in > our test environment on a 5-node cluster with 5 shards and a replication > factor of 2. The documents are uploaded from map/reduce. No significant > changes were made to solrconfig and there are no update processors enabled. > We are using a trunk revision from this weekend. > > The indexing speed is well below what we are used to see, we can easily > index 5 millions documents on a non-cloud enabled Solr 3.x instance within > an hour. What could be going on? There aren't many open TCP connections and > the number of file descriptors is stable and I/O is low but CPU-time is > high! Each node has two Solr cores both writing to their dedicated disk. > > The indexing speed is stable, it was slow at start and still is. It's now > running for well over 6 hours and only 3.5 millions documents are indexed. > Another strange detail is that the node receiving all incoming documents > (we're not yet using a client side Solr server pool) has a much larger disk > usage than all other nodes. This is peculiar as we expected all replica's to > be a about the same size. > > The receiving node has slightly higher CPU than the other nodes but the > thread dump shows a very large amount of threads of type > cmdDistribExecutor-8-thread-292260 (295090) with 0-100ms CPU-time. At the > top of the list these threads all have < 20ms time but near the bottom it > rises to just over 100ms. All nodes have a couple of http-80-30 (121994) > threads with very high CPU-time each. > > Is this a known issue? Did i miss something? Any ideas? > > Thanks
Re: Solr 4.0 and production environments
I am here on lucene as a user since the project started, even before solr came to life, many many years. And I was always using trunk version for pretty big customers, and *never* experienced some serious problems. The worst thing that can happen is to notice bug somewhere, and if you have some reasonable testing for your product, you will see it quickly. But, with this community, *you will definitely not have wait long top get it fixed*. Not only they will fix it, they will thank you for bringing it up! I can, as an old user, 100 % vouch what Robert said below. Simply, just go for it, test you application a bit and make your users happy. On Wed, Mar 7, 2012 at 5:55 PM, Robert Muir wrote: > On Wed, Mar 7, 2012 at 11:47 AM, Dirceu Vieira wrote: >> Hi All, >> >> Has anybody started using Solr 4.0 in production environments? Is it stable >> enough? >> I'm planning to create a proof of concept using solr 4.0, we have some >> projects that will gain a lot with features such as near real time search, >> joins and others, that are available only on version 4. >> >> Is it too risky to think of using it right now? >> What are your thoughts and experiences with that? >> > > In general, we try to keep our 'trunk' (slated to be 4.0) in very > stable condition. > > Really, it should be 'ready-to-release' at any time, of course 4.0 has > had many drastic changes: both at the Lucene and Solr level. > > Before deciding what is stable, you should define stability: is it: > * api stability: will i be able to upgrade to a more recent snapshot > of 4.0 without drastic changes to my app? > * index format stability: will i be able to upgrade to a more recent > snapshot of 4.0 without re-indexing? > * correctness: is 4.0 dangerous in some way that it has many bugs > since much of the code is new? > > I think you should limit your concerns to only the first 2 items, as > far as correctness, just look at the tests. For any open source > project, you can easily judge its quality by its tests: this is a > fact. > > For lucene/solr the testing strategy, in my opinion, goes above and > beyond many other projects: for example random testing: > http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011_presentations#dawid_weiss > > and the new solr cloud functionality also adds the similar chaosmonkey > concept on top of this already. > > If you are worried about bugs, is a lucene/solr trunk snapshot less > reliable than even a released version of alternative software? its an > interesting question. look at their tests. > > -- > lucidimagination.com
Is there any performance cost of using lots of OR in the solr query
Hi All, I am working on an application which makes few solr calls to get the data. On the high level, We have a requirement like this - Make first call to Solr, to get the list of products which are children of a given category - Make 2nd solr call to get product documents based on a list of product ids 2nd query will look like q=document_type:SKU&fq=product_id:(34 OR 45 OR 56 OR 77) We can have close to 100 product ids in fq. is there a performance cost of doing these solr calls which have lots of OR? As per Slide # 41 of Presentation "The Seven Deadly Sins of Solr", it is a bad idea to have these kind of queries. http://www.slideshare.net/lucenerevolution/hill-jay-7-sins-of-solrpdf But, It does not become clear the reason it is bad. Any inputs will be welcome. Thanks Saroj
Re: How to do custom sorting in Solr?
Hi All > > I have an index which contains a Catalog of Products and Categories, with > Solr 4.0 from trunk > > Data is organized like this: > > Category: Books > > Sub Category: Programming > > Products: > > Product # 1, Price: Regular Sort Order:1 > Product # 2, Price: Markdown, Sort Order:2 > Product # 3 Price: Regular, Sort Order:3 > Product # 4 Price: Regular, Sort Order:4 > > . > ... > Product # 100 Price: Regular, Sort Order:100 > > Sub Category: Fiction > > Products: > > Product # 1, Price: Markdown, Sort Order:1 > Product # 2, Price: Regular, Sort Order:2 > Product # 3 Price: Regular, Sort Order:3 > Product # 4 Price: Markdown, Sort Order:4 > > . > ... > Product # 70 Price: Regular, Sort Order:70 > > > I want to query Solr and sort these products within each of the > sub-category in a such a way that products which are on markdown, are at > the bottom of the documents list and other products > which are on regular price, are sorted as per their sort order in their > sub-category. > > Expected Results are > > Category: Books > > Sub Category: Programming > > Products: > > Product # 1, Price: Regular Sort Order:1 > Product # 2, Price: Markdown, Sort Order:101 > Product # 3 Price: Regular, Sort Order:3 > Product # 4 Price: Regular, Sort Order:4 > > . > ... > Product # 100 Price: Regular, Sort Order:100 > > Sub Category: Fiction > > Products: > > Product # 1, Price: Markdown, Sort Order:71 > Product # 2, Price: Regular, Sort Order:2 > Product # 3 Price: Regular, Sort Order:3 > Product # 4 Price: Markdown, Sort Order:71 > > . > ... > Product # 70 Price: Regular, Sort Order:70 > > > My query is like this: > > q=*:*&fq=category:Books > > What are the options to implement custom sorting and how do I do it? > > >- Define a Custom Function query? >- Define a Custom Comparator? Or, >- Define a Custom Collector? > > > Please let me know the best way to go about it and any pointers to > customize Solr 4. > Thanks Saroj
Re: How to do custom sorting in Solr?
Thanks Erik for your quick feedback When Products are assigned to a category or Sub-Category then they can be in any order and price type can be regular or markdown. So, reg and markdown products are intermingled as per their assignment but I want to sort them in such a way that we ensure that all the products which are on markdown are at the bottom of the list. I can use these multiple sorts but I realize that they are costly in terms of heap used, as they are using FieldCache. I have an index with 2M docs and docs are pretty big. So, I don't want to use them unless there is no other option. I am wondering if I can define a custom function query which can be like this: - check if product is on the markdown - if yes then change its sort order field to be the max value in the given sub-category, say 99 - else, use the sort order of the product in the sub-category I have been looking at existing function queries but do not have a good handle on how to make one of my own. - Another option could be use a custom sort comparator but I am not sure about the way it works Any thoughts? -Saroj On Sun, Jun 10, 2012 at 5:02 AM, Erick Erickson wrote: > Skimming this, I two options come to mind: > > 1> Simply apply primary, secondary, etc sorts. Something like > &sort=subcategory asc,markdown_or_regular desc,sort_order asc > > 2> You could also use grouping to arrange things in groups and sort within > those groups. This has the advantage of returning some members > of each of the top N groups in the result set, which makes it easier > to > get some of each group rather than having to analyze the whole > list > > But your example is somewhat contradictory. You say > "products which are on markdown, are at > the bottom of the documents list" > > But in your examples, products on "markdown" are intermingled > > Best > Erick > > On Sun, Jun 10, 2012 at 3:36 AM, roz dev wrote: > > Hi All > > > >> > >> I have an index which contains a Catalog of Products and Categories, > with > >> Solr 4.0 from trunk > >> > >> Data is organized like this: > >> > >> Category: Books > >> > >> Sub Category: Programming > >> > >> Products: > >> > >> Product # 1, Price: Regular Sort Order:1 > >> Product # 2, Price: Markdown, Sort Order:2 > >> Product # 3 Price: Regular, Sort Order:3 > >> Product # 4 Price: Regular, Sort Order:4 > >> > >> . > >> ... > >> Product # 100 Price: Regular, Sort Order:100 > >> > >> Sub Category: Fiction > >> > >> Products: > >> > >> Product # 1, Price: Markdown, Sort Order:1 > >> Product # 2, Price: Regular, Sort Order:2 > >> Product # 3 Price: Regular, Sort Order:3 > >> Product # 4 Price: Markdown, Sort Order:4 > >> > >> . > >> ... > >> Product # 70 Price: Regular, Sort Order:70 > >> > >> > >> I want to query Solr and sort these products within each of the > >> sub-category in a such a way that products which are on markdown, are at > >> the bottom of the documents list and other products > >> which are on regular price, are sorted as per their sort order in their > >> sub-category. > >> > >> Expected Results are > >> > >> Category: Books > >> > >> Sub Category: Programming > >> > >> Products: > >> > >> Product # 1, Price: Regular Sort Order:1 > >> Product # 2, Price: Markdown, Sort Order:101 > >> Product # 3 Price: Regular, Sort Order:3 > >> Product # 4 Price: Regular, Sort Order:4 > >> > >> . > >> ... > >> Product # 100 Price: Regular, Sort Order:100 > >> > >> Sub Category: Fiction > >> > >> Products: > >> > >> Product # 1, Price: Markdown, Sort Order:71 > >> Product # 2, Price: Regular, Sort Order:2 > >> Product # 3 Price: Regular, Sort Order:3 > >> Product # 4 Price: Markdown, Sort Order:71 > >> > >> . > >> ... > >> Product # 70 Price: Regular, Sort Order:70 > >> > >> > >> My query is like this: > >> > >> q=*:*&fq=category:Books > >> > >> What are the options to implement custom sorting and how do I do it? > >> > >> > >>- Define a Custom Function query? > >>- Define a Custom Comparator? Or, > >>- Define a Custom Collector? > >> > >> > >> Please let me know the best way to go about it and any pointers to > >> customize Solr 4. > >> > > > > Thanks > > Saroj >
Re: How to do custom sorting in Solr?
Yes, these documents have lots of unique values as the same product could be assigned to lots of other categories and that too, in a different sort order. We did some evaluation of heap usage and found that with kind of queries we generate, heap usage was going up to 24-26 GB. I could trace it to the fact that fieldCache is creating an array of 2M size for each of the sort fields. Since same products are mapped to multiple categories, we incur significant memory overhead. Therefore, any solve where memory consumption can be reduced is a good one for me. In fact, we have situations where same product is mapped to more than 1 sub-category in the same category like Books -- Programming - Java in a nutshell -- Sale (40% off) - Java in a nutshell So,another thought in my mind is to somehow use second pass collector to group books appropriately in Programming and Sale categories, with right sort order. But, i have no clue about that piece :( -Saroj On Sun, Jun 10, 2012 at 4:30 PM, Erick Erickson wrote: > 2M docs is actually pretty small. Sorting is sensitive to the number > of _unique_ values in the sort fields, not necessarily the number of > documents. > > And sorting only works on fields with a single value (i.e. it can't have > more than one token after analysis). So for each field you're only talking > 2M values at the vary maximum, assuming that the field in question has > a unique value per document, which I doubt very much given your > problem description. > > So with a corpus that size, I'd "just try it'. > > Best > Erick > > On Sun, Jun 10, 2012 at 7:12 PM, roz dev wrote: > > Thanks Erik for your quick feedback > > > > When Products are assigned to a category or Sub-Category then they can be > > in any order and price type can be regular or markdown. > > So, reg and markdown products are intermingled as per their assignment > but > > I want to sort them in such a way that we > > ensure that all the products which are on markdown are at the bottom of > the > > list. > > > > I can use these multiple sorts but I realize that they are costly in > terms > > of heap used, as they are using FieldCache. > > > > I have an index with 2M docs and docs are pretty big. So, I don't want to > > use them unless there is no other option. > > > > I am wondering if I can define a custom function query which can be like > > this: > > > > > > - check if product is on the markdown > > - if yes then change its sort order field to be the max value in the > > given sub-category, say 99 > > - else, use the sort order of the product in the sub-category > > > > I have been looking at existing function queries but do not have a good > > handle on how to make one of my own. > > > > - Another option could be use a custom sort comparator but I am not sure > > about the way it works > > > > Any thoughts? > > > > > > -Saroj > > > > > > > > > > On Sun, Jun 10, 2012 at 5:02 AM, Erick Erickson >wrote: > > > >> Skimming this, I two options come to mind: > >> > >> 1> Simply apply primary, secondary, etc sorts. Something like > >> &sort=subcategory asc,markdown_or_regular desc,sort_order asc > >> > >> 2> You could also use grouping to arrange things in groups and sort > within > >> those groups. This has the advantage of returning some members > >> of each of the top N groups in the result set, which makes it > easier > >> to > >> get some of each group rather than having to analyze the whole > >> list > >> > >> But your example is somewhat contradictory. You say > >> "products which are on markdown, are at > >> the bottom of the documents list" > >> > >> But in your examples, products on "markdown" are intermingled > >> > >> Best > >> Erick > >> > >> On Sun, Jun 10, 2012 at 3:36 AM, roz dev wrote: > >> > Hi All > >> > > >> >> > >> >> I have an index which contains a Catalog of Products and Categories, > >> with > >> >> Solr 4.0 from trunk > >> >> > >> >> Data is organized like this: > >> >> > >> >> Category: Books > >> >> > >> >> Sub Category: Programming > >> >> > >> >> Products: > >> >> > >> >> Product # 1, Price: Regular Sort Order:1 > >> >> Product # 2, Price: Markdown, So
Re: Issue with field collapsing in solr 4 while performing distributed search
I think that there is no way around doing custom logic in this case. If indexing process knows that documents have to be grouped then they better be together. -Saroj On Mon, Jun 11, 2012 at 6:37 AM, Nitesh Nandy wrote: > Martijn, > > How do we add a custom algorithm for distributing documents in Solr Cloud? > According to this discussion > > http://lucene.472066.n3.nabble.com/SolrCloud-how-to-index-documents-into-a-specific-core-and-how-to-search-against-that-core-td3985262.html > , Mark discourages users from using custom distribution mechanism in Solr > Cloud. > > Load balancing is not an issue for us at the moment. In that case, how > should we implement a custom partitioning algorithm. > > > On Mon, Jun 11, 2012 at 6:23 PM, Martijn v Groningen < > martijn.v.gronin...@gmail.com> wrote: > > > The ngroups returns the number of groups that have matched with the > > query. However if you want ngroups to be correct in a distributed > > environment you need > > to put document belonging to the same group into the same shard. > > Groups can't cross shard boundaries. I guess you need to do > > some manual document partitioning. > > > > Martijn > > > > On 11 June 2012 14:29, Nitesh Nandy wrote: > > > Version: Solr 4.0 (svn build 30th may, 2012) with Solr Cloud (2 slices > > and > > > 2 shards) > > > > > > The setup was done as per the wiki: > > http://wiki.apache.org/solr/SolrCloud > > > > > > We are doing distributed search. While querying, we use field > collapsing > > > with "ngroups" set as true as we need the number of search results. > > > > > > However, there is a difference in the number of "result list" returned > > and > > > the "ngroups" value returned. > > > > > > Ex: > > > > > > http://localhost:8983/solr/select?q=message:blah%20AND%20userid:3&&group=true&group.field=id&group.ngroups=true > > > > > > > > > The response XMl looks like > > > > > > > > > > > > > > > 0 > > > 46 > > > > > > id > > > true > > > true > > > messagebody:monit AND usergroupid:3 > > > > > > > > > > > > > > > 10 > > > 9 > > > > > > > > > 320043 > > > > > > ... > > > > > > > > > > > > 398807 > > > ... > > > > > > > > > > > > 346878 > > > ... > > > > > > > > > 346880 > > > ... > > > > > > > > > > > > > > > > > > > > > So you can see that the ngroups value returned is 9 and the actual > number > > > of groups returned is 4 > > > > > > Why do we have this discrepancy in the ngroups, matches and actual > number > > > of groups. Is this an open issue ? > > > > > > Any kind of help is appreciated. > > > > > > -- > > > Regards, > > > > > > Nitesh Nandy > > > > > > > > -- > > Met vriendelijke groet, > > > > Martijn van Groningen > > > > > > -- > Regards, > > Nitesh Nandy >
SolrJ Question about Bad Request Root cause error
Hi All We are using SolrJ client (v 1.4.1) to integrate with our solr search server. We notice that whenever SolrJ request does not match with Solr schema, we get Bad Request exception which makes sense. org.apache.solr.common.SolrException: Bad Request But, SolrJ Client does not provide any clue about the reason request is Bad. Is there any way to get the root cause on client side? Of Course, solr server logs have enough info to know that data is bad but it would be great to have the same info in the exception generated by SolrJ. Any thoughts? Is there any plan to add this in future releases? Thanks, Saroj
Question about http://wiki.apache.org/solr/Deduplication
Hi, Use case I am trying to figure out is about preserving IDs without re-indexing on duplicate, rather adding this new ID under list of document id "aliases". Example: Input collection: "id":1, "text":"dummy text 1", "signature":"A" "id":2, "text":"dummy text 1", "signature":"A" I add the first document in empty index, text is going to be indexed, ID is going to be "1", so far so good Now the question, if I add second document with id == "2", instead of deleting/indexing this new document, I would like to store id == 2 in multivalued Field "id" At the end, I would have one document less indexed and both ID are going to be "searchable" (and stored as well)... Is it possible in solr to have multivalued "id"? Or I need to make my own "mv_ID" for this? Any ideas how to achieve this efficiently? My target is not to add new documents if signature matches, but to have IDs indexed and stored? Thanks, eks