Re: Using solr(cloud) as source-of-truth for data (with no backing external db)
@alex That makes sense, but it can be ~fixed by just storing every field that you need. @Walter Many of those things are missing from many nosql dbs yet they're used as source of data. As long as the backup is "point in time", meaning consistent timestamp across all shards it ~should be ok for many usecases. The 1-line-curl may need a patch to be disabled from config. On Thu, Nov 17, 2016 at 6:29 PM, Walter Underwood wrote: > I agree, it is a bad idea. > > Solr is missing nearly everything you want in a repository, because it is > not designed to be a repository. > > Does not have: > > * access control > * transactions > * transactional backup > * dump and load > * schema migration > * versioning > > And so on. > > Also, I’m glad to share a one-line curl command that will delete all the > documents > in your collection. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Nov 17, 2016, at 1:20 AM, Alexandre Rafalovitch > wrote: > > > > I've heard of people doing it but it is not recommended. > > > > One of the biggest implementation breakthroughs is that - after the > > initial learning curve - you will start mapping your input data to > > signals. Those signals will not look very much like your original data > > and therefore are not terribly suitable to be the source of it. > > > > We are talking copyFields, UpdateRequestProcessor pre-processing, > > fields that are not stored, nested documents flattening, > > denormalization, etc. Getting back from that to original shape of data > > is painful. > > > > Regards, > > Alex. > > > > Solr Example reading group is starting November 2016, join us at > > http://j.mp/SolrERG > > Newsletter and resources for Solr beginners and intermediates: > > http://www.solr-start.com/ > > > > > > On 17 November 2016 at 18:46, Dorian Hoxha > wrote: > >> Hi, > >> > >> Anyone use solr for source-of-data with no `normal` db (of course with > >> normal backups/replication) ? > >> > >> Are there any drawbacks ? > >> > >> Thank You > >
Combined Dismax and Block Join Scoring on nested documents
Apologies if I'm doing something incredibly stupid as I'm new to Solr. I am having an issue with scoring child documents in a block join query when including a dismax query. I'm actually a little unclear on whether or not that's a complete oxymoron, combining dismax and block join. Problem statement: Given a set of Product documents - which contain the product names and descriptions - which contain nested variant documents (see below for abridged example) - which contain the boolean stock status (in_stock) and the variant prices (list_price_gbp) - I want to do a Dismax query of, say, "skirt" on the product name (name) and sort the resulting product documents by the minimum price (list_price_gbp) of their child variant documents. Note that, although the abridged document doesn't show them, there are a number of other arbitrary fields which may be used as filter queries on the child documents, for example size or colour, which will in effect change the "active" minimum price of a product. Hence, denormalizing, or flattening, the documents is not really an option I want to pursue. An abridged example document returned by the Solr Admin Query console which I am querying: 12345 product black flared skirt 40.0 12345abcd 12345 variant 65.0 true 12345fghi 12345 variant 40.0 true So I am familiar with the block join score mode; setting aside the dismax aspect for now, this query, using the Function Query {!func}list_price_gbp, with score ascending, returns documents ordered correctly, with a £2.00 (cheapest) product first: q={!parent which=content_type:product score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid" v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(true))&start=0&row s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml The "explain" for this is: 2.184 = Score based on 1 child docs in range from 26752 to 26752, best match: 2.184 = sum of: 1.8374416E-5 = weight(in_stock:T in 26752) [], result of: 1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0 ), product of: 1.8374416E-5 = idf(docFreq=27211, docCount=27211) 1.0 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.0 = parameter b (norms omitted for field) 2.0 = FunctionQuery(float(list_price_gbp)), product of: 2.0 = float(list_price_gbp)=2.0 1.0 = boost 1.0 = queryNorm Even though this is doing what I want, I have a slight niggle the that overall score is not just the result of the Function Query, however, as all results get the same tiny fraction added, it doesn't matter. However, when I prepend my dismax query: q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid" v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(true))&start=0&row s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml The scoring is only dependent on the dismax scoring, where the "explain" for this is: 2.7600822 = sum of: 2.7600822 = weight(name:skirt in 13406) [], result of: 2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0 ), product of: 3.5851278 = idf(docFreq=103, docCount=3731) 0.76987 = tfNorm, computed from: 1.0 = termFreq=1.0 1.2 = parameter k1 0.75 = parameter b 4.108818 = avgFieldLength 7.11 = fieldLength So in actual fact, with score ascending, it is ordering the results by least matching first and the nested document list_price_gbp is irrelevant. I strongly suspect I am being totally dumb and that this is expected behaviour for an obvious reason that escapes me, apart from perhaps it's because the two scoring methods are just plainly incompatible. I have additionally tried just doing a lucene query instead: q=+name:skirt +{!parent which=content_type:product score=min} (in_stock:(true)){!func}list_price_gbp&doc.q={!terms f="productid" v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:(true))&start=0&row s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml The "explain" of this indicates it's scoring products, for which list_price_gbp simply does not exist, as the Function Query always returns zero. 6243963 = sum of: 3.624396 = weight(name:skirt in 18113) [], result of: 3.624396 = score(doc=18113,freq=1.0 = termFreq=1.0 ), product of: 3.5851278 = idf(docFreq=103, docCount=3731)
Re: Using solr(cloud) as source-of-truth for data (with no backing external db)
Sure. And the people do it. Especially for their first deployment. I have some prototypes/proof-of-concepts like that myself. Just later don't say you didn't ask and we didn't tell :-) Regards, Alex. Solr Example reading group is starting November 2016, join us at http://j.mp/SolrERG Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 18 November 2016 at 20:45, Dorian Hoxha wrote: > @alex > That makes sense, but it can be ~fixed by just storing every field that you > need. > > @Walter > Many of those things are missing from many nosql dbs yet they're used as > source of data. > As long as the backup is "point in time", meaning consistent timestamp > across all shards it ~should be ok for many usecases. > > The 1-line-curl may need a patch to be disabled from config. > > On Thu, Nov 17, 2016 at 6:29 PM, Walter Underwood > wrote: > >> I agree, it is a bad idea. >> >> Solr is missing nearly everything you want in a repository, because it is >> not designed to be a repository. >> >> Does not have: >> >> * access control >> * transactions >> * transactional backup >> * dump and load >> * schema migration >> * versioning >> >> And so on. >> >> Also, I’m glad to share a one-line curl command that will delete all the >> documents >> in your collection. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >> > On Nov 17, 2016, at 1:20 AM, Alexandre Rafalovitch >> wrote: >> > >> > I've heard of people doing it but it is not recommended. >> > >> > One of the biggest implementation breakthroughs is that - after the >> > initial learning curve - you will start mapping your input data to >> > signals. Those signals will not look very much like your original data >> > and therefore are not terribly suitable to be the source of it. >> > >> > We are talking copyFields, UpdateRequestProcessor pre-processing, >> > fields that are not stored, nested documents flattening, >> > denormalization, etc. Getting back from that to original shape of data >> > is painful. >> > >> > Regards, >> > Alex. >> > >> > Solr Example reading group is starting November 2016, join us at >> > http://j.mp/SolrERG >> > Newsletter and resources for Solr beginners and intermediates: >> > http://www.solr-start.com/ >> > >> > >> > On 17 November 2016 at 18:46, Dorian Hoxha >> wrote: >> >> Hi, >> >> >> >> Anyone use solr for source-of-data with no `normal` db (of course with >> >> normal backups/replication) ? >> >> >> >> Are there any drawbacks ? >> >> >> >> Thank You >> >>
json facet api and facet.threads
Hi Everybody, can anyone point me in the right direction for using "facet.threads" with the json faceting-api? Does it only work if terms facets are exclusively used in the query? Best regards Michael Aleythe Java Entwickler | STERNWALD SYSTEMS GMBH
Re: Combined Dismax and Block Join Scoring on nested documents
Hello Mike, Structured queries in Solr are way cumbersome. Start from: q=+{!dismax v="skirt" qf="name"} +{!parent which=content_type:product score=min v=childq}&childq=+in_stock:true^=0 {!func}list_price_gbp&... beside of "explain" there is a parsed query entry in debug that's more useful for troubleshooting purposes. Please also make sure that + is properly encoded by %2B and pass http hurdle. On Fri, Nov 18, 2016 at 2:14 PM, Mike Allen < mike.al...@thecommercepartnership.com> wrote: > Apologies if I'm doing something incredibly stupid as I'm new to Solr. I am > having an issue with scoring child documents in a block join query when > including a dismax query. I'm actually a little unclear on whether or not > that's a complete oxymoron, combining dismax and block join. > > > > Problem statement: Given a set of Product documents - which contain the > product names and descriptions - which contain nested variant documents > (see > below for abridged example) - which contain the boolean stock status > (in_stock) and the variant prices (list_price_gbp) - I want to do a Dismax > query of, say, "skirt" on the product name (name) and sort the resulting > product documents by the minimum price (list_price_gbp) of their child > variant documents. Note that, although the abridged document doesn't show > them, there are a number of other arbitrary fields which may be used as > filter queries on the child documents, for example size or colour, which > will in effect change the "active" minimum price of a product. Hence, > denormalizing, or flattening, the documents is not really an option I want > to pursue. > > > > An abridged example document returned by the Solr Admin Query console which > I am querying: > > > > > > 12345 > > product > > black flared skirt > > 40.0 > > > > > > 12345abcd > > 12345 > > variant > > 65.0 > > true > > > > > > 12345fghi > > 12345 > > variant > > 40.0 > > true > > > > > > > > So I am familiar with the block join score mode; setting aside the dismax > aspect for now, this query, using the Function Query {!func}list_price_gbp, > with score ascending, returns documents ordered correctly, with a £2.00 > (cheapest) product first: > > > > q={!parent which=content_type:product > score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms > f="productid" > v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:( > true))&start=0&row > s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml > > > > The "explain" for this is: > > > > 2.184 = Score based on 1 child docs in range from 26752 to 26752, best > match: > > 2.184 = sum of: > > 1.8374416E-5 = weight(in_stock:T in 26752) [], result of: > > 1.8374416E-5 = score(doc=26752,freq=1.0 = termFreq=1.0 > > ), product of: > > 1.8374416E-5 = idf(docFreq=27211, docCount=27211) > > 1.0 = tfNorm, computed from: > > 1.0 = termFreq=1.0 > > 1.2 = parameter k1 > > 0.0 = parameter b (norms omitted for field) > > 2.0 = FunctionQuery(float(list_price_gbp)), product of: > > 2.0 = float(list_price_gbp)=2.0 > > 1.0 = boost > > 1.0 = queryNorm > > > > Even though this is doing what I want, I have a slight niggle the that > overall score is not just the result of the Function Query, however, as all > results get the same tiny fraction added, it doesn't matter. > > > > However, when I prepend my dismax query: > > > > q={!dismax v="skirt" qf="name"}+{!parent which=content_type:product > score=min}+(in_stock:(true)){!func}list_price_gbp&doc.q={!terms > f="productid" > v=$row.id}&doc.rows=1000&doc.fl=score,*&doc.fq=(in_stock:( > true))&start=0&row > s=103&fl=score,*,doc:[subquery]&sort=score asc&debugQuery=on&wt=xml > > > > The scoring is only dependent on the dismax scoring, where the "explain" > for > this is: > > > > 2.7600822 = sum of: > > 2.7600822 = weight(name:skirt in 13406) [], result of: > > 2.7600822 = score(doc=13406,freq=1.0 = termFreq=1.0 > > ), product of: > > 3.5851278 = idf(docFreq=103, docCount=3731) > > 0.76987 = tfNorm, computed from: > > 1.0 = termFreq=1.0 > > 1.2 = parameter k1 > > 0.75 = parameter b > > 4.108818 = avgFieldLength > > 7.11 = fieldLength > > > > So in actual fact, with score ascending, it is ordering the results by > least > matching first and the nested document list_price_gbp is irrelevant. I > strongly suspect I am being totally dumb and that this is expected > behaviour > for an obvious reason that escapes me, apart from perhaps it's because the > two scoring methods are just plainly in
SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient
Hi all, I am looking to improve indexing speed when loading many documents as part of an import. I am using the SolrJ-Client and currently I add the documents one-by-one using HttpSolrClient and its method add(SolrInputDocument doc, int commitWithinMs). My first step would be to change that to use add(Collection docs, int commitWithinMs) instead, which I expect would already improve performance. Does it matter which method I use? Beside the method taking a Collection there is also one that takes an Iterator ... and what about ConcurrentUpdateSolrClient? Should I use it for bulk indexing instead of HttpSolrClient? Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. only one instance etc. Indexing 39657 documents (which result in a core size of appr. 127MB) took about 10 minutes with the one-by-one approach. Best regards and thanks for any suggestions, Sebastian Riemer
Re: SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient
On 11/18/2016 6:00 AM, Sebastian Riemer wrote: > I am looking to improve indexing speed when loading many documents as part of > an import. I am using the SolrJ-Client and currently I add the documents > one-by-one using HttpSolrClient and its method add(SolrInputDocument doc, > int commitWithinMs). If you batch them (probably around 500 to 1000 at a time), indexing speed will go up. Below you have described the add methods used for batching. > My first step would be to change that to use > add(Collection docs, int commitWithinMs) instead, which I > expect would already improve performance. > Does it matter which method I use? Beside the method taking a > Collection there is also one that takes an > Iterator ... and what about ConcurrentUpdateSolrClient? > Should I use it for bulk indexing instead of HttpSolrClient? > > Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. > only one instance etc. > Indexing 39657 documents (which result in a core size of appr. 127MB) took > about 10 minutes with the one-by-one approach. The concurrent client will send updates in parallel, without any threading code in your own program, but there is one glaring disadvantage -- indexing failures will be logged (via SLF4J), but your program will NOT be informed about them, which means that the entire Solr cluster could be down, and all your indexing requests will still appear to succeed from your program's point of view. Here's an issue I filed on the problem. It hasn't been fixed because there really isn't a good solution. https://issues.apache.org/jira/browse/SOLR-3284 The concurrent client swallows all exceptions that occur during add() operations -- they are conducted in the background. This might also happen during delete operations, though I am unsure about that. You won't know about any problems unless those problems are still there when your program tries an operation that can't happen in the background, like commit or query. If you're relying on automatic commits, your indexing program might NEVER become aware of problems on the server end. In a nutshell ... the concurrent client is great for initial bulk loading (if and only if you don't need error detection), but not all that useful for ongoing update activity that runs all the time. If you set up multiple indexing threads in your own program, you can use HttpSolrClient or CloudSolrClient with similar concurrent effectiveness to the concurrent client, without sacrificing the ability to detect errors during indexing. Indexing 40K documents in batches should take very little time, and in my opinion is not worth the disadvantages of the concurrent client, or taking the time to write multi-threaded code. If you reach the point where you've got millions of documents, then you might want to consider writing multi-threaded indexing code. Thanks, Shawn
Re: Bkd tree numbers/geo on solr 6.3 ?
Looks like it needs https://issues.apache.org/jira/browse/SOLR-8396 . On Thu, Nov 17, 2016 at 2:41 PM, Dorian Hoxha wrote: > Hi, > > I've read that lucene 6 has fancy bkd-tree implementation for numbers. But > on latest cwiki I only see TrieNumbers. Aren't they implemented or did I > miss something (they still mention "indexing multiple values for > range-queries" , which is the old way)? > > Thank You >
Data Import Request Handler isolated into its own project - any suggestions?
Hello. My name is Marek Scevlik. Currently I am working for a small company where we are interested in implementing your Sorl 6.3 search engine. We are hoping to take out from the original source package the Data Import Request Handler into its own project and create a usable .jar file out of it. It should then serve as tool that would allow to connect to a remote server and return data for us to our other application that would use the returned data. What do you think? Would anything like this possible? To isolate out the Data Import Request Handler into its own standalone project? If we could achieve this we won’t mind to share with the community this new feature. I realize this is a first email and may lead into several hundreds so for the start my request is very simple and not so high level detailed but I am sure you realize it may lead into being quite complex. So I wonder if anyone replies. Thanks a lot for any replies and further info or guidance. Thanks. Regards Marek Scevlik
RE: Data Import Request Handler isolated into its own project - any suggestions?
Marek, I've wanted to do something like this in the past as well. However, a rewrite that supports the same XML syntax might be better. There are several problems with the design of the Data Import Handler that make it not quite suitable: - Not designed for Multi-threading - Bad implementation of XPath Another issue is that one of the big advantages of Data Import Handler goes away at this point, which is that it is hosted within Solr, and has a UI for testing within the Solr admin. A better open-source Java solution might be to connect Solr with Apache Camel - http://camel.apache.org/solr.html. If you are not tied absolutely to pure open-source, and freemium products will do, then you might look at Pentaho Spoon and Kettle. Although Talend is much more established in the market, I find Pentaho's XML-based ETL a bit easier to integrate as a developer, and unit test and such. Talend does better when you have a full infrastructure set up, but then the attention required to unit tests and Git integration seems over the top. Another powerful way to get things done, depending on what you are indexing, is to use LogStash and couple that with Document processing chains. Many of our projects benefit from having a single RDBMS view, perhaps a materialized view, that is used for the index. LogStash does just fine here, pulling from the RDBMS and posting each row to Solr. The hierarchical execution of Data Import Handler is very nice, but this can often be handled on the RDBMS side by creating a view, maybe using functions to provide some rows. Many RDBMS systems also support federation and the import of XML from files, so that this brings XML processing into the picture. Hoping this helps, Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message- From: Marek Ščevlík [mailto:mscev...@codenameprojects.com] Sent: Friday, November 18, 2016 9:29 AM To: solr-user@lucene.apache.org Subject: Data Import Request Handler isolated into its own project - any suggestions? Hello. My name is Marek Scevlik. Currently I am working for a small company where we are interested in implementing your Sorl 6.3 search engine. We are hoping to take out from the original source package the Data Import Request Handler into its own project and create a usable .jar file out of it. It should then serve as tool that would allow to connect to a remote server and return data for us to our other application that would use the returned data. What do you think? Would anything like this possible? To isolate out the Data Import Request Handler into its own standalone project? If we could achieve this we won’t mind to share with the community this new feature. I realize this is a first email and may lead into several hundreds so for the start my request is very simple and not so high level detailed but I am sure you realize it may lead into being quite complex. So I wonder if anyone replies. Thanks a lot for any replies and further info or guidance. Thanks. Regards Marek Scevlik
Re: field set up help
Perfect. Just had to wrap the pho curl request URL with urlencode and it worked Sent from my iPhone > On Nov 17, 2016, at 5:56 PM, Kris Musshorn wrote: > > This q={!prefix f=metatag.date}2016-10 returns zero records > > -Original Message- > From: KRIS MUSSHORN [mailto:mussho...@comcast.net] > Sent: Thursday, November 17, 2016 3:00 PM > To: solr-user@lucene.apache.org > Subject: Re: field set up help > > so if the field was named metatag.date q={!prefix f=metatag.date}2016-10 > > - Original Message - > > From: "Erik Hatcher" > To: solr-user@lucene.apache.org > Sent: Thursday, November 17, 2016 2:46:32 PM > Subject: Re: field set up help > > Given what you’ve said, my hunch is you could make the query like this: > >q={!prefix f=field_name}2016-10 > > tada! ?! > > there’s nothing wrong with indexing dates as text like that, as long as your > queries are performantly possible. And in the case of the query type you > mentioned, the text/string’ish indexing you’ve done is suited quite well to > prefix queries to grab dates by year, year-month, and year-month-day. But > you could, if needed to get more sophisticated with date queries > (DateRangeField is my new favorite) you can leverage > ParseDateFieldUpdateProcessorFactory without having to change the incoming > format. > >Erik > > > > >> On Nov 17, 2016, at 1:55 PM, KRIS MUSSHORN wrote: >> >> >> I have a field in solr 5.4.1 that has values like: >> 2016-10-15 >> 2016-09-10 >> 2015-10-12 >> 2010-09-02 >> >> Yes it is a date being stored as text. >> >> I am getting the data onto solr via nutch and the metatag plug in. >> >> The data is coming directly from the website I am crawling and I am not able >> to change the data at the source to something more palpable. >> >> The field is set in solr to be of type TextField that is indexed, tokenized, >> stored, multivalued and norms are omitted. >> >> Both the index and query analysis chains contain just the whitespace >> tokenizer factory and the lowercase filter factory. >> >> I need to be able to query for 2016-10 and only match 2016-10-15. >> >> Any ideas on how to set this up? >> >> TIA >> >> Kris >> > > >
Re: Detecting schema errors while adding documents
On 11/16/2016 11:02 AM, Mike Thomsen wrote: > We're stuck on Solr 4.10.3 (Cloudera bundle). Is there any way to detect > with SolrJ when a document added to the index violated the schema? All we > see when we look at the stacktrace for the SolrException that comes back is > that it contains messages about an IOException when talking to the solr > nodes. Solr is up and running, and the documents are only invalid because I > added a Java statement to make a field invalid for testing purposes. When I > remove that statement, the indexing happens just fine. > > Any way to do this? I seem to recall that at least in newer versions of > Solr it would tell you more about the specific error. What *exactly* are you trying to get SolrJ/Solr to tell you that it isn't telling you? Erick's response has information for one possible scenario you might be describing. Using the 4.10.3 client, trying to add a document with an unknown field, I get very specific and relevant messages like the following from both HttpSolrServer and CloudSolrServer: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: ERROR: [doc=123] unknown field 'florj' at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:552) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210) at org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.doRequest(LBHttpSolrServer.java:340) at org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:301) at org.apache.solr.client.solrj.impl.CloudSolrServer.request(CloudSolrServer.java:659) at org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:116) at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:102) at org.elyograg.Flubber.main(Flubber.java:44) (this specific stacktrace came from using a 4.10.3 client with SolrCloud running 4.2.1 -- so my CloudSolrServer object had to be configured to use xml instead of javabin) When I adjusted the code to send a collection of docs instead of a single doc, with one good doc and one bad doc, I got the same message, with the uniqueKey field value from the bad document. For newer versions, there is an issue where the load balancing client (used by the cloud client) wraps *any* problem in an exception that just says "No live SolrServers available to handle this request" ... but that doesn't seem to be a problem in SolrJ 4.10.3. The problem was probably introduced by the big changes for 5.0. https://issues.apache.org/jira/browse/SOLR-7951 If you are running into SOLR-7951 (or any other bug), it will NOT be fixed in any 4.x version. Development on 4.x has ceased entirely. There's a good chance it won't even be fixed in 5.x, but only in a new 6.x version. I have no idea when Cloudera might update the version of Solr that they include. Note that even on versions affected by SOLR-7951, you'd still be able to see the actual problem exception, because it's still there, as the cause of the outer exception. It's always possible that Cloudera has embedded a layer on top of Solr or SolrJ that gets rid of the meaningful messages that Solr normally returns. We'll need the actual entire stacktrace and error message you're seeing. Thanks, Shawn
Re: SolrJ bulk indexing documents - HttpSolrClient vs. ConcurrentUpdateSolrClient
Here's some numbers for batching improvements: https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/ And I totally agree with Shawn that for 40K documents anything more complex is probably overkill. Best, Erick On Fri, Nov 18, 2016 at 6:02 AM, Shawn Heisey wrote: > On 11/18/2016 6:00 AM, Sebastian Riemer wrote: >> I am looking to improve indexing speed when loading many documents as part >> of an import. I am using the SolrJ-Client and currently I add the documents >> one-by-one using HttpSolrClient and its method add(SolrInputDocument doc, >> int commitWithinMs). > > If you batch them (probably around 500 to 1000 at a time), indexing > speed will go up. Below you have described the add methods used for > batching. > >> My first step would be to change that to use >> add(Collection docs, int commitWithinMs) instead, which I >> expect would already improve performance. >> Does it matter which method I use? Beside the method taking a >> Collection there is also one that takes an >> Iterator ... and what about ConcurrentUpdateSolrClient? >> Should I use it for bulk indexing instead of HttpSolrClient? >> >> Currently we are on version 5.5.0 of solr, and we don't run SolrCloud, i.e. >> only one instance etc. >> Indexing 39657 documents (which result in a core size of appr. 127MB) took >> about 10 minutes with the one-by-one approach. > > The concurrent client will send updates in parallel, without any > threading code in your own program, but there is one glaring > disadvantage -- indexing failures will be logged (via SLF4J), but your > program will NOT be informed about them, which means that the entire > Solr cluster could be down, and all your indexing requests will still > appear to succeed from your program's point of view. Here's an issue I > filed on the problem. It hasn't been fixed because there really isn't a > good solution. > > https://issues.apache.org/jira/browse/SOLR-3284 > > The concurrent client swallows all exceptions that occur during add() > operations -- they are conducted in the background. This might also > happen during delete operations, though I am unsure about that. You > won't know about any problems unless those problems are still there when > your program tries an operation that can't happen in the background, > like commit or query. If you're relying on automatic commits, your > indexing program might NEVER become aware of problems on the server end. > > In a nutshell ... the concurrent client is great for initial bulk > loading (if and only if you don't need error detection), but not all > that useful for ongoing update activity that runs all the time. > > If you set up multiple indexing threads in your own program, you can use > HttpSolrClient or CloudSolrClient with similar concurrent effectiveness > to the concurrent client, without sacrificing the ability to detect > errors during indexing. > > Indexing 40K documents in batches should take very little time, and in > my opinion is not worth the disadvantages of the concurrent client, or > taking the time to write multi-threaded code. If you reach the point > where you've got millions of documents, then you might want to consider > writing multi-threaded indexing code. > > Thanks, > Shawn >
Best Way to Read A Nested Structure from Solr?
Hello, I am sure there have been many discussions on the best way to do this, but I am lost and need your advice. I have a nested Solr Document containing multiple levels of sub-documents. Here is a JSON example so you can see the full structure: { "id": "Test Library", "description": "example of nested document", "content_type": "library", "authors": [{ "id": "author1", "content_type": "author", "name": "First Author", "books": { "id": "book1", "content_type": "book", "title": "title of book 1" }, "shortStories": { "id": "shortStory1", "content_type": "shortStory", "title": "title of short story 1" } }, { "id": "author2", "content_type": "author", "name": "Second Author", "books": { "id": "book1", "content_type": "book", "title": "title of book 1" }, "shortStories": { "id": "shortStory1", "content_type": "shortStory", "title": "title of short story 1" } }] } I want to query for a document and retrieve the nested structure. I tried using the ChildDocumentTranformerFactory but it flattened the result to be just Library and all other documents as children: { "id": "Test Library", "description": "example of nested document", "content_type": "library", "_childDocuments_":[ {"id": "author1", "content_type": "author", "name": "First Author" }, {"id": "book1", "content_type": "book", "title": "title of book 1" }, { "id": "shortStory1", "content_type": "shortStory", "title": "title of short story 1" }, { "id": "author2", "content_type": "author", "name": "Second Author" }, { "id": "book1", "content_type": "book", "title": "title of book 1" }, { "id": "shortStory1", "content_type": "shortStory", "title": "title of short story 1" } ] } Here are the query parameters I used: q={!parent which='content_type:library'} df=id fl=*,[child parentFilter='content_type:library' childFilter='id:*'] wt=json indent=true What is the best way to read the nested structure from Solr? Do I need to do some sort of faceting? Thank you, Jennifer Coston P.S. I am using Solr version 5.2.1
Index and search on PDF text using Solr
Hello, i'm new in Solr and i have a big problem. I have many text documents in PDF format (more than 1) and I need to create a site with this PDFs. In this site, I have to create a search by any terms in this PDFs. I don't have idea how to start. Anyone can help me? Thank you so much. -- View this message in context: http://lucene.472066.n3.nabble.com/Index-and-search-on-PDF-text-using-Solr-tp4306486.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index and search on PDF text using Solr
see the section in the Solr Reference Guide: "Uploading Data with Solr Cell using Apache Tika" here: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika to get a start. The basic idea is to use Apache Tika to parse the PDF file and then stuff the data into Solr. There are a lot of tweaks you'll need to do, particularly mapping the meta-data fields to Solr fields, but the above should get you started. Once you get that operating, you can refine your approach. I'm personally not a fan of doing all this on the Solr server in a _production_ environment unless it's a one-time operation, here's a writeup of why I think that and a model Java program that'd allow you to do this on a Java client. It uses some older Solr classes (i.e. CloudSolrServer is not CloudSolrClient) but it should give you a starting place if you want to do something similar. It has both a database bit and a Tika bit but the database bits can just be taken out, there's nothing about parsing the files with Tika that requires it. https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ Best, Erick On Fri, Nov 18, 2016 at 10:14 AM, vascaino90 wrote: > Hello, i'm new in Solr and i have a big problem. > I have many text documents in PDF format (more than 1) and I need to > create a site with this PDFs. In this site, I have to create a search by any > terms in this PDFs. > I don't have idea how to start. > Anyone can help me? > > Thank you so much. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Index-and-search-on-PDF-text-using-Solr-tp4306486.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Data Import Request Handler isolated into its own project - any suggestions?
Is your goal to still index into Solr? It was not clear. If yes, then it has been discussed quite a bit. The challenge is that DIH is integrated into AdminUI, which makes it easier to see the progress and set some flags. Plus the required jars are loaded via solrconfig.xml, just like all other extra libraries. So, contribution back would need to take that into account. If you are not ready to face that, it may make sense to look at other libraries first. Apache Camel, Apache NiFi, Cloudera morphline, etc. All of them can send data into Solr, though their version support differ. For example Camel seems to need Solr 3.5 still. Somebody updating their implementation to Solr 6.3 and contributing that back to that project would do a lot of good. Regards, Alex. Solr Example reading group is starting November 2016, join us at http://j.mp/SolrERG Newsletter and resources for Solr beginners and intermediates: http://www.solr-start.com/ On 19 November 2016 at 01:29, Marek Ščevlík wrote: > Hello. My name is Marek Scevlik. > > > > Currently I am working for a small company where we are interested in > implementing your Sorl 6.3 search engine. > > > > We are hoping to take out from the original source package the Data Import > Request Handler into its own project and create a usable .jar file out of > it. > > > > It should then serve as tool that would allow to connect to a remote server > and return data for us to our other application that would use the returned > data. > > > > What do you think? Would anything like this possible? To isolate out the > Data Import Request Handler into its own standalone project? > > > > If we could achieve this we won’t mind to share with the community this new > feature. > > > > I realize this is a first email and may lead into several hundreds so for > the start my request is very simple and not so high level detailed but I am > sure you realize it may lead into being quite complex. > > > > So I wonder if anyone replies. > > > > Thanks a lot for any replies and further info or guidance. > > > > > > Thanks. > > Regards Marek Scevlik
CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.
Hi, I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The SolrCloud is having difficulties talking to ZK when I am ingesting data into the collections. At that time I am also running queries (that return millions of docs). The ingest job is crying with the the following exception org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error from server at http://xxx/solr/collection1_shard15_replica1: Cannot talk to ZooKeeper - Updates are disabled. I think this is happening when the ingest job is trying to update the clusterstate.json file but the query is reading from that file and thus has some kind of a lock on that file. Are there any factors that will cause the "READ" to acquire lock for a long time? Is my understanding correct? I am using the cursor approach using SolrJ to get back results from Solr. How often is the ZK updated with the latest cluster state and what parameter governs that? Should I just increase the ZK client timeout so that it retries connecting to the ZK for a longer period of time (right now it is 15 seconds)? Thanks!
Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.
The clusterstate on Zookeeper shouldn't be changing very often, only when nodes come and go. bq: At that time I am also running queries (that return millions of docs). As in rows=milions? This is an anti-pattern, if that's true then you're probably network saturated and the like. If you mean your numFound is millions, then this is unlikely to be a problem. you say "clusterstate.json", which indicates you're on 4x? This has been changed to make a state.json for each collection, so either you upgraded sometime and didn't transform you ZK (there's a command to do that) or can you upgrade? What I'm guessing is that you have too much going on somehow and you're overloading your system and getting a timeout. So increasing the timeout is definitely a possibility, or reducing the ingestion load as a test. Best, Erick On Fri, Nov 18, 2016 at 4:51 PM, Chetas Joshi wrote: > Hi, > > I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The > SolrCloud is having difficulties talking to ZK when I am ingesting data > into the collections. At that time I am also running queries (that return > millions of docs). The ingest job is crying with the the following exception > > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error > from server at http://xxx/solr/collection1_shard15_replica1: Cannot talk to > ZooKeeper - Updates are disabled. > > I think this is happening when the ingest job is trying to update the > clusterstate.json file but the query is reading from that file and thus has > some kind of a lock on that file. Are there any factors that will cause the > "READ" to acquire lock for a long time? Is my understanding correct? I am > using the cursor approach using SolrJ to get back results from Solr. > > How often is the ZK updated with the latest cluster state and what > parameter governs that? Should I just increase the ZK client timeout so > that it retries connecting to the ZK for a longer period of time (right now > it is 15 seconds)? > > Thanks!
Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.
Thanks Erick. The numFound is millions but I was also trying with rows= 1 Million. I will reduce it to 500K. I am sorry. It is state.json. I am using Solr 5.5.0 One of the things I am not able to understand is why my ingestion job is complaining about "Cannot talk to ZooKeeper - Updates are disabled." I have a spark streaming job that continuously ingests into Solr. My shards are always up and running. The moment I start a query on SolrCloud it starts running into this exception. However as you said ZK will only update the state of the cluster when the shards go down. Then why my job is trying to contact ZK when the cluster is up and why is the exception about updating ZK? On Fri, Nov 18, 2016 at 5:11 PM, Erick Erickson wrote: > The clusterstate on Zookeeper shouldn't be changing > very often, only when nodes come and go. > > bq: At that time I am also running queries (that return > millions of docs). > > As in rows=milions? This is an anti-pattern, if that's true > then you're probably network saturated and the like. If > you mean your numFound is millions, then this is unlikely > to be a problem. > > you say "clusterstate.json", which indicates you're on > 4x? This has been changed to make a state.json for > each collection, so either you upgraded sometime and > didn't transform you ZK (there's a command to do that) > or can you upgrade? > > What I'm guessing is that you have too much going on > somehow and you're overloading your system and > getting a timeout. So increasing the timeout > is definitely a possibility, or reducing the ingestion load > as a test. > > Best, > Erick > > On Fri, Nov 18, 2016 at 4:51 PM, Chetas Joshi > wrote: > > Hi, > > > > I have a SolrCloud (on HDFS) of 50 nodes and a ZK quorum of 5 nodes. The > > SolrCloud is having difficulties talking to ZK when I am ingesting data > > into the collections. At that time I am also running queries (that return > > millions of docs). The ingest job is crying with the the following > exception > > > > org.apache.solr.client.solrj.impl.CloudSolrClient$RouteException: Error > > from server at http://xxx/solr/collection1_shard15_replica1: Cannot > talk to > > ZooKeeper - Updates are disabled. > > > > I think this is happening when the ingest job is trying to update the > > clusterstate.json file but the query is reading from that file and thus > has > > some kind of a lock on that file. Are there any factors that will cause > the > > "READ" to acquire lock for a long time? Is my understanding correct? I am > > using the cursor approach using SolrJ to get back results from Solr. > > > > How often is the ZK updated with the latest cluster state and what > > parameter governs that? Should I just increase the ZK client timeout so > > that it retries connecting to the ZK for a longer period of time (right > now > > it is 15 seconds)? > > > > Thanks! >
Re: CloudSolrClient$RouteException: Cannot talk to ZooKeeper - Updates are disabled.
On 11/18/2016 6:50 PM, Chetas Joshi wrote: > The numFound is millions but I was also trying with rows= 1 Million. I will > reduce it to 500K. > > I am sorry. It is state.json. I am using Solr 5.5.0 > > One of the things I am not able to understand is why my ingestion job is > complaining about "Cannot talk to ZooKeeper - Updates are disabled." > > I have a spark streaming job that continuously ingests into Solr. My shards > are always up and running. The moment I start a query on SolrCloud it starts > running into this exception. However as you said ZK will only update the > state of the cluster when the shards go down. Then why my job is trying to > contact ZK when the cluster is up and why is the exception about updating ZK? SolrCloud and SolrJ (CloudSolrClient) both maintain constant connections to all the zookeeper servers they are configured to use. If zookeeper quorum is lost, SolrCloud will go read-only -- no updating is possible. That is what is meant by "updates are disabled." Solr and Lucene are optimized for very low rowcounts, typically two or three digits. Asking for hundreds of thousands of rows is problematic. The cursorMark feature is designed for efficient queries when paging deeply into results, but it assumes your rows value is relatively small, and that you will be making many queries to get a large number of results, each of which will be fast and won't overload the server. Since it appears you are having a performance issue, here's a few things I have written on the topic: https://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn