Rolling partitions with solr shards
Is there a simple way to get solr to maintain shards as rolling partitions by date, e.g., the last day's documents in one shard, the week before yesterday in the next shard, the month before that in the next shard, and so on? I really don't need querying to be fast on the entire index, but it is critical that it be blazing fast on recent documents. A related but different question: in which config file can I change the default hash function to assign documents to shards? This outdated post http://wiki.apache.org/solr/NewSolrCloudDesign seems to suggest that you can define your own hash functions as well as assign hash ranges to partitions, but I am not sure whether or how solr 3.6 supports this. For that matter, I don't know whether or how SolrCloud (that I understand is available only in solr4) supports this. -- View this message in context: http://lucene.472066.n3.nabble.com/Rolling-partitions-with-solr-shards-tp3986315.html Sent from the Solr - User mailing list archive at Nabble.com.
solr java.lang.NullPointerException on select queries
I have recently started getting the error pasted below with solr-3.6 on /select queries. I don't know of anything that changed in the config to start causing this error. I am also running a second independent solr server on the same machine, which continues to run fine and has the same configuration as the first one except for the port number. The first one seems to be doing dataimport operations fine and updating index files as usual, but fails on select queries. An example of a failing query (that used to run fine) is: http:///solr/select/?q=title%3Afoo&version=2.2&start=0&rows=10&indent=on I am stupefied. Any idea? HTTP ERROR 500 Problem accessing /solr/select/. Reason: null java.lang.NullPointerException at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:398) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr java.lang.NullPointerException on select queries
For the first install, I copied over all files in the directory "example" into, let's call it, "install1". I did the same for "install2". The two installs run on different ports, use different jar files, are not really related to each other in any way as far as I can see. In particular, they are not "multicore". They have the same access control setup via jetty. I did a diff on config files and confirmed that only port numbers are different. Both had been running fine in parallel importing from a common database for several weeks. The documents indexed by install1, the problematic one currently, is a vastly bigger (~2.5B) superset of those indexed by install2 (~250M). At this point, select queries on install1 incurs the NullPointerException irrespective of whether install2 is running or not. The log file looks like it is indexing normally as always though. The index is also growing at the usual rate each day. Just select queries fail. :( -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990476.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr java.lang.NullPointerException on select queries
Erick, thanks for pointing that out. I was going to say in my original post that it is almost like some limit on max documents got violated all of a sudden, but the rest of the symptoms didn't seem to quite match. But now that I think about it, the problem probably happened at 2B (corresponding exactly to the size of the signed int space) as my ID space in the database has roughly 85% holes and the problem probably happened when the ID hit around 2.4B. It is still odd that indexing appears to proceed normally and the select queries "know" which IDs are used because the error happens only for queries with non-empty results, e.g., searching for an ID that doesn't exist gives a valid "0 numResponses" response. Is this because solr uses 'long' or more for indexing (given that the schema supports long) but not in the querying modules? I hadn't used solr sharding because I really needed "rolling" partitions, where I keep a small index of recent documents and throw the rest into a slow "archive" index. So maintaining the smaller instance2 (usually < 50M) and replicating it if needed was my homebrewed sharding approach. But I guess it is time to shard the archive after all. AV -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990534.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr java.lang.NullPointerException on select queries
Yes, wonky indeed. numDocs : -2006905329 maxDoc : -1993357870 And yes, I meant that the holes are in the database auto-increment ID space, nothing to do with lucene IDs. I will set up sharding. But is there any way to retrieve most of the current index? Currently, all select queries even in ranges in the hundreds of millions return the NullPointerException. It would suck to lose all of this. :( -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990542.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr java.lang.NullPointerException on select queries
Thanks. Do you know if the tons of index files with names like '_zxt.tis' in the index/data/ directory have the lucene IDs embedded in the binaries? The files look good to me and are partly readable even if in binary. I am wondering if I could just set up a new solr instance and move these index files there and hope to use them (or most of them) as is without shards? If so, I will just set up a separate sharded index for the documents indexed henceforth, but won't bother splitting the huge existing index. -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990560.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr java.lang.NullPointerException on select queries
Erick, thanks for the advice, but let me make sure you haven't misunderstood what I was asking. I am not trying to split the huge existing index in install1 into shards. I am also not trying to make the huge install1 index as one shard of a sharded solr setup. I plan to use a sharded setup only for future docs. I do want to avoid trying to re-index the docs in install1 and think of them as a slow "tape archive" index server if I ever need to go and query the past documents. So I was wondering if I could somehow use the existing segment files to run an isolated (unsharded) solr server that lets me query roughly the first 2B docs before the wraparound problem happened. If the "negative" internal doc IDs have pervasively corrupted the segment files, this would not be possible, but I am not able to imagine an underlying lucene design that would cause such a problem. Is my only option to re-index the past 2B docs if I want to be able to query them at this point or is there any way to use the existing segment files? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990615.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr java.lang.NullPointerException on select queries
Erick, much thanks for detailing these options. I am currently trying the second one as that seems a little easier and quicker to me. I successfully deleted documents with IDs after the problem time that I do know to an accuracy of a couple hours. Now, the stats are: numDocs : 2132454075 maxDoc : -2130733352 The former is nicely below 2^31. But I can't seem to get the latter to "decrease" and become positive by deleting further. Should I just run an optimize at this point? I have never manually run an optimize and plan to just hit http:///solr/update?optimize=true Can you confirm this? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3990798.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr java.lang.NullPointerException on select queries
So, I tried 'optimize', but it failed because of lack of space on the first machine. I then moved the whole thing to a different machine where the index was pretty much the only thing and was using about 37% of disk, but it still failed because of a "No space left on device" IOException. Also, the size of the index has since doubled to roughly 74% of the disk on this second machine now and the number of files has increased from 3289 to 3329. Actually even the 3289 files on the first machine were after I tried optimize on it once, so the "original" size must have been even smaller. I don't think I can afford any more space and am close to giving up and reclaiming space on the two machines. A couple more questions before that: 1) I am tempted to try editing binary--the "magnetic needle" option. Could you elaborate on this? Would there be a way to go back to an index that is the original size from its super-sized current form(s)? 2) Will CheckIndex also need more than twice the space? Would there be a way to bring down the size to the original size without running 'optimize' if I try that route? Also how exactly do I run CheckIndex, e.g., the exact URL I need to hit? -- View this message in context: http://lucene.472066.n3.nabble.com/solr-java-lang-NullPointerException-on-select-queries-tp3989974p3991400.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud error while propagating update to primary ZK node
I get a JSON parse error (pasted below) when I send an update to a replica node. I downloaded solr 4 alpha and followed the instructions at http://wiki.apache.org/solr/SolrCloud/ and setup numShards=1 with 3 total servers managed by a zookeeper ensemble, the primary at 8983 and the other two at 7574 and 8900 respectively. The error below shows up in the primary's log when I try to add a document to either replica. The document add fails. I am able to successfully add documents by directly sending to the primary. How do I correctly add documents to replicas? SEVERE: org.apache.noggit.JSONParser$ParseException: JSON Parse Error: char=<,position=0 BEFORE='<' AFTER='add>2' at org.apache.noggit.JSONParser.err(JSONParser.java:221) at org.apache.noggit.JSONParser.next(JSONParser.java:620) at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661) at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:105) at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95) at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) ... [snip] -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760.html Sent from the Solr - User mailing list archive at Nabble.com.
SolrCloud replication question
I am trying to wrap my head around replication in SolrCloud. I tried the setup at http://wiki.apache.org/solr/SolrCloud/. I mainly need replication for high query throughput. The setup at the URL above appears to maintain just one copy of the index at the primary node (instead of a replicated index as in a master/slave configuration). Will I still get roughly an n-fold increase in query throughput with n replicas? And if so, why would one do master/slave replication with multiple copies of the index at all? -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761.html Sent from the Solr - User mailing list archive at Nabble.com.
DataImport using last_indexed_id or getting max(id) quickly
My understanding is that the DIH in solr only enters last_indexed_time in dataimport.properties, but not say last_indexed_id for a primary key 'id'. How can I efficiently get the max(id) (note that 'id' is an auto-increment field in the database) ? Maintaining max(id) outside of solr is brittle and calling max(id) before each dataimport can take several minutes when the index has several hundred million records. How can I either import based on ID or get max(id) quickly? I can not use timestamp-based import because I get out-of-memory errors if/when solr falls behind and the suggested fixes online did not work for me. -- View this message in context: http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud error while propagating update to primary ZK node
I tried adding in two ways with the same outcome: (1) using solrj to call HttpSolrServer.add(docList) using BinaryRequestWriter; (2) using DataImportHandler to import directly from a database through a db-data-config.xml file. The document I'm adding has a long primary key id field and a few other string and timestamp fields. I also added a long _version_ field coz the URL said so. I've been using this schema without problems with 3.6 for a while and it works fine when added to the primary in 4.0. "Mark Miller-3 [via Lucene]" wrote: Can you show us exactly how you are adding the document? Eg, what update handler are you using, and what is the document you are adding? On Jul 8, 2012, at 12:52 PM, avenka wrote: > I get a JSON parse error (pasted below) when I send an update to a replica > node. I downloaded solr 4 alpha and followed the instructions at > http://wiki.apache.org/solr/SolrCloud/ and setup numShards=1 with 3 total > servers managed by a zookeeper ensemble, the primary at 8983 and the other > two at 7574 and 8900 respectively. > > The error below shows up in the primary's log when I try to add a document > to either replica. The document add fails. I am able to successfully add > documents by directly sending to the primary. How do I correctly add > documents to replicas? > > SEVERE: org.apache.noggit.JSONParser$ParseException: JSON Parse Error: > char=<,position=0 BEFORE='<' AFTER='add>2' > at org.apache.noggit.JSONParser.err(JSONParser.java:221) > at org.apache.noggit.JSONParser.next(JSONParser.java:620) > at org.apache.noggit.JSONParser.nextEvent(JSONParser.java:661) > at > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:105) > > at > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:95) > > at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:59) > at > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1561) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:442) > > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:263) > > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337) > > ... [snip] > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760.html > Sent from the Solr - User mailing list archive at Nabble.com. - Mark Miller lucidimagination.com _ If you reply to this email, your message will be added to the discussion below: http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760p3993780.html To unsubscribe from SolrCloud error while propagating update to primary ZK node, click here. NAML -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-error-while-propagating-update-to-primary-ZK-node-tp3993760p3993781.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud replication question
Erick, thanks. I now do see segment files in an index. directory at the replicas. Not sure why they were not getting populated earlier. I have a couple more questions, the second is more elaborate - let me know if I should move it to a separate thread. (1) The speed of adding documents in SolrCloud is excruciatingly slow. It takes about 30-50 seconds to add a batch of 100 documents (and about twice that to add 200, etc.) to the primary but just ~10 seconds to add 5K documents in batches of 200 on a standalone solr 4 server. The log files indicate that the primary is timing out with messages like below and Cloud->Graph in the UI shows the other two replicas in orange after starting green. org.apache.solr.client.solrj.SolrServerException: Timeout occured while waiting response from server at: http://localhost:7574/solr Any idea why? (3) I am seriously considering using symbolic links for a replicated solr setup with completely independent instances on a *single machine*. Tell me if I am thinking about this incorrectly. Here is my reasoning: (a) Master/slave replication in 3.6 simply seems old school as it doesn't have the nice consistency properties of SolrCloud. Polling say every 20 seconds means I don't know exactly how up-to-speed each replica is, which will complicate my request re-distribution. (b) SolrCloud seems like a great alternative to master/slave replication. But it seems slow (see 1) and having played with it, I don't feel comfortable with the maturity of ZK integration (or my comprehension of it) in solr 4 alpha. (c) Symbolic links seem like the fastest and most space-efficient solution *provided* there is only a single writer, which is just fine for me. I plan to run completely separate solr instances with one designated as the primary and do the following operations in sequence: Add a batch to the primary and commit --> From each replica's index directory, remove all symlinks and re-create symlinks to segment files in the primary (but not the write.lock file) --> Call update?commit=true to force replicas to re-load their in-memory index --> Do whatever read-only processing is required on the batch using the primary and all replicas by manually (randomly) distributing read requests --> Repeat sequence. Is there any downside to 3(c) (other than maintaining a trivial script to manage symlinks and call commit)? I tested it on small index sizes and it seems to work fine. The throughput improves with more replicas (for 2-4 replicas) as a single replica is not enough to saturate the machine (due to high query latency). Am I overlooking something in this setup? Overall, I need high throughput and minimal latency from the time a document is added to the time it is available at a replica. SolrCloud's automated request redirection, consistency, and fault-tolerance is awesome for a physically distributed setup, but I don't see how it beats 3(c) in a single-writer, single-machine, replicated setup. AV On Jul 9, 2012, at 9:43 AM, Erick Erickson [via Lucene] wrote: > No, you're misunderstanding the setup. Each replica has a complete > index. Updates get automatically forwarded to _both_ nodes for a > particular shard. So, when a doc comes in to be indexed, it gets > sent to the leader for, say, shard1. From there: > 1> it gets indexed on the leader > 2> it gets forwarded to the replica(s) where it gets indexed locally. > > Each replica has a complete index (for that shard). > > There is no master/slave setup any more. And you do > _not_ have to configure replication. > > Best > Erick > > On Sun, Jul 8, 2012 at 1:03 PM, avenka <[hidden email]> wrote: > > > I am trying to wrap my head around replication in SolrCloud. I tried the > > setup at http://wiki.apache.org/solr/SolrCloud/. I mainly need replication > > for high query throughput. The setup at the URL above appears to maintain > > just one copy of the index at the primary node (instead of a replicated > > index as in a master/slave configuration). Will I still get roughly an > > n-fold increase in query throughput with n replicas? And if so, why would > > one do master/slave replication with multiple copies of the index at all? > > > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > > If you reply to this email, your message will be added to the discussion > below: > http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3993889.html > To unsubscribe from SolrCloud replication question, click here. > NAML -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3993960.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud replication question
Hmm, never mind my question about replicating using symlinks. Given that replication on a single machine improves throughput, I should be able to get a similar improvement by simply sharding on a single machine. As also observed at http://carsabi.com/car-news/2012/03/23/optimizing-solr-7x-your-search-speed/ I am now benchmarking my workload to compare replication vs. sharding performance on a single machine. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-replication-question-tp3993761p3994017.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DataImport using last_indexed_id or getting max(id) quickly
Thanks. Can you explain more the first TermsComponent option to obtain max(id)? Do I have to modify schema.xml to add a new field? How exactly do I query for the lowest value of "1 - id"? -- View this message in context: http://lucene.472066.n3.nabble.com/DataImport-using-last-indexed-id-or-getting-max-id-quickly-tp3993763p3994560.html Sent from the Solr - User mailing list archive at Nabble.com.