Numerous problems with SolrCloud
This is my first experience with SolrCloud, so please bear with me. I've inherited a setup with 5 servers, 2 of which are Zookeeper only and the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 & 3.4.7. There's around 80 Gb of index, some collections are rather big (20Gb) and some very small. All of them have only one shard. The bigger ones are almost constantly being updated (and of course queried at the same time). I've had a huge number of errors, many different ones. At some point the system seemed rather stable, but I've tried to add a few new collections and things went wrong again. The usual symptom is that some cores stop synchronizing; sometimes an entire server is shown as "gone" (although it's still alive and well). When I add a core on a server, another (or several others) often goes down on that server. Even when the system is rather stable some cores are shown as recovering. When restarting a server it takes a very long time (30 min at least) to fully recover. Some of the many errors I've got (I've skipped the warnings): - org.apache.solr.common.SolrException: Error trying to proxy request for url - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting up to try to start recovery on replica - org.apache.solr.common.SolrException; Error while trying to recover. core=[...]:org.apache.solr.common.SolrException: No registered leader was found after waiting - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING, tlog=null} - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE after succesful recovery - org.apache.solr.common.SolrException; Could not find core to call recovery - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...': Unable to create core - org.apache.solr.request.SolrRequestInfo; prev == info : false - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was not closed! - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool - and so on... Any advice on where I should start? I've checked disk space, memory usage, max number of open files, everything seems fine there. My guess is that the configuration is rather unaltered from the defaults. I've extended timeouts in Zookeeper already. Thanks, John
Re: Numerous problems with SolrCloud
Thanks, I'll have a try. Can the load on the Solr servers impair the zk response time in the current situation, which would cause the desync? Is this the reason for the change? John. On 21/12/15 16:45, Erik Hatcher wrote: > John - the first recommendation that pops out is to run (only) 3 zookeepers, > entirely separate from Solr servers, and then as many Solr servers from there > that you need to scale indexing and querying to your needs. Sounds like 3 > ZKs + 2 Solr’s is a good start, given you have 5 servers at your disposal. > > > — > Erik Hatcher, Senior Solutions Architect > http://www.lucidworks.com <http://www.lucidworks.com/> > > > >> On Dec 21, 2015, at 10:37 AM, John Smith wrote: >> >> This is my first experience with SolrCloud, so please bear with me. >> >> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and >> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 & >> 3.4.7. There's around 80 Gb of index, some collections are rather big >> (20Gb) and some very small. All of them have only one shard. The bigger >> ones are almost constantly being updated (and of course queried at the >> same time). >> >> I've had a huge number of errors, many different ones. At some point the >> system seemed rather stable, but I've tried to add a few new collections >> and things went wrong again. The usual symptom is that some cores stop >> synchronizing; sometimes an entire server is shown as "gone" (although >> it's still alive and well). When I add a core on a server, another (or >> several others) often goes down on that server. Even when the system is >> rather stable some cores are shown as recovering. When restarting a >> server it takes a very long time (30 min at least) to fully recover. >> >> Some of the many errors I've got (I've skipped the warnings): >> - org.apache.solr.common.SolrException: Error trying to proxy request >> for url >> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting >> up to try to start recovery on replica >> - org.apache.solr.common.SolrException; Error while trying to recover. >> core=[...]:org.apache.solr.common.SolrException: No registered leader >> was found after waiting >> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING, >> tlog=null} >> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE >> after succesful recovery >> - org.apache.solr.common.SolrException; Could not find core to call recovery >> - org.apache.solr.common.SolrException: Error CREATEing SolrCore '...': >> Unable to create core >> - org.apache.solr.request.SolrRequestInfo; prev == info : false >> - org.apache.solr.request.SolrRequestInfo; Previous SolrRequestInfo was >> not closed! >> - org.apache.solr.update.SolrIndexWriter; Error closing IndexWriter >> - org.apache.solr.update.SolrIndexWriter; SolrIndexWriter was not closed >> prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! >> - org.apache.solr.cloud.OverseerCollectionMessageHandler; Error from shard >> - org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting >> for connection from pool >> - and so on... >> >> Any advice on where I should start? I've checked disk space, memory >> usage, max number of open files, everything seems fine there. My guess >> is that the configuration is rather unaltered from the defaults. I've >> extended timeouts in Zookeeper already. >> >> Thanks, >> John >> >
Re: Numerous problems with SolrCloud
OK, great. I've eliminated OOM errors after increasing the memory allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal setting but this is all I can have right now on the Solr machines. I'll look into GC logging too. Turning to the Solr logs, a quick sweep revealed a lot of "Caused by: java.net.SocketException: Connection reset" lines, but this isn't very explicit. I suppose I'll have to cross-check on the concerned server(s). Anyway, I'll have a try at the updated setting and I'll get back to the list. Thanks, John. On 21/12/15 17:21, Erick Erickson wrote: > ZK isn't pushed all that heavily, although all things are possible. Still, > for maintenance putting Zk on separate machines is a good idea. They > don't have to be very beefy machines. > > Look in your logs for LeaderInitiatedRecovery messages. If you find them > then _probably_ you have some issues with timeouts, often due to > excessive GC pauses, turning on GC logging can help you get > a handle on that. > > Another "popular" reason for nodes going into recovery is Out Of Memory > errors, which is easy to do in a system that gets set up and > then more and more docs get added to it. You either have to move > some collections to other Solr instances, get more memory to the JVM > (but watch out for GC pauses and starving the OS's memory) etc. > > But the Solr logs are the place I'd look first for any help in understanding > the root cause of nodes going into recovery. > > Best, > Erick > > On Mon, Dec 21, 2015 at 8:04 AM, John Smith wrote: >> Thanks, I'll have a try. Can the load on the Solr servers impair the zk >> response time in the current situation, which would cause the desync? Is >> this the reason for the change? >> >> John. >> >> >> On 21/12/15 16:45, Erik Hatcher wrote: >>> John - the first recommendation that pops out is to run (only) 3 >>> zookeepers, entirely separate from Solr servers, and then as many Solr >>> servers from there that you need to scale indexing and querying to your >>> needs. Sounds like 3 ZKs + 2 Solr’s is a good start, given you have 5 >>> servers at your disposal. >>> >>> >>> — >>> Erik Hatcher, Senior Solutions Architect >>> http://www.lucidworks.com <http://www.lucidworks.com/> >>> >>> >>> >>>> On Dec 21, 2015, at 10:37 AM, John Smith wrote: >>>> >>>> This is my first experience with SolrCloud, so please bear with me. >>>> >>>> I've inherited a setup with 5 servers, 2 of which are Zookeeper only and >>>> the 3 others SolrCloud + Zookeeper. Versions are respectively 5.4.0 & >>>> 3.4.7. There's around 80 Gb of index, some collections are rather big >>>> (20Gb) and some very small. All of them have only one shard. The bigger >>>> ones are almost constantly being updated (and of course queried at the >>>> same time). >>>> >>>> I've had a huge number of errors, many different ones. At some point the >>>> system seemed rather stable, but I've tried to add a few new collections >>>> and things went wrong again. The usual symptom is that some cores stop >>>> synchronizing; sometimes an entire server is shown as "gone" (although >>>> it's still alive and well). When I add a core on a server, another (or >>>> several others) often goes down on that server. Even when the system is >>>> rather stable some cores are shown as recovering. When restarting a >>>> server it takes a very long time (30 min at least) to fully recover. >>>> >>>> Some of the many errors I've got (I've skipped the warnings): >>>> - org.apache.solr.common.SolrException: Error trying to proxy request >>>> for url >>>> - org.apache.solr.update.processor.DistributedUpdateProcessor; Setting >>>> up to try to start recovery on replica >>>> - org.apache.solr.common.SolrException; Error while trying to recover. >>>> core=[...]:org.apache.solr.common.SolrException: No registered leader >>>> was found after waiting >>>> - update log not in ACTIVE or REPLAY state. FSUpdateLog{state=BUFFERING, >>>> tlog=null} >>>> - org.apache.solr.cloud.RecoveryStrategy; Could not publish as ACTIVE >>>> after succesful recovery >>>> - org.apache.solr.common.SolrException; Could not find core to call >>>> recovery >>>> - org.apache.solr.common.SolrException: Error CREATEing SolrCore
Re: Numerous problems with SolrCloud
Hi, Thanks Erick for your input. I've added GC logging, but it was normal when the error came again this morning. I was adding a large collection (27 Gb): on the first server all went well. At the time I created the core on a second server, it was almost immediately disconnected from the cloud. This time I could nail what seems to be the root cause in the logs: ERROR - 2015-12-22 09:39:29.029; [ ] org.apache.solr.common.SolrException; OverseerAutoReplicaFailoverThread had an error in its thread work loop.:org.apache.solr.common.SolrException: Error reading cluster properties at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:738) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.doWork(OverseerAutoReplicaFailoverThread.java:153) at org.apache.solr.cloud.OverseerAutoReplicaFailoverThread.run(OverseerAutoReplicaFailoverThread.java:132) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.InterruptedException: sleep interrupted at java.lang.Thread.sleep(Native Method) at org.apache.solr.common.cloud.ZkCmdExecutor.retryDelay(ZkCmdExecutor.java:108) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:76) at org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:308) at org.apache.solr.common.cloud.ZkStateReader.getClusterProps(ZkStateReader.java:731) ... 3 more WARN - 2015-12-22 09:39:29.890; [ ] org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter; Keeper Exception org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /live_nodes at org.apache.zookeeper.KeeperException.create(KeeperException.java:127) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.printTree(ZookeeperInfoHandler.java:581) at org.apache.solr.handler.admin.ZookeeperInfoHandler$ZKPrinter.print(ZookeeperInfoHandler.java:527) at org.apache.solr.handler.admin.ZookeeperInfoHandler.handleRequestBody(ZookeeperInfoHandler.java:406) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:156) at org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:664) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:438) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:222) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:181) ... After that the server was marked as "gone" in the cloud graph and it took a long time to register itself again and recover. I haven't changed the ZK config yet as per your suggestion below. Could this fix the problem? Do you have any other suggestion? Thanks, John On 21/12/15 17:39, Erick Erickson wrote: > right, do note that when you _do_ hit an OOM, you really > should restart the JVM as nothing is _really_ certain after > that. > > You're right, just bumping the memory is a band-aid, but > whatever gets you by. Lucene makes heavy use of > MMapDirectory which uses OS memory rather than JVM > memory, so you're robbing Peter to pay Paul when you > allocate high percentages of the physical memory to the JVM. > See Uwe's excellent blog here: > http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html > > And yeah, your "connection reset" errors may well be GC-related > if you're getting a lot of stop-the-world GC pauses. > > Sounds like you inherited a system that's getting more and more > docs added to it over time and outgrew its host, but that's a guess. > > And you get to deal with it over the holidays too ;) > > Best, > Erick > > On Mon, Dec 21, 2015 at 8:33 AM, John Smith wrote: >> OK, great. I've eliminated OOM errors after increasing the memory >> allocated to Solr: 12Gb out of 20Gb. It's probably not an optimal >> setting but this is all I can have right now on the Solr machines. I'll >> look into GC logging too. >> >> Turning to the Solr logs, a quick sweep revealed a lot of "Caused by: >> java.net.SocketException: Connection reset" lines, but this isn't very >> explicit. I suppose I'll have to cross-check on the concerned server(s). >> >> Anyway, I'll have a
More problems (now jetty errorrs) with SolrCloud
Hi, This morning one of the 2 nodes of our SolrCloud went down. I've tried many ways to recover it but to no avail. I've tried to unload all cores on the failed node and reload it after emptying the data directory, hoping it would sync from scratch. The core is still marked as down and no data is downloaded. I get a lot of messages like to following ones in the log: WARN - 2016-01-22 14:14:28.535; [ ] org.eclipse.jetty.http.HttpParser; badMessage: 400 Unknown Version for HttpChannelOverHttp@e795880{r=0,c=false,a=IDLE,uri=-} WARN - 2016-01-22 14:15:02.559; [ ] org.eclipse.jetty.http.HttpParser; badMessage: 400 Unknown Version for HttpChannelOverHttp@727c5f10{r=0,c=false,a=IDLE,uri=-} WARN - 2016-01-22 14:15:02.580; [ ] org.eclipse.jetty.http.HttpParser; badMessage: java.lang.IllegalStateException: too much data after closed for HttpChannelOverHttp@727c5f10{r=0,c=true,a=COMPLETED,uri=null} ERROR - 2016-01-22 14:15:03.496; [ ] org.apache.solr.common.SolrException; null:org.apache.solr.common.SolrException: Unexpected method type: [... truncated json data, presumably from a document being updated ...]POST etc. I've restarted all nodes in zk and SolrCloud, and reloaded all cores on the main node (that is, the one that seems to work). After some research, I saw that the zk timeout is set to 60 sec. in the solr config, but at least one entry in the gc logs mentions 134 sec. However, I noticed that the zk logs states a negociated timeout of 30 sec... Here are my questions: - Are the log entries shown above related to the zk session timeout, or should I look elsewhere? - How to make sure the timeout negociated with zk matches the value from the solr config? - What parameter(s) would allow to reduce the gc execution time, presumably at the expense of a more frequent gc? Thanks!
Re: Boosts for relevancy (shopping products)
Hi, For once I might be of some help: I've had a similar configuration (large set of products from various sources). It's very difficult to find the right balance between all parameters and requires a lot of tweaking, most often in the dark unfortunately. What I've found is that omitNorms=true is a real breakthrough: without it results tend to favor small texts, which is not what's wanted for product names. I also added a RemoveDuplicatesTokenFilterFactory for the name as it's a common practice for spammers to repeat some key words in order to be better placed in results. Stemming and custom stop words (e.g. "cheap", "sale", ...) are other potential ideas. I've also ended up in removing the description field as it's often too broad, and name is now the only field left: brand, category and merchant (as well as other fields) are offered as additional filters using facets. Note that you'd have to re-index them as plain strings. It's more difficult to achieve but popularity boost can also be useful: you can measure it by sales or by number of clicks. I use a combination of both, and store those values using partial updates. Hope it helps, John On 17/03/16 09:36, Robert Brown wrote: > Hi, > > I currently have an index of ~50m docs representing shopping products: > name, description, brand, category, etc. > > Our "qf" is currently setup as: > > name^5 > brand^2 > category^3 > merchant^2 > description^1 > > mm: 100% > ps: 5 > > I'm getting complaints from the business concerning relevancy, and was > hoping to get some constructive ideas/thoughts on whether these boosts > look semi-sensible or not, I think they were put in place pretty much > at random. > > I know it's going to be a case of rounds upon rounds of testing, but > maybe there's a good starting point that will save me some time? > > My initial thoughts right now are to actually just search on the name > field, and maybe the brand (for things like "Apple Ipod"). > > Has anyone got a similar setup that could share some direction? > > Many Thanks, > Rob >
Actual (specific) RT Search?
Hi, The purpose of the project is an actual RT Search, not NRT, but with a specific condition: when an updated document meets a fixed criteria, it should be filtered out from future results (no reuse of the document). This criteria is present in the search query but of course doesn't work for uncommitted documents. What I wrote is a combination of the following: - an UpdateRequestProcessor in the update chain storing the document unique key in a local cache when the condition is met - a postCommit listener clearing the cache - a PostFilter collecting documents that aren't found in the cache, activated in the search query as a fq parameter Functionally it does the job, however for large indexes the filter takes a hit. The index that poses problem has 18 mil documents in 13Gb, and queries return an average of 25,000 docs in results. The VM has 8 cores and 20Gb RAM, and uses nimble storage (combination of ssd & hd). Without the code Solr works like a charm. My guess so far is that the filter has to fetch the unique key for all documents in results, which consumes a lot of resources. What would be your advice? - Could I use the internal document id instead of a field value? This id would have to be available both in the UpdateRequestProcessor and PostFilter: is it the case and how can I access it? I suppose the SolrInputDocument in the request processor doesn't have it yet anyway. - If I reduce the autoSoftCommit maxDocs value (how far?), would it be wise (and feasible) to convert the PostFilter into a plain filter query such as "*:* NOT (id:1 OR id:2)" or something similar? How could I implement this and how to estimate the filter cost in order for Solr to execute it at the right position? - Maybe I took the wrong path altogether? Thanks in advance John
Unexpected delayed document deletion with atomic updates
Hi, I'm bumping on the following problem with update XML messages. The idea is to record the number of clicks for a document: each time, a message is sent to .../update such as this one: abc 1 1.05 (Clicks is an int field; Boost is a float field, it's updated to reflect the change in popularity using a formula based on the number of clicks). At the moment in the dev environment, changes are committed immediately. When a document is updated, the changes are indeed reflected in the search results. If I click on the same document again, all goes well. But when I click on an other document, the latter gets updated as expected but the former is plainly deleted. It can no longer be found and the admin core Overview page counts 1 document less. If I click on a 3rd document, so goes the 2nd one. The schema is the default one amended to remove unneeded fields and add new ones, nothing fancy. All fields are stored="true" and there's no . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with the same outcome. It looks like a bug to me but I might have overlooked something? This is my first attempt at atomic updates. Thanks, John.
Re: Unexpected delayed document deletion with atomic updates
The ids are all different: they're unique numbers followed by a couple of keywords. I've made a test with a small collection of 10 documents to make sure I can manage them manually: all ids are confirmed as different. I also dumped the exact command, here's one example: 101084385_Sebago_ sebago shoes11.8701925463775 It's sent as the body of a POST request to http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a Content-Type: text/xml header. I still noted the consistent loss of another document with the update above. John On 08/10/15 00:38, Upayavira wrote: > What ID are you using? Are you possibly using the same ID field for > both, so the second document you visit causes the first to be > overwritten? > > Upayavira > > On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote: >> This certainly should not be happening. I'd >> take a careful look at what you actually send. >> My _guess_ is that you're not sending the update >> command you think you are >> >> As a test you could just curl (or use post.jar) to >> send these types of commands up individually. >> >> Perhaps looking at the solr log would help too... >> >> Best, >> Erick >> >> On Wed, Oct 7, 2015 at 6:32 AM, John Smith >> wrote: >>> Hi, >>> >>> I'm bumping on the following problem with update XML messages. The idea >>> is to record the number of clicks for a document: each time, a message >>> is sent to .../update such as this one: >>> >>> >>> >>> abc >>> 1 >>> 1.05 >>> >>> >>> >>> (Clicks is an int field; Boost is a float field, it's updated to reflect >>> the change in popularity using a formula based on the number of clicks). >>> >>> At the moment in the dev environment, changes are committed immediately. >>> >>> >>> When a document is updated, the changes are indeed reflected in the >>> search results. If I click on the same document again, all goes well. >>> But when I click on an other document, the latter gets updated as >>> expected but the former is plainly deleted. It can no longer be found >>> and the admin core Overview page counts 1 document less. If I click on a >>> 3rd document, so goes the 2nd one. >>> >>> >>> The schema is the default one amended to remove unneeded fields and add >>> new ones, nothing fancy. All fields are stored="true" and there's no >>> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with >>> the same outcome. It looks like a bug to me but I might have overlooked >>> something? This is my first attempt at atomic updates. >>> >>> Thanks, >>> John. >>>
Re: Unexpected delayed document deletion with atomic updates
Oh, I forgot Erick's mention of the logs: there's nothing unusual in INFO level, the update request just gets mentioned. No exception. I reran it with the DEBUG level, but most of the log was related to jetty. Here's a line I noticed though: org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest: {wt=json&commit=true&update.chain=dedupe} The update.chain parameter wasn't part of the original request, and "dedupe" looks suspicious to me. Perhaps should I investigate further there? Thanks, John. On 08/10/15 08:25, John Smith wrote: > The ids are all different: they're unique numbers followed by a couple > of keywords. I've made a test with a small collection of 10 documents to > make sure I can manage them manually: all ids are confirmed as different. > > I also dumped the exact command, here's one example: > > 101084385_Sebago_ sebago shoes name="Clicks" update="set">1 update="set">1.8701925463775 > > It's sent as the body of a POST request to > http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a > Content-Type: text/xml header. I still noted the consistent loss of > another document with the update above. > > John > > > On 08/10/15 00:38, Upayavira wrote: >> What ID are you using? Are you possibly using the same ID field for >> both, so the second document you visit causes the first to be >> overwritten? >> >> Upayavira >> >> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote: >>> This certainly should not be happening. I'd >>> take a careful look at what you actually send. >>> My _guess_ is that you're not sending the update >>> command you think you are.... >>> >>> As a test you could just curl (or use post.jar) to >>> send these types of commands up individually. >>> >>> Perhaps looking at the solr log would help too... >>> >>> Best, >>> Erick >>> >>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith >>> wrote: >>>> Hi, >>>> >>>> I'm bumping on the following problem with update XML messages. The idea >>>> is to record the number of clicks for a document: each time, a message >>>> is sent to .../update such as this one: >>>> >>>> >>>> >>>> abc >>>> 1 >>>> 1.05 >>>> >>>> >>>> >>>> (Clicks is an int field; Boost is a float field, it's updated to reflect >>>> the change in popularity using a formula based on the number of clicks). >>>> >>>> At the moment in the dev environment, changes are committed immediately. >>>> >>>> >>>> When a document is updated, the changes are indeed reflected in the >>>> search results. If I click on the same document again, all goes well. >>>> But when I click on an other document, the latter gets updated as >>>> expected but the former is plainly deleted. It can no longer be found >>>> and the admin core Overview page counts 1 document less. If I click on a >>>> 3rd document, so goes the 2nd one. >>>> >>>> >>>> The schema is the default one amended to remove unneeded fields and add >>>> new ones, nothing fancy. All fields are stored="true" and there's no >>>> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with >>>> the same outcome. It looks like a bug to me but I might have overlooked >>>> something? This is my first attempt at atomic updates. >>>> >>>> Thanks, >>>> John. >>>> >
Re: Unexpected delayed document deletion with atomic updates
Yes indeed, the update chain had been activated... I commented it out again and the problem vanished. Good job, thanks Erick and Upayavira! John On 08/10/15 08:58, Upayavira wrote: > Look for the DedupUpdateProcessor in an update chain. > > that is there, but commented out IIRC in the techproducts sample > configs. > > Perhaps you uncommented it to use your own update processors, but didn't > remove that component? > > On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote: >> Oh, I forgot Erick's mention of the logs: there's nothing unusual in >> INFO level, the update request just gets mentioned. No exception. I >> reran it with the DEBUG level, but most of the log was related to jetty. >> Here's a line I noticed though: >> >> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest: >> {wt=json&commit=true&update.chain=dedupe} >> >> The update.chain parameter wasn't part of the original request, and >> "dedupe" looks suspicious to me. Perhaps should I investigate further >> there? >> >> Thanks, >> John. >> >> >> On 08/10/15 08:25, John Smith wrote: >>> The ids are all different: they're unique numbers followed by a couple >>> of keywords. I've made a test with a small collection of 10 documents to >>> make sure I can manage them manually: all ids are confirmed as different. >>> >>> I also dumped the exact command, here's one example: >>> >>> 101084385_Sebago_ sebago shoes>> name="Clicks" update="set">1>> update="set">1.8701925463775 >>> >>> It's sent as the body of a POST request to >>> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a >>> Content-Type: text/xml header. I still noted the consistent loss of >>> another document with the update above. >>> >>> John >>> >>> >>> On 08/10/15 00:38, Upayavira wrote: >>>> What ID are you using? Are you possibly using the same ID field for >>>> both, so the second document you visit causes the first to be >>>> overwritten? >>>> >>>> Upayavira >>>> >>>> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote: >>>>> This certainly should not be happening. I'd >>>>> take a careful look at what you actually send. >>>>> My _guess_ is that you're not sending the update >>>>> command you think you are >>>>> >>>>> As a test you could just curl (or use post.jar) to >>>>> send these types of commands up individually. >>>>> >>>>> Perhaps looking at the solr log would help too... >>>>> >>>>> Best, >>>>> Erick >>>>> >>>>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith >>>>> wrote: >>>>>> Hi, >>>>>> >>>>>> I'm bumping on the following problem with update XML messages. The idea >>>>>> is to record the number of clicks for a document: each time, a message >>>>>> is sent to .../update such as this one: >>>>>> >>>>>> >>>>>> >>>>>> abc >>>>>> 1 >>>>>> 1.05 >>>>>> >>>>>> >>>>>> >>>>>> (Clicks is an int field; Boost is a float field, it's updated to reflect >>>>>> the change in popularity using a formula based on the number of clicks). >>>>>> >>>>>> At the moment in the dev environment, changes are committed immediately. >>>>>> >>>>>> >>>>>> When a document is updated, the changes are indeed reflected in the >>>>>> search results. If I click on the same document again, all goes well. >>>>>> But when I click on an other document, the latter gets updated as >>>>>> expected but the former is plainly deleted. It can no longer be found >>>>>> and the admin core Overview page counts 1 document less. If I click on a >>>>>> 3rd document, so goes the 2nd one. >>>>>> >>>>>> >>>>>> The schema is the default one amended to remove unneeded fields and add >>>>>> new ones, nothing fancy. All fields are stored="true" and there's no >>>>>> . I've tried versions 5.2.1 & 5.3.1 in standalone mode, with >>>>>> the same outcome. It looks like a bug to me but I might have overlooked >>>>>> something? This is my first attempt at atomic updates. >>>>>> >>>>>> Thanks, >>>>>> John. >>>>>>
Re: Unexpected delayed document deletion with atomic updates
After some further investigation, for those interested: the SignatureUpdateProcessorFactory fields were somehow mis-configured (I guess copied over from another collection). The initial import had been made using a data import handler: I suppose the update chain isn't called in this process and no signature field is created - am I right?. The first time a document was updated, a signature field with value "" was added. The next time, the same signature was generated for the new udpate, which triggered the deletion of all documents with the same signature (i.e. the first one) as overwriteDupes was set to true. Correct behavior but quite tricky... So my conclusion here (please correct me if I'm wrong) is of course to fix the signature configuration problem, but also to manage calling the update chain (or maybe a simplified one, e.g. by skipping logging) in the data import handler. Is there an easy way to do this? Conceptually, shouldn't the update chain be callable from the data import process - maybe it is? John On 08/10/15 09:43, Upayavira wrote: > Yay! > > On Thu, Oct 8, 2015, at 08:38 AM, John Smith wrote: >> Yes indeed, the update chain had been activated... I commented it out >> again and the problem vanished. >> >> Good job, thanks Erick and Upayavira! >> John >> >> >> On 08/10/15 08:58, Upayavira wrote: >>> Look for the DedupUpdateProcessor in an update chain. >>> >>> that is there, but commented out IIRC in the techproducts sample >>> configs. >>> >>> Perhaps you uncommented it to use your own update processors, but didn't >>> remove that component? >>> >>> On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote: >>>> Oh, I forgot Erick's mention of the logs: there's nothing unusual in >>>> INFO level, the update request just gets mentioned. No exception. I >>>> reran it with the DEBUG level, but most of the log was related to jetty. >>>> Here's a line I noticed though: >>>> >>>> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest: >>>> {wt=json&commit=true&update.chain=dedupe} >>>> >>>> The update.chain parameter wasn't part of the original request, and >>>> "dedupe" looks suspicious to me. Perhaps should I investigate further >>>> there? >>>> >>>> Thanks, >>>> John. >>>> >>>> >>>> On 08/10/15 08:25, John Smith wrote: >>>>> The ids are all different: they're unique numbers followed by a couple >>>>> of keywords. I've made a test with a small collection of 10 documents to >>>>> make sure I can manage them manually: all ids are confirmed as different. >>>>> >>>>> I also dumped the exact command, here's one example: >>>>> >>>>> 101084385_Sebago_ sebago shoes>>>> name="Clicks" update="set">1>>>> update="set">1.8701925463775 >>>>> >>>>> It's sent as the body of a POST request to >>>>> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a >>>>> Content-Type: text/xml header. I still noted the consistent loss of >>>>> another document with the update above. >>>>> >>>>> John >>>>> >>>>> >>>>> On 08/10/15 00:38, Upayavira wrote: >>>>>> What ID are you using? Are you possibly using the same ID field for >>>>>> both, so the second document you visit causes the first to be >>>>>> overwritten? >>>>>> >>>>>> Upayavira >>>>>> >>>>>> On Wed, Oct 7, 2015, at 06:38 PM, Erick Erickson wrote: >>>>>>> This certainly should not be happening. I'd >>>>>>> take a careful look at what you actually send. >>>>>>> My _guess_ is that you're not sending the update >>>>>>> command you think you are >>>>>>> >>>>>>> As a test you could just curl (or use post.jar) to >>>>>>> send these types of commands up individually. >>>>>>> >>>>>>> Perhaps looking at the solr log would help too... >>>>>>> >>>>>>> Best, >>>>>>> Erick >>>>>>> >>>>>>> On Wed, Oct 7, 2015 at 6:32 AM, John Smith >>>>>>> wrote: >>>&g
Re: Unexpected delayed document deletion with atomic updates
Well, every day we update a lot of documents (usually several millions) so the DIH is a good fit. Calling the update chain would make sense there: after all a data import is just a batch update. Otherwise, the same operations would have to be made upfront, possibly in another environment and/or language. That's probably what I'm gonna do anyway. Thanks for your help! John On 08/10/15 13:39, Upayavira wrote: > You can either specify the update chain via an update.chain request > parameter, or you can configure a new request parameter with its own URL > and separate update.chain value. > > I have no idea how you would then reference that in the DIH - I've never > really used it. > > Upayavira > > On Thu, Oct 8, 2015, at 09:25 AM, John Smith wrote: >> After some further investigation, for those interested: the >> SignatureUpdateProcessorFactory fields were somehow mis-configured (I >> guess copied over from another collection). The initial import had been >> made using a data import handler: I suppose the update chain isn't >> called in this process and no signature field is created - am I right?. >> >> The first time a document was updated, a signature field with value >> "" was added. The next time, the same signature was >> generated for the new udpate, which triggered the deletion of all >> documents with the same signature (i.e. the first one) as overwriteDupes >> was set to true. Correct behavior but quite tricky... >> >> So my conclusion here (please correct me if I'm wrong) is of course to >> fix the signature configuration problem, but also to manage calling the >> update chain (or maybe a simplified one, e.g. by skipping logging) in >> the data import handler. Is there an easy way to do this? Conceptually, >> shouldn't the update chain be callable from the data import process - >> maybe it is? >> >> John >> >> >> On 08/10/15 09:43, Upayavira wrote: >>> Yay! >>> >>> On Thu, Oct 8, 2015, at 08:38 AM, John Smith wrote: >>>> Yes indeed, the update chain had been activated... I commented it out >>>> again and the problem vanished. >>>> >>>> Good job, thanks Erick and Upayavira! >>>> John >>>> >>>> >>>> On 08/10/15 08:58, Upayavira wrote: >>>>> Look for the DedupUpdateProcessor in an update chain. >>>>> >>>>> that is there, but commented out IIRC in the techproducts sample >>>>> configs. >>>>> >>>>> Perhaps you uncommented it to use your own update processors, but didn't >>>>> remove that component? >>>>> >>>>> On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote: >>>>>> Oh, I forgot Erick's mention of the logs: there's nothing unusual in >>>>>> INFO level, the update request just gets mentioned. No exception. I >>>>>> reran it with the DEBUG level, but most of the log was related to jetty. >>>>>> Here's a line I noticed though: >>>>>> >>>>>> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest: >>>>>> {wt=json&commit=true&update.chain=dedupe} >>>>>> >>>>>> The update.chain parameter wasn't part of the original request, and >>>>>> "dedupe" looks suspicious to me. Perhaps should I investigate further >>>>>> there? >>>>>> >>>>>> Thanks, >>>>>> John. >>>>>> >>>>>> >>>>>> On 08/10/15 08:25, John Smith wrote: >>>>>>> The ids are all different: they're unique numbers followed by a couple >>>>>>> of keywords. I've made a test with a small collection of 10 documents to >>>>>>> make sure I can manage them manually: all ids are confirmed as >>>>>>> different. >>>>>>> >>>>>>> I also dumped the exact command, here's one example: >>>>>>> >>>>>>> 101084385_Sebago_ sebago shoes>>>>>> name="Clicks" update="set">1>>>>>> update="set">1.8701925463775 >>>>>>> >>>>>>> It's sent as the body of a POST request to >>>>>>> http://127.0.0.1:8080/solr/ato_test/update?wt=json&commit=true, with a >>>>>>> Content-Type: text/xml header. I still noted the c
Re: Unexpected delayed document deletion with atomic updates
Hi Allessandro, In the example I set the value to 1, but it's actually incremented in the code, so with time it should go up. You're right though, I could use an inc update instead. John On 08/10/15 16:45, Alessandro Benedetti wrote: > Not related to the deletion problem, only as a curiosity for your use case : > > 1 > > Have i misunderstood your use case, or you should use : > > inc > > Increments a numeric value by a specific amount. > > Must be specified as a single numeric value. > > Basically overtime you click, you always set the value for that field to > "1" . > So a document with 1 click will be considered equal to one with 1000 clicks. > My 2 cents > > Cheers > > On 8 October 2015 at 14:10, John Smith wrote: > >> Well, every day we update a lot of documents (usually several millions) >> so the DIH is a good fit. >> >> Calling the update chain would make sense there: after all a data import >> is just a batch update. Otherwise, the same operations would have to be >> made upfront, possibly in another environment and/or language. That's >> probably what I'm gonna do anyway. >> >> Thanks for your help! >> John >> >> >> On 08/10/15 13:39, Upayavira wrote: >>> You can either specify the update chain via an update.chain request >>> parameter, or you can configure a new request parameter with its own URL >>> and separate update.chain value. >>> >>> I have no idea how you would then reference that in the DIH - I've never >>> really used it. >>> >>> Upayavira >>> >>> On Thu, Oct 8, 2015, at 09:25 AM, John Smith wrote: >>>> After some further investigation, for those interested: the >>>> SignatureUpdateProcessorFactory fields were somehow mis-configured (I >>>> guess copied over from another collection). The initial import had been >>>> made using a data import handler: I suppose the update chain isn't >>>> called in this process and no signature field is created - am I right?. >>>> >>>> The first time a document was updated, a signature field with value >>>> "" was added. The next time, the same signature was >>>> generated for the new udpate, which triggered the deletion of all >>>> documents with the same signature (i.e. the first one) as overwriteDupes >>>> was set to true. Correct behavior but quite tricky... >>>> >>>> So my conclusion here (please correct me if I'm wrong) is of course to >>>> fix the signature configuration problem, but also to manage calling the >>>> update chain (or maybe a simplified one, e.g. by skipping logging) in >>>> the data import handler. Is there an easy way to do this? Conceptually, >>>> shouldn't the update chain be callable from the data import process - >>>> maybe it is? >>>> >>>> John >>>> >>>> >>>> On 08/10/15 09:43, Upayavira wrote: >>>>> Yay! >>>>> >>>>> On Thu, Oct 8, 2015, at 08:38 AM, John Smith wrote: >>>>>> Yes indeed, the update chain had been activated... I commented it out >>>>>> again and the problem vanished. >>>>>> >>>>>> Good job, thanks Erick and Upayavira! >>>>>> John >>>>>> >>>>>> >>>>>> On 08/10/15 08:58, Upayavira wrote: >>>>>>> Look for the DedupUpdateProcessor in an update chain. >>>>>>> >>>>>>> that is there, but commented out IIRC in the techproducts sample >>>>>>> configs. >>>>>>> >>>>>>> Perhaps you uncommented it to use your own update processors, but >> didn't >>>>>>> remove that component? >>>>>>> >>>>>>> On Thu, Oct 8, 2015, at 07:38 AM, John Smith wrote: >>>>>>>> Oh, I forgot Erick's mention of the logs: there's nothing unusual in >>>>>>>> INFO level, the update request just gets mentioned. No exception. I >>>>>>>> reran it with the DEBUG level, but most of the log was related to >> jetty. >>>>>>>> Here's a line I noticed though: >>>>>>>> >>>>>>>> org.apache.solr.servlet.HttpSolrCall; Closing out SolrRequest: >>>>>>>> {wt=json&commit=true&update.chain=dedupe} >>>>>>
Best way to backup and restore an index for a cloud setup in 4.6.1?
All, With a cloud setup for a collection in 4.6.1, what is the most elegant way to backup and restore an index? We are specifically looking into the application of when doing a full reindex, with the idea of building an index on one set of servers, backing up the index, and then restoring that backup on another set of servers. Is there a better way to rebuild indexes on another set of servers? We are not sharding if that makes any difference. Thanks, g10vstmoney
Creating a collection with 1 shard gives a weird range
I'm trying to create a collection starting with only one shard (numShards=1) using a compositeID router. The purpose is to start small and begin splitting shards when the index grows larger. The shard created gets a weird range value: 8000-7fff, which doesn't look effective. Indeed, if a try to import some documents using a DIH, none gets added. If I create the same collection with 2 shards, the ranges seem more logical (0-7fff & 8000-). In this case documents are indexed correctly. Is this behavior by design, i.e. is a minimum of 2 shards required? If not, how can I create a working collection with a single shard? This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8. Thanks, John
Re: Creating a collection with 1 shard gives a weird range
On 17/05/16 11:56, Tom Evans wrote: > On Tue, May 17, 2016 at 9:40 AM, John Smith wrote: >> I'm trying to create a collection starting with only one shard >> (numShards=1) using a compositeID router. The purpose is to start small >> and begin splitting shards when the index grows larger. The shard >> created gets a weird range value: 8000-7fff, which doesn't look >> effective. Indeed, if a try to import some documents using a DIH, none >> gets added. >> >> If I create the same collection with 2 shards, the ranges seem more >> logical (0-7fff & 8000-). In this case documents are >> indexed correctly. >> >> Is this behavior by design, i.e. is a minimum of 2 shards required? If >> not, how can I create a working collection with a single shard? >> >> This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8. >> > I believe this is as designed, see this email from Shawn: > > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c570d0a03.5010...@elyograg.org%3E > > Cheers > > Tom Thanks Tom, signed integers make sense here, I overlooked that - classical mistake. I still have a problem with DIH though, I'll investigate further. John
parent/child rows in solr
Hi, I have a document structure like this (this is a made up schema, my data has nothing to do with departments and employees, but the structure holds true to my real data): department 1 employee 11 employee 12 employee 13 room 11 room 12 room 13 department 2 employee 21 employee 22 room 21 ... etc I'm trying to figure out the best way to index this, and perform queries. Due to the sheer volume of data, I cannot do a simple "flat file" approach, repeating the header data for each child entry. So that leaves me with "graph traversal" or "block joins". I've played with both of those, but I'm running into various issues with each approach. I need to be able to run filters on any or all of the header + child rows in the same query, at once (can't seem to get that working in either graph or block join). One problem I had with graph is that I can't force solr to return the header, then all the children for that header, then the next header + all it's children, it just spits them out without keeping them together. block join seems to return the children nested under the parents, which is great, but then I can't seem to filter on parent + children in the same query: I get the dreaded error message "Parent query must not match any docs besides parent filter" Kinda lost here, any tips/suggestions?
Re: parent/child rows in solr
Thanks Shawn, for your comments. The reason why I don't want to go flat file structure, is due to all the wasted/duplicated data. If a department has 100 employees, then it's very wasteful in terms of disk space to repeat the header data over and over again, 100 times. In this example there is only a few doc types, but my real-life data is much larger, and the problem is a "scaling" problem; with just a little bit of data, no problem in duplicating header fields, but with massive amounts of data it's a large problem. My understanding of both graph traversal and block joins, is that the header data would only be present once, so that's why I'm gravitating towards those solutions. I just can't seem to line up the "fq" and queries correctly such that I am able to join 3+ document types together, filter on them, and return my requested columns. On Fri, Sep 7, 2018 at 9:32 PM Shawn Heisey wrote: > On 9/7/2018 3:06 PM, John Smith wrote: > > Hi, I have a document structure like this (this is a made up schema, my > > data has nothing to do with departments and employees, but the structure > > holds true to my real data): > > > > department 1 > > employee 11 > > employee 12 > > employee 13 > > room 11 > > room 12 > > room 13 > > > > department 2 > > employee 21 > > employee 22 > > room 21 > > > > ... etc > > > > I'm trying to figure out the best way to index this, and perform queries. > > Due to the sheer volume of data, I cannot do a simple "flat file" > approach, > > repeating the header data for each child entry. > > Why not? > > For the precise use case you have outlined, Solr will work better if you > only have the child documents and simply have every document contain a > "department" field which contains an identifier for the department. > Since this precise structure is not what you are doing, you'll need to > adapt what I'm saying to your actual data. > > The volume of data should be irrelevant to this decision. Solr will > always work best with a flat document structure. > > I have never used the parent/child document feature in Solr, so I cannot > offer any advice on it. Somebody else will need to help you if you > choose to use that feature. > > Thanks, > Shawn > >
Re: parent/child rows in solr
> > On 9/7/2018 7:44 PM, John Smith wrote: > > Thanks Shawn, for your comments. The reason why I don't want to go flat > > file structure, is due to all the wasted/duplicated data. If a department > > has 100 employees, then it's very wasteful in terms of disk space to > repeat > > the header data over and over again, 100 times. In this example there is > > only a few doc types, but my real-life data is much larger, and the > problem > > is a "scaling" problem; with just a little bit of data, no problem in > > duplicating header fields, but with massive amounts of data it's a large > > problem. > > If your goal is data storage, then you are completely correct. All that > data duplication is something to avoid for a data storage situation. > Normalizing your data so it's relational makes perfect sense, because > most database software is designed to efficiently deal with those > relationships. > > Solr is not designed as a data storage platform, and does not handle > those relationships efficiently. Solr's design goals are all about > *search*. It often gets touted as filling a NoSQL role ... but it's not > something I would personally use as a primary data repository. Search > is a space where data duplication is expected and completely normal. > This is something that people often have a hard time accepting. > > I'm not actually trying to use solr as a data storage platform; all our data is stored in an sql database, we are using solr strictly for the search features, not storage features. Here is a good example from a test I ran today. I have a header table, and 8 child tables which link directly to the header table. The children link only to 1 header row, and they do not link to other children. So a 1:many between header and each child. Some row counts: header: 223,580 child1: 124,978 child2: 254,045 child3: 127,917 child4:1,009,030 child5: 225,311 child6: 381,561 child7: 438,315 child8: 18,850 Trying to index that into solr with a flatfile schema, blows up into 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a left outer join between header and each child and getting a row count in the database. That's not going to scale, at all, considering the small size of the source input tables. Some of our indexes would require 50 million header rows alone, never mind the child tables. So solr has no way of indexing something like this? I can't believe I would be the first person to run into this issue, I have a feeling I'm missing something obvious somewhere.
Re: parent/child rows in solr
On Tue, Sep 11, 2018 at 9:32 PM Shawn Heisey wrote: > On 9/11/2018 7:07 PM, John Smith wrote: > > header: 223,580 > > > > child1: 124,978 > > child2: 254,045 > > child3: 127,917 > > child4:1,009,030 > > child5: 225,311 > > child6: 381,561 > > child7: 438,315 > > child8: 18,850 > > > > > > Trying to index that into solr with a flatfile schema, blows up into > > 5,475,316,072 rows. Yes, 5.5 billion rows. I calculated that by running a > > I think you're not getting what I'm suggesting. Or maybe there's an > aspect of your data that I'm not understanding. > > If we add up all those numbers for the child docs, there are 2.5 million > of them. So you would have 2.5 million docs in Solr. I have created > Solr indexes far larger than this, and I do not consider my work to be > "big data". Solr can handle 2.5 million docs easily, as long as the > hardware resources are sufficient. > > Where the data duplication will come in is in additional fields in those > 2.5 million docs. Each one will contain some (or maybe all) of the data > that WOULD have been in the parent document. The amount of data > balloons, but the number of documents (rows) doesn't. > > That kind of arrangement is usually enough to accomplish whatever is > needed. I cannot assume that it will work for your use case, but it > does work for most. > > Thanks, > Shawn > > The problem is that the math isn't a simple case of adding up all the row counts. These are "left outer join"s. In sql, it would be this query: select * from header h left outer join child1 c1 on c1.hid = h.id left outer join child2 c2 on c2.hid = h.id ... left outer join child8 c8 on c8.hid = h.id If there are 10 rows in child1 linked to 1 header with id "abc", and 10 rows in child2 linked to that same header, then we end up with 10 * 10 rows in solr, not 20. Considering there are 8 child tables in this example, there is simply an explosion of data. I can't describe it much better than that (abstractly), though perhaps I could put together a simple example with live data. Suffice it to say, in my example row counts above, that is all "live data" in a relatively small database of ours, the row counts are real, and the final row count of 5.5 billion was calculated inside sql using that query above: select count(*) from ( select id from header h left outer join child1 c1 on c1.hid = h.id left outer join child2 c2 on c2.hid = h.id ... left outer join child8 c8 on c8.hid = h.id ) tmp;
Re: parent/child rows in solr
On Tue, Sep 11, 2018 at 11:00 PM Shawn Heisey wrote: > On 9/11/2018 8:35 PM, John Smith wrote: > > The problem is that the math isn't a simple case of adding up all the row > > counts. These are "left outer join"s. In sql, it would be this query: > > I think we'll just have to conclude that I do not understand what you > are doing. I have no idea what "left outer join" even means, how it's > different than a join that's NOT "left outer". > > I will say this: Solr is not very efficient at joins, and there are a > bunch of caveats involved. It's usually better to go with a flat > document space for a search engine. > > Thanks, > Shawn > > A "left outer join" in sql is a join such that if there is no match in the child table for a given header id, then the child cells are returned as "null" values, instead of the header row being removed from the result set (which is what happens in "inner join" or standard sql join). A good rundown on the various sql joins: https://stackoverflow.com/questions/38549/what-is-the-difference-between-inner-join-and-outer-join
Re: parent/child rows in solr
On Tue, Sep 11, 2018 at 11:05 PM Walter Underwood wrote: > Have you tried modeling it with multivalued fields? > > That's an interesting idea, but I don't think that would work. We would lose the concept of "rows". So let's say child1 has col "a" and col "b", both are turned into multi-value fields in the solr index. Normally in sql we can query for a specific value in col "a", and then see what the associated value in col "b" would be, but we can't do that if we stuff the col values in multi-value; we can no longer see which value from col "a" corresponds to which value in col "b". I'm probably explaining that poorly, but I just don't see how that would work.
statistics in hitlist
I'm using solr, and enabling stats as per this page: https://lucene.apache.org/solr/guide/6_6/the-stats-component.html I want to get more stat values though. Specifically I'm looking for r-squared (coefficient of determination). This value is not present in solr, however some of the pieces used to calculate r^2 are in the stats element, for example: 0.0 10.0 15 17 85.0 603.0 5.667 2.943920288775949 So I have the sumOfSquares available (SST), and using this calculation, I can get R^2: R^2 = 1 - SSE/SST All I need then is SSE. Is there anyway I can get SSE from those other stats in solr? Thanks in advance!
Re: statistics in hitlist
Hi Joel, thanks for the answer. I'm not really a stats guy, but the end result of all this is supposed to be obtaining R^2. Is there no way of obtaining this value, then (short of iterating over all the results in the hitlist and calculating it myself)? On Fri, Feb 23, 2018 at 12:26 PM, Joel Bernstein wrote: > Typically SSE is the sum of the squared errors of the prediction in a > regression analysis. The stats component doesn't perform regression, > although it might be a nice feature. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Fri, Feb 23, 2018 at 12:17 PM, John Smith wrote: > > > I'm using solr, and enabling stats as per this page: > > https://lucene.apache.org/solr/guide/6_6/the-stats-component.html > > > > I want to get more stat values though. Specifically I'm looking for > > r-squared (coefficient of determination). This value is not present in > > solr, however some of the pieces used to calculate r^2 are in the stats > > element, for example: > > > > 0.0 > > 10.0 > > 15 > > 17 > > 85.0 > > 603.0 > > 5.667 > > 2.943920288775949 > > > > > > So I have the sumOfSquares available (SST), and using this calculation, I > > can get R^2: > > > > R^2 = 1 - SSE/SST > > > > All I need then is SSE. Is there anyway I can get SSE from those other > > stats in solr? > > > > Thanks in advance! > > >
Re: statistics in hitlist
Joel, thanks for the pointers to the streaming feature. I had no idea solr had that (and also just discovered the very intersting sql feature! I will be sure to investigate that in more detail in the future). However I'm having some trouble getting basic streaming functions working. I've already figured out that I had to move to "solr cloud" instead of "solr standalone" because I was getting errors about "cannot find zk instance" or whatever which went away when using "solr start -c" instead. But now I'm trying to use the random function since that was one of the functions used in your example. random(tx_header, q="*:*", rows="100", fl="countyname") I posted that directly in the "stream" section of the solr admin UI. This is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in case it was a bug in one) I get back an error message: *sort param could not be parsed as a query, and is not a field that exists in the index: random_-255009774* I'm not passing in any sort field anywhere. But the solr logs show these three log entries: 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request [tx_header_shard1_replica_n1] webapp=/solr path=/select params={q=*:*&_stateVer_=tx_header:6&fl=countyname *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400 QTime=19 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient Request to collection [tx_header] failed due to (400) org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://192.168.13.31:8983/solr/tx_header: sort param could not be parsed as a query, and is not a field that exists in the index: random_-255009774, retry? 0 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_header s:shard1 r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.s.ExceptionStream java.io.IOException: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://192.168.13.31:8983/solr/tx_header: sort param could not be parsed as a query, and is not a field that exists in the index: random_-255009774 So basically it looks like solr is injecting the "sort=random_" stuff into my query and of course that is failing on the search since that field/column doesn't exist in my schema. Everytime I run the random function, I get a slightly different field name that it injects, but they all start with "random_" etc. I have tried adding my own sort field instead, hoping solr wouldn't inject one for me, but it still injected a random sort fieldname: random(tx_header, q="*:*", rows="100", fl="countyname", sort="countyname asc") Assuming I can fix that whole problem, my second question is: can I add multiple "fq=" parameters to the random function? I build a pretty complicated query using many fq= fields, and then want to run some stats on that hitlist; so somehow I have to pass in the query that made up the exact hitlist to these various functions, but when I used multiple "fq=" values it only seemed to use the last one I specified and just ignored all the previous fq's? Thanks in advance for any comments/suggestions...! On Fri, Feb 23, 2018 at 5:59 PM, Joel Bernstein wrote: > This is going to be a complex answer because Solr actually now has multiple > ways of doing regression analysis as part of the Streaming Expression > statistical programming library. The basic documentation is here: > > https://lucene.apache.org/solr/guide/7_2/statistical-programming.html > > Here is a sample expression that performs a simple linear regression in > Solr 7.2: > > let(a=random(collection1, q="any query", rows="15000", fl="fieldA, > fieldB"), > b=col(a, fieldA), > c=col(a, fieldB), > d=regress(b, c)) > > > The expression above takes a random sample of 15000 results from > collection1. The result set will include fieldA and fieldB in each record. > The result set is stored in variable "a". > > Then the "col" function creates arrays of numbers from the results stored > in variable a. The values in fieldA are stored in the variable "b". The > values in fieldB are stored in variable "c". > > Then the regress function performs a simple linear regression on arrays > stored in variables "b" and "c". > > The output of the regress function is a map containing the regression > result. This result includes RSquared and other attributes of the > regression model such as R (correlation), slope, y intercept etc... > > > >
Re: statistics in hitlist
Thanks Joel for your help on this. What I've done so far: - unzip downloaded solr-7.2 - modify the _default "managed-schema" to add the random field type and the dynamic random field - start solr7 using "solr start -c" - indexed my data using pint/pdouble/boolean field types etc I can now run the random function all by itself, it returns random results as expected. So far so good! However... now trying to get the regression stuff working: let(a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15000", fl="oil_first_90_days_production,oil_last_30_days_production"), b=col(a, oil_first_90_days_production), c=col(a, oil_last_30_days_production), d=regress(b, c)) Posted directly into solr admin UI. Run the streaming expression and I get this error message: "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value expected but found type java.lang.String for value oil_first_90_days_production" It thinks my numeric field is defined as a string? But when I view the schema, those 2 fields are defined as ints: When I run a normal query and choose xml as output format, then it also puts "int" elements into the hitlist, so the schema appears to be correct it's just when using this regress function that something goes wrong and solr thinks the field is string. Any suggestions? Thanks! On Thu, Mar 1, 2018 at 9:12 PM, Joel Bernstein wrote: > The field type will also need to be in the schema: > > > > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Mar 1, 2018 at 8:00 PM, Joel Bernstein wrote: > > > You'll need to have this field in your schema: > > > > > > > > I'll check to see if the default schema used with solr start -c has this > > field, if not I'll add it. Thanks for pointing this out. > > > > I checked and right now the random expression is only accepting one fq, > > but I consider this a bug. It should accept multiple. I'll create ticket > > for getting this fixed. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Mar 1, 2018 at 4:55 PM, John Smith wrote: > > > >> Joel, thanks for the pointers to the streaming feature. I had no idea > solr > >> had that (and also just discovered the very intersting sql feature! I > will > >> be sure to investigate that in more detail in the future). > >> > >> However I'm having some trouble getting basic streaming functions > working. > >> I've already figured out that I had to move to "solr cloud" instead of > >> "solr standalone" because I was getting errors about "cannot find zk > >> instance" or whatever which went away when using "solr start -c" > instead. > >> > >> But now I'm trying to use the random function since that was one of the > >> functions used in your example. > >> > >> random(tx_header, q="*:*", rows="100", fl="countyname") > >> > >> I posted that directly in the "stream" section of the solr admin UI. > This > >> is all on linux, with solr 7.1.0 and 7.2.1 (tried several versions in > case > >> it was a bug in one) > >> > >> I get back an error message: > >> *sort param could not be parsed as a query, and is not a field that > exists > >> in the index: random_-255009774* > >> > >> I'm not passing in any sort field anywhere. But the solr logs show these > >> three log entries: > >> > >> 2018-03-01 21:41:18.954 INFO (qtp257513673-21) [c:tx_header s:shard1 > >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.S.Request > >> [tx_header_shard1_replica_n1] webapp=/solr path=/select > >> params={q=*:*&_stateVer_=tx_header:6&fl=countyname > >> *&sort=random_-255009774+asc*&rows=100&wt=javabin&version=2} status=400 > >> QTime=19 > >> > >> 2018-03-01 21:41:18.966 ERROR (qtp257513673-17) [c:tx_header s:shard1 > >> r:core_node2 x:tx_header_shard1_replica_n1] o.a.s.c.s.i.CloudSolrClient > >> Request to collection [tx_header] failed due to (400) > >> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: > >> Error > >> from server at http://192.168.13.31:8983/solr/tx_header: sort param > could > >> not be parsed as a query, and is not a field that exists in the index: > >> random_-255009774, retry? 0 > >> > >> 2018-03-01 21:41:18.968 ERROR (qtp257513673-17) [c:tx_head
Re: statistics in hitlist
Hi Joel, I did some more work on this statistics stuff today. Yes, we do have nulls in our data; the document contains many fields, we don't always have values for each field, but we can't set the nulls to 0 either (or any other value, really) as that will mess up other calculations (such as when calculating average etc); we would normally just ignore fields with null values when calculating stats manually ourselves. Adding a check in the "q" parameter to ensure that the fields used in the calculations are > 0 does work now. Thanks for the tip (and sorry, should have caught that myself). But I am unable to use "fq" for these checks, they have to be added to the q instead. Adding fq's doesn't have any effect. Anyway, I'm trying to change this up a little. This is what I'm currently using (switched from "random" to "search" since I actually need the full hitlist not just a random subset): let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", fq="isParent:true", rows="150", fl="id,oil_first_90_days_production,oil_last_30_days_production", sort="id asc"), b=col(a, oil_first_90_days_production), c=col(a, oil_last_30_days_production), d=regress(b, c)) So I have 2 fields there defined, that works great (in terms of a test and running the query); but I need to replace the second field, "oil_last_30_days_production" with the avg value in oil_first_90_days_production. I can get the avg with this expression: stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", fq="isParent:true", rows="150", avg(oil_first_90_days_production)) But I don't know how to push that avg value into the first streaming expression; guessing I have to set "c=" but that is where I'm getting lost, since avg only returns 1 value and the first parameter, "b", returns a list of sorts. Somehow I have to get the avg value stuffed inside a "col", where it is the same value for every row in the hitlist...? Thanks for your help! On Mon, Mar 5, 2018 at 10:50 PM, Joel Bernstein wrote: > I suspect you've got nulls in your data. I just tested with null values and > got the same error. For testing purposes try loading the data with default > values of zero. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Mon, Mar 5, 2018 at 10:12 PM, Joel Bernstein > wrote: > > > Let's break the expression down and build it up slowly. Let's start with: > > > > let(echo="true", > > a=random(tx_prod_production, q="*:*", fq="isParent:true", rows="15", > > fl="oil_first_90_days_production,oil_last_30_days_production"), > > b=col(a, oil_first_90_days_production)) > > > > > > This should return variables a and b. Let's see what the data looks like. > > I changed the rows from 15 to 15000. If it all looks good we can expand > the > > rows and continue adding functions. > > > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Mon, Mar 5, 2018 at 4:11 PM, John Smith wrote: > > > >> Thanks Joel for your help on this. > >> > >> What I've done so far: > >> - unzip downloaded solr-7.2 > >> - modify the _default "managed-schema" to add the random field type and > >> the dynamic random field > >> - start solr7 using "solr start -c" > >> - indexed my data using pint/pdouble/boolean field types etc > >> > >> I can now run the random function all by itself, it returns random > >> results as expected. So far so good! > >> > >> However... now trying to get the regression stuff working: > >> > >> let(a=random(tx_prod_production, q="*:*", fq="isParent:true", > >> rows="15000", fl="oil_first_90_days_producti > >> on,oil_last_30_days_production"), > >> b=col(a, oil_first_90_days_production), > >> c=col(a, oil_last_30_days_production), > >> d=regress(b, c)) > >> > >> Posted directly into solr admin UI. Run the streaming expression and I > >> get this error message: > >> "EXCEPTION": "Failed to evaluate expression regress(b,c) - Numeric value > >> expected but found type java.lang.String for value > >> oil_first_90_days_production" > >> > >> It thinks my numeric field is defined as a string? But when I view the > >> schema, those 2 fields are defined as ints: > >> > >> > >> When I
Re: statistics in hitlist
Thanks for the link to the documentation, that will probably come in useful. I didn't see a way though, to get my avg function working? So instead of doing a linear regression on two fields, X and Y, in a hitlist, we need to do a linear regression on field X, and the average value of X. Is that possible? To pass in a function to the regress function instead of a field? On Thu, Mar 15, 2018 at 10:41 PM, Joel Bernstein wrote: > I've been working on the user guide for the math expressions. Here is the > page on regression: > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > documentation/solr/solr-ref-guide/src/regression.adoc > > This page is part of the larger math expression documentation. The TOC is > here: > > https://github.com/joel-bernstein/lucene-solr/blob/math_expressions_ > documentation/solr/solr-ref-guide/src/math-expressions.adoc > > The docs are still very rough but you can get an idea of the coverage. > > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Thu, Mar 15, 2018 at 10:26 PM, Joel Bernstein > wrote: > > > If you want to get everything in query you can do this: > > > > let(echo="d,e", > > a=search(tx_prod_production, q="oil_first_90_days_production:[1 TO > > *]", > > fq="isParent:true", rows="150", > > fl="id,oil_first_90_days_production,oil_last_30_days_production", > sort="id > > asc"), > > b=col(a, oil_first_90_days_production), > > c=col(a, oil_last_30_days_production), > > d=regress(b, c), > > e=someExpression()) > > > > The echo parameter tells the let expression which variables to output. > > > > > > > > Joel Bernstein > > http://joelsolr.blogspot.com/ > > > > On Thu, Mar 15, 2018 at 3:13 PM, Erick Erickson > > > wrote: > > > >> What does the fq clause look like? > >> > >> On Thu, Mar 15, 2018 at 11:51 AM, John Smith > >> wrote: > >> > Hi Joel, I did some more work on this statistics stuff today. Yes, we > do > >> > have nulls in our data; the document contains many fields, we don't > >> always > >> > have values for each field, but we can't set the nulls to 0 either (or > >> any > >> > other value, really) as that will mess up other calculations (such as > >> when > >> > calculating average etc); we would normally just ignore fields with > null > >> > values when calculating stats manually ourselves. > >> > > >> > Adding a check in the "q" parameter to ensure that the fields used in > >> the > >> > calculations are > 0 does work now. Thanks for the tip (and sorry, > >> should > >> > have caught that myself). But I am unable to use "fq" for these > checks, > >> > they have to be added to the q instead. Adding fq's doesn't have any > >> effect. > >> > > >> > > >> > Anyway, I'm trying to change this up a little. This is what I'm > >> currently > >> > using (switched from "random" to "search" since I actually need the > full > >> > hitlist not just a random subset): > >> > > >> > let(a=search(tx_prod_production, q="oil_first_90_days_production:[1 > TO > >> *]", > >> > fq="isParent:true", rows="150", > >> > fl="id,oil_first_90_days_production,oil_last_30_days_production", > >> sort="id > >> > asc"), > >> > b=col(a, oil_first_90_days_production), > >> > c=col(a, oil_last_30_days_production), > >> > d=regress(b, c)) > >> > > >> > So I have 2 fields there defined, that works great (in terms of a test > >> and > >> > running the query); but I need to replace the second field, > >> > "oil_last_30_days_production" with the avg value in > >> > oil_first_90_days_production. > >> > > >> > I can get the avg with this expression: > >> > stats(tx_prod_production, q="oil_first_90_days_production:[1 TO *]", > >> > fq="isParent:true", rows="150", avg(oil_first_90_days_ > production)) > >> > > >> > But I don't know how to push that avg value into the first streaming > >> > expression; guessing I have to set "c=" but that is where I'm > >> getting > >> > los