Solr Cloud - is replication really a feature on the trunk?
Hello, I'm working off the trunk and the following wiki link: http://wiki.apache.org/solr/SolrCloud The wiki link has a section that seeks to quickly familiarize a user with replication in SolrCloud - "Example B: Simple two shard cluster with shard replicas" But after going through it, I have to wonder if this is truly replication? Because if it is truly replication then somewhere along the line, the following properties must have been set programmatically: replicateAfter, confFiles, masterUrl, pollInterval Can someone tell me: Where exactly in the code is this happening? I've been looking through some older threads where I see stuff like: [Jan Høydahl]: Question: Is ReplicationHandler ZK-aware yet? [Mark Miller]: As I think you now know, not yet ;) Not sure if the comments above really fit-in with my question but it certainly isn't encouraging. SolrCloud does an excellent job of super-simplifying the sharding process, so I'm hoping that can anyone tell me what needs to happen to make it do the same for replication? I'm willing to get my hands dirty and contribute to the trunk if someone can provide high-level mentoring/guidance around the already existing SolrCloud code.
Re: Solr Cloud - is replication really a feature on the trunk?
Thank You Yury. After looking at your thread, there's something I must clarify: Is solr.xml not uploaded and held in ZooKeeper? I ask this because you have a slightly different config between Node 1 & 2: http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html On Wed, Sep 7, 2011 at 8:34 PM, Yury Kats wrote: > On 9/7/2011 3:18 PM, Pulkit Singhal wrote: >> Hello, >> >> I'm working off the trunk and the following wiki link: >> http://wiki.apache.org/solr/SolrCloud >> >> The wiki link has a section that seeks to quickly familiarize a user >> with replication in SolrCloud - "Example B: Simple two shard cluster >> with shard replicas" >> >> But after going through it, I have to wonder if this is truly >> replication? > > Not really. Replication is not set up in the example. > The example use "replicas" as "copies", to demonstrate high search > availability. > >> Because if it is truly replication then somewhere along >> the line, the following properties must have been set >> programmatically: >> replicateAfter, confFiles, masterUrl, pollInterval >> Can someone tell me: Where exactly in the code is this happening? > > Nowhere. > > If you want replication, you need to set all the properties you listed > in solrconfig.xml. > > I've done it recently, see > http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html > >
Re: SolrCloud Feedback
Hello Jan, You've made a very good point in (b). I would be happy to make the edit to the wiki if I understood your explanation completely. When you say that it is "looking up what collection that core is part of" ... I'm curious how a core is being put under a particular collection in the first place? And what that collection is named? Obviously you've made it clear that colelction1 is really the name of the core itself. And where this association is being stored for the code to look it up? If not Jan, then perhaps the gurus who wrote Solr Cloud could answer :) Thanks! - Pulkit On Thu, Feb 10, 2011 at 9:10 AM, Jan Høydahl wrote: > Hi, > > I have so far just tested the examples and got a N by M cluster running. My > feedback: > > a) First of all, a major update of the SolrCloud Wiki is needed, to clearly > state what is in which version, what are current improvement plans and get > rid of outdated stuff. That said I think there are many good ideas there. > > b) The "collection" terminology is too much confused with "core", and should > probably be made more distinct. I just tried to configure two cores on the > same Solr instance into the same collection, and that worked fine, both as > distinct shards and as same shard (replica). The wiki examples give the > impression that "collection1" in > localhost:8983/solr/collection1/select?distrib=true is some magic collection > identifier, but what it really does is doing the query on the *core* named > "collection1", looking up what collection that core is part of and > distributing the query to all shards in that collection. > > c) ZK is not designed to store large files. While the files in conf are > normally well below the 1M limit ZK imposes, we should perhaps consider using > a lightweight distributed object or k/v store for holding the /CONFIGS and > let ZK store a reference only > > d) How are admins supposed to update configs in ZK? Install their favourite > ZK editor? > > e) We should perhaps not be so afraid to make ZK a requirement for Solr in > v4. Ideally you should interact with a 1-node Solr in the same manner as you > do with a 100-node Solr. An example is the Admin GUI where the "schema" and > "solrconfig" links assume local file. This requires decent tool support to > make ZK interaction intuitive, such as "import" and "export" commands. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > > On 19. jan. 2011, at 21.07, Mark Miller wrote: > >> Hello Users, >> >> About a little over a year ago, a few of us started working on what we >> called SolrCloud. >> >> This initial bit of work was really a combination of laying some base work - >> figuring out how to integrate ZooKeeper with Solr in a limited way, dealing >> with some infrastructure - and picking off some low hanging search side >> fruit. >> >> The next step is the indexing side. And we plan on starting to tackle that >> sometime soon. >> >> But first - could you help with some feedback?ISome people are using our >> SolrCloud start - I have seen evidence of it ;) Some, even in production. >> >> I would love to have your help in targeting what we now try and improve. Any >> suggestions or feedback? If you have sent this before, I/others likely >> missed it - send it again! >> >> I know anyone that has used SolrCloud has some feedback. I know it because >> I've used it too ;) It's too complicated to setup still. There are still >> plenty of pain points. We accepted some compromise trying to fit into what >> Solr was, and not wanting to dig in too far before feeling things out and >> letting users try things out a bit. Thinking that we might be able to adjust >> Solr to be more in favor of SolrCloud as we go, what is the ideal state of >> the work we have currently done? >> >> If anyone using SolrCloud helps with the feedback, I'll help with the coding >> effort. >> >> - Mark Miller >> -- lucidimagination.com > >
Re: SolrCloud Feedback
I think I understand it a bit better now but wouldn't mind some validation. 1) solr.xml does not become part of ZooKeeper 2) The default looks like this out-of-box: so that may leave one wondering where the core's association to a collection name is made? It can be made like so: a) statically in a file: b) at start time via java: java ... -Dcollection.configName=myconf ... -jar start.jar And I'm guessing that since the core's name ("collection1") for shard1 has already been associated with -Dcollection.configname=myconf in http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster once already, adding an additional shard2 with the same core name ("collection1"), automatically throws it in with the collection name ("myconf") without any need to specify anything at startup via -D or statically in solr.xml file. Validate away otherwise I'll just accept any hate mail after making edits to the Solr wiki directly. - Pulkit On Fri, Sep 9, 2011 at 11:38 AM, Pulkit Singhal wrote: > Hello Jan, > > You've made a very good point in (b). I would be happy to make the > edit to the wiki if I understood your explanation completely. > > When you say that it is "looking up what collection that core is part > of" ... I'm curious how a core is being put under a particular > collection in the first place? And what that collection is named? > Obviously you've made it clear that colelction1 is really the name of > the core itself. And where this association is being stored for the > code to look it up? > > If not Jan, then perhaps the gurus who wrote Solr Cloud could answer :) > > Thanks! > - Pulkit > > On Thu, Feb 10, 2011 at 9:10 AM, Jan Høydahl wrote: >> Hi, >> >> I have so far just tested the examples and got a N by M cluster running. My >> feedback: >> >> a) First of all, a major update of the SolrCloud Wiki is needed, to clearly >> state what is in which version, what are current improvement plans and get >> rid of outdated stuff. That said I think there are many good ideas there. >> >> b) The "collection" terminology is too much confused with "core", and should >> probably be made more distinct. I just tried to configure two cores on the >> same Solr instance into the same collection, and that worked fine, both as >> distinct shards and as same shard (replica). The wiki examples give the >> impression that "collection1" in >> localhost:8983/solr/collection1/select?distrib=true is some magic collection >> identifier, but what it really does is doing the query on the *core* named >> "collection1", looking up what collection that core is part of and >> distributing the query to all shards in that collection. >> >> c) ZK is not designed to store large files. While the files in conf are >> normally well below the 1M limit ZK imposes, we should perhaps consider >> using a lightweight distributed object or k/v store for holding the /CONFIGS >> and let ZK store a reference only >> >> d) How are admins supposed to update configs in ZK? Install their favourite >> ZK editor? >> >> e) We should perhaps not be so afraid to make ZK a requirement for Solr in >> v4. Ideally you should interact with a 1-node Solr in the same manner as you >> do with a 100-node Solr. An example is the Admin GUI where the "schema" and >> "solrconfig" links assume local file. This requires decent tool support to >> make ZK interaction intuitive, such as "import" and "export" commands. >> >> -- >> Jan Høydahl, search solution architect >> Cominvent AS - www.cominvent.com >> >> On 19. jan. 2011, at 21.07, Mark Miller wrote: >> >>> Hello Users, >>> >>> About a little over a year ago, a few of us started working on what we >>> called SolrCloud. >>> >>> This initial bit of work was really a combination of laying some base work >>> - figuring out how to integrate ZooKeeper with Solr in a limited way, >>> dealing with some infrastructure - and picking off some low hanging search >>> side fruit. >>> >>> The next step is the indexing side. And we plan on starting to tackle that >>> sometime soon. >>> >>> But first - could you help with some feedback?ISome people are using our >>> SolrCloud start - I have seen evidence of it ;) Some, even in production. >>> >>> I would love to have your help in targeting what we now try and improve. >>> Any suggestions or feedback? If you have sent this before, I/othe
Re: Solr Cloud - is replication really a feature on the trunk?
Thanks Again. Another question: My solr.xml has: And I omitted -Dcollection.configName=myconf from the startup command because I felt that specifying collection="myconf" should take care of that: cd /trunk/solr/example java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar But the zookeeper.jsp page doesn't seem to take any of that into effect and shows: /collections (v=6 children=1) collection1 (v=0 children=1) "configName=configuration1" shards (v=0 children=1) shard1 (v=0 children=1) tiklup-mac.local:8983_solr_ (v=0) "node_name=tiklup-mac.local:8983_solr url=http://tiklup-mac.local:8983/solr/"; Then what is the point of naming the core and the collection? - Pulkit 2011/9/9 Yury Kats : > On 9/9/2011 10:52 AM, Pulkit Singhal wrote: >> Thank You Yury. After looking at your thread, there's something I must >> clarify: Is solr.xml not uploaded and held in ZooKeeper? > > Not as far as I understand. Cores are loaded/created by the local > Solr server based on solr.xml and then registered with ZK, so that > ZK know what cores are out there and how they are organized in shards. > > >> because you have a slightly different config between Node 1 & 2: >> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html > > > I have two shards, each shard having a master and a slave core. > Cores are located so that master and slave are on different nodes. > This protects search (but not indexing) from node failure. >
Re: Solr Cloud - is replication really a feature on the trunk?
I had forgotten to save the file, the collection name at least shows up but the core name is still not used, is it simply decorative? /collections (v=6 children=1) myconf (v=0 children=1) "configName=configuration1" shards (v=0 children=1) shard1 (v=0 children=1) tiklup-mac.local:8983_solr_ (v=0) "node_name=tiklup-mac.local:8983_solr url=http://tiklup-mac.local:8983/solr/"; Thanks! - Pulkit On Fri, Sep 9, 2011 at 5:54 PM, Pulkit Singhal wrote: > Thanks Again. > > Another question: > > My solr.xml has: > > > > > And I omitted -Dcollection.configName=myconf from the startup command > because I felt that specifying collection="myconf" should take care of > that: > cd /trunk/solr/example > java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar > > But the zookeeper.jsp page doesn't seem to take any of that into > effect and shows: > /collections (v=6 children=1) > collection1 (v=0 children=1) "configName=configuration1" > shards (v=0 children=1) > shard1 (v=0 children=1) > tiklup-mac.local:8983_solr_ (v=0) > "node_name=tiklup-mac.local:8983_solr > url=http://tiklup-mac.local:8983/solr/"; > > Then what is the point of naming the core and the collection? > > - Pulkit > > 2011/9/9 Yury Kats : >> On 9/9/2011 10:52 AM, Pulkit Singhal wrote: >>> Thank You Yury. After looking at your thread, there's something I must >>> clarify: Is solr.xml not uploaded and held in ZooKeeper? >> >> Not as far as I understand. Cores are loaded/created by the local >> Solr server based on solr.xml and then registered with ZK, so that >> ZK know what cores are out there and how they are organized in shards. >> >> >>> because you have a slightly different config between Node 1 & 2: >>> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html >> >> >> I have two shards, each shard having a master and a slave core. >> Cores are located so that master and slave are on different nodes. >> This protects search (but not indexing) from node failure. >> >
Re: Solr Cloud - is replication really a feature on the trunk?
1s of all, thanks everyone, your expertise and time is much appreciated. @Jamie: Great suggestion, I just have one small objection to it ... I wouldn't want to mix the core's name with the collection's configName. Wouldn't you also want to keep the two separate for clarity? What do you think about that? @Yury: Overall what you said makes sense and I'll roll with it. But FYI, through experimentation I found out that collection="myconf" does not become the value for configName when I inspect ZooKeeper.jsp, here's an example of what shows up if I setup the solr.xml file but don't say anything in the cmd line startup: myconf (v=0 children=1) "configName=configuration1" But perhaps that's exactly what you are trying to warn me about. I'll experiment more and get back. - Pulkit On Fri, Sep 9, 2011 at 10:17 PM, Jamie Johnson wrote: > as a note you could change out the values in solr.xml to be as follows > and pull these values from System Properties. > > > > > > unless someone says otherwise, but the quick tests I've run seem to > work perfectly well with this setup. > > 2011/9/9 Yury Kats : >> On 9/9/2011 6:54 PM, Pulkit Singhal wrote: >>> Thanks Again. >>> >>> Another question: >>> >>> My solr.xml has: >>> >>> >> collection="myconf"/> >>> >>> >>> And I omitted -Dcollection.configName=myconf from the startup command >>> because I felt that specifying collection="myconf" should take care of >>> that: >>> cd /trunk/solr/example >>> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar >> >> With this you are telling ZK to bootstrap a collection with content of >> specific >> files, but you don't tell what collection that should be. >> >> Hence you want collection.configName parameter, and you want >> solr.xml to reference the same name in 'collection' attribute for the cores, >> so that SolrCloud knows where to pull configuration for that core from. >> >> >> >
Re: Solr Cloud - is replication really a feature on the trunk?
Yes now I'm sure that a) collection="blah" in solr.xml, and b) -Dcollection.configName="myconf" at cmd line actually fill in values for two very different fields. Here's why I say so: Example config # 1: Results in: /collections (v=6 children=1) scaleDeep (v=0 children=1) "configName=myconf" Example config # 2: Results in: /collections (v=6 children=1) scaleDeep (v=0 children=1) "configName=scaleDeep" What do you think about that? I maybe mis-interpreting the resutls so pleaase pelase feel free to set me straight on this. Also it would be nice if I knew the code well enough to just look @ it and give an authoritative answer. Does anyone have that kind of expertise? Reverse-engineering is getting a bit mundane. Thanks! - Pulkit On Sat, Sep 10, 2011 at 11:43 AM, Pulkit Singhal wrote: > 1s of all, thanks everyone, your expertise and time is much appreciated. > > @Jamie: > Great suggestion, I just have one small objection to it ... I wouldn't > want to mix the core's name with the collection's configName. Wouldn't > you also want to keep the two separate for clarity? What do you think > about that? > > @Yury: > Overall what you said makes sense and I'll roll with it. But FYI, > through experimentation I found out that collection="myconf" does not > become the value for configName when I inspect ZooKeeper.jsp, here's > an example of what shows up if I setup the solr.xml file but don't say > anything in the cmd line startup: > > myconf (v=0 children=1) "configName=configuration1" > > But perhaps that's exactly what you are trying to warn me about. I'll > experiment more and get back. > > - Pulkit > > On Fri, Sep 9, 2011 at 10:17 PM, Jamie Johnson wrote: >> as a note you could change out the values in solr.xml to be as follows >> and pull these values from System Properties. >> >> >> >> >> >> unless someone says otherwise, but the quick tests I've run seem to >> work perfectly well with this setup. >> >> 2011/9/9 Yury Kats : >>> On 9/9/2011 6:54 PM, Pulkit Singhal wrote: >>>> Thanks Again. >>>> >>>> Another question: >>>> >>>> My solr.xml has: >>>> >>>> >>> collection="myconf"/> >>>> >>>> >>>> And I omitted -Dcollection.configName=myconf from the startup command >>>> because I felt that specifying collection="myconf" should take care of >>>> that: >>>> cd /trunk/solr/example >>>> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar >>>> start.jar >>> >>> With this you are telling ZK to bootstrap a collection with content of >>> specific >>> files, but you don't tell what collection that should be. >>> >>> Hence you want collection.configName parameter, and you want >>> solr.xml to reference the same name in 'collection' attribute for the cores, >>> so that SolrCloud knows where to pull configuration for that core from. >>> >>> >>> >> >
Re: Solr Cloud - is replication really a feature on the trunk?
Sorry a message got sent without me finishing it up, ctrl+s is not save but send ... sigh! Yes now I'm sure that a) collection="blah" in solr.xml, and b) -Dcollection.configName="myconf" at cmd line actually fill in values for two very different fields. Here's why I say so: Example config # 1: java -Dcollection.configName=*myconf* ... -DzkRun -jar start.jar Results in: /collections (v=6 children=1) *scaleDeep* (v=0 children=1) "configName=*myconf*" Example config # 2: java -Dcollection.configName=*scaleDeep* ... -DzkRun -jar start.jar Results in: /collections (v=6 children=1) *scaleDeep* (v=0 children=1) "configName=*scaleDeep*" What do you think about that? I maybe mis-interpreting the results so please please feel free to set me straight on this. Also it would be nice if I knew the code well enough to just look @ it and give an authoritative answer. Does anyone have that kind of expertise? Reverse-engineering is getting a bit mundane. Thanks! - Pulkit > On Sat, Sep 10, 2011 at 11:43 AM, Pulkit Singhal > wrote: >> 1s of all, thanks everyone, your expertise and time is much appreciated. >> >> @Jamie: >> Great suggestion, I just have one small objection to it ... I wouldn't >> want to mix the core's name with the collection's configName. Wouldn't >> you also want to keep the two separate for clarity? What do you think >> about that? >> >> @Yury: >> Overall what you said makes sense and I'll roll with it. But FYI, >> through experimentation I found out that collection="myconf" does not >> become the value for configName when I inspect ZooKeeper.jsp, here's >> an example of what shows up if I setup the solr.xml file but don't say >> anything in the cmd line startup: >> >> myconf (v=0 children=1) "configName=configuration1" >> >> But perhaps that's exactly what you are trying to warn me about. I'll >> experiment more and get back. >> >> - Pulkit >> >> On Fri, Sep 9, 2011 at 10:17 PM, Jamie Johnson wrote: >>> as a note you could change out the values in solr.xml to be as follows >>> and pull these values from System Properties. >>> >>> >>> >>> >>> >>> unless someone says otherwise, but the quick tests I've run seem to >>> work perfectly well with this setup. >>> >>> 2011/9/9 Yury Kats : >>>> On 9/9/2011 6:54 PM, Pulkit Singhal wrote: >>>>> Thanks Again. >>>>> >>>>> Another question: >>>>> >>>>> My solr.xml has: >>>>> >>>>> >>>>> >>>>> >>>>> And I omitted -Dcollection.configName=myconf from the startup command >>>>> because I felt that specifying collection="myconf" should take care of >>>>> that: >>>>> cd /trunk/solr/example >>>>> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar >>>> >>>> With this you are telling ZK to bootstrap a collection with content of specific >>>> files, but you don't tell what collection that should be. >>>> >>>> Hence you want collection.configName parameter, and you want >>>> solr.xml to reference the same name in 'collection' attribute for the cores, >>>> so that SolrCloud knows where to pull configuration for that core from. >>>> >>>> >>>> >>> >> >
Re: Replication setup with SolrCloud/Zk
Hi Yury, How do you manage to start the instances without any issues? The way I see it, no matter which instance is started first, the slave will complain about not being to find its respective master because that instance hasn't been started yet ... no? Thanks, - Pulkit 2011/5/17 Yury Kats > On 5/17/2011 10:17 AM, Stefan Matheis wrote: > > Yury, > > > > perhaps Java-Pararms (like used for this sample: > > > http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node > ) > > can help you? > > Ah, thanks! It does seem to work! > > Cluster's solrconfig.xml (shared between all Solr instances and cores via > SolrCloud/ZK): > > > ${enable.master:false} > commit >startup > > >${enable.slave:false} >00:01:00 >http:// > ${masterHost:xyz}/solr/master/replication > > > > Node 1 solr.xml: > > collection="myconf" > > > > collection="myconf"> > > > > > > Node 2 solr.xml: > > collection="myconf" > > > > collection="myconf"> > > > > > >
Re: Replication setup with SolrCloud/Zk
Sorry, stupid question, now I see that the core still starts and the polling process simply logs an error: SEVERE: Master at: http://localhost:7574/solr/master2/replication is not available. Index fetch failed. Exception: Connection refused I was able to setup the instructions in-detail with this thread's help here: http://pulkitsinghal.blogspot.com/2011/09/multicore-master-slave-replication-in.html Thanks, - Pulkit On Sat, Sep 10, 2011 at 2:54 PM, Pulkit Singhal wrote: > Hi Yury, > > How do you manage to start the instances without any issues? The way I see > it, no matter which instance is started first, the slave will complain about > not being to find its respective master because that instance hasn't been > started yet ... no? > > Thanks, > - Pulkit > > 2011/5/17 Yury Kats > >> On 5/17/2011 10:17 AM, Stefan Matheis wrote: >> > Yury, >> > >> > perhaps Java-Pararms (like used for this sample: >> > >> http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node >> ) >> > can help you? >> >> Ah, thanks! It does seem to work! >> >> Cluster's solrconfig.xml (shared between all Solr instances and cores via >> SolrCloud/ZK): >> >> >> ${enable.master:false} >> commit >>startup >> >> >>${enable.slave:false} >>00:01:00 >>http:// >> ${masterHost:xyz}/solr/master/replication >> >> >> >> Node 1 solr.xml: >> >>> collection="myconf" > >> >> >>> collection="myconf"> >> >> >> >> >> >> Node 2 solr.xml: >> >>> collection="myconf" > >> >> >>> collection="myconf"> >> >> >> >> >> >> >
Re: Example Solr Config on EC2
Just to clarify, that link doesn't do anything to promote an already running slave into a master. One would have to bounce the Solr node which has that slave and then make the shift. It is not something that happens at runtime live. On Wed, Aug 10, 2011 at 4:04 PM, Akshay wrote: > Yes you can promote a slave to be master refer > > http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node > > In AWS one can use an elastic IP(http://aws.amazon.com/articles/1346) to > refer to the master and this can be assigned to slaves as they assume the > role of master(in case of failure). All slaves will then refer to this new > master and there will be no need to regenerate data. > > Automation of this maybe possible through CloudWatch alarm-actions. I don't > know of any available example automation scripts. > > Cheers > Akshay. > > On Wed, Aug 10, 2011 at 9:08 PM, Matt Shields > wrote: > > > If I were to build a master with multiple slaves, is it possible to > promote > > a slave to be the new master if the original master fails? Will all the > > slaves pickup right where they left off, or any time the master fails > will > > we need to completely regenerate all the data? > > > > If this is possible, are there any examples of this being automated? > > Especially on Win2k3. > > > > Matthew Shields > > Owner > > BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, Colocation, > > Managed Services > > www.beantownhost.com > > www.sysadminvalley.com > > www.jeeprally.com > > > > > > > > On Mon, Aug 8, 2011 at 5:34 PM, wrote: > > > > > Matthew, > > > > > > Here's another resource: > > > > > > > > > http://www.lucidimagination.com/blog/2010/02/01/solr-shines-through-the-cloud-lucidworks-solr-on-ec2/ > > > > > > > > > Michael Bohlig > > > Lucid Imagination > > > > > > > > > > > > - Original Message > > > From: Matt Shields > > > To: solr-user@lucene.apache.org > > > Sent: Mon, August 8, 2011 2:03:20 PM > > > Subject: Example Solr Config on EC2 > > > > > > I'm looking for some examples of how to setup Solr on EC2. The > > > configuration I'm looking for would have multiple nodes for redundancy. > > > I've tested in-house with a single master and slave with replication > > > running in Tomcat on Windows Server 2003, but even if I have multiple > > > slaves > > > the single master is a single point of failure. Any suggestions or > > example > > > configurations? The project I'm working on is a .NET setup, so ideally > > I'd > > > like to keep this search cluster on Windows Server, even though I > prefer > > > Linux. > > > > > > Matthew Shields > > > Owner > > > BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, > Colocation, > > > Managed Services > > > www.beantownhost.com > > > www.sysadminvalley.com > > > www.jeeprally.com > > > > > > > > >
How to combine RSS w/ Tika when using Data Import Handler (DIH)
Given an RSS raw feed source link such as the following: http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn I can easily get to the value of the description for an item like so: But the content of "description" happens to be in HTML and sadly it is this HTML chunk that has some pretty decent information that I would like to import as well. 1) For example it has the image for the item: http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg"; ... /> 2) It has the price for the item: $13.99 And many other useful pieces of data that aren't in a proper rss format but they are simply thrown together inside the html chunk that is served as the value for the xpath="/rss/item/description" So, how can I configure DIH to start importing this html information as well? Is Tika the way to go? Can someone give a brief example of what a config file with both Tika config and RSS config would/should look like? Thanks! - Pulkit
Re: Parameter not working for master/slave
Hello Bill, I can't really answer your question about replicaiton being supported on Solr3.3 (I use trunk 4.x myself) BUT I can tell you that if each Solr node has just one core ... only then does it make sense to use -Denable.master=true and -Denable.slave=true ... otherwise, as Yury points out, you should use solr.xml to pass in the value for each core individually. What is a node you ask? To me it means one App Server (Jetty) running Solr ... doesn't matter if its multiple ones on the same machine or single ones on different machines. That's what I mean by a node here. 2011/9/12 Yury Kats > On 9/11/2011 11:24 PM, William Bell wrote: > > I am using 3.3 SOLR. I tried passing in -Denable.master=true and > > -Denable.slave=true on the Slave machine. > > Then I changed solrconfig.xml to reference each as per: > > > > > http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node > > These are core parameters, you need to set them in solr.xml per core. >
Re: Re; DIH Scheduling
I don't see anywhere in: http://issues.apache.org/jira/browse/SOLR-2305 any statement that shows the code's inclusion was "decided against" when did this happen and what is needed from the community before someone with the powers to do so will actually commit this? 2011/6/24 Noble Paul നോബിള് नोब्ळ् > On Thu, Jun 23, 2011 at 9:13 PM, simon wrote: > > The Wiki page describes a design for a scheduler, which has not been > > committed to Solr yet (I checked). I did see a patch the other day > > (see https://issues.apache.org/jira/browse/SOLR-2305) but it didn't > > look well tested. > > > > I think that you're basically stuck with something like cron at this > > time. If your application is written in java, take a look at the > > Quartz scheduler - http://www.quartz-scheduler.org/ > > It was considered and decided against. > > > > -Simon > > > > > > -- > - > Noble Paul >
Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)
Hello Everyone, I've been investigating and I understand that using the RegexTransformer is an option that is open for identifying and extracting data to multiple fields from a single rss value source ... But rather than hack together something I once again wanted to check with the community: Is there another option for navigating the HTML DOM tree using some well-tested transformer or TIka or something? Thanks! - Pulkit On Mon, Sep 12, 2011 at 1:45 PM, Pulkit Singhal wrote: > Given an RSS raw feed source link such as the following: > > http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn > > I can easily get to the value of the description for an item like so: > > > But the content of "description" happens to be in HTML and sadly it is this > HTML chunk that has some pretty decent information that I would like to > import as well. > 1) For example it has the image for the item: > http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg"; ... > /> > 2) It has the price for the item: > $13.99 > And many other useful pieces of data that aren't in a proper rss format but > they are simply thrown together inside the html chunk that is served as the > value for the xpath="/rss/item/description" > > So, how can I configure DIH to start importing this html information as > well? > Is Tika the way to go? > Can someone give a brief example of what a config file with both Tika > config and RSS config would/should look like? > > Thanks! > - Pulkit >
Re: DIH load only selected documents with XPathEntityProcessor
This solution doesn't seem to be working for me. I am using Solr trunk and I have the same question as Bernd with a small twist: the field that should NOT be empty, happens to be a derived field called price, see the config below: ... I have also changed the sample script to check the price field isntead of the link field that was being used as an example in this thread earlier: Does anyone have any thoughts on what I'm missing? Thanks! - Pulkit On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling < bernd.fehl...@uni-bielefeld.de> wrote: > Hi Gora, > > thanks a lot, very nice solution, works perfectly. > I will dig more into ScriptTransformer, seems to be very powerful. > > Regards, > Bernd > > Am 08.01.2011 14:38, schrieb Gora Mohanty: > > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling > > wrote: > >> Hello list, > >> > >> is it possible to load only selected documents with > XPathEntityProcessor? > >> While loading docs I want to drop/skip/ignore documents with missing > URL. > >> > >> Example: > >> > >> > >>first title > >>identifier_01 > >>http://www.foo.com/path/bar.html > >> > >> > >>second title > >>identifier_02 > >> > >> > >> > >> > >> The first document should be loaded, the second document should be > ignored > >> because it has an empty link (should also work for missing link field). > > [...] > > > > You can use a ScriptTransformer, along with $skipRow/$skipDoc. > > E.g., something like this for your data import configuration file: > > > > > > > function skipRow(row) { > > var link = row.get( 'link' ); > > if( link == null || link == '' ) { > > row.put( '$skipRow', 'true' ); > > } > > return row; > > } > > ]]> > > > > > > > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'" > > recursive="true" rootEntity="false" dataSource="null"> > > > forEach="/documents/document" url="${f.fileAbsolutePath}" > > transformer="script:skipRow"> > > > > > > > > > > > > > > > > > > Regards, > > Gora >
Re: DIH load only selected documents with XPathEntityProcessor
Oh and I"m sure that I'm using Java 6 because the properties from the Solr webpage spit out: java.runtime.version = 1.6.0_26-b03-384-10M3425 On Tue, Sep 13, 2011 at 4:15 PM, Pulkit Singhal wrote: > This solution doesn't seem to be working for me. > > I am using Solr trunk and I have the same question as Bernd with a small > twist: the field that should NOT be empty, happens to be a derived field > called price, see the config below: > >transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer, > script:skipRow"> > >xpath="/rss/channel/item/description" > /> > > regex=".*\$(\d*.\d*)" > sourceColName="description" > /> > ... > > > I have also changed the sample script to check the price field isntead of > the link field that was being used as an example in this thread earlier: > > > > <![CDATA[ > function skipRow(row) { > var price = row.get( 'price' ); > if ( price == null || price == '' ) { > > row.put( '$skipRow', 'true' ); > } > return row; > } > ]]> > > > Does anyone have any thoughts on what I'm missing? > Thanks! > - Pulkit > > > On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling < > bernd.fehl...@uni-bielefeld.de> wrote: > >> Hi Gora, >> >> thanks a lot, very nice solution, works perfectly. >> I will dig more into ScriptTransformer, seems to be very powerful. >> >> Regards, >> Bernd >> >> Am 08.01.2011 14:38, schrieb Gora Mohanty: >> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling >> > wrote: >> >> Hello list, >> >> >> >> is it possible to load only selected documents with >> XPathEntityProcessor? >> >> While loading docs I want to drop/skip/ignore documents with missing >> URL. >> >> >> >> Example: >> >> >> >> >> >>first title >> >>identifier_01 >> >>http://www.foo.com/path/bar.html >> >> >> >> >> >>second title >> >>identifier_02 >> >> >> >> >> >> >> >> >> >> The first document should be loaded, the second document should be >> ignored >> >> because it has an empty link (should also work for missing link field). >> > [...] >> > >> > You can use a ScriptTransformer, along with $skipRow/$skipDoc. >> > E.g., something like this for your data import configuration file: >> > >> > >> > <![CDATA[ >> > function skipRow(row) { >> > var link = row.get( 'link' ); >> > if( link == null || link == '' ) { >> > row.put( '$skipRow', 'true' ); >> > } >> > return row; >> > } >> > ]]> >> > >> > >> > > > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'" >> > recursive="true" rootEntity="false" dataSource="null"> >> > > > forEach="/documents/document" url="${f.fileAbsolutePath}" >> > transformer="script:skipRow"> >> > >> > >> > >> > >> > >> > >> > >> > >> > Regards, >> > Gora >> > >
DIH skipping imports with skipDoc vs skipDoc
Hello, 1) The documented explanation of skipDoc and skipRow is not enough for me to discern the difference between them: $skipDoc : Skip the current document . Do not add it to Solr. The value can be String true/false $skipRow : Skip the current row. The document will be added with rows from other entities. The value can be String true/false Can someone please elaborate and help me out with an example? 2) I am working off the Solr trunk (4.x) and nothing I do seems to make the import for a given row/doc get skipped. As proof I've added these tests to my data import xml and all the rows are still getting indexed!!! If anyone sees something wrong with my config please tell me. Make sure to take note of the blatant use of row.put( '$skipDoc', 'true' ); and Yet stuff still gets imported, this is beyond me. Need a fresh pair of eyes :) http://www.amazon.com/gp/rss/new-releases/apparel/1040660/ref=zg_bsnr_1040660_rsslink"; processor="XPathEntityProcessor" forEach="/rss/channel | /rss/channel/item" transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow,TemplateTransformer"> Thanks! - Pulkit
RegexTransformer - need help with regex value
Hello, Feel free to point me to alternate sources of information if you deem this question unworthy of the Solr list :) But until then please hear me out! When my config is something like: I don't get any data. But when my config is like: I get the following data as the value for imageUrl: http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_.gif"; width="64" As the result shows, this is a string that should be able to match even on the 1st regex=".*img src=.(.*)\.gif..alt=.*" and produce a result like: http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_ But it doesn't! Can anyone tell me why that would be the case? Is it something about the way RegexTransformer is wired or is it just my regex value that isn't right?
Re: RegexTransformer - need help with regex value
Thanks a bunch, got it working with a reluctant qualifier and the use of " as the escaped representation of double qoutes within the regex value so that the config file doesn't crash & burn: Cheers, - Pulkit On Wed, Sep 14, 2011 at 2:24 PM, Pulkit Singhal wrote: > Hello, > > Feel free to point me to alternate sources of information if you deem > this question unworthy of the Solr list :) > > But until then please hear me out! > > When my config is something like: > regex=".*img src=.(.*)\.gif..alt=.*" > sourceColName="description" > /> > I don't get any data. > > But when my config is like: > regex=".*img src=.(.*)..alt=.*" > sourceColName="description" > /> > I get the following data as the value for imageUrl: > http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_.gif"; > width="64" > > As the result shows, this is a string that should be able to match > even on the 1st regex=".*img src=.(.*)\.gif..alt=.*" and produce a > result like: > http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_ > But it doesn't! > Can anyone tell me why that would be the case? > Is it something about the way RegexTransformer is wired or is it just > my regex value that isn't right? >
[DIH] How to use combine Regex and HTML transformers
Hello, I need to pull out the price and imageURL for products in an Amazon RSS feed. PROBLEM STATEMENT: The following: works but I am left with html junk inside the description! USELESS WORKAROUND: If I try to strip the html from the data being fed into description while letting the price and imageURL know of the direct path of the RSS feed field like so: then this fails and only the last configured field in this list (imageURL) ends up having any data imported. Is this a bug? CRUX OF THE PROBLEM: Also I tried to then create a field just to store the raw html data like so but this configuration yields no results for the description field so I'm back to where I started: I was suspicious of trying to combine sourceColName with stripHTML to begin with ... I suppose that I was hoping that the regex transformer will run first and copy all the html data as-is which will then be stripped out later by the HTML transformer but this didn't work. Why? what else can I do? Thanks! - Pulkit
Generating large datasets for Solr proof-of-concept
Hello Everyone, I have a goal of populating Solr with a million unique products in order to create a test environment for a proof of concept. I started out by using DIH with Amazon RSS feeds but I've quickly realized that there's no way I can glean a million products from one RSS feed. And I'd go mad if I just sat at my computer all day looking for feeds and punching them into DIH config for Solr. Has anyone ever had to create large mock/dummy datasets for test environments or for POCs/Demos to convince folks that Solr was the wave of the future? Any tips would be greatly appreciated. I suppose it sounds a lot like crawling even though it started out as innocent DIH usage. - Pulkit
Re: Generating large datasets for Solr proof-of-concept
Ah missing } doh! BTW I still welcome any ideas on how to build an e-commerce test base. It doesn't have to be amazon that was jsut my approach, any one? - Pulkit On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal wrote: > Thanks for all the feedback thus far. Now to get little technical about it :) > > I was thinking of feeding a file with all the tags of amazon that > yield close to roughly 5 results each into a file and then running > my rss DIH off of that, I came up with the following config but > something is amiss, can someone please point out what is off about > this? > > > processor="LineEntityProcessor" > url="file:///xxx/yyy/zzz/amazonfeeds.txt" > rootEntity="false" > dataSource="myURIreader1" > transformer="RegexTransformer,DateFormatTransformer" > > > pk="link" > url="${amazonFeeds.rawLine" > processor="XPathEntityProcessor" > forEach="/rss/channel | /rss/channel/item" > > transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow"> > ... > > The rawline should feed into the url key but instead i get: > > Caused by: java.net.MalformedURLException: no protocol: > null${amazonFeeds.rawLine > at > org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) > > Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback > INFO: start rollback > > Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback > SEVERE: Exception while solr rollback. > > Thanks in advance! > > On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma > wrote: >> If we want to test with huge amounts of data we feed portions of the >> internet. >> The problem is it takes a lot of bandwith and lots of computing power to get >> to a `reasonable` size. On the positive side, you deal with real text so it's >> easier to tune for relevance. >> >> I think it's easier to create a simple XML generator with mock data, prices, >> popularity rates etc. It's fast to generate millions of mock products and >> once >> you have a large quantity of XML files, you can easily index, test, change >> config or schema and reindex. >> >> On the other hand, the sample data that comes with the Solr example is a good >> set as well as it proves the concepts well, especially with the stock >> Velocity >> templates. >> >> We know Solr will handle enormous sets but quantity is not always a part of a >> PoC. >> >>> Hello Everyone, >>> >>> I have a goal of populating Solr with a million unique products in >>> order to create a test environment for a proof of concept. I started >>> out by using DIH with Amazon RSS feeds but I've quickly realized that >>> there's no way I can glean a million products from one RSS feed. And >>> I'd go mad if I just sat at my computer all day looking for feeds and >>> punching them into DIH config for Solr. >>> >>> Has anyone ever had to create large mock/dummy datasets for test >>> environments or for POCs/Demos to convince folks that Solr was the >>> wave of the future? Any tips would be greatly appreciated. I suppose >>> it sounds a lot like crawling even though it started out as innocent >>> DIH usage. >>> >>> - Pulkit >> >
Re: Generating large datasets for Solr proof-of-concept
Thanks for all the feedback thus far. Now to get little technical about it :) I was thinking of feeding a file with all the tags of amazon that yield close to roughly 5 results each into a file and then running my rss DIH off of that, I came up with the following config but something is amiss, can someone please point out what is off about this? ... The rawline should feed into the url key but instead i get: Caused by: java.net.MalformedURLException: no protocol: null${amazonFeeds.rawLine at org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90) Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback SEVERE: Exception while solr rollback. Thanks in advance! On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma wrote: > If we want to test with huge amounts of data we feed portions of the internet. > The problem is it takes a lot of bandwith and lots of computing power to get > to a `reasonable` size. On the positive side, you deal with real text so it's > easier to tune for relevance. > > I think it's easier to create a simple XML generator with mock data, prices, > popularity rates etc. It's fast to generate millions of mock products and once > you have a large quantity of XML files, you can easily index, test, change > config or schema and reindex. > > On the other hand, the sample data that comes with the Solr example is a good > set as well as it proves the concepts well, especially with the stock Velocity > templates. > > We know Solr will handle enormous sets but quantity is not always a part of a > PoC. > >> Hello Everyone, >> >> I have a goal of populating Solr with a million unique products in >> order to create a test environment for a proof of concept. I started >> out by using DIH with Amazon RSS feeds but I've quickly realized that >> there's no way I can glean a million products from one RSS feed. And >> I'd go mad if I just sat at my computer all day looking for feeds and >> punching them into DIH config for Solr. >> >> Has anyone ever had to create large mock/dummy datasets for test >> environments or for POCs/Demos to convince folks that Solr was the >> wave of the future? Any tips would be greatly appreciated. I suppose >> it sounds a lot like crawling even though it started out as innocent >> DIH usage. >> >> - Pulkit >
How to set up the schema to avoid NumberFormatException
Hello Folks, Surprisingly, the value from the following raw data gives me a NFE (Number Format Exception) when running the DIH (Data Import Handler): $1,000.00 The error logs look like: Caused by: org.apache.solr.common.SolrException: Error while creating field 'price{type=sdouble,properties=indexed,stored,omitNorms,sortMissingLast}' from value '1,000' at org.apache.solr.schema.FieldType.createField(FieldType.java:249) at org.apache.solr.schema.SchemaField.createField(SchemaField.java:102) at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:198) at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:257) ... 13 more Caused by: java.lang.NumberFormatException: For input string: "1,000" at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222) at java.lang.Double.parseDouble(Double.java:510) at org.apache.solr.util.NumberUtils.double2sortableStr(NumberUtils.java:129) at org.apache.solr.schema.SortableDoubleField.toInternal(SortableDoubleField.java:61) at org.apache.solr.schema.FieldType.createField(FieldType.java:247) It is pretty obvious from this that the "sdouble" schema fieldtype is not setup to parse out group-separators from a number. 1) Then my question is which type pf schema fieldtype will parse out the comma group-separator from 1,000? 2) Also, shouldn't we think about making locale based parsing be part of this stack trace as well? at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222) at java.lang.Double.parseDouble(Double.java:510) at org.apache.solr.util.NumberUtils.double2sortableStr(NumberUtils.java:129) Thanks! - Pulkit
Miscellaneous DIH related questions
My DIH's full-import logs end with a tailing output saying that 1500 documents were added, which is correct because I have 16 sources and one of them was down and each source is supposed to give me 100 results: (1500 adds)],optimize=} 0 0 But When I check my document count I get only 1384 results: INFO: [rss] webapp=/solr path=/select params={start=0&q=*:*&rows=0} hits=1384 status=0 QTime=0 1) I think I may have duplicates based on the primary key for the data coming in. Is there any other explnation than that? 2) Is there some way to get a log of how many documents were deleted? Because an update does a delete then add, this would allow me to make sure of what is going on. The sources I have are URL based, soemtimes they appear to be down because the request gets denied I suppose: SEVERE: Exception thrown while getting data java.io.FileNotFoundException: http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100 Caused by: java.io.FileNotFoundException: http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100 at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1434) 3) Is there some way to configure the datasource to retry 3 time or something like that? I have increased the values for connectionTimeout and readTimeout but it doesn't help when sometimes the server simply denies the request due to heavy load. I need to be able to retry at those times. The onError has only the abort,skip,continue options, non of which really let me retry anything. Thank You. - Pulkit
Re: Generating large datasets for Solr proof-of-concept
Thanks Hoss. I agree that the way you restated the question is better for getting results. BTW I think you've tipped me off to exactly what I needed with this URL: http://bbyopen.com/ Thanks! - Pulkit On Fri, Sep 16, 2011 at 4:35 PM, Chris Hostetter wrote: > > : Has anyone ever had to create large mock/dummy datasets for test > : environments or for POCs/Demos to convince folks that Solr was the > : wave of the future? Any tips would be greatly appreciated. I suppose > : it sounds a lot like crawling even though it started out as innocent > : DIH usage. > > the better question to ask is where you can find good sample data sets for > building proof of concept implementations. > > If you want an example of product data, the best buy product catalog is > available for developers using either an API or a bulk download of xml > files... > > http://bbyopen.com/ > > ...last time i looked (~1 year ago) there were about 1 million products in > the data dump. > > > -Hoss >
Re: JSON and DataImportHandler
Any updates on this topic? On Fri, Jul 16, 2010 at 5:36 PM, P Williams wrote: > Hi All, > > Has anyone gotten the DataImportHandler to work with json as input? Is > there an even easier alternative to DIH? Could you show me an example? > > Many thanks, > Tricia >
Re: JSON and DataImportHandler
Ah I see now: http://wiki.apache.org/solr/UpdateJSON#Example Not part of DIH that's all. On Sun, Sep 18, 2011 at 5:42 PM, Pulkit Singhal wrote: > Any updates on this topic? > > On Fri, Jul 16, 2010 at 5:36 PM, P Williams > wrote: >> Hi All, >> >> Has anyone gotten the DataImportHandler to work with json as input? Is >> there an even easier alternative to DIH? Could you show me an example? >> >> Many thanks, >> Tricia >> >
JSON indexing failing...
Hello, I am running a simple test after reading: http://wiki.apache.org/solr/UpdateJSON I am only using one object from a large json file to test and see if the indexing works: curl 'http://localhost:8983/solr/update/json?commit=true' --data-binary @productSample.json -H 'Content-type:application/json' The data is from bbyopen.com, I've attached the one single object that I'm testing with. The indexing process fails with: Sep 19, 2011 2:37:54 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: invalid key: url [1701] at org.apache.solr.handler.JsonLoader.parseDoc(JsonLoader.java:355) I thought that any json attributes that did not have a mapping in the schema.xml file would simply not get indexed. (a) Is this not true? But this error made me retry after adding url to schema.xml file: I retried after a restart but I still keep getting the same error! (b) Can someone wise perhaps point me in the right direction for troubleshooting this issue? Thank You! - Pulkit productSample.json Description: application/json
How does Solr deal with JSON data?
Hello Everyone, I'm quite curious about how does the following data get understood and indexed by Solr? [{ "id":"Fubar", "url": null, "regularPrice": 3.99, "offers": [ { "url": "", "text": "On Sale", "id": "OS" } ] }] 1) The field "id" is present as part of the main object and as part of a nested offers object, so how does Solr make sense of it? 2) Is the data under offers expected to be stored as multi-value in Solr? Or am I supposed to create offerURL, offerText and offerId fields in schema.xml? Even if I do that how do I tell Solr what data to match up where? Please be kind, I know I'm not thinking about this in the right manner, just gently set me straight about all this :) - Pulkit
Re: JSON indexing failing...
Ok a little bit of deleting lines from the json file led me to realize that Solr isn't happy with the following: "offers": [ { "url": "", "text": "On Sale", "id": "OS" } ], But as to why? Or what to do to remedy this ... I have no clue :( - Pulkit On Mon, Sep 19, 2011 at 2:45 PM, Pulkit Singhal wrote: > Hello, > > I am running a simple test after reading: > http://wiki.apache.org/solr/UpdateJSON > > I am only using one object from a large json file to test and see if > the indexing works: > curl 'http://localhost:8983/solr/update/json?commit=true' > --data-binary @productSample.json -H 'Content-type:application/json' > > The data is from bbyopen.com, I've attached the one single object that > I'm testing with. > > The indexing process fails with: > Sep 19, 2011 2:37:54 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: invalid key: url [1701] > at org.apache.solr.handler.JsonLoader.parseDoc(JsonLoader.java:355) > > I thought that any json attributes that did not have a mapping in the > schema.xml file would simply not get indexed. > (a) Is this not true? > > But this error made me retry after adding url to schema.xml file: > > I retried after a restart but I still keep getting the same error! > (b) Can someone wise perhaps point me in the right direction for > troubleshooting this issue? > > Thank You! > - Pulkit >
Troubleshooting OOM in DIH w/ FileListEntityProcessor and XPathEntityProcessor
Hello Everyone, I need help in: (a) figuring out the causes of OutOfMemoryError (OOM) when I run Data Import Handler (DIH), (b) finding workarounds and fixes to get rid of the OOM issue per cause. The stacktrace is at the very bottom to avoid having your eyes glaze over and to prevent you from skipping this thread ;) 1) Based on the documentation so far, I would say that "batchSize" based control does not exist for FileListEntityProcessor or XPathEntityProcessor. Please correct me if I'm wrong about this. 2) The files being processed by FileListEntityProcessor range from 90.9 to 2.8 MB in size. 2.1) Is there some way to let FileListEntityProcessor bring in only one file at a time? Or is that the default already? 2.2) Is there some way to let FileListEntityProcessor stream the file to its nested XPathEntityProcessor? 2.3) If streaming a file is something that should be configured directly on XPathEntityProcessor, then please let me know how to do that as well. 3) Where are the default xms and xmx for Solr configured? Please let me know so I may try tweaking them for startup. STACKTRACE: SEVERE: Exception while processing: bbyopenProductsArchive document : null: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.OutOfMemoryError: Java heap space at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718) ... Caused by: java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2734) at java.util.ArrayList.toArray(ArrayList.java:275) at java.util.ArrayList.(ArrayList.java:131) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.getDeepCopy(XPathRecordReader.java:586) ... INFO: start rollback Sep 20, 2011 4:22:26 PM org.apache.solr.handler.dataimport.SolrWriter rollback SEVERE: Exception while solr rollback. java.lang.NullPointerException at org.apache.solr.update.DefaultSolrCoreState.rollbackIndexWriter(DefaultSolrCoreState.java:73)
Re: How to set up the schema to avoid NumberFormatException
Hi Hoss, Thanks for the input! Something rather strange happened. I fixed my regex such that instead of returning just 1,000 ... it would return 1,000.00 and voila it worked! So Parsing group separators is already supported apparently then ... its just that the format is also looking for a decimal-separator and digits after that ... weird huh? - Pulkit On Fri, Sep 16, 2011 at 10:53 AM, Chris Hostetter wrote: > > : It is pretty obvious from this that the "sdouble" schema fieldtype is > : not setup to parse out group-separators from a number. > > correct. the numeric (and date) field types are all designed to deal with > conversion of the canonical string represetantion. > > : 1) Then my question is which type pf schema fieldtype will parse out > : the comma group-separator from 1,000? > > that depends on how you wnat to interpret/use those values.. > > : 2) Also, shouldn't we think about making locale based parsing be part > : of this stack trace as well? > > Not in the field types. > > 1) adding extra parse logic there would be inefficient for people who are > only ever sending well formed data. > 2) as a client/server setup, it would be a bad idea for hte server to > assume the client is using the same locale > > The right place in the stack for this type of logic would be in an > UpdateProcessor (for indexing docs) or in a > QueryParser/DocTransformer (for querying / writing back values in the > results). > > Solr could certainly use some more generla purpose UpdateProcessors for > parsing various non-canonical input formats (we've talked about one for > doing rule based SimpleDateParsing as well) if you'd like to take a stab > at writting one and contributing it. > > > -Hoss >
How to skip fields when using DIH?
The data I'm running through the DIH looks like: false false 349.99 As you can see, in this particular instance of a product, there is no value for "salesRankShortTerm" which happens to be defined in my schema like so: Having an empty value in the incoming DIH data leads to an exception: Caused by: java.lang.NumberFormatException: For input string: "" 1) How can I skip this field if its empty? If I use script transformer like so: THEN, I will end up skipping the entire document :( 2) So please help me understand how I can configure it to only skip a field and not the document? Thanks, - Pulkit
Re: How to skip fields when using DIH?
OMG, I'm so sorry, please ignore. Its so simple, just had to use: row.remove( 'salesRankShortTerm' ); because the script runs at the end after the entire entity has been processed (I suppose) rather than per field. Thanks! On Tue, Sep 20, 2011 at 5:42 PM, Pulkit Singhal wrote: > The data I'm running through the DIH looks like: > > > > false > false > 349.99 > > > > > As you can see, in this particular instance of a product, there is no > value for "salesRankShortTerm" which happens to be defined in my > schema like so: > /> > > Having an empty value in the incoming DIH data leads to an exception: > Caused by: java.lang.NumberFormatException: For input string: "" > > 1) How can I skip this field if its empty? > > If I use script transformer like so: > > <![CDATA[ > function skipRow(row) { > var salesRankShortTerm = row.get( 'salesRankShortTerm' ); > if ( salesRankShortTerm == null || salesRankShortTerm == '' ) { > row.put( '$skipRow', 'true' ); > } > return row; > } > ]]> > > THEN, I will end up skipping the entire document :( > > 2) So please help me understand how I can configure it to only skip a > field and not the document? > > Thanks, > - Pulkit >
Best Practices for indexing nested XML in Solr via DIH
Hello Everyone, I was wondering what are the various best practices that everyone follows for indexing nested XML into Solr. Please don't feel limited by examples, feel free to share your own experiences. Given an xml structure such as the following: cat001 Everything cat002 Music cat003 Pop How do you make the best use of the data when indexing? 1) Do you use Scenario A? categoryPath_category_id = cat001 cat002 cat003 (flattened) categoryPath_category_name = Everything Music Pop (flattened) If so then how do you manage to find the corresponding categoryPath_category_id if someone's search matches a value in the categoryPath_category_name field? I understand that Solr is not about lookups but this may be important information for you to display right away as part of the search results page rendering. 2) Do you use Scenario B? categoryPath_category_id = [cat001 cat002 cat003] (the [] signifies a multi-value field) categoryPath_category_name = [Everything Music Pop] (the [] signifies a multi-value field) And once again how do you find associated data sets once something matches. Side Question: How can one configure DIH to store the data this way for Scenario B? Thanks! - Pulkit
Re: How to write core's name in log
Not sure if this is a good lead for you but when I run out-of-the-box multi-core example-DIH instance of Solr, I often see core name thrown about in the logs. Perhaps you can look there? On Thu, Sep 15, 2011 at 6:50 AM, Joan wrote: > Hi, > > I have multiple core in Solr and I want to write core name in log through to > lo4j. > > I've found in SolrException a method called log(Logger log, Throwable e) but > when It try to build a Exception it haven't core's name. > > The Exception is built in toStr() method in SolrException class, so I want > to write core's name in the message of Exception. > > I'm thinking to add MDC variable, this will be name of core. Finally I'll > use it in log4j configuration like this in ConversionPattern %X{core} > > The idea is that when Solr received a request I'll add this new variable > "name of core". > > But I don't know if it's a good idea or not. > > or Do you already exists any solution for add name of core in log? > > Thanks > > Joan >
Re: strange copied field problem
I am NOT claiming that making a copy of a copy field is wrong or leads to a race condition. I don't know that. BUT did you try to copy into the text field directly from the genre field? Instead of the genre_search field? Did that yield working queries? On Wed, Sep 21, 2011 at 12:16 PM, Tanner Postert wrote: > i have 3 fields that I am working with: genre, genre_search and text. genre > is a string field which comes from the data source. genre_search is a text > field that is copied from genre, and text is a text field that is copied > from genre_search and a few other fields. Text field is the default search > field for queries. When I search for q=genre_search:indie+rock, solr returns > several records that have both Indie as a genre and Rock as a genre, which > is great, but when I search for q=indie+rock or q=text:indie+rock, i get no > results. > > Why would the source field return the value and the destination wouldn't. > Both genre_search and text are the same data type, so there shouldn't be any > strange translations happening. >
Re: OOM errors and -XX:OnOutOfMemoryError flag not working on solr?
Usually any good piece of java code refrains from capturing Throwable so that Errors will bubble up unlike exceptions. Having said that, perhaps someone in the list can help, if you share which particular Solr version you are using where you suspect that the Error is being eaten up. On Fri, Sep 16, 2011 at 2:47 PM, Jason Toy wrote: > I have solr issues where I keep running out of memory. I am working on > solving the memory issues (this will take a long time), but in the meantime, > I'm trying to be notified when the error occurs. I saw with the jvm I can > pass the -XX:OnOutOfMemoryError= flag and pass a script to run. Every time > the out of memory issue occurs though my script never runs. Does solr let > the error bubble up so that the jvm can call this script? If not how can I > have a script run when solr gets an out of memory issue? >
Re: add quartz like scheduling cabalities to solr-DIH
I think what Ahmet is trying to say is that such functionality does not exist. As the functionality does not exist, there is no procedure or conf file related work to speak of. There has been request to have this work done and you can vote/watch for it here: https://issues.apache.org/jira/browse/SOLR-1251 On Fri, Sep 16, 2011 at 7:35 AM, vighnesh wrote: > thanks iroxxx > > > but how can l add quartz like scheduling to solr dih ,is there any changes > required in anyof the configuration files please specify the procedure. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/add-quartz-like-scheduling-cabalities-to-solr-DIH-tp3341141p3341795.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Solr Indexing - Null Values in date field
Also you may use the script transformer to explicitly remove the field from the document if the field is null. I do this for all my sdouble and sdate fields ... its a bit manual and I would like to see Solr enhanced to simply skip stuff like this by having a flag for its DIH code but until then it suffices: ... transformer="DateFormatTransformer,script:skipEmptyFields" On Wed, Sep 21, 2011 at 6:06 AM, Gora Mohanty wrote: > On Wed, Sep 21, 2011 at 4:08 PM, mechravi25 wrote: >> Hi, >> >> I have a field in my source with data type as string and that field has NULL >> values. I am trying to index this field in solr as a date data type with >> multivalued = true. Following is the entry for that field in my schema.xml > [...] > > One cannot have NULL values as input for Solr date fields. The > multivalued part is irrelevant here. > > As it seems like you are getting the input data from a database, > you will need to supply some invalid date for NULL date values. > E.g., with mysql, we have: > COALESCE( CreationDate, STR_TO_DATE( '1970,1,1', '%Y,%m,%d' ) ) > The required syntax will be different for other databases. > > Regards, > Gora >
Debugging DIH by placing breakpoints
Hello, I was wondering where can I find the source code for DIH? I want to checkout the source and step-trhought it breakpoint by breakpoint to understand it better :) Thanks! - Pulkit
Re: Debugging DIH by placing breakpoints
Correct! With that additional info, plus http://wiki.apache.org/solr/HowToContribute (ant eclipse), plus a refreshed (close/open) eclipse project ... I'm all set. Thanks Again. On Wed, Sep 21, 2011 at 1:43 PM, Gora Mohanty wrote: > On Thu, Sep 22, 2011 at 12:08 AM, Pulkit Singhal > wrote: >> Hello, >> >> I was wondering where can I find the source code for DIH? I want to >> checkout the source and step-trhought it breakpoint by breakpoint to >> understand it better :) > > Should be under contrib/dataimporthandler in your Solr source > tree. > > Regards, > Gora >
Re: strange copied field problem
No probs. I would still hope someone would comment on you thread with some expert opinions about making a copy of a copy :) On Wed, Sep 21, 2011 at 1:38 PM, Tanner Postert wrote: > sure enough that worked. could have sworn we had it this way before, but > either way, that fixed it. Thanks. > > On Wed, Sep 21, 2011 at 11:01 AM, Tanner Postert > wrote: > >> i believe that was the original configuration, but I can switch it back and >> see if that yields any results. >> >> >> On Wed, Sep 21, 2011 at 10:54 AM, Pulkit Singhal >> wrote: >> >>> I am NOT claiming that making a copy of a copy field is wrong or leads >>> to a race condition. I don't know that. BUT did you try to copy into >>> the text field directly from the genre field? Instead of the >>> genre_search field? Did that yield working queries? >>> >>> On Wed, Sep 21, 2011 at 12:16 PM, Tanner Postert >>> wrote: >>> > i have 3 fields that I am working with: genre, genre_search and text. >>> genre >>> > is a string field which comes from the data source. genre_search is a >>> text >>> > field that is copied from genre, and text is a text field that is copied >>> > from genre_search and a few other fields. Text field is the default >>> search >>> > field for queries. When I search for q=genre_search:indie+rock, solr >>> returns >>> > several records that have both Indie as a genre and Rock as a genre, >>> which >>> > is great, but when I search for q=indie+rock or q=text:indie+rock, i get >>> no >>> > results. >>> > >>> > Why would the source field return the value and the destination >>> wouldn't. >>> > Both genre_search and text are the same data type, so there shouldn't be >>> any >>> > strange translations happening. >>> > >>> >> >> >
ScriptTransformer question
Hello, I'm using DIH in the trunk version and I have placed breakpoints in the Solr code. I can see that the value for a row being fed into the ScriptTransformer instance is: {buybackPlans.buybackPlan.type=[PSP-PRP], buybackPlans.buybackPlan.name=[2-Year Buy Back Plan], buybackPlans.buybackPlan.sku=[2490748], $forEach=/products/product/buybackPlans/buybackPlan, buybackPlans.buybackPlan.price=[]} Now price cannot be empty because Solr will complain so the following script should be running but it doesn't do anything!!! Can anyone spot the issue here? function skipEmptyFieldsInBuybackPlans(row) { var buybackPlans_buybackPlan_price = row.get( 'buybackPlans.buybackPlan.price' ); if ( buybackPlans_buybackPlan_price == null || buybackPlans_buybackPlan_price == '' || buybackPlans_buybackPlan_price.length == 0) { row.remove( 'buybackPlans.buybackPlan.price' ); } return row; } I would hate to have to get the rhino javascript engine source code and step-through that. I'm sure I'm being really dumb and am hoping that someone on the Solr mailing list can help me spot the issue :) Thanks! - Pulkit
Re: DIH error when nested db datasource and file data source
Few thoughts: 1) If you place the script transformer method on the entity named "x" and then pass the ${topic_tree.topic_id} to that as an argument, then shouldn't you have everything you need to work with x's row? Even if you can't look up at the parent, all you needed to know was the topic_id and based on that you can edit or not edit x's row ... shouldn't that be sufficient to get you what you need to do? 2) Regarding the manner in which you are trying to use the following xpath syntax: forEach="/gvpVideoMetaData/mediaItem[@media_id='${topic_tree.topic_id}']" There are two other closely related thread that I've come across: (a) http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html (b) http://lucene.472066.n3.nabble.com/using-DIH-with-mets-alto-file-sets-td1926642.html They both seemed to want to use the full power of XPath like you do and I think that in a roundabout way they were told utilize the xsl attribute to make up for what the XPath was lacking by default. Here are some choice words by Lance that I've extracted out for you: "XPathEntityProcessor parses a very limited XPath syntax. However, you can add an XSL script as an attribute, and this somehow gets called instead." - Lance There is an option somewhere to use the full XML DOM implementation for using xpaths. The purpose of the XPathEP is to be as simple and dumb as possible and handle most cases: RSS feeds and other open standards. Search for xsl(optional) http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1 - Lance I hope you can make some sense of this, I'm no expert, but just thought I'd offer my 2 cts. On Fri, Sep 23, 2011 at 9:21 AM, abhayd wrote: > hi > I am not getting exception anymore.. I had issue with database > > But now real problem i always have ... > Now that i can fetch ID's from database how would i fetch correcponding data > from ID in xm file > > So after getting DB info from jdbcsource I use xpath processor like this, > but it does not work. > baseDir="${solr.solr.home}" fileName=".xml" > recursive="false" rootEntity="true" > dataSource="video_datasource"> > > forEach="/gvpVideoMetaData/mediaItem[@media_id='${topic_tree.topic_id}']" > url="${f.fileAbsolutePath}" > > > > I even tried using script transformer but "row" in script transformer has > scope limited to entity "f" If this is nested under another entity u cant > access top level variables with "row" . > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/DIH-error-when-nested-db-datasource-and-file-data-source-tp3345664p3362007.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: UIMA DictionaryAnnotator partOfSpeach
At first glance it seems like a simple localization issue as indicated by this: > org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotatorProcessException: > EXCEPTION MESSAGE LOCALIZATION FAILED: java.util.MissingResourceException: > Can't find bundle for base name > org.apache.uima.annotator.dict_annot.dictionaryAnnotatorMessages, locale > en_US Perhaps you can get the source code for UIMA and run the server hosting Solr in debug mode then remote connect to it via eclipse or some other IDE and use a breakpoint to figure out which resource is the issue. After that it would be UIMA specific solution, I think. On Wed, Sep 28, 2011 at 4:11 PM, chanhangfai wrote: > Hi all, > > I have the dictionary Annotator UIMA-solr running, > used my own dictionary file and it works, > it will match all the words (Nouns, Verbs and Adjectives) from my dictionary > file. > > *but now, if I only want to match "Nouns", (ignore other part of speech)* > > how can I configure it? > > > http://uima.apache.org/d/uima-addons-current/DictionaryAnnotator/DictionaryAnnotatorUserGuide.html > > From the above user guide, in section (3.3. Input Match Type Filters), > i added the following code to my DictionaryAnnotatorDescriptor.xml, > > > InputMatchFilterFeaturePath > > *partOfSpeach* > > > > > FilterConditionOperator > > EQUALS > > > > > FilterConditionValue > > noun > > > > > but it fails, and the error said featurePathElementNames "*partOfSpeach*" is > invalid. > > org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotatorProcessException: > EXCEPTION MESSAGE LOCALIZATION FAILED: java.util.MissingResourceException: > Can't find bundle for base name > org.apache.uima.annotator.dict_annot.dictionaryAnnotatorMessages, locale > en_US > at > org.apache.uima.annotator.dict_annot.impl.FeaturePathInfo_impl.typeSystemInit(FeaturePathInfo_impl.java:110) > at > org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotator.typeSystemInit(DictionaryAnnotator.java:383) > at > org.apache.uima.analysis_component.CasAnnotator_ImplBase.checkTypeSystemChange(CasAnnotator_ImplBase.java:100) > at > org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:55) > at > org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377) > at > org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295) > at > org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567) > at > org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409) > at > org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342) > at > org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267) > at > org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) > at > org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280) > > > > Any idea please, > Thanks in advance.. > > Frankie > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/UIMA-DictionaryAnnotator-partOfSpeach-tp3377440p3377440.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: basic solr cloud questions
@Darren: I feel that the question itself is misleading. Creating shards is meant to separate out the data ... not keep the exact same copy of it. I think the two node setup that was attempted by Sam mislead him and us into thinking that configuring two nodes which are to be named "shard1" ... somehow means that they are instantly replicated too ... this is not the case! I can see how this misunderstanding can develop as I too was confused until Yury cleared it up. @Sam: If you are interested in performing a quick exercise to understand the pieces involved for replication rather than sharding ... perhaps this link would be of help in taking you through it: http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html - Pulkit 2011/9/27 Yury Kats : > On 9/27/2011 5:16 PM, Darren Govoni wrote: >> On 09/27/2011 05:05 PM, Yury Kats wrote: >>> You need to either submit the docs to both nodes, or have a replication >>> setup between the two. Otherwise they are not in sync. >> I hope that's not the case. :/ My understanding (or hope maybe) is that >> the new Solr Cloud implementation will support auto-sharding and >> distributed indexing. This means that shards will receive different >> documents regardless of which node received the submitted document >> (spread evenly based on a hash<->node assignment). Distributed queries >> will thus merge all the solr shard/node responses. > > All cores in the same shard must somehow have the same index. > Only then can you continue servicing searches when individual cores > fail. Auto-sharding and distributed indexing don't have anything to > do with this. > > In the future, SolrCloud may be managing replication between cores > in the same shard automatically. But right now it does not. >
Re: Why I can't take an full-import with entity name?
Can you monitor the DB side to see what results it returned for that query? 2011/8/30 于浩 : > I am using solr1.3,I updated solr index throgh solr delta import every two > hours. but the delta import is database connection wasteful. > So i want to use full-import with entity name instead of delta import. > > my db-data-config.xml file: > > > > query="select Article_ID,Article_Title,Article_Abstract from Article_Detail > where Article_ID>'${dataimporter.request.minID}' and Article_ID > <='{dataimporter.request.maxID}' > "> > > > > > then I uses > http://192.168.1.98:8081/solr/db_article/dataimport?command=full-import&entity=delta_article&commit=true&clean=false&maxID=1000&minID=10 > but the solr will finish nearyly instant,and there is no any record > imported. but what the fact is there are many records meets the condtion of > maxID and minID. > > > the tomcat log: > 信息: [db_article] webapp=/solr path=/dataimport > params={maxID=6737277&clean=false&commit=true&entity=delta_article&command=full-import&minID=6736841} > status=0 QTime=0 > 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.DataImporter > doFullImport > 信息: Starting Full Import > 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.SolrWriter > readIndexerProperties > 信息: Read dataimport.properties > 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.SolrWriter > persistStartTime > 信息: Wrote last indexed time to dataimport.properties > 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.DocBuilder commit > 信息: Full Import completed successfully > > > some body who can help or some advices? >
Re: SolrCloud: is there a programmatic way to create an ensemble
Did you find out about this? 2011/8/2 Yury Kats : > I have multiple SolrCloud instances, each running its own Zookeeper > (Solr launched with -DzkRun). > > I would like to create an ensemble out of them. I know about -DzkHost > parameter, but can I achieve the same programmatically? Either with > SolrJ or REST API? > > Thanks, > Yury >
Re: basic solr cloud questions
SOLR-2355 is definitely a step in the right direction but something I would like to get clarified: a) There were some fixes to it that went on the 3.4 & 3.5 branch based on the comments section ... are they not available or not needed on 4.x trunk? b) Does this basic implementation distribute across shards or across cores? I think that distributing across all the cores in a shard is the key towards using it successfully with SolrCloud and I really don't know if this does this right now as I am not familiar with the source code. If someone could answer this it would be great otherwise I'll post back eventually when I do become familiar. Cheers, - Pulkit
Re: basic solr cloud questions
BTW I update the wiki with the following, hope it keeps it simpel for others starting out: Example B: Simple two shard cluster with shard replicas Note: This setup leverages copy/paste to setup 2 cores per shard and distributed searches validate a succesful completion of this example/exercise. But DO NOT assume that any new data that you index will be distributed across and indexes at each core of a given shard. That will not happen. Distributed Indexing is not part of SolrCloud yet. You may however adapt a basic implementation of distributed indexing by referring to SOLR-2355. On Fri, Sep 30, 2011 at 11:26 AM, Pulkit Singhal wrote: > SOLR-2355 is definitely a step in the right direction but something I > would like to get clarified: > > a) There were some fixes to it that went on the 3.4 & 3.5 branch based > on the comments section ... are they not available or not needed on > 4.x trunk? > > b) Does this basic implementation distribute across shards or across > cores? I think that distributing across all the cores in a shard is > the key towards using it successfully with SolrCloud and I really > don't know if this does this right now as I am not familiar with the > source code. If someone could answer this it would be great otherwise > I'll post back eventually when I do become familiar. > > Cheers, > - Pulkit >
Bug in DIH?
Its rather strange stacktrace(at the bottom). An entire 1+ dataset finishes up only to end up crashing & burning due to a log statement :) Based on what I can tell from the stacktrace and the 4.x trunk source code, it seems that the follwoign log statement dies: //LogUpdateProcessorFactory.java:188 log.info( ""+toLog + " 0 " + (elapsed) ); Eventually at the strict cast: //NamedList.java:127 return (String)nvPairs.get(idx << 1); I was wondering what kind of mistaken data would I have ended up getting misplaced into: //LogUpdateProcessorFactory.java:76 private final NamedList toLog; To cause the java.util.ArrayList cannot be cast to java.lang.String issue? Could it be due to the multivalued fields that I'm trying to index? Is this a bug or just a mistake in how I use DIH, please let me know your thoughts! SEVERE: Full Import failed:java.lang.ClassCastException: java.util.ArrayList cannot be cast to java.lang.String at org.apache.solr.common.util.NamedList.getName(NamedList.java:127) at org.apache.solr.common.util.NamedList.toString(NamedList.java:263) at java.lang.String.valueOf(String.java:2826) at java.lang.StringBuilder.append(StringBuilder.java:115) at org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188) at org.apache.solr.handler.dataimport.SolrWriter.close(SolrWriter.java:57) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:265) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:372) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:440) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:421)
Enabling the right logs for DIH
The Problem: When using DIH with trunk 4.x, I am seeing some very funny numbers with a particularly large XML file that I'm trying to import. Usually there are bound to be more rows than documents indexed in DIH because of the foreach property but my other xm lfiles have maybe 1.5 times the rows compared to the # of docs indexed. This particular funky file ends up with something like: 25614008 1048 That's 25 million rows fetched before even a measly 1000 docs are indexed! Something has to be wrong here. I checked the xml for well-formed-ness in vim by running "!:xmllint --noout %" so I think there are no issues there. The Question: For those intimately familiar with DIH code/behaviour: What is the appropriate log-level that will let me see the rows & docs printed out to log as each one is fetched/created? I don't want to make the logs explode because then I won't be able to read through them. Is there some gentle balance here that I can leverage? Thanks! - Pulkit
Re: Bug in DIH?
Thanks Lance, its logged as: https://issues.apache.org/jira/browse/SOLR-2804 - Pulkit On Sat, Oct 1, 2011 at 8:59 PM, Lance Norskog wrote: > Should bugs in LogProcessor should be ignored by DIH? They are not required > to index data, right? > > Please open an issue for this. The fix should have two parts: > 1) fix the exception > 2) log and ignore exceptions in the LogProcessor > > On Sat, Oct 1, 2011 at 2:02 PM, Pulkit Singhal wrote: > >> Its rather strange stacktrace(at the bottom). >> An entire 1+ dataset finishes up only to end up crashing & burning >> due to a log statement :) >> >> Based on what I can tell from the stacktrace and the 4.x trunk source >> code, it seems that the follwoign log statement dies: >> //LogUpdateProcessorFactory.java:188 >> log.info( ""+toLog + " 0 " + (elapsed) ); >> >> Eventually at the strict cast: >> //NamedList.java:127 >> return (String)nvPairs.get(idx << 1); >> >> I was wondering what kind of mistaken data would I have ended up >> getting misplaced into: >> //LogUpdateProcessorFactory.java:76 >> private final NamedList toLog; >> >> To cause the java.util.ArrayList cannot be cast to java.lang.String issue? >> Could it be due to the multivalued fields that I'm trying to index? >> Is this a bug or just a mistake in how I use DIH, please let me know >> your thoughts! >> >> SEVERE: Full Import failed:java.lang.ClassCastException: >> java.util.ArrayList cannot be cast to java.lang.String >> at org.apache.solr.common.util.NamedList.getName(NamedList.java:127) >> at >> org.apache.solr.common.util.NamedList.toString(NamedList.java:263) >> at java.lang.String.valueOf(String.java:2826) >> at java.lang.StringBuilder.append(StringBuilder.java:115) >> at >> org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188) >> at >> org.apache.solr.handler.dataimport.SolrWriter.close(SolrWriter.java:57) >> at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:265) >> at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:372) >> at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:440) >> at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:421) >> > > > > -- > Lance Norskog > goks...@gmail.com >
DIH full-import with clean=false is still removing old data
Hello, I have a unique dataset of 1,110,000 products, each as its own file. It is split into three different directories as 500,000 and 110,000 files and 500,000. When I run: http://localhost:8983/solr/bbyopen/dataimport?command=full-import&clean=false&commit=true The first 500,000 entries are successfully indexed and then the next 110,000 entries also work ... but after I run the third full-import on the last set of 500,000 entries, the document count remains at 610,000 ... it doesn't go up to 1,110,000! 1) Is there some kind of limit here? Why can the full-import keep the initial 500,000 entries and then let me do a full-import with 110,000 more entries ... but when I try to do a 3rd full-import, the document count doesn't go up. 2) I know for sure that all the data is unique. Since I am not doing delta-imports, I have NOT specified any primary key in the data-import.xml file. But I do have a uniqueKey in the schema.xml file. Any tips? - Pulkit
Re: DIH full-import with clean=false is still removing old data
Bah it worked after cleaning it out for the 3rd time, don't know what I did differently this time :( On Tue, Oct 4, 2011 at 8:00 PM, Pulkit Singhal wrote: > Hello, > > I have a unique dataset of 1,110,000 products, each as its own file. > It is split into three different directories as 500,000 and 110,000 > files and 500,000. > > When I run: > http://localhost:8983/solr/bbyopen/dataimport?command=full-import&clean=false&commit=true > The first 500,000 entries are successfully indexed and then the next > 110,000 entries also work ... but after I run the third full-import on > the last set of 500,000 entries, the document count remains at 610,000 > ... it doesn't go up to 1,110,000! > > 1) Is there some kind of limit here? Why can the full-import keep the > initial 500,000 entries and then let me do a full-import with 110,000 > more entries ... but when I try to do a 3rd full-import, the document > count doesn't go up. > > 2) I know for sure that all the data is unique. Since I am not doing > delta-imports, I have NOT specified any primary key in the > data-import.xml file. But I do have a uniqueKey in the schema.xml > file. > > Any tips? > - Pulkit >
Interesting DIH challenge
Hello Folks, I'm a big DIH fan but I'm fairly sure that now I've run into a scenario where it can't help me anymore ... but before I give up and roll my own solution, I jsut wanted to check with everyone else. The scenario: - already have 1M+ documents indexed - the schema.xml needs to have one more field added to it ... problem/do-able? yes? no? remove all the old data? or do the update per doc (add/delete)? - need to populate data from a file that has a key and value per line and i need to use the key to find the doc to update and then add the value to the new schema field Any ideas?
Re: Interesting DIH challenge
@Gora Thank You! I know that Solr accepts xml with Solr specific elements that are commands that only it understands ... such as , etc. Question: Is there some way to ask Solr to dump out whatever it has in its index already ... as a Solr xml document? Plan: I intend to message that xml dump (add the field + value that I need in every doc's xml element) and then I should be able to push this dump back to Solr to get data indexed again, I hope. Thanks! - Pulkit On Sun, Oct 9, 2011 at 2:57 PM, Gora Mohanty wrote: > On Mon, Oct 10, 2011 at 1:17 AM, Pulkit Singhal > wrote: > > Hello Folks, > > > > I'm a big DIH fan but I'm fairly sure that now I've run into a scenario > > where it can't help me anymore ... but before I give up and roll my own > > solution, I jsut wanted to check with everyone else. > > > > The scenario: > > - already have 1M+ documents indexed > > - the schema.xml needs to have one more field added to it ... > > problem/do-able? yes? no? remove all the old data? or do the update per > doc > > (add/delete)? > > This is independent of DIH. If you want to add a new field to the schema, > you should reindex. 1M documents should not take that long. > > > - need to populate data from a file that has a key and value per line and > i > > need to use the key to find the doc to update and then add the value to > the > > new schema field > > It is best just to reindex, but it should be possible to write a script to > pull > the doc from the existing Solr index, massage the return format into > Solr's XML format, adding a value for the new field in the process, and > then posting the new file to Solr for indexing. > > Regards, > Gora >
Re: Interesting DIH challenge
Oh also: Does DIH have any experimental way for folks to be reading data from one solr core and then massaging it and importing it into another core? If not, then would that be a good addition or just a waste of time for some architectural reason? On Sun, Oct 9, 2011 at 8:00 PM, Pulkit Singhal wrote: > @Gora Thank You! > > I know that Solr accepts xml with Solr specific elements that are commands > that only it understands ... such as , etc. > > Question: Is there some way to ask Solr to dump out whatever it has in its > index already ... as a Solr xml document? > > Plan: I intend to message that xml dump (add the field + value that I need > in every doc's xml element) and then I should be able to push this dump back > to Solr to get data indexed again, I hope. > > Thanks! > - Pulkit > > > On Sun, Oct 9, 2011 at 2:57 PM, Gora Mohanty wrote: > >> On Mon, Oct 10, 2011 at 1:17 AM, Pulkit Singhal >> wrote: >> > Hello Folks, >> > >> > I'm a big DIH fan but I'm fairly sure that now I've run into a scenario >> > where it can't help me anymore ... but before I give up and roll my own >> > solution, I jsut wanted to check with everyone else. >> > >> > The scenario: >> > - already have 1M+ documents indexed >> > - the schema.xml needs to have one more field added to it ... >> > problem/do-able? yes? no? remove all the old data? or do the update per >> doc >> > (add/delete)? >> >> This is independent of DIH. If you want to add a new field to the schema, >> you should reindex. 1M documents should not take that long. >> >> > - need to populate data from a file that has a key and value per line >> and i >> > need to use the key to find the doc to update and then add the value to >> the >> > new schema field >> >> It is best just to reindex, but it should be possible to write a script to >> pull >> the doc from the existing Solr index, massage the return format into >> Solr's XML format, adding a value for the new field in the process, and >> then posting the new file to Solr for indexing. >> >> Regards, >> Gora >> > >
Re: Replication fails in SolrCloud
@Prakash: Can your please format the body a bit for readability? @Solr-Users: Is anybody else having any problems when running Zookeeper from the latest code in the trunk(4.x)? On Mon, Nov 7, 2011 at 4:44 PM, prakash chandrasekaran < prakashchandraseka...@live.com> wrote: > > hi all, i followed steps in link > http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensembleand > created "Two shard cluster with shard replicas and zookeeper ensemble", > and then for Solr Replication i followed steps in link > http://wiki.apache.org/solr/SolrReplication .. > now after server start, when slave tries to pull data from master .. i m > seeing below error messages .. > org.apache.solr.common.SolrException logSEVERE: > org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does > not support getConfigDir() - likely, what you are trying to do is not > supported in ZooKeeper modeat > org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:99) >at > org.apache.solr.handler.ReplicationHandler.getConfFileInfoFromCache(ReplicationHandler.java:378) > at > org.apache.solr.handler.ReplicationHandler.getFileList(ReplicationHandler.java:364) > i have few questions regarding this 1) Does Solr Cloud supports > Replication ??2) or do we need to follow different steps to achieve > Replication in Solr Cloud ?? > > Thanks,prakash > > > From: prakashchandraseka...@live.com > > To: solr-user@lucene.apache.org > > Subject: Zookeeper aware Replication in SolrCloud > > Date: Fri, 4 Nov 2011 03:36:27 + > > > > > > > > hi, > > i m using SolrCloud and i wanted to add Replication feature to it .. > > i followed the steps in Solr Wiki .. but when the client tried to poll > for data from server i got below Error Message .. > > in Master LogNov 3, 2011 8:34:00 PM > > > > in Slave logNov 3, 2011 8:34:00 PM > org.apache.solr.handler.ReplicationHandler doFetchSEVERE: SnapPull failed > org.apache.solr.common.SolrException: Request failed for the url > org.apache.commons.httpclient.methods.PostMethod@18eabf6at > org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.java:197) > at org.apache.solr.handler.SnapPuller.fetchFileList(SnapPuller.java:219) > at > org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:281) > at > org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:284) > > but i could see the slave pointing to correct master from link : > http://localhost:7574/solr/replication?command=details > > i m also seeing these values in replication details link .. ( > http://localhost:7574/solr/replication?command=details) > > Thu Nov 03 20:28:00 PDT > 2011Thu Nov 03 20:27:00 PDT 2011Thu Nov 03 20:26:00 > PDT 2011Thu Nov 03 20:25:00 PDT 2011 name="replicationFailedAtList"> Thu Nov 03 20:28:00 PDT 2011 > Thu Nov 03 20:27:00 PDT 2011 Thu Nov 03 20:26:00 PDT > 2011 Thu Nov 03 20:25:00 PDT 2011 > > > > > > Thanks,Prakash >
Re: Error while trying to load JSON
It seems that you are using the bbyopen data. If have made up your mind on using the JSON data then simply store it in ElasticSearch instead of Solr as they do take any valid JSON structure. Otherwise, you can download the xml archive from bbyopen and prepare a schema: Here are some generic instructions to familiarize you with building schema given arbitrary data, it should help speed things up, they don't apply directly to bbyopen data though: http://pulkitsinghal.blogspot.com/2011/10/import-dynamic-fields-from-xml-into.html http://pulkitsinghal.blogspot.com/2011/09/import-data-from-amazon-rss-feeds-into.html Keep in mind, ES also does you a favor by building the right schema dynamically on the fly as you feed it the JSON data. So it is much easier to work with. On Fri, Mar 16, 2012 at 1:26 PM, Erick Erickson wrote: > bq: Shouldn't it be able to take any valid JSON structure? > > No, that was never the intent. The intent here was just to provide > a JSON-compatible format for indexing data for those who > don't like/want to use XML or SolrJ or Solr doesn't index arbitrary > XML either. And I have a hard time imagining what the > schema.xml file would look like when trying to map > arbitrary JSON (or XML or) into fields. > > Best > Erick > > On Fri, Mar 16, 2012 at 12:54 PM, Chambeda wrote: > > Ok, so my issue is that it must be a flat structure. Why isn't the JSON > > parser able to deconstruct the object into a flatter structure for > indexing? > > Shouldn't it be able to take any valid JSON structure? > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/Error-while-trying-to-load-JSON-tp3832518p3832611.html > > Sent from the Solr - User mailing list archive at Nabble.com. >
Schema error unknown field
I'm getting the following exception SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'desc' I'm wondering what I need to do in order to add the "desc" field to the Solr schema for indexing?
@Field annotation support
Hello All, When I use Maven or Eclipse to try and compile my bean which has the @Field annotation as specified in http://wiki.apache.org/solr/Solrj page ... the compiler doesn't find any class to support the annotation. What jar should we use to bring in this custom Solr annotation?
Re: Schema error unknown field
I guess my n00b-ness is showing :) I started off using the instructions directly from http://wiki.apache.org/solr/Solrj and there was no mention of schema there and even after gettign this error and searching for schema.xml in the wiki ... I found no meaningful hits so I thought it best to ask. With your advice, I searched for schema.xml and found 13 instances of it: \solr_1.4.0\client\ruby\solr-ruby\solr\conf\schema.xml \solr_1.4.0\client\ruby\solr-ruby\test\conf\schema.xml \solr_1.4.0\contrib\clustering\src\test\resource\schema.xml \solr_1.4.0\contrib\extraction\src\test\resource\schema.xml \solr_1.4.0\contrib\velocity\src\main\solr\conf\schema.xml \solr_1.4.0\example\example-DIH\solr\db\conf\schema.xml \solr_1.4.0\example\example-DIH\solr\mail\conf\schema.xml \solr_1.4.0\example\example-DIH\solr\rss\conf\schema.xml \solr_1.4.0\example\multicore\core0\conf\schema.xml \solr_1.4.0\example\multicore\core1\conf\schema.xml \solr_1.4.0\example\solr\conf\schema.xml \solr_1.4.0\src\test\test-files\solr\conf\schema.xml \solr_1.4.0\src\test\test-files\solr\shared\conf\schema.xml I took a wild guess and added the field I wanted ("desc") into this file since its name seemed to be the most generic one: C:\apps\solr_1.4.0\example\solr\conf\schema.xml And it worked ... a bit strange that an example directory is used but I suppose it is configurable somewhere? Thanks for you help Erick! Cheers, - Pulkit On Thu, Feb 18, 2010 at 9:53 AM, Erick Erickson wrote: > Add desc as a in your schema.xml > file would be my first guess. > > Providing some explanation of what you're trying to do > would help diagnose your issues. > > HTH > Erick > > On Thu, Feb 18, 2010 at 12:21 PM, Pulkit Singhal > wrote: > >> I'm getting the following exception >> SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'desc' >> >> I'm wondering what I need to do in order to add the "desc" field to >> the Solr schema for indexing? >> >
Run Solr within my war
Hello Everyone, I do NOT want to host Solr separately. I want to run it within my war with the Java Application which is using it. How easy/difficult is that to setup? Can anyone with past experience on this topic, please comment. thanks, - Pulkit
Re: Run Solr within my war
Yeah I have been pitching that but I want all the functionality of Solr in a small package because it is not a concern given the specifically limited data set being searched upon. I understand that the # of users is still another part of this equation but there just aren't that many at this time and having it separate will add to deployment complexity and kill the product before it ever takes off. Adoption is key for me. On Thu, Feb 18, 2010 at 2:25 PM, Dave Searle wrote: > Why would you want to? Surely having it seperate increases scalablity? > > On 18 Feb 2010, at 22:23, "Pulkit Singhal" > wrote: > >> Hello Everyone, >> >> I do NOT want to host Solr separately. I want to run it within my war >> with the Java Application which is using it. How easy/difficult is >> that to setup? Can anyone with past experience on this topic, please >> comment. >> >> thanks, >> - Pulkit >
Re: Run Solr within my war
Using EmbeddedSolrServer is a client side way of communicating with Solr via the file system. Solr has to still be up and running before that. My question is more along the lines of how to put the server jars that perform the core functionality and bundle them to start up within a war which is also the application war for the program that will communicate as the client with the Solr server. On Thu, Feb 18, 2010 at 5:49 PM, Richard Frovarp wrote: > On 2/18/2010 4:22 PM, Pulkit Singhal wrote: >> >> Hello Everyone, >> >> I do NOT want to host Solr separately. I want to run it within my war >> with the Java Application which is using it. How easy/difficult is >> that to setup? Can anyone with past experience on this topic, please >> comment. >> >> thanks, >> - Pulkit >> >> > > So basically you're talking about running an embedded version of Solr like > the EmbeddedSolrServer? I have no experience on this, but this should > provide you the correct search term to find documentation on use. From what > little code I've seen to run test cases against Solr, it looks relatively > straight forward to get running. To use you would use the SolrJ library to > communicate with the embedded solr server. > > Richard >
Re: @Field annotation support
Ok then, is this the correct class to support the @Field annotation? Because I have it on the path but its not working. org\apache\solr\solr-solrj\1.4.0\solr-solrj-1.4.0.jar/org\apache\solr\client\solrj\beans\Field.class 2010/2/18 Noble Paul നോബിള് नोब्ळ् : > solrj jar > > On Thu, Feb 18, 2010 at 10:52 PM, Pulkit Singhal > wrote: >> Hello All, >> >> When I use Maven or Eclipse to try and compile my bean which has the >> @Field annotation as specified in http://wiki.apache.org/solr/Solrj >> page ... the compiler doesn't find any class to support the >> annotation. What jar should we use to bring in this custom Solr >> annotation? >> > > > > -- > - > Noble Paul | Systems Architect| AOL | http://aol.com >