I also took out my requestHandler and used the standard /update/extract handler. Same result.
On Wed, Feb 29, 2012 at 11:47 AM, Matthew Parker < mpar...@apogeeintegration.com> wrote: > I tried running SOLR Cloud with the default number of shards (i.e. 1), and > I get the same results. > > On Wed, Feb 29, 2012 at 10:46 AM, Matthew Parker < > mpar...@apogeeintegration.com> wrote: > >> Mark, >> >> Nothing appears to be wrong in the logs. I wiped the indexes and imported >> 37 files from SharePoint using Manifold. All 37 make it in, but SOLR still >> has issues with the results being inconsistent. >> >> Let me run my setup by you, and see whether that is the issue? >> >> On one machine, I have three zookeeper instances, four solr instances, >> and a data directory for solr and zookeeper config data. >> >> Step 1. I modified each zoo.xml configuration file to have: >> >> Zookeeper 1 - Create /zookeeper1/conf/zoo.cfg >> ================ >> tickTime=2000 >> initLimit=10 >> syncLimit=5 >> dataDir=[DATA_DIRECTORY]/zk1_data >> clientPort=2181 >> server.1=localhost:2888:3888 >> server.2=localhost:2889:3889 >> server.3=localhost:2890:3890 >> >> Zookeeper 1 - Create /[DATA_DIRECTORY]/zk1_data/myid with the following >> contents: >> ============================================================== >> 1 >> >> Zookeep 2 - Create /zookeeper2/conf/zoo.cfg >> ============== >> tickTime=2000 >> initLimit=10 >> syncLimit=5 >> dataDir=[DATA_DIRECTORY]/zk2_data >> clientPort=2182 >> server.1=localhost:2888:3888 >> server.2=localhost:2889:3889 >> server.3=localhost:2890:3890 >> >> Zookeeper 2 - Create /[DATA_DIRECTORY]/zk2_data/myid with the following >> contents: >> ============================================================== >> 2 >> >> Zookeeper 3 - Create /zookeeper3/conf/zoo.cfg >> ================ >> tickTime=2000 >> initLimit=10 >> syncLimit=5 >> dataDir=[DATA_DIRECTORY]/zk3_data >> clientPort=2183 >> server.1=localhost:2888:3888 >> server.2=localhost:2889:3889 >> server.3=localhost:2890:3890 >> >> Zookeeper 3 - Create /[DATA_DIRECTORY]/zk3_data/myid with the following >> contents: >> ==================================================== >> 3 >> >> Step 2 - SOLR Build >> =============== >> >> I pulled the latest SOLR trunk down. I built it with the following >> commands: >> >> ant example dist >> >> I modified the solr.war files and added the solr cell and extraction >> libraries to WEB-INF/lib. I couldn't get the extraction to work >> any other way. Will zookeper pickup jar files stored with the rest of the >> configuration files in Zookeeper? >> >> I copied the contents of the example directory to each of my SOLR >> directories. >> >> Step 3 - Starting Zookeeper instances >> =========================== >> >> I ran the following commands to start the zookeeper instances: >> >> start .\zookeeper1\bin\zkServer.cmd >> start .\zookeeper2\bin\zkServer.cmd >> start .\zookeeper3\bin\zkServer.cmd >> >> Step 4 - Start Main SOLR instance >> ========================== >> I ran the following command to start the main SOLR instance >> >> java -Djetty.port=8081 -Dhostport=8081 >> -Dbootstrap_configdir=[DATA_DIRECTORY]/solr/conf -Dnumshards=2 >> -Dzkhost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar >> >> Starts up fine. >> >> Step 5 - Start the Remaining 3 SOLR Instances >> ================================== >> I ran the following commands to start the other 3 instances from their >> home directories: >> >> java -Djetty.port=8082 -Dhostport=8082 >> -Dzkhost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar >> >> java -Djetty.port=8083 -Dhostport=8083 >> -Dzkhost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar >> >> java -Djetty.port=8084 -Dhostport=8084 >> -Dzkhost=localhost:2181,localhost:2182,localhost:2183 -jar start.jar >> >> All startup without issue. >> >> Step 6 - Modified solrconfig.xml to have a custom request handler >> =============================================== >> >> <requestHandler name="/update/sharepoint" startup="lazy" >> class="solr.extraction.ExtractingRequestHandler"> >> <lst name="defaults"> >> <str name="update.chain">sharepoint-pipeline</str> >> <str name="fmap.content">text</str> >> <str name="lowernames">true</str> >> <str name="uprefix">ignored</str> >> <str name="caputreAttr">true</str> >> <str name="fmap.a">links</str> >> <str name="fmap.div">ignored</str> >> </lst> >> </requestHandler> >> >> <updateRequestProcessorChain name="sharepoint-pipeline"> >> <processor class="solr.processor.SignatureUpdateProcessorFactory"> >> <bool name="enabled">true</bool> >> <str name="signatureField">id</str> >> <bool name="owerrightDupes">true</bool> >> <str name="fields">url</str> >> <str name="signatureClass">solr.processor.Lookup3Signature</str> >> </processor> >> <processor class="solr.LogUpdateProcessorFactory"/> >> <processor class="solr.RunUpdateProcessorFactory"/> >> </updateRequestProcessorChain> >> >> >> Hopefully this will shed some light on why my configuration is having >> issues. >> >> Thanks for your help. >> >> Matt >> >> >> >> On Tue, Feb 28, 2012 at 8:29 PM, Mark Miller <markrmil...@gmail.com>wrote: >> >>> Hmm...this is very strange - there is nothing interesting in any of the >>> logs? >>> >>> In clusterstate.json, all of the shards have an active state? >>> >>> >>> There are quite a few of us doing exactly this setup recently, so there >>> must be something we are missing here... >>> >>> Any info you can offer might help. >>> >>> - Mark >>> >>> On Feb 28, 2012, at 1:00 PM, Matthew Parker wrote: >>> >>> > Mark, >>> > >>> > I got the codebase from the 2/26/2012, and I got the same inconsistent >>> > results. >>> > >>> > I have solr running on four ports 8081-8084 >>> > >>> > 8081 and 8082 are the leaders for shard 1, and shard 2, respectively >>> > >>> > 8083 - is assigned to shard 1 >>> > 8084 - is assigned to shard 2 >>> > >>> > queries come in and sometime it seems the windows from 8081 and 8083 >>> move >>> > responding to the query but there are no results. >>> > >>> > if the queries run on 8081/8082 or 8081/8084 then results come back ok. >>> > >>> > The query is nothing more than: q=*:* >>> > >>> > Regards, >>> > >>> > Matt >>> > >>> > >>> > On Mon, Feb 27, 2012 at 9:26 PM, Matthew Parker < >>> > mpar...@apogeeintegration.com> wrote: >>> > >>> >> I'll have to check on the commit situation. We have been pushing data >>> from >>> >> SharePoint the last week or so. Would that somehow block the documents >>> >> moving between the solr instances? >>> >> >>> >> I'll try another version tomorrow. Thanks for the suggestions. >>> >> >>> >> On Mon, Feb 27, 2012 at 5:34 PM, Mark Miller <markrmil...@gmail.com >>> >wrote: >>> >> >>> >>> Hmmm...all of that looks pretty normal... >>> >>> >>> >>> Did a commit somehow fail on the other machine? When you view the >>> stats >>> >>> for the update handler, are there a lot of pending adds for on of the >>> >>> nodes? Do the commit counts match across nodes? >>> >>> >>> >>> You can also query an individual node with distrib=false to check >>> that. >>> >>> >>> >>> If you build is a month old, I'd honestly recommend you try >>> upgrading as >>> >>> well. >>> >>> >>> >>> - Mark >>> >>> >>> >>> On Feb 27, 2012, at 3:34 PM, Matthew Parker wrote: >>> >>> >>> >>>> Here is most of the cluster state: >>> >>>> >>> >>>> Connected to Zookeeper >>> >>>> localhost:2181, localhost: 2182, localhost:2183 >>> >>>> >>> >>>> /(v=0 children=7) "" >>> >>>> /CONFIGS(v=0, children=1) >>> >>>> /CONFIGURATION(v=0 children=25) >>> >>>> <<<<< all the configuration files, velocity info, xslt, >>> etc. >>> >>>>>>>> >>> >>>> /NODE_STATES(v=0 children=4) >>> >>>> MACHINE1:8083_SOLR (v=121)"[{"shard_id":"shard1", >>> >>>> >>> "state":"active","core":"","collection":"collection1","node_name:"..." >>> >>>> MACHINE1:8082_SOLR (v=101)"[{"shard_id":"shard2", >>> >>>> >>> "state":"active","core":"","collection":"collection1","node_name:"..." >>> >>>> MACHINE1:8081_SOLR (v=92)"[{"shard_id":"shard1", >>> >>>> >>> "state":"active","core":"","collection":"collection1","node_name:"..." >>> >>>> MACHINE1:8084_SOLR (v=73)"[{"shard_id":"shard2", >>> >>>> >>> "state":"active","core":"","collection":"collection1","node_name:"..." >>> >>>> /ZOOKEEPER (v-0 children=1) >>> >>>> QUOTA(v=0) >>> >>>> >>> >>>> >>> >>> >>> /CLUSTERSTATE.JSON(V=272)"{"collection1":{"shard1":{MACHINE1:8081_solr_":{shard_id":"shard1","leader":"true","..." >>> >>>> /LIVE_NODES (v=0 children=4) >>> >>>> MACHINE1:8083_SOLR(ephemeral v=0) >>> >>>> MACHINE1:8082_SOLR(ephemeral v=0) >>> >>>> MACHINE1:8081_SOLR(ephemeral v=0) >>> >>>> MACHINE1:8084_SOLR(ephemeral v=0) >>> >>>> /COLLECTIONS (v=1 children=1) >>> >>>> COLLECTION1(v=0 children=2)"{"configName":"configuration1"}" >>> >>>> LEADER_ELECT(v=0 children=2) >>> >>>> SHARD1(V=0 children=1) >>> >>>> ELECTION(v=0 children=2) >>> >>>> >>> >>>> 87186203314552835-MACHINE1:8081_SOLR_-N_0000000096(ephemeral v=0) >>> >>>> >>> >>>> 87186203314552836-MACHINE1:8083_SOLR_-N_0000000084(ephemeral v=0) >>> >>>> SHARD2(v=0 children=1) >>> >>>> ELECTION(v=0 children=2) >>> >>>> >>> >>>> 231301391392833539-MACHINE1:8084_SOLR_-N_0000000085(ephemeral v=0) >>> >>>> >>> >>>> 159243797356740611-MACHINE1:8082_SOLR_-N_0000000084(ephemeral v=0) >>> >>>> LEADERS (v=0 children=2) >>> >>>> SHARD1 (ephemeral >>> >>>> v=0)"{"core":"","node_name":"MACHINE1:8081_solr","base_url":" >>> >>>> http://MACHINE1:8081/solr"}" >>> >>>> SHARD2 (ephemeral >>> >>>> v=0)"{"core":"","node_name":"MACHINE1:8082_solr","base_url":" >>> >>>> http://MACHINE1:8082/solr"}" >>> >>>> /OVERSEER_ELECT (v=0 children=2) >>> >>>> ELECTION (v=0 children=4) >>> >>>> 231301391392833539-MACHINE1:8084_SOLR_-N_0000000251(ephemeral >>> >>> v=0) >>> >>>> 87186203314552835-MACHINE1:8081_SOLR_-N_0000000248(ephemeral >>> >>> v=0) >>> >>>> 159243797356740611-MACHINE1:8082_SOLR_-N_0000000250(ephemeral >>> >>> v=0) >>> >>>> 87186203314552836-MACHINE1:8083_SOLR_-N_0000000249(ephemeral >>> >>> v=0) >>> >>>> LEADER (emphemeral >>> >>>> v=0)"{"id":"87186203314552835-MACHINE1:8081_solr-n_000000248"}" >>> >>>> >>> >>>> >>> >>>> >>> >>>> On Mon, Feb 27, 2012 at 2:47 PM, Mark Miller <markrmil...@gmail.com >>> > >>> >>> wrote: >>> >>>> >>> >>>>> >>> >>>>> On Feb 27, 2012, at 2:22 PM, Matthew Parker wrote: >>> >>>>> >>> >>>>>> Thanks for your reply Mark. >>> >>>>>> >>> >>>>>> I believe the build was towards the begining of the month. The >>> >>>>>> solr.spec.version is 4.0.0.2012.01.10.38.09 >>> >>>>>> >>> >>>>>> I cannot access the clusterstate.json contents. I clicked on it a >>> >>> couple >>> >>>>> of >>> >>>>>> times, but nothing happens. Is that stored on disk somewhere? >>> >>>>> >>> >>>>> Are you using the new admin UI? That has recently been updated to >>> work >>> >>>>> better with cloud - it had some troubles not too long ago. If you >>> are, >>> >>> you >>> >>>>> should trying using the old admin UI's zookeeper page - that should >>> >>> show >>> >>>>> the cluster state. >>> >>>>> >>> >>>>> That being said, there has been a lot of bug fixes over the past >>> month >>> >>> - >>> >>>>> so you may just want to update to a recent version. >>> >>>>> >>> >>>>>> >>> >>>>>> I configured a custom request handler to calculate an unique >>> document >>> >>> id >>> >>>>>> based on the file's url. >>> >>>>>> >>> >>>>>> On Mon, Feb 27, 2012 at 1:13 PM, Mark Miller < >>> markrmil...@gmail.com> >>> >>>>> wrote: >>> >>>>>> >>> >>>>>>> Hey Matt - is your build recent? >>> >>>>>>> >>> >>>>>>> Can you visit the cloud/zookeeper page in the admin and send the >>> >>>>> contents >>> >>>>>>> of the clusterstate.json node? >>> >>>>>>> >>> >>>>>>> Are you using a custom index chain or anything out of the >>> ordinary? >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> - Mark >>> >>>>>>> >>> >>>>>>> On Feb 27, 2012, at 12:26 PM, Matthew Parker wrote: >>> >>>>>>> >>> >>>>>>>> TWIMC: >>> >>>>>>>> >>> >>>>>>>> Environment >>> >>>>>>>> ========= >>> >>>>>>>> Apache SOLR rev-1236154 >>> >>>>>>>> Apache Zookeeper 3.3.4 >>> >>>>>>>> Windows 7 >>> >>>>>>>> JDK 1.6.0_23.b05 >>> >>>>>>>> >>> >>>>>>>> I have built a SOLR Cloud instance with 4 nodes using the >>> embeded >>> >>> Jetty >>> >>>>>>>> servers. >>> >>>>>>>> >>> >>>>>>>> I created a 3 node zookeeper ensemble to manage the solr >>> >>> configuration >>> >>>>>>> data. >>> >>>>>>>> >>> >>>>>>>> All the instances run on one server so I've had to move ports >>> around >>> >>>>> for >>> >>>>>>>> the various applications. >>> >>>>>>>> >>> >>>>>>>> I start the 3 zookeeper nodes. >>> >>>>>>>> >>> >>>>>>>> I started the first instance of solr cloud with the parameter to >>> >>> have >>> >>>>> two >>> >>>>>>>> shards. >>> >>>>>>>> >>> >>>>>>>> The start the remaining 3 solr nodes. >>> >>>>>>>> >>> >>>>>>>> The system comes up fine. No errors thrown. >>> >>>>>>>> >>> >>>>>>>> I can view the solr cloud console and I can see the SOLR >>> >>> configuration >>> >>>>>>>> files managed by ZooKeeper. >>> >>>>>>>> >>> >>>>>>>> I published data into the SOLR Cloud instances from SharePoint >>> using >>> >>>>>>> Apache >>> >>>>>>>> Manifold 0.4-incubating. Manifold is setup to publish the data >>> into >>> >>>>>>>> collection1, which is the only collection defined in the >>> cluster. >>> >>>>>>>> >>> >>>>>>>> When I query the data from collection1 as per the solr wiki, the >>> >>>>> results >>> >>>>>>>> are inconsistent. Sometimes all the results are there, other >>> times >>> >>>>>>> nothing >>> >>>>>>>> comes back at all. >>> >>>>>>>> >>> >>>>>>>> It seems to be having an issue auto replicating the data across >>> the >>> >>>>>>> cloud. >>> >>>>>>>> >>> >>>>>>>> Is there some specific setting I might have missed? Based upon >>> what >>> >>> I >>> >>>>>>> read, >>> >>>>>>>> I thought that SOLR cloud would take care of distributing and >>> >>>>> replicating >>> >>>>>>>> the data automatically. Do you have to tell it what shard to >>> publish >>> >>>>> the >>> >>>>>>>> data into as well? >>> >>>>>>>> >>> >>>>>>>> Any help would be appreciated. >>> >>>>>>>> >>> >>>>>>>> Thanks, >>> >>>>>>>> >>> >>>>>>>> Matt >>> >>>>>>>> >>> >>>>>>>> ------------------------------ >>> >>>>>>>> This e-mail and any files transmitted with it may be >>> proprietary. >>> >>>>>>> Please note that any views or opinions presented in this e-mail >>> are >>> >>>>> solely >>> >>>>>>> those of the author and do not necessarily represent those of >>> Apogee >>> >>>>>>> Integration. >>> >>>>>>> >>> >>>>>>> - Mark Miller >>> >>>>>>> lucidimagination.com >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>> >>> >>>>>> Matt >>> >>>>>> >>> >>>>>> ------------------------------ >>> >>>>>> This e-mail and any files transmitted with it may be proprietary. >>> >>>>> Please note that any views or opinions presented in this e-mail are >>> >>> solely >>> >>>>> those of the author and do not necessarily represent those of >>> Apogee >>> >>>>> Integration. >>> >>>>> >>> >>>>> - Mark Miller >>> >>>>> lucidimagination.com >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>> >>> >>>> ------------------------------ >>> >>>> This e-mail and any files transmitted with it may be proprietary. >>> >>> Please note that any views or opinions presented in this e-mail are >>> solely >>> >>> those of the author and do not necessarily represent those of Apogee >>> >>> Integration. >>> >>> >>> >>> - Mark Miller >>> >>> lucidimagination.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> >>> > >>> > ------------------------------ >>> > This e-mail and any files transmitted with it may be proprietary. >>> Please note that any views or opinions presented in this e-mail are solely >>> those of the author and do not necessarily represent those of Apogee >>> Integration. >>> >>> - Mark Miller >>> lucidimagination.com >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> > ------------------------------ This e-mail and any files transmitted with it may be proprietary. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Apogee Integration.