Getting started with Solr
Hi, I've just installed Solr (will be controlling with Solarium and using to search Nutch queries.) I'm working through the starting tutorials described here: https://cwiki.apache.org/confluence/display/solr/Running+Solr When I try to run $ bin/post -c gettingstarted example/exampledocs/*.json, I get a bunch of errors having to do with there not being a gettingstarted folder in /solr/. Is this normal? Should I create one? Sincerely, Baruch Kogan Marketing Manager Seller Panda <http://sellerpanda.com> +972(58)441-3829 baruch.kogan at Skype
Re: Getting started with Solr
Oh, I see. I used the start -e cloud command, then ran through a setup with one core and default options for the rest, then tried to post the json example again, and got another error: buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted example/exampledocs/*.json /usr/lib/jvm/java-7-oracle/bin/java -classpath /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool example/exampledocs/books.json SimplePostTool version 5.0.0 Posting files to [base] url http://localhost:8983/solr/gettingstarted/update... Entering auto mode. File endings considered are xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log POSTing file books.json (application/json) to [base] SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url: http://localhost:8983/solr/gettingstarted/update SimplePostTool: WARNING: Response: Error 404 Not Found HTTP ERROR 404 Problem accessing /solr/gettingstarted/update. Reason: Not FoundPowered by Jetty:// Sincerely, Baruch Kogan Marketing Manager Seller Panda <http://sellerpanda.com> +972(58)441-3829 baruch.kogan at Skype On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher wrote: > How did you start Solr? If you started with `bin/solr start -e cloud` > you’ll have a gettingstarted collection created automatically, otherwise > you’ll need to create it yourself with `bin/solr create -c gettingstarted` > > > — > Erik Hatcher, Senior Solutions Architect > http://www.lucidworks.com <http://www.lucidworks.com/> > > > > > > On Feb 26, 2015, at 4:53 AM, Baruch Kogan > wrote: > > > > Hi, I've just installed Solr (will be controlling with Solarium and using > > to search Nutch queries.) I'm working through the starting tutorials > > described here: > > https://cwiki.apache.org/confluence/display/solr/Running+Solr > > > > When I try to run $ bin/post -c gettingstarted > example/exampledocs/*.json, > > I get a bunch of errors having to do > > with there not being a gettingstarted folder in /solr/. Is this normal? > > Should I create one? > > > > Sincerely, > > > > Baruch Kogan > > Marketing Manager > > Seller Panda <http://sellerpanda.com> > > +972(58)441-3829 > > baruch.kogan at Skype > >
Re: Getting started with Solr
Thanks for bearing with me. I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this: *Welcome to the SolrCloud example!* *This interactive session will help you launch a SolrCloud cluster on your local workstation.* *To begin, how many Solr nodes would you like to run in your local cluster? (specify 1-4 nodes) [2] * *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.* *Please enter the port for node1 [8983] * *8983* *Please enter the port for node2 [7574] * *7574* *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1 into /home/ubuntu/crawler/solr/example/cloud/node2* *Starting up SolrCloud node1 on port 8983 using command:* *solr start -cloud -s example/cloud/node1/solr -p 8983 * I then go to http://localhost:8983/solr/admin/cores and get the following: *This XML file does not appear to have any style information associated with it. The document tree is shown below.* *02testCollection_shard1_replica1/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.296Z4638010truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)7171 bytestestCollection_shard1_replica2/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.751Z4592610truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)7171 bytestestCollection_shard2_replica1/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.596Z4608110truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)7171 bytestestCollection_shard2_replica2/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.718Z4595910truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; maxCacheMB=48.0 maxMergeSizeMB=4.0)7171 bytes* I do not seem to have a gettingstarted collection. Sincerely, Baruch Kogan Marketing Manager Seller Panda <http://sellerpanda.com> +972(58)441-3829 baruch.kogan at Skype On Fri, Feb 27, 2015 at 12:00 AM, Erik Hatcher wrote: > I’m sorry, I’m not following exactly. > > Somehow you no longer have a gettingstarted collection, but it is not > clear how that happened. > > Could you post the exact script steps you used that got you this error? > > What collections/cores does the Solr admin show you have?What are the > results of http://localhost:8983/solr/admin/cores < > http://localhost:8983/solr/admin/cores> ? > > — > Erik Hatcher, Senior Solutions Architect > http://www.lucidworks.com <http://www.lucidworks.com/> > > > > > > On Feb 26, 2015, at 9:58 AM, Baruch Kogan > wrote: > > > > Oh, I see. I used the start -e cloud command, then ran through a setup > with > > one core and default options for the rest, then tried to post the json > > example again, and got another error: > > buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted > > example/exampledocs/*.json > > /usr/lib/jvm/java-7-oracle/bin/java -classpath > > /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes > > -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool > > example/exampledocs/books.json > > SimplePostTool version 5.0.0 > > Posting files to [base] url > > http://localhost:8983/solr/gettingstarted/update... > > Entering auto mode. File endings considered are > > > xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log > > POSTing file books.json (application/json) to [base] >
Re: Getting started with Solr
OK, got it, works now. Maybe you can advise on something more general? I'm trying to use Solr to analyze html data retrieved with Nutch. I want to crawl a list of webpages built according to a certain template, and analyze certain fields in their HTML (identified by a span class and consisting of a number,) then output results as csv to generate a list with the website's domain and sum of the numbers in all the specified fields. How should I set up the flow? Should I configure Nutch to only pull the relevant fields from each page, then use Solr to add the integers in those fields and output to a csv? Or should I use Nutch to pull in everything from the relevant page and then use Solr to strip out the relevant fields and process them as above? Can I do the processing strictly in Solr, using the stuff found here <https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations>, or should I use PHP through Solarium or something along those lines? Your advice would be appreciated-I don't want to reinvent the bicycle. Sincerely, Baruch Kogan Marketing Manager Seller Panda <http://sellerpanda.com> +972(58)441-3829 baruch.kogan at Skype On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan wrote: > Thanks for bearing with me. > > I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this: > > *Welcome to the SolrCloud example!* > > > *This interactive session will help you launch a SolrCloud cluster on your > local workstation.* > > *To begin, how many Solr nodes would you like to run in your local > cluster? (specify 1-4 nodes) [2] * > *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.* > > *Please enter the port for node1 [8983] * > *8983* > *Please enter the port for node2 [7574] * > *7574* > *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1 > into /home/ubuntu/crawler/solr/example/cloud/node2* > > *Starting up SolrCloud node1 on port 8983 using command:* > > *solr start -cloud -s example/cloud/node1/solr -p 8983 * > > I then go to http://localhost:8983/solr/admin/cores and get the following: > > > *This XML file does not appear to have any style information associated > with it. The document tree is shown below.* > > *0 name="QTime">2 name="testCollection_shard1_replica1"> name="name">testCollection_shard1_replica1 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.296Z name="uptime">463800 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions">false name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; > maxCacheMB=48.0 maxMergeSizeMB=4.0) name="sizeInBytes">7171 bytes name="testCollection_shard1_replica2"> name="name">testCollection_shard1_replica2 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.751Z name="uptime">459260 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions">false name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; > maxCacheMB=48.0 maxMergeSizeMB=4.0) name="sizeInBytes">7171 bytes name="testCollection_shard2_replica1"> name="name">testCollection_shard2_replica1 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.596Z name="uptime">460810 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions
Integrating Solr with Nutch
Hi, guys, I'm working through the tutorial here <http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch>. I've run a crawl on a list of webpages. Now I'm trying to index them into Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns queries. I've edited the Nutch schema as per instructions. Now I hit a wall: - Save the file and restart Solr under ${APACHE_SOLR_HOME}/example: java -jar start.jar\ On my install (the latest Solr,) there is no such file, but there is a solr.sh file in the /bin which I can start. So I pasted it into solr/example/ and ran it from there. Solr cranks over. Now I need to: - run the Solr Index command from ${NUTCH_RUNTIME_HOME}: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/ and I get this: *ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex http://127.0.0.1:8983/solr/ <http://127.0.0.1:8983/solr/> crawl/crawldb -linkdb crawl/linkdb crawl/segments/* *Indexer: starting at 2015-03-01 19:51:09* *Indexer: deleting gone documents: false* *Indexer: URL filtering: false* *Indexer: URL normalizing: false* *Active IndexWriters :* *SOLRIndexWriter* * solr.server.url : URL of the SOLR instance (mandatory)* * solr.commit.size : buffer size when sending to SOLR (default 1000)* * solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)* * solr.auth : use authentication (default false)* * solr.auth.username : use authentication (default false)* * solr.auth : username for authentication* * solr.auth.password : password for authentication* *Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/crawldb/current* *Input path does not exist: file:/home/ubuntu/crawler/nutch/crawl/linkdb/current* * at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)* * at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)* * at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)* * at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)* * at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)* * at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)* * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)* * at java.security.AccessController.doPrivileged(Native Method)* * at javax.security.auth.Subject.doAs(Subject.java:415)* * at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)* * at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)* * at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)* * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)* * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)* * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)* * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)* * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)* What am I doing wrong? Sincerely, Baruch Kogan Marketing Manager Seller Panda <http://sellerpanda.com> +972(58)441-3829 baruch.kogan at Skype