OK, got it, works now. Maybe you can advise on something more general?
I'm trying to use Solr to analyze html data retrieved with Nutch. I want to crawl a list of webpages built according to a certain template, and analyze certain fields in their HTML (identified by a span class and consisting of a number,) then output results as csv to generate a list with the website's domain and sum of the numbers in all the specified fields. How should I set up the flow? Should I configure Nutch to only pull the relevant fields from each page, then use Solr to add the integers in those fields and output to a csv? Or should I use Nutch to pull in everything from the relevant page and then use Solr to strip out the relevant fields and process them as above? Can I do the processing strictly in Solr, using the stuff found here <https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations>, or should I use PHP through Solarium or something along those lines? Your advice would be appreciated-I don't want to reinvent the bicycle. Sincerely, Baruch Kogan Marketing Manager Seller Panda <http://sellerpanda.com> +972(58)441-3829 baruch.kogan at Skype On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan <bar...@sellerpanda.com> wrote: > Thanks for bearing with me. > > I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this: > > *Welcome to the SolrCloud example!* > > > *This interactive session will help you launch a SolrCloud cluster on your > local workstation.* > > *To begin, how many Solr nodes would you like to run in your local > cluster? (specify 1-4 nodes) [2] * > *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.* > > *Please enter the port for node1 [8983] * > *8983* > *Please enter the port for node2 [7574] * > *7574* > *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1 > into /home/ubuntu/crawler/solr/example/cloud/node2* > > *Starting up SolrCloud node1 on port 8983 using command:* > > *solr start -cloud -s example/cloud/node1/solr -p 8983 * > > I then go to http://localhost:8983/solr/admin/cores and get the following: > > > *This XML file does not appear to have any style information associated > with it. The document tree is shown below.* > > *<response><lst name="responseHeader"><int name="status">0</int><int > name="QTime">2</int></lst><lst name="initFailures"/><lst name="status"><lst > name="testCollection_shard1_replica1"><str > name="name">testCollection_shard1_replica1</str><str > name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/</str><str > name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/</str><str > name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date > name="startTime">2015-03-01T06:59:12.296Z</date><long > name="uptime">46380</long><lst name="index"><int name="numDocs">0</int><int > name="maxDoc">0</int><int name="deletedDocs">0</int><long > name="indexHeapUsageBytes">0</long><long name="version">1</long><int > name="segmentCount">0</int><bool name="current">true</bool><bool > name="hasDeletions">false</bool><str > name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; > maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long > name="sizeInBytes">71</long><str name="size">71 bytes</str></lst></lst><lst > name="testCollection_shard1_replica2"><str > name="name">testCollection_shard1_replica2</str><str > name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/</str><str > name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/</str><str > name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date > name="startTime">2015-03-01T06:59:12.751Z</date><long > name="uptime">45926</long><lst name="index"><int name="numDocs">0</int><int > name="maxDoc">0</int><int name="deletedDocs">0</int><long > name="indexHeapUsageBytes">0</long><long name="version">1</long><int > name="segmentCount">0</int><bool name="current">true</bool><bool > name="hasDeletions">false</bool><str > name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; > maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long > name="sizeInBytes">71</long><str name="size">71 bytes</str></lst></lst><lst > name="testCollection_shard2_replica1"><str > name="name">testCollection_shard2_replica1</str><str > name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/</str><str > name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/</str><str > name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date > name="startTime">2015-03-01T06:59:12.596Z</date><long > name="uptime">46081</long><lst name="index"><int name="numDocs">0</int><int > name="maxDoc">0</int><int name="deletedDocs">0</int><long > name="indexHeapUsageBytes">0</long><long name="version">1</long><int > name="segmentCount">0</int><bool name="current">true</bool><bool > name="hasDeletions">false</bool><str > name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; > maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long > name="sizeInBytes">71</long><str name="size">71 bytes</str></lst></lst><lst > name="testCollection_shard2_replica2"><str > name="name">testCollection_shard2_replica2</str><str > name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/</str><str > name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/</str><str > name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date > name="startTime">2015-03-01T06:59:12.718Z</date><long > name="uptime">45959</long><lst name="index"><int name="numDocs">0</int><int > name="maxDoc">0</int><int name="deletedDocs">0</int><long > name="indexHeapUsageBytes">0</long><long name="version">1</long><int > name="segmentCount">0</int><bool name="current">true</bool><bool > name="hasDeletions">false</bool><str > name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/index > lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b; > maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long > name="sizeInBytes">71</long><str name="size">71 > bytes</str></lst></lst></lst></response>* > > I do not seem to have a gettingstarted collection. > > Sincerely, > > Baruch Kogan > Marketing Manager > Seller Panda <http://sellerpanda.com> > +972(58)441-3829 > baruch.kogan at Skype > > On Fri, Feb 27, 2015 at 12:00 AM, Erik Hatcher <erik.hatc...@gmail.com> > wrote: > >> I’m sorry, I’m not following exactly. >> >> Somehow you no longer have a gettingstarted collection, but it is not >> clear how that happened. >> >> Could you post the exact script steps you used that got you this error? >> >> What collections/cores does the Solr admin show you have? What are the >> results of http://localhost:8983/solr/admin/cores < >> http://localhost:8983/solr/admin/cores> ? >> >> — >> Erik Hatcher, Senior Solutions Architect >> http://www.lucidworks.com <http://www.lucidworks.com/> >> >> >> >> >> > On Feb 26, 2015, at 9:58 AM, Baruch Kogan <bar...@sellerpanda.com> >> wrote: >> > >> > Oh, I see. I used the start -e cloud command, then ran through a setup >> with >> > one core and default options for the rest, then tried to post the json >> > example again, and got another error: >> > buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted >> > example/exampledocs/*.json >> > /usr/lib/jvm/java-7-oracle/bin/java -classpath >> > /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes >> > -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool >> > example/exampledocs/books.json >> > SimplePostTool version 5.0.0 >> > Posting files to [base] url >> > http://localhost:8983/solr/gettingstarted/update... >> > Entering auto mode. File endings considered are >> > >> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log >> > POSTing file books.json (application/json) to [base] >> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for >> url: >> > http://localhost:8983/solr/gettingstarted/update >> > SimplePostTool: WARNING: Response: <html> >> > <head> >> > <meta http-equiv="Content-Type" content="text/html; >> charset=ISO-8859-1"/> >> > <title>Error 404 Not Found</title> >> > </head> >> > <body><h2>HTTP ERROR 404</h2> >> > <p>Problem accessing /solr/gettingstarted/update. Reason: >> > <pre> Not Found</pre></p><hr /><i><small>Powered by >> > Jetty://</small></i><br/> >> > >> > Sincerely, >> > >> > Baruch Kogan >> > Marketing Manager >> > Seller Panda <http://sellerpanda.com> >> > +972(58)441-3829 >> > baruch.kogan at Skype >> > >> > On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher <erik.hatc...@gmail.com> >> > wrote: >> > >> >> How did you start Solr? If you started with `bin/solr start -e cloud` >> >> you’ll have a gettingstarted collection created automatically, >> otherwise >> >> you’ll need to create it yourself with `bin/solr create -c >> gettingstarted` >> >> >> >> >> >> — >> >> Erik Hatcher, Senior Solutions Architect >> >> http://www.lucidworks.com <http://www.lucidworks.com/> >> >> >> >> >> >> >> >> >> >>> On Feb 26, 2015, at 4:53 AM, Baruch Kogan <bar...@sellerpanda.com> >> >> wrote: >> >>> >> >>> Hi, I've just installed Solr (will be controlling with Solarium and >> using >> >>> to search Nutch queries.) I'm working through the starting tutorials >> >>> described here: >> >>> https://cwiki.apache.org/confluence/display/solr/Running+Solr >> >>> >> >>> When I try to run $ bin/post -c gettingstarted >> >> example/exampledocs/*.json, >> >>> I get a bunch of errors having to do >> >>> with there not being a gettingstarted folder in /solr/. Is this >> normal? >> >>> Should I create one? >> >>> >> >>> Sincerely, >> >>> >> >>> Baruch Kogan >> >>> Marketing Manager >> >>> Seller Panda <http://sellerpanda.com> >> >>> +972(58)441-3829 >> >>> baruch.kogan at Skype >> >> >> >> >> >> >