Re: Getting started with Solr

Baruch Kogan Sun, 01 Mar 2015 03:37:04 -0800

OK, got it, works now.

Maybe you can advise on something more general?


I'm trying to use Solr to analyze html data retrieved with Nutch. I want to
crawl a list of webpages built according to a certain template, and analyze
certain fields in their HTML (identified by a span class and consisting of
a number,) then output results as csv to generate a list with the website's
domain and sum of the numbers in all the specified fields.

How should I set up the flow? Should I configure Nutch to only pull the
relevant fields from each page, then use Solr to add the integers in those
fields and output to a csv? Or should I use Nutch to pull in everything
from the relevant page and then use Solr to strip out the relevant fields
and process them as above? Can I do the processing strictly in Solr, using
the stuff found here
<https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations>,
or should I use PHP through Solarium or something along those lines?

Your advice would be appreciated-I don't want to reinvent the bicycle.

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda <http://sellerpanda.com>
+972(58)441-3829
baruch.kogan at Skype

On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan <[email protected]> wrote:

> Thanks for bearing with me.
>
> I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this:
>
> *Welcome to the SolrCloud example!*
>
>
> *This interactive session will help you launch a SolrCloud cluster on your
> local workstation.*
>
> *To begin, how many Solr nodes would you like to run in your local
> cluster? (specify 1-4 nodes) [2] *
> *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.*
>
> *Please enter the port for node1 [8983] *
> *8983*
> *Please enter the port for node2 [7574] *
> *7574*
> *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1
> into /home/ubuntu/crawler/solr/example/cloud/node2*
>
> *Starting up SolrCloud node1 on port 8983 using command:*
>
> *solr start -cloud -s example/cloud/node1/solr -p 8983   *
>
> I then go to http://localhost:8983/solr/admin/cores and get the following:
>
>
> *This XML file does not appear to have any style information associated
> with it. The document tree is shown below.*
>
> *<response><lst name="responseHeader"><int name="status">0</int><int
> name="QTime">2</int></lst><lst name="initFailures"/><lst name="status"><lst
> name="testCollection_shard1_replica1"><str
> name="name">testCollection_shard1_replica1</str><str
> name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/</str><str
> name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/</str><str
> name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date
> name="startTime">2015-03-01T06:59:12.296Z</date><long
> name="uptime">46380</long><lst name="index"><int name="numDocs">0</int><int
> name="maxDoc">0</int><int name="deletedDocs">0</int><long
> name="indexHeapUsageBytes">0</long><long name="version">1</long><int
> name="segmentCount">0</int><bool name="current">true</bool><bool
> name="hasDeletions">false</bool><str
> name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long
> name="sizeInBytes">71</long><str name="size">71 bytes</str></lst></lst><lst
> name="testCollection_shard1_replica2"><str
> name="name">testCollection_shard1_replica2</str><str
> name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/</str><str
> name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/</str><str
> name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date
> name="startTime">2015-03-01T06:59:12.751Z</date><long
> name="uptime">45926</long><lst name="index"><int name="numDocs">0</int><int
> name="maxDoc">0</int><int name="deletedDocs">0</int><long
> name="indexHeapUsageBytes">0</long><long name="version">1</long><int
> name="segmentCount">0</int><bool name="current">true</bool><bool
> name="hasDeletions">false</bool><str
> name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long
> name="sizeInBytes">71</long><str name="size">71 bytes</str></lst></lst><lst
> name="testCollection_shard2_replica1"><str
> name="name">testCollection_shard2_replica1</str><str
> name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/</str><str
> name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/</str><str
> name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date
> name="startTime">2015-03-01T06:59:12.596Z</date><long
> name="uptime">46081</long><lst name="index"><int name="numDocs">0</int><int
> name="maxDoc">0</int><int name="deletedDocs">0</int><long
> name="indexHeapUsageBytes">0</long><long name="version">1</long><int
> name="segmentCount">0</int><bool name="current">true</bool><bool
> name="hasDeletions">false</bool><str
> name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long
> name="sizeInBytes">71</long><str name="size">71 bytes</str></lst></lst><lst
> name="testCollection_shard2_replica2"><str
> name="name">testCollection_shard2_replica2</str><str
> name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/</str><str
> name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/</str><str
> name="config">solrconfig.xml</str><str name="schema">schema.xml</str><date
> name="startTime">2015-03-01T06:59:12.718Z</date><long
> name="uptime">45959</long><lst name="index"><int name="numDocs">0</int><int
> name="maxDoc">0</int><int name="deletedDocs">0</int><long
> name="indexHeapUsageBytes">0</long><long name="version">1</long><int
> name="segmentCount">0</int><bool name="current">true</bool><bool
> name="hasDeletions">false</bool><str
> name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0)</str><lst name="userData"/><long
> name="sizeInBytes">71</long><str name="size">71
> bytes</str></lst></lst></lst></response>*
>
> I do not seem to have a gettingstarted collection.
>
> Sincerely,
>
> Baruch Kogan
> Marketing Manager
> Seller Panda <http://sellerpanda.com>
> +972(58)441-3829
> baruch.kogan at Skype
>
> On Fri, Feb 27, 2015 at 12:00 AM, Erik Hatcher <[email protected]>
> wrote:
>
>> I’m sorry, I’m not following exactly.
>>
>> Somehow you no longer have a gettingstarted collection, but it is not
>> clear how that happened.
>>
>> Could you post the exact script steps you used that got you this error?
>>
>> What collections/cores does the Solr admin show you have?    What are the
>> results of http://localhost:8983/solr/admin/cores <
>> http://localhost:8983/solr/admin/cores> ?
>>
>> —
>> Erik Hatcher, Senior Solutions Architect
>> http://www.lucidworks.com <http://www.lucidworks.com/>
>>
>>
>>
>>
>> > On Feb 26, 2015, at 9:58 AM, Baruch Kogan <[email protected]>
>> wrote:
>> >
>> > Oh, I see. I used the start -e cloud command, then ran through a setup
>> with
>> > one core and default options for the rest, then tried to post the json
>> > example again, and got another error:
>> > buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted
>> > example/exampledocs/*.json
>> > /usr/lib/jvm/java-7-oracle/bin/java -classpath
>> > /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes
>> > -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool
>> > example/exampledocs/books.json
>> > SimplePostTool version 5.0.0
>> > Posting files to [base] url
>> > http://localhost:8983/solr/gettingstarted/update...
>> > Entering auto mode. File endings considered are
>> >
>> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
>> > POSTing file books.json (application/json) to [base]
>> > SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for
>> url:
>> > http://localhost:8983/solr/gettingstarted/update
>> > SimplePostTool: WARNING: Response: <html>
>> > <head>
>> > <meta http-equiv="Content-Type" content="text/html;
>> charset=ISO-8859-1"/>
>> > <title>Error 404 Not Found</title>
>> > </head>
>> > <body><h2>HTTP ERROR 404</h2>
>> > <p>Problem accessing /solr/gettingstarted/update. Reason:
>> > <pre>    Not Found</pre></p><hr /><i><small>Powered by
>> > Jetty://</small></i><br/>
>> >
>> > Sincerely,
>> >
>> > Baruch Kogan
>> > Marketing Manager
>> > Seller Panda <http://sellerpanda.com>
>> > +972(58)441-3829
>> > baruch.kogan at Skype
>> >
>> > On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher <[email protected]>
>> > wrote:
>> >
>> >> How did you start Solr?   If you started with `bin/solr start -e cloud`
>> >> you’ll have a gettingstarted collection created automatically,
>> otherwise
>> >> you’ll need to create it yourself with `bin/solr create -c
>> gettingstarted`
>> >>
>> >>
>> >> —
>> >> Erik Hatcher, Senior Solutions Architect
>> >> http://www.lucidworks.com <http://www.lucidworks.com/>
>> >>
>> >>
>> >>
>> >>
>> >>> On Feb 26, 2015, at 4:53 AM, Baruch Kogan <[email protected]>
>> >> wrote:
>> >>>
>> >>> Hi, I've just installed Solr (will be controlling with Solarium and
>> using
>> >>> to search Nutch queries.)  I'm working through the starting tutorials
>> >>> described here:
>> >>> https://cwiki.apache.org/confluence/display/solr/Running+Solr
>> >>>
>> >>> When I try to run $ bin/post -c gettingstarted
>> >> example/exampledocs/*.json,
>> >>> I get a bunch of errors having to do
>> >>> with there not being a gettingstarted folder in /solr/. Is this
>> normal?
>> >>> Should I create one?
>> >>>
>> >>> Sincerely,
>> >>>
>> >>> Baruch Kogan
>> >>> Marketing Manager
>> >>> Seller Panda <http://sellerpanda.com>
>> >>> +972(58)441-3829
>> >>> baruch.kogan at Skype
>> >>
>> >>
>>
>>
>

Re: Getting started with Solr

Reply via email to