Getting started with Solr

2015-02-26 Thread Baruch Kogan
Hi, I've just installed Solr (will be controlling with Solarium and using
to search Nutch queries.)  I'm working through the starting tutorials
described here:
https://cwiki.apache.org/confluence/display/solr/Running+Solr

When I try to run $ bin/post -c gettingstarted example/exampledocs/*.json,
I get a bunch of errors having to do
with there not being a gettingstarted folder in /solr/. Is this normal?
Should I create one?

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda <http://sellerpanda.com>
+972(58)441-3829
baruch.kogan at Skype


Re: Getting started with Solr

2015-02-26 Thread Baruch Kogan
Oh, I see. I used the start -e cloud command, then ran through a setup with
one core and default options for the rest, then tried to post the json
example again, and got another error:
buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted
example/exampledocs/*.json
/usr/lib/jvm/java-7-oracle/bin/java -classpath
/home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes
-Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool
example/exampledocs/books.json
SimplePostTool version 5.0.0
Posting files to [base] url
http://localhost:8983/solr/gettingstarted/update...
Entering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing file books.json (application/json) to [base]
SimplePostTool: WARNING: Solr returned an error #404 (Not Found) for url:
http://localhost:8983/solr/gettingstarted/update
SimplePostTool: WARNING: Response: 


Error 404 Not Found

HTTP ERROR 404
Problem accessing /solr/gettingstarted/update. Reason:
Not FoundPowered by
Jetty://

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda <http://sellerpanda.com>
+972(58)441-3829
baruch.kogan at Skype

On Thu, Feb 26, 2015 at 4:07 PM, Erik Hatcher 
wrote:

> How did you start Solr?   If you started with `bin/solr start -e cloud`
> you’ll have a gettingstarted collection created automatically, otherwise
> you’ll need to create it yourself with `bin/solr create -c gettingstarted`
>
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>
> > On Feb 26, 2015, at 4:53 AM, Baruch Kogan 
> wrote:
> >
> > Hi, I've just installed Solr (will be controlling with Solarium and using
> > to search Nutch queries.)  I'm working through the starting tutorials
> > described here:
> > https://cwiki.apache.org/confluence/display/solr/Running+Solr
> >
> > When I try to run $ bin/post -c gettingstarted
> example/exampledocs/*.json,
> > I get a bunch of errors having to do
> > with there not being a gettingstarted folder in /solr/. Is this normal?
> > Should I create one?
> >
> > Sincerely,
> >
> > Baruch Kogan
> > Marketing Manager
> > Seller Panda <http://sellerpanda.com>
> > +972(58)441-3829
> > baruch.kogan at Skype
>
>


Re: Getting started with Solr

2015-02-28 Thread Baruch Kogan
Thanks for bearing with me.

I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this:

*Welcome to the SolrCloud example!*


*This interactive session will help you launch a SolrCloud cluster on your
local workstation.*

*To begin, how many Solr nodes would you like to run in your local cluster?
(specify 1-4 nodes) [2] *
*Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.*

*Please enter the port for node1 [8983] *
*8983*
*Please enter the port for node2 [7574] *
*7574*
*Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1
into /home/ubuntu/crawler/solr/example/cloud/node2*

*Starting up SolrCloud node1 on port 8983 using command:*

*solr start -cloud -s example/cloud/node1/solr -p 8983   *

I then go to http://localhost:8983/solr/admin/cores and get the following:


*This XML file does not appear to have any style information associated
with it. The document tree is shown below.*

*02testCollection_shard1_replica1/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.296Z4638010truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
maxCacheMB=48.0 maxMergeSizeMB=4.0)7171 bytestestCollection_shard1_replica2/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.751Z4592610truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
maxCacheMB=48.0 maxMergeSizeMB=4.0)7171 bytestestCollection_shard2_replica1/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.596Z4608110truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
maxCacheMB=48.0 maxMergeSizeMB=4.0)7171 bytestestCollection_shard2_replica2/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2//home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/solrconfig.xmlschema.xml2015-03-01T06:59:12.718Z4595910truefalseorg.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/index
lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
maxCacheMB=48.0 maxMergeSizeMB=4.0)7171
bytes*

I do not seem to have a gettingstarted collection.

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda <http://sellerpanda.com>
+972(58)441-3829
baruch.kogan at Skype

On Fri, Feb 27, 2015 at 12:00 AM, Erik Hatcher 
wrote:

> I’m sorry, I’m not following exactly.
>
> Somehow you no longer have a gettingstarted collection, but it is not
> clear how that happened.
>
> Could you post the exact script steps you used that got you this error?
>
> What collections/cores does the Solr admin show you have?What are the
> results of http://localhost:8983/solr/admin/cores <
> http://localhost:8983/solr/admin/cores> ?
>
> —
> Erik Hatcher, Senior Solutions Architect
> http://www.lucidworks.com <http://www.lucidworks.com/>
>
>
>
>
> > On Feb 26, 2015, at 9:58 AM, Baruch Kogan 
> wrote:
> >
> > Oh, I see. I used the start -e cloud command, then ran through a setup
> with
> > one core and default options for the rest, then tried to post the json
> > example again, and got another error:
> > buntu@ubuntu-VirtualBox:~/crawler/solr$ bin/post -c gettingstarted
> > example/exampledocs/*.json
> > /usr/lib/jvm/java-7-oracle/bin/java -classpath
> > /home/ubuntu/crawler/solr/dist/solr-core-5.0.0.jar -Dauto=yes
> > -Dc=gettingstarted -Ddata=files org.apache.solr.util.SimplePostTool
> > example/exampledocs/books.json
> > SimplePostTool version 5.0.0
> > Posting files to [base] url
> > http://localhost:8983/solr/gettingstarted/update...
> > Entering auto mode. File endings considered are
> >
> xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
> > POSTing file books.json (application/json) to [base]
> 

Re: Getting started with Solr

2015-03-01 Thread Baruch Kogan
OK, got it, works now.

Maybe you can advise on something more general?

I'm trying to use Solr to analyze html data retrieved with Nutch. I want to
crawl a list of webpages built according to a certain template, and analyze
certain fields in their HTML (identified by a span class and consisting of
a number,) then output results as csv to generate a list with the website's
domain and sum of the numbers in all the specified fields.

How should I set up the flow? Should I configure Nutch to only pull the
relevant fields from each page, then use Solr to add the integers in those
fields and output to a csv? Or should I use Nutch to pull in everything
from the relevant page and then use Solr to strip out the relevant fields
and process them as above? Can I do the processing strictly in Solr, using
the stuff found here
<https://cwiki.apache.org/confluence/display/solr/Indexing+and+Basic+Data+Operations>,
or should I use PHP through Solarium or something along those lines?

Your advice would be appreciated-I don't want to reinvent the bicycle.

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda <http://sellerpanda.com>
+972(58)441-3829
baruch.kogan at Skype

On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan  wrote:

> Thanks for bearing with me.
>
> I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this:
>
> *Welcome to the SolrCloud example!*
>
>
> *This interactive session will help you launch a SolrCloud cluster on your
> local workstation.*
>
> *To begin, how many Solr nodes would you like to run in your local
> cluster? (specify 1-4 nodes) [2] *
> *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.*
>
> *Please enter the port for node1 [8983] *
> *8983*
> *Please enter the port for node2 [7574] *
> *7574*
> *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1
> into /home/ubuntu/crawler/solr/example/cloud/node2*
>
> *Starting up SolrCloud node1 on port 8983 using command:*
>
> *solr start -cloud -s example/cloud/node1/solr -p 8983   *
>
> I then go to http://localhost:8983/solr/admin/cores and get the following:
>
>
> *This XML file does not appear to have any style information associated
> with it. The document tree is shown below.*
>
> *0 name="QTime">2 name="testCollection_shard1_replica1"> name="name">testCollection_shard1_replica1 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.296Z name="uptime">463800 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions">false name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0) name="sizeInBytes">7171 bytes name="testCollection_shard1_replica2"> name="name">testCollection_shard1_replica2 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.751Z name="uptime">459260 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions">false name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0) name="sizeInBytes">7171 bytes name="testCollection_shard2_replica1"> name="name">testCollection_shard2_replica1 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.596Z name="uptime">460810 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions

Integrating Solr with Nutch

2015-03-01 Thread Baruch Kogan
Hi, guys,

I'm working through the tutorial here
<http://wiki.apache.org/nutch/NutchTutorial#A6._Integrate_Solr_with_Nutch>.
I've run a crawl on a list of webpages. Now I'm trying to index them into
Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns
queries. I've edited the Nutch schema as per instructions. Now I hit a wall:

   -

   Save the file and restart Solr under ${APACHE_SOLR_HOME}/example:

   java -jar start.jar\


On my install (the latest Solr,) there is no such file, but there is a
solr.sh file in the /bin which I can start. So I pasted it into
solr/example/ and ran it from there. Solr cranks over. Now I need to:


   -

   run the Solr Index command from ${NUTCH_RUNTIME_HOME}:

   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/


and I get this:

*ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex
http://127.0.0.1:8983/solr/ <http://127.0.0.1:8983/solr/> crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*
*Indexer: starting at 2015-03-01 19:51:09*
*Indexer: deleting gone documents: false*
*Indexer: URL filtering: false*
*Indexer: URL normalizing: false*
*Active IndexWriters :*
*SOLRIndexWriter*
* solr.server.url : URL of the SOLR instance (mandatory)*
* solr.commit.size : buffer size when sending to SOLR (default 1000)*
* solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)*
* solr.auth : use authentication (default false)*
* solr.auth.username : use authentication (default false)*
* solr.auth : username for authentication*
* solr.auth.password : password for authentication*


*Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/crawldb/current*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/linkdb/current*
* at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*
* at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)*
* at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*
* at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*
* at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*
* at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*
* at java.security.AccessController.doPrivileged(Native Method)*
* at javax.security.auth.Subject.doAs(Subject.java:415)*
* at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*
* at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*
* at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*
* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*
* at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
* at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
* at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
* at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*

What am I doing wrong?

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda <http://sellerpanda.com>
+972(58)441-3829
baruch.kogan at Skype