I am having trouble connecting the Nutch 1.10 web crawler with Solr 5.3.0.

2016-03-15 Thread John Mitchell
Hi,

I am having trouble connecting the Nutch 1.10 web crawler with Solr 5.3.0.

I have Solr correctly setup via "bin/solr Start -c cloud -noprompt" and I
have even crawled data with Norconex web crawler and been able to
successfully commit this crawled data into Solr but I want to see if I can
commit Apache Nutch crawled data into Solr.

I tried the tutorial Integrate Solr with Nutchat
https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch but
the location and files referred to don't match my Solr 5.3.0 setup.


Thanks,

John Mitchell


Would it be better to make my Schema changes within the renamed "/solr-5.3.0/server/solr/configsets/data_driven_schema_configs/conf/schema.xml" instead of the way that I am doing it now via curl -X PO

2016-03-19 Thread John Mitchell
I noticed that within
"/solr-5.3.0/server/solr/configsets/data_driven_schema_configs/conf" it has
a file called "managed-schema" and within this file it says "This is the
Solr schema file. This file should be named "schema.xml" and should be in
the conf directory".  Currently I have not renamed this file to
"schema.xml" and any adjustments to the Schema I have done via "curl -X
POST -H 'Content-type:application/json' --data-binary '{
"add-field":   ".

My question would it be better to make my Schema changes within the renamed
"/solr-5.3.0/server/solr/configsets/data_driven_schema_configs/conf/schema.xml"
instead of the way that I am doing it now via curl -X POST -H
'Content-type:application/json' --data-binary '{
"add-field":   "?

I have pasted below my shell script which starts with an empty Solr, then
adds to the Schema via "curl -X POST -H 'Content-type:application/json'
--data-binary '{ "add-field":   ", and then
runs Norconex-Collector-Http webcrawler which commits into Solr.

Thanks,

John Mitchell



#!/bin/bash

cd /home/jmitchell/20150905/solr-5.3.0

bin/solr stop -all ; rm -Rf example/cloud/

bin/solr start -e cloud -noprompt

# I am using a dynamic schema so I added the "content" field with the
type of "text_general" (see below) before starting to load any data into
Apache Solr and now everything correctly loads into Apache Solr. For now
not using the default "_text_" field as a replacement for the "content"
field.

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{ "name":"content", "type":"text_general" } }'
http://localhost:8983/solr/gettingstarted/schema

# Adding the "Institutions_of_Higher_Education" field with a type of
strings:

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{ "name":"Institutions_of_Higher_Education", "type":"strings" }
}' http://localhost:8983/solr/gettingstarted/schema

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{ "name":"Local_Education_Agencies", "type":"strings" } }'
http://localhost:8983/solr/gettingstarted/schema

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{ "name":"Nonprofit_Organizations", "type":"strings" } }'
http://localhost:8983/solr/gettingstarted/schema

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{ "name":"Other_Organizations_and_or_Agencies",
"type":"strings" } }' http://localhost:8983/solr/gettingstarted/schema

curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{ "name":"State_Education_Agencies", "type":"strings" } }'
http://localhost:8983/solr/gettingstarted/schema

cd ..

chmod -R 777 solr-5.3.0

cd norconex-collector-http-2.2.1

rm -rf committer-queue/ examples-output/

#curl "
http://localhost:8983/solr/gettingstarted/select?q=*:*&wt=json&indent=true&rows=10
"

/home/jmitchell/20150905/norconex-collector-http-2.2.1/collector-http.sh -a
start -c examples/minimum/minimum-config.xml 2>&1 >
/home/jmitchell/collector-http-solr-gettingstarted-changed_content_structure_adding_new_faceted_fields_3.txt
&