Re: Getting started with Solr

2015-03-01 Thread Baruch Kogan
OK, got it, works now.

Maybe you can advise on something more general?

I'm trying to use Solr to analyze html data retrieved with Nutch. I want to
crawl a list of webpages built according to a certain template, and analyze
certain fields in their HTML (identified by a span class and consisting of
a number,) then output results as csv to generate a list with the website's
domain and sum of the numbers in all the specified fields.

How should I set up the flow? Should I configure Nutch to only pull the
relevant fields from each page, then use Solr to add the integers in those
fields and output to a csv? Or should I use Nutch to pull in everything
from the relevant page and then use Solr to strip out the relevant fields
and process them as above? Can I do the processing strictly in Solr, using
the stuff found here
,
or should I use PHP through Solarium or something along those lines?

Your advice would be appreciated-I don't want to reinvent the bicycle.

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda 
+972(58)441-3829
baruch.kogan at Skype

On Sun, Mar 1, 2015 at 9:17 AM, Baruch Kogan  wrote:

> Thanks for bearing with me.
>
> I start Solr with `bin/solr start -e cloud' with 2 nodes. Then I get this:
>
> *Welcome to the SolrCloud example!*
>
>
> *This interactive session will help you launch a SolrCloud cluster on your
> local workstation.*
>
> *To begin, how many Solr nodes would you like to run in your local
> cluster? (specify 1-4 nodes) [2] *
> *Ok, let's start up 2 Solr nodes for your example SolrCloud cluster.*
>
> *Please enter the port for node1 [8983] *
> *8983*
> *Please enter the port for node2 [7574] *
> *7574*
> *Cloning Solr home directory /home/ubuntu/crawler/solr/example/cloud/node1
> into /home/ubuntu/crawler/solr/example/cloud/node2*
>
> *Starting up SolrCloud node1 on port 8983 using command:*
>
> *solr start -cloud -s example/cloud/node1/solr -p 8983   *
>
> I then go to http://localhost:8983/solr/admin/cores and get the following:
>
>
> *This XML file does not appear to have any style information associated
> with it. The document tree is shown below.*
>
> *0 name="QTime">2 name="testCollection_shard1_replica1"> name="name">testCollection_shard1_replica1 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.296Z name="uptime">463800 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions">false name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica1/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0) name="sizeInBytes">7171 bytes name="testCollection_shard1_replica2"> name="name">testCollection_shard1_replica2 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.751Z name="uptime">459260 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions">false name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard1_replica2/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0) name="sizeInBytes">7171 bytes name="testCollection_shard2_replica1"> name="name">testCollection_shard2_replica1 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/ name="config">solrconfig.xmlschema.xml name="startTime">2015-03-01T06:59:12.596Z name="uptime">460810 name="maxDoc">00 name="indexHeapUsageBytes">01 name="segmentCount">0true name="hasDeletions">false name="directory">org.apache.lucene.store.NRTCachingDirectory:NRTCachingDirectory(MMapDirectory@/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica1/data/index
> lockFactory=org.apache.lucene.store.NativeFSLockFactory@2a4f8f8b;
> maxCacheMB=48.0 maxMergeSizeMB=4.0) name="sizeInBytes">7171 bytes name="testCollection_shard2_replica2"> name="name">testCollection_shard2_replica2 name="instanceDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/ name="dataDir">/home/ubuntu/crawler/solr/example/cloud/node1/solr/testCollection_shard2_replica2/data/ name="config">solrconfig.x

Correct connection methodology for Zookeeper/SolrCloud?

2015-03-01 Thread Julian Perry

Hi

I'm really after best practice guidelines for making queries to
an index on a Solr cluster.  I'm not calling from Java.

I have Solr 4.10.2 up and running, seems stable.

I have about 6 indexes/collections - am running SolrCloud with
two Solr instances (both currently running on the same dev. box -
just one shard each) and standalone Zookeeper with 3 instances.
All seems fine.  I can do queries against either instance, and
perform index updates and replication works fine.

I'm not using Java to talk to Solr - the web pages are built with
PHP (or something similar - happy to call zk/Solr from C).  So I
need to call Solr from the web page code.  Clearly I need
resilience and so don't want to specifically call one of the Solr
instances directly.

I could just set up a load balancer on the two Solr instances and
let client query requests use the load balancer to find a working
instance.

From what I have read though - I am supposed to make a call to
zookeeper to ask which Solr instances are running up to date and
working replicas of the collection that I need.  Is that right?
I should do that every time I need to make a query?

There seems to be a zookeeper client library in the zk dist - in
zookeeper-3.4.6/src/c/ - can I use that?  It looks like I can
pass in a list of potential zk host:port pairs and it will find
a working zk for me - is that right?

Then I need to ask the working zk which solr instance I should
connect to for the given index/collection - how do I do that -
is that held in clusterstate.json?

So the steps to make a Solr query against my cluster would be:

a) call zk client library with list of zk host/ports

b) ask zk for clusterstate.json

c) pick an active server (at random) for the relevant collection
   (is there some load balancing option in there)

d) call the Solr server returned by (c)

Is that best practice - or am I missing something?

--
Cheers
Jules.



Conditional invocation of HTMLStripCharFactory

2015-03-01 Thread SolrUser1543
is it possible to make a considional invocation of a HTMLStripCharFactory? I
want to decide when to enable or disable it according to a value of specific
field in my document.  E.g. when a value of field A is true, then enable a
filter on field B,or disable otherwise. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Conditional-invocation-of-HTMLStripCharFactory-tp4190010.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Is it possible to use multiple index data directory in Apache Solr?

2015-03-01 Thread Susheel Kumar
Under Solr/example folder, you will find "multicore" folder under which you can 
create multiple core/index directory folders and edit the solr.xml to specify 
each of the new core/directory.  

When you start Solr under examples directory, use command line like below to 
load Solr and then you should be able to see these multiple core in Solr admin 
and index data in each of the core/data directory.

> java -Dsolr.solr.home=multicore -jar start.jar 

Thnx

-Original Message-
From: Jou Sung-Shik [mailto:lik...@gmail.com] 
Sent: February 28, 2015 10:03 PM
To: solr-user@lucene.apache.org
Subject: Is it possible to use multiple index data directory in Apache Solr?

I'm new in Apache Lucene/Solr.

I try to move from Elasticsearch to Apache Solr.

So, I have a question about following index data location configuration.


*in Elasticsearch*

# Can optionally include more than one lo # the locations (a la RAID 0) on a 
file l # space on creation. For example:
#
# path.data: /path/to/data1,/path/to/data2

*in Apache Solr*

/var/data/solr/


I want to configure multiple index data directory like Elasticsearch in Apache 
Solr.

Is it possible?

How I can reach the goal?





--
-
BLOG : http://www.codingstar.net
-


filtering tfq() function query to specific part of collection not the whole documents

2015-03-01 Thread Ali Nazemian
Hi,
I was wondering is it possible to filter tfq() function query to specific
selection of collection? Suppose I want to count all occurrences of term
"test" in documents with fq=category:2, how can I handle such query with
tfq() function query? It seems applying fq=category:2 in a "select" query
with considering tfq() does not affect tfq(), no matter what is the other
part of my query, tfq() always return the total term frequency for specific
field in the whole collection. So what is the solution for this case?
Best regards.

-- 
A.Nazemian


Re: [ANNOUNCE] Luke 4.10.3 released

2015-03-01 Thread Dmitry Kan
Hi Tomoko,

I have just created the pivot branch off of the current master. Let's move
our discussion there:

https://github.com/DmitryKey/luke/tree/pivot-luke

Thanks,
Dmitry

On Fri, Feb 27, 2015 at 7:53 PM, Tomoko Uchida  wrote:

> Hi Dmitry,
>
> In my environment, I cannot produce this pivots's error in HotSpot VM
> 1.7.0, please give me some time...
> Or, I'll try to make pull requests https://github.com/DmitryKey/luke for
> pivots's version.
>
> At any rate, it would be best to manage both of (current) thinlet's and
> pivots's versions at same place, as you suggested.
>
> Thanks,
> Tomoko
>
> 2015-02-26 22:15 GMT+09:00 Dmitry Kan :
>
> > Sure, it is:
> >
> > java version "1.7.0_76"
> > Java(TM) SE Runtime Environment (build 1.7.0_76-b13)
> > Java HotSpot(TM) 64-Bit Server VM (build 24.76-b04, mixed mode)
> >
> >
> > On Thu, Feb 26, 2015 at 2:39 PM, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com
> > > wrote:
> >
> > > Sorry, I'm afraid I have not encountered such errors when launch.
> > > Seems something wrong around Pivot's, but I have no idea about it.
> > > Would you tell me java version you're using ?
> > >
> > > Tomoko
> > >
> > > 2015-02-26 21:15 GMT+09:00 Dmitry Kan :
> > >
> > > > Thanks, Tomoko, it compiles ok!
> > > >
> > > > Now launching produces some errors:
> > > >
> > > > $ java -cp "dist/*" org.apache.lucene.luke.ui.LukeApplication
> > > > Exception in thread "main" java.lang.ExceptionInInitializerError
> > > > at org.apache.lucene.luke.ui.LukeApplication.main(Unknown
> > Source)
> > > > Caused by: java.lang.NumberFormatException: For input string: "3
> > 1644336
> > > "
> > > > at
> > > >
> > > >
> > >
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > > > at java.lang.Integer.parseInt(Integer.java:492)
> > > > at java.lang.Byte.parseByte(Byte.java:148)
> > > > at java.lang.Byte.parseByte(Byte.java:174)
> > > > at org.apache.pivot.util.Version.decode(Version.java:156)
> > > > at
> > > >
> > > >
> > >
> >
> org.apache.pivot.wtk.ApplicationContext.(ApplicationContext.java:1704)
> > > > ... 1 more
> > > >
> > > >
> > > > On Thu, Feb 26, 2015 at 1:48 PM, Tomoko Uchida <
> > > > tomoko.uchida.1...@gmail.com
> > > > > wrote:
> > > >
> > > > > Thank you for checking out it!
> > > > > Sorry, I've forgot to note important information...
> > > > >
> > > > > ivy jar is needed to compile. Packaging process needs to be
> > organized,
> > > > but
> > > > > for now, I'm borrowing it from lucene's tools/lib.
> > > > > In my environment, Fedora 20 and OpenJDK 1.7.0_71, it can be
> compiled
> > > and
> > > > > run as follows.
> > > > > If there are any problems, please let me know.
> > > > >
> > > > > 
> > > > >
> > > > > $ svn co http://svn.apache.org/repos/asf/lucene/sandbox/luke/
> > > > > $ cd luke/
> > > > >
> > > > > // copy ivy jar to lib/tools
> > > > > $ cp /path/to/lucene_solr_4_10_3/lucene/tools/lib/ivy-2.3.0.jar
> > > > lib/tools/
> > > > > $ ls lib/tools/
> > > > > ivy-2.3.0.jar
> > > > >
> > > > > $ java -version
> > > > > java version "1.7.0_71"
> > > > > OpenJDK Runtime Environment (fedora-2.5.3.3.fc20-x86_64 u71-b14)
> > > > > OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)
> > > > >
> > > > > $ ant ivy-resolve
> > > > > ...
> > > > > BUILD SUCCESSFUL
> > > > >
> > > > > // compile and make jars and run
> > > > > $ ant dist
> > > > > ...
> > > > > BUILD SUCCESSFULL
> > > > > $ java -cp "dist/*" org.apache.lucene.luke.ui.LukeApplication
> > > > > ...
> > > > > 
> > > > >
> > > > > Thanks,
> > > > > Tomoko
> > > > >
> > > > > 2015-02-26 16:39 GMT+09:00 Dmitry Kan :
> > > > >
> > > > > > Hi Tomoko,
> > > > > >
> > > > > > Thanks for the link. Do you have build instructions somewhere?
> > When I
> > > > > > executed ant with no params, I get:
> > > > > >
> > > > > > BUILD FAILED
> > > > > > /home/dmitry/projects/svn/luke/build.xml:40:
> > > > > > /home/dmitry/projects/svn/luke/lib-ivy does not exist.
> > > > > >
> > > > > >
> > > > > > On Thu, Feb 26, 2015 at 2:27 AM, Tomoko Uchida <
> > > > > > tomoko.uchida.1...@gmail.com
> > > > > > > wrote:
> > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > Would you announce at LUCENE-2562 to me and all watchers
> > interested
> > > > in
> > > > > > this
> > > > > > > issue, when the branch is ready? :)
> > > > > > > As you know, current pivots's version (that supports Lucene
> > 4.10.3)
> > > > is
> > > > > > > here.
> > > > > > > http://svn.apache.org/repos/asf/lucene/sandbox/luke/
> > > > > > >
> > > > > > > Regards,
> > > > > > > Tomoko
> > > > > > >
> > > > > > > 2015-02-25 18:37 GMT+09:00 Dmitry Kan :
> > > > > > >
> > > > > > > > Ok, sure. The plan is to make the pivot branch in the current
> > > > github
> > > > > > repo
> > > > > > > > and update its structure accordingly.
> > > > > > > > Once it is there, I'll let you know.
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > > Dmitry
> > > > > > > >
> 

Re: Is it possible to use multiple index data directory in Apache Solr?

2015-03-01 Thread Alexandre Rafalovitch
On 1 March 2015 at 01:03, Shawn Heisey  wrote:
> How exactly does ES split the index files when multiple paths are
> configured?  I am very curious about exactly how this works.  Google is
> not helping me figure it out.  I even grabbed the ES master branch and
> wasn't able to trace how path.data is used after it makes it into the
> environment.

Elasticsearch automatically creates indexes and shards. So, multiple
directories are just used to distribute the shards' indexes among
them. 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-dir-layout.html
So, when a new shard is created, one of the directories is used either
randomly or usage-based.

So, to me, the question would be not about the implementation matching
but what is the OP trying to achieve with that: replication? more even
disk utilization? something else?

Regards,
Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


Re: About solr recovery

2015-03-01 Thread Erick Erickson
Several. One is if your network has trouble and Zookeeper times out a Solr node.

Can you describe your problem though? Or is this just an informational
question? Because I'm quite sure how to respond helpfully here.

Best,
Erick

On Fri, Feb 27, 2015 at 10:37 PM, 龚俊衡  wrote:
> HI,
>
> Our production solr’s replication was offline in some time but both zookeeper 
> and network is ok,  and Solr jvm is normal.
>
> my question are there any other reason will let solr’s replication into 
> recovering state?


Re: Correct connection methodology for Zookeeper/SolrCloud?

2015-03-01 Thread Erick Erickson
bq: I could just set up a load balancer on the two Solr instances and
let client query requests use the load balancer to find a working
instance.

That's all you need to do. The client shouldn't have to really even be
aware that Zookeeper exists, there's really no need to query ZK and
route your requests yourself. The _Solr_ instances query ZK and "know"
about each other's state and are notivied of any problems, i.e. nodes
going up/down etc. Once a request hits any running Solr node, it'll be
routed around any problems. In the setup you describe, i.e. not using
SolrJ, your client really shouldn't even need to be aware ZK exists.

Your load balancer should know what nodes are up and route your
requests around any hosed machines.

If you _do_ decide to use SolrJ sometime, CloudSolrServer (renamed
CloudSolrClient in 5x) _does_ take the ZK ensemble and do some smart
routing on the client side, including simple load balancing, and
responds to any solr nodes going up/down for you.

Putting a load balancer in front or some other type of connection,
though, will accomplish much the same thing if Java isn't an option.
The SolrJ stuff is more sophisticated though.

Best,
Erick

On Sun, Mar 1, 2015 at 3:51 AM, Julian Perry  wrote:
> Hi
>
> I'm really after best practice guidelines for making queries to
> an index on a Solr cluster.  I'm not calling from Java.
>
> I have Solr 4.10.2 up and running, seems stable.
>
> I have about 6 indexes/collections - am running SolrCloud with
> two Solr instances (both currently running on the same dev. box -
> just one shard each) and standalone Zookeeper with 3 instances.
> All seems fine.  I can do queries against either instance, and
> perform index updates and replication works fine.
>
> I'm not using Java to talk to Solr - the web pages are built with
> PHP (or something similar - happy to call zk/Solr from C).  So I
> need to call Solr from the web page code.  Clearly I need
> resilience and so don't want to specifically call one of the Solr
> instances directly.
>
> I could just set up a load balancer on the two Solr instances and
> let client query requests use the load balancer to find a working
> instance.
>
> From what I have read though - I am supposed to make a call to
> zookeeper to ask which Solr instances are running up to date and
> working replicas of the collection that I need.  Is that right?
> I should do that every time I need to make a query?
>
> There seems to be a zookeeper client library in the zk dist - in
> zookeeper-3.4.6/src/c/ - can I use that?  It looks like I can
> pass in a list of potential zk host:port pairs and it will find
> a working zk for me - is that right?
>
> Then I need to ask the working zk which solr instance I should
> connect to for the given index/collection - how do I do that -
> is that held in clusterstate.json?
>
> So the steps to make a Solr query against my cluster would be:
>
> a) call zk client library with list of zk host/ports
>
> b) ask zk for clusterstate.json
>
> c) pick an active server (at random) for the relevant collection
>(is there some load balancing option in there)
>
> d) call the Solr server returned by (c)
>
> Is that best practice - or am I missing something?
>
> --
> Cheers
> Jules.
>


Integrating Solr with Nutch

2015-03-01 Thread Baruch Kogan
Hi, guys,

I'm working through the tutorial here
.
I've run a crawl on a list of webpages. Now I'm trying to index them into
Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns
queries. I've edited the Nutch schema as per instructions. Now I hit a wall:

   -

   Save the file and restart Solr under ${APACHE_SOLR_HOME}/example:

   java -jar start.jar\


On my install (the latest Solr,) there is no such file, but there is a
solr.sh file in the /bin which I can start. So I pasted it into
solr/example/ and ran it from there. Solr cranks over. Now I need to:


   -

   run the Solr Index command from ${NUTCH_RUNTIME_HOME}:

   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/


and I get this:

*ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex
http://127.0.0.1:8983/solr/  crawl/crawldb
-linkdb crawl/linkdb crawl/segments/*
*Indexer: starting at 2015-03-01 19:51:09*
*Indexer: deleting gone documents: false*
*Indexer: URL filtering: false*
*Indexer: URL normalizing: false*
*Active IndexWriters :*
*SOLRIndexWriter*
* solr.server.url : URL of the SOLR instance (mandatory)*
* solr.commit.size : buffer size when sending to SOLR (default 1000)*
* solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)*
* solr.auth : use authentication (default false)*
* solr.auth.username : use authentication (default false)*
* solr.auth : username for authentication*
* solr.auth.password : password for authentication*


*Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/crawldb/current*
*Input path does not exist:
file:/home/ubuntu/crawler/nutch/crawl/linkdb/current*
* at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*
* at
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)*
* at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*
* at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*
* at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*
* at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*
* at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*
* at java.security.AccessController.doPrivileged(Native Method)*
* at javax.security.auth.Subject.doAs(Subject.java:415)*
* at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*
* at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*
* at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*
* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*
* at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
* at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
* at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
* at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*

What am I doing wrong?

Sincerely,

Baruch Kogan
Marketing Manager
Seller Panda 
+972(58)441-3829
baruch.kogan at Skype


RE: Integrating Solr with Nutch

2015-03-01 Thread Markus Jelsma
Hello Baruch!

You are not pointing to a directory of segments, not a specific segment.

You must either point to a directory with the -dir option:
   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb -dir crawl/segments/

Or point to a segment:

   bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
-linkdb crawl/linkdb crawl/segments/YOUR_SEGMENT

Cheers
 
 
-Original message-
> From:Baruch Kogan 
> Sent: Sunday 1st March 2015 18:57
> To: solr-user@lucene.apache.org
> Subject: Integrating Solr with Nutch
> 
> Hi, guys,
> 
> I'm working through the tutorial here
> .
> I've run a crawl on a list of webpages. Now I'm trying to index them into
> Solr. Solr's installed, runs fine, indexes .json, .xml, whatever, returns
> queries. I've edited the Nutch schema as per instructions. Now I hit a wall:
> 
>-
> 
>Save the file and restart Solr under ${APACHE_SOLR_HOME}/example:
> 
>java -jar start.jar\
> 
> 
> On my install (the latest Solr,) there is no such file, but there is a
> solr.sh file in the /bin which I can start. So I pasted it into
> solr/example/ and ran it from there. Solr cranks over. Now I need to:
> 
> 
>-
> 
>run the Solr Index command from ${NUTCH_RUNTIME_HOME}:
> 
>bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb
> -linkdb crawl/linkdb crawl/segments/
> 
> 
> and I get this:
> 
> *ubuntu@ubuntu-VirtualBox:~/crawler/nutch$ bin/nutch solrindex
> http://127.0.0.1:8983/solr/  crawl/crawldb
> -linkdb crawl/linkdb crawl/segments/*
> *Indexer: starting at 2015-03-01 19:51:09*
> *Indexer: deleting gone documents: false*
> *Indexer: URL filtering: false*
> *Indexer: URL normalizing: false*
> *Active IndexWriters :*
> *SOLRIndexWriter*
> * solr.server.url : URL of the SOLR instance (mandatory)*
> * solr.commit.size : buffer size when sending to SOLR (default 1000)*
> * solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)*
> * solr.auth : use authentication (default false)*
> * solr.auth.username : use authentication (default false)*
> * solr.auth : username for authentication*
> * solr.auth.password : password for authentication*
> 
> 
> *Indexer: org.apache.hadoop.mapred.InvalidInputException: Input path does
> not exist: file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_fetch*
> *Input path does not exist:
> file:/home/ubuntu/crawler/nutch/crawl/segments/crawl_parse*
> *Input path does not exist:
> file:/home/ubuntu/crawler/nutch/crawl/segments/parse_data*
> *Input path does not exist:
> file:/home/ubuntu/crawler/nutch/crawl/segments/parse_text*
> *Input path does not exist:
> file:/home/ubuntu/crawler/nutch/crawl/crawldb/current*
> *Input path does not exist:
> file:/home/ubuntu/crawler/nutch/crawl/linkdb/current*
> * at
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:197)*
> * at
> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)*
> * at
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)*
> * at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1081)*
> * at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1073)*
> * at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)*
> * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)*
> * at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)*
> * at java.security.AccessController.doPrivileged(Native Method)*
> * at javax.security.auth.Subject.doAs(Subject.java:415)*
> * at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)*
> * at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)*
> * at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)*
> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)*
> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
> 
> What am I doing wrong?
> 
> Sincerely,
> 
> Baruch Kogan
> Marketing Manager
> Seller Panda 
> +972(58)441-3829
> baruch.kogan at Skype
> 


backport Heliosearch features to Solr

2015-03-01 Thread Yonik Seeley
As many of you know, I've been doing some work in the experimental
"heliosearch" fork of Solr over the past year.  I think it's time to
bring some more of those changes back.

So here's a poll: Which Heliosearch features do you think should be
brought back to Apache Solr?

http://bit.ly/1E7wi1Q
(link to google form)

-Yonik


Using HDFS with Solr

2015-03-01 Thread Jou Sung-Shik
Hello.

I have a question about using HDFS with Solr.

I watched when one of shard node is gone, another node take them like this
graph in admin console.


  *(10.62.65.46
is Gone)*

+- shard
1-10.62.65.48 (active)
collection-hdfs---+- shard
2-10.62.65.47 (active)
+- shard
3-10.62.65.48 (active)

So, when 10.62.65.46 is restarted but shard 1 is still assignd 10.62.65.48
node.

Is it right?

I think shard 1 assignd 10.62.65.46 node instead of 10.62.65.48 node.

Please comment me.

Thanks.

-- 
-
BLOG : http://www.codingstar.net
-


solr5 - where does solr5 look for schema files?

2015-03-01 Thread Gulliver Smith
I am running the out-of-the-box solr5 as instructed in the tutorial.

The solr documentation has no useful documentation about the shema
file argument to create core.

I have a schema.xml that I was using for a solr 4 installation by
manually editing the core directories as root.

When playing with solr5, I have tried a number of things without success.

a) copied my custom schema.xml to
server/solr/configsets/basic_configs/conf/custom_schema.xml
- when I typed custom_schema.xml into the "schema:" field in the
create core dialog, a core is created but the new schema isn't used.
Making cusom_schema.xml into invalid XML doesn't break anything.

b) put custom_schema.xml in an accessible location on my server and
entered the full path into the schema field - in this case I got an
error message "Error CREATEing SolrCore 'xxx': Unable to create core
... Invalid path string
"/configs/gettingstarted//.../custom_schema.xml

There is no "configs" directory in the solr installaition.There is no
"gettingstarted" directory either, though there are
gettingstarted_shard1_replica1 etc. directories.

The only meaningful schema.xml seems to be
server/solr/configsets/basic_configs/conf/schema.xml.

The cores are created in example/cloud/node*/solr

There is no directory structure in the installation matching that
described in the 500 page pdf.The files screen in the admin console
does not mention schema.xml and there doesn't seem to be any place
namimg or showing schema.xml in the admin interface.

So how in the world is one to install a custom schema?

Thanks
Gulliver


Re: solr5 - where does solr5 look for schema files?

2015-03-01 Thread Erick Erickson
You haven't stated it explicitly, but I think you're running SolrCloud, right?

In which case... the configs are all stored in ZooKeeper, and you don't
edit them there. The startup scripts automate the "upconfig" step that
pushes your configs to Zookeeper. Thereafter, they are read from
Zookeeper by the Solr node on startup from ZK, but not stored locally
on each node. Otherwise, keeping all the nodes coordinated would be
difficult.

You can see the uploaded configs in the Solr admin UI/Cloud/tree/configs
area.

So you keep your configs somewhere (some kind of VCS is recommended)
and, when you make changes to them, push the results to ZK and either
restart or reload your collection.

Did you see the documentation at:
https://cwiki.apache.org/confluence/display/solr/Using+ZooKeeper+to+Manage+Configuration+Files?

And assuming I'm right and you're using SolrCloud, I _strongly_ suggest you
try to think in terms of replicas rather than cores. In particular
avoid using the
old, familiar admin core API and instead use the collections api (see
the ref guide).
You can do pretty much anything with the collections api you used to
do with the core admin,
and at the same time have a lot less chance to get something wrong.
The collections
api makes use of the individual core admin API calls to carry out the
instructed tasks as
necessary.

All that said, "the new way of doing things" is a bit of a shock to
the system if you're an
old Solr hand, especially in SolrCloud.

Best,
Erick


On Sun, Mar 1, 2015 at 4:58 PM, Gulliver Smith
 wrote:
> I am running the out-of-the-box solr5 as instructed in the tutorial.
>
> The solr documentation has no useful documentation about the shema
> file argument to create core.
>
> I have a schema.xml that I was using for a solr 4 installation by
> manually editing the core directories as root.
>
> When playing with solr5, I have tried a number of things without success.
>
> a) copied my custom schema.xml to
> server/solr/configsets/basic_configs/conf/custom_schema.xml
> - when I typed custom_schema.xml into the "schema:" field in the
> create core dialog, a core is created but the new schema isn't used.
> Making cusom_schema.xml into invalid XML doesn't break anything.
>
> b) put custom_schema.xml in an accessible location on my server and
> entered the full path into the schema field - in this case I got an
> error message "Error CREATEing SolrCore 'xxx': Unable to create core
> ... Invalid path string
> "/configs/gettingstarted//.../custom_schema.xml
>
> There is no "configs" directory in the solr installaition.There is no
> "gettingstarted" directory either, though there are
> gettingstarted_shard1_replica1 etc. directories.
>
> The only meaningful schema.xml seems to be
> server/solr/configsets/basic_configs/conf/schema.xml.
>
> The cores are created in example/cloud/node*/solr
>
> There is no directory structure in the installation matching that
> described in the 500 page pdf.The files screen in the admin console
> does not mention schema.xml and there doesn't seem to be any place
> namimg or showing schema.xml in the admin interface.
>
> So how in the world is one to install a custom schema?
>
> Thanks
> Gulliver


Re: backport Heliosearch features to Solr

2015-03-01 Thread Otis Gospodnetic
Hi Yonik,

Now that you joined Cloudera, why not everything?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley  wrote:

> As many of you know, I've been doing some work in the experimental
> "heliosearch" fork of Solr over the past year.  I think it's time to
> bring some more of those changes back.
>
> So here's a poll: Which Heliosearch features do you think should be
> brought back to Apache Solr?
>
> http://bit.ly/1E7wi1Q
> (link to google form)
>
> -Yonik
>


Re: backport Heliosearch features to Solr

2015-03-01 Thread Yonik Seeley
On Sun, Mar 1, 2015 at 7:18 PM, Otis Gospodnetic
 wrote:
> Hi Yonik,
>
> Now that you joined Cloudera, why not everything?

Everything is on the table, but from a practical point of view I
wanted to verify areas of user interest/support before doing the work
to get things back.

Even when there is user support, some things may be blocked anyway
(part of the reason why I did things under a fork in the first place).
I'll do what I can though.

-Yonik


> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Sun, Mar 1, 2015 at 4:50 PM, Yonik Seeley  wrote:
>
>> As many of you know, I've been doing some work in the experimental
>> "heliosearch" fork of Solr over the past year.  I think it's time to
>> bring some more of those changes back.
>>
>> So here's a poll: Which Heliosearch features do you think should be
>> brought back to Apache Solr?
>>
>> http://bit.ly/1E7wi1Q
>> (link to google form)
>>
>> -Yonik
>>


SOLR Backup and Restore - Solr 3.6.1

2015-03-01 Thread abhi Abhishek
Hello,
   we have solr 3.6.1 in our environment. we are trying to analyse
backup and recovery solutions for the same. is there a way to compress the
backup taken?

we have explored about replicationHandler with backup command. but as our
index is in 100's of GB's we would like a solution that provides
compression to reduce storage overhead.

thanks in advance

Regards,
Abhishek


Re: solr cloud does not start with many collections

2015-03-01 Thread Damien Kamerman
I still see the same cloud startup issue with Solr 5.0.0. I created 4,000
collections from scratch and then attempted to stop/start the cloud.

node1:
WARN  - 2015-03-02 18:09:02.371;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:07.196; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:46.238; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3219 after 30 seconds; our state says
http://host:8002/solr/DD-3219_shard1_replica1/, but ZooKeeper says
http://host:8000/solr/DD-3219_shard1_replica2/

node2:
WARN  - 2015-03-02 18:09:01.871;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:17:04.458;
org.apache.solr.common.cloud.ZkStateReader$3; ZooKeeper watch triggered,
but Solr cannot talk to ZK
stop/start
WARN  - 2015-03-02 18:53:12.725;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:56:30.702; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-3581 after 30 seconds; our state says
http://host:8001/solr/DD-3581_shard1_replica2/, but ZooKeeper says
http://host:8002/solr/DD-3581_shard1_replica1/

node3:
WARN  - 2015-03-02 18:09:03.022;
org.eclipse.jetty.server.handler.RequestLogHandler; !RequestLog
WARN  - 2015-03-02 18:10:08.178; org.apache.solr.cloud.ZkController; Timed
out waiting to see all nodes published as DOWN in our cluster state.
WARN  - 2015-03-02 18:13:47.737; org.apache.solr.cloud.ZkController; Still
seeing conflicting information about the leader of shard shard1 for
collection DD-2707 after 30 seconds; our state says
http://host:8002/solr/DD-2707_shard1_replica2/, but ZooKeeper says
http://host:8000/solr/DD-2707_shard1_replica1/



On 27 February 2015 at 17:48, Shawn Heisey  wrote:

> On 2/26/2015 11:14 PM, Damien Kamerman wrote:
> > I've run into an issue with starting my solr cloud with many collections.
> > My setup is:
> > 3 nodes (solr 4.10.3 ; 64GB RAM each ; jdk1.8.0_25) running on a single
> > server (256GB RAM).
> > 5,000 collections (1 x shard ; 2 x replica) = 10,000 cores
> > 1 x Zookeeper 3.4.6
> > Java arg -Djute.maxbuffer=67108864 added to solr and ZK.
> >
> > Then I stop all nodes, then start all nodes. All replicas are in the down
> > state, some have no leader. At times I have seen some (12 or so) leaders
> in
> > the active state. In the solr logs I see lots of:
> >
> > org.apache.solr.cloud.ZkController; Still seeing conflicting information
> > about the leader of shard shard1 for collection DD-4351 after 30
> > seconds; our state says
> http://ftea1:8001/solr/DD-4351_shard1_replica1/,
> > but ZooKeeper says http://ftea1:8000/solr/DD-4351_shard1_replica2/
>
> 
>
> > I've tried staggering the starts (1min) but does not help.
> > I've reproduced with zero documents.
> > Restarts are OK up to around 3,000 cores.
> > Should this work?
>
> This is going to push SolrCloud beyond its limits.  Is this just an
> exercise to see how far you can push Solr, or are you looking at setting
> up a production install with several thousand collections?
>
> In Solr 4.x, the clusterstate is one giant JSON structure containing the
> state of the entire cloud.  With 5000 collections, the entire thing
> would need to be downloaded and uploaded at least 5000 times during the
> course of a successful full system startup ... and I think with
> replicationFactor set to 2, that might actually be 1 times. The
> best-case scenario is that it would take a VERY long time, the
> worst-case scenario is that concurrency problems would lead to a
> deadlock.  A deadlock might be what is happening here.
>
> In Solr 5.x, the clusterstate is broken up so there's a separate state
> structure for each collection.  This setup allows for faster and safer
> multi-threading and far less data transfer.  Assuming I understand the
> implications correctly, there might not be any need to increase
> jute.maxbuffer with 5.x ... although I have to assume that I might be
> wrong about that.
>
> I would very much recommend that you set your scenario up from scratch
> in Solr 5.0.0, to see if the new clusterstate format can eliminate the
> problem you're seeing.  If it doesn't, then we can pursue it as a likely
> bug in the 5.x branch and you can file an issue in Jira.
>
> Thanks,
> Shawn
>
>


-- 
Damien Kamerman