delta import not working properly

2014-07-05 Thread madhav bahuguna
I have 8 tables in my solr data-config files and all are joined since i
need data from all of them.
But out of those 8 tables i have three tables that have common fields
,which i can use to link.But the issues is that the common fields in the
three table that i have are such that they are repeating values.
eg:-

First table(businessmasters) has 2 columns
Business_id(pk)Business_name   Mod1(on update timestamp)
   1ABC
   2XYZ
   3KOT


Second table(search_tag) has 3 columns
Search_tag_id(pk)   Business_id  Search_tag_name  Mod2(on update timestamp)
 1  1hair
 2  1trimming

 3  2massage
 4  1facial
 5  2makeup

Now i want to join the two tables and i do that on the basis of
Business_id.But since iam joining on Business_id which is a non key in
second table i just write that business_id is a key(because if i cant use
Search_tag_id as its not common in both tables and if i take Search_tag_id
as pk and join the tables on Business_id   solr gives an error during delta
import
"deltaQuery has no column to resolve to declared primary key").
But when i say that Business_id is pk and join tables on the basis of that
it works fine and delta import works.

But the issue comes when i try  to change the Business_id in second table
eg i want to change the Business_id of Search_tag_id  3 to 1 and i run
delta import.The update  adds the search tag to Business_id 1 but also
doesn't remove the search tag from Business_ID 2.The value of search tags
in Business_id 2 remains there ,where as it should go away if iam changing
Business_id from 2 to 1.Since now Business_id is not linked to any search
tag.
Iam not able to figure out how to resolve this issue.This is how i write it
in data-config file.






 

  











 




Thanks in advance iam new to solr and iam stuck on this issue.

-- 
Regards
Madhav Bahuguna


Re: Field for 'species' data?

2014-07-05 Thread Dan Bolser
I'm super noob... Why choose to write it add a custom update request
processor rather than an analysis pipeline?

Cheers, Dan.
On 5 Jul 2014 03:45, "Alexandre Rafalovitch"  wrote:

> Do that with a custom update request processor.
>
> Just remember Solr is there to find things not to preserve structure. So
> mangle your data until you can find it.
>
> Also check if SirenDB would fit your requirements if you want to encode the
> information as complex structure.
>
> Regards,
> Alex
>


Solr and SolrCloud repllcation, and load balancing questions.

2014-07-05 Thread Himanshu Mehrotra
Hi,

I had three quesions/doubts regarding Solr and SolrCloud functionality.
Can anyone help clarify these? I know these are bit long, please bear with
me.

[A] Replication related - As I understand before SolrCloud, under a classic
master/slave replication setup, every 'X' minutes slaves will pull/poll the
updated index (index segments added and deleted/merged away ).  And when a
client explicitly issues a 'commit' only master solr closes/finalizes
current index segment and creates a new current index segment.  As port of
this index segment merges as well as 'fsync' ensuring data is on the disk
also happens.

I read documentation regarding replication on SolrCloud but unfortunately
it is still not very clear to me.

Say I have solr cloud setup of 3 solr servers with just a single shard.
Let's call them L (the leader) and F1 and F2, the followers.

Case 1: We are not using autoCommits, and explictly issue 'commit' via
Client.  How does replication happen now?
Does the each update to leader L that goes into tlog get replicated to
followers F1, and F2 (wher they also put update in tlog ) before client
sees response from leader L?  What happens when client issues a 'commit'?
Does  the creation of new segment, merging of index segments if required,
and fsync happen on all three solrs or that just happens on leader L and
followers F1, F2 simply sync the post commit state of index.  More-over
does leader L wait for fsync in followers F1, F2, before responding
sucessfully to Client?  If yes does it sequentially updates F1 and then F2
or is the process concurrent/parallel via threads.

Case 2: We use autoCommit every 'X' minutes and do not issue 'commit' via
Client.  Is this setup similar to classic master slave in terms of
data/index updates?
As in since autoCommit happens every 'X' minutes replication will happen
after commit, every 'X' minutes followers get updated index.  But does
simple updates, the ones that go int tlog get replicated immediately to
follower's tlog .

Another thing I noticed in Solr Admin UI, is that replication is set to
afterCommit, what are other possible settings for this knob.  And what
behaviour do we get out of them.




[B] Load balancing related - In traditional master/slave setup we use load
balancer to distribute load search query load equally over slaves.  In case
one of the slave solr is running on 'beefier' machine (say more RAM or CPU
or both) than others, then load balancers allow distributing load by
weights so that we can distribute load proportional to percieved machine
capacity.

With solr cloud setup, lets take an example, 2 shards, 3 replicas per
shard, totaling to 6 solr servers are running and say we have
Servers S1L1, S1F1, S1F2 hosting replicas of shard1 and servers S2L1, S2F1,
S2F2 hosting replicas of shard2.  S1L1 and S2L2 happen to be leaders of
their respective shard.  And lets say S1F2, and S2F1 happen to be twice as
big machines as others (twice the RAM and CPU).

Ideally speaking in such case we would want S2F1 and S1F2 to handle twice
the search query load as their peers.  That is if 100 search queries come
we know each shard will receive these 100 queries.  So we want S1L1, and
S1F1 to handle 25 queries each, and S1F2 to handle 50 queries.  Similarly
we would want S2L1 and S2F2 to handle 25 queries and S2F1 to handle 50
queries.

As far as I understand, this is not possible via smart client provided in
SolrJ.  All solr servers will handle 33% of the query load.

Alternative is to use dumb client and load balancer over all servers.  But
even then I guess we won't get correct/desired distribution of queries.
Say we put following weights for each server

1 - S1L1
1 - S1F1
2 - S1F2
1 - S1L1
2 - S1F1
1 - S1F2

Now 1/4 of total number of requests go to S1F2 directly, plus now it
recieves  1/6 ( 1/2 * 1/3 ) of request that went to some server on shard 2.
This totals up to 10/24 of request load, not half as we would expect.

One way could be to chose weight y and x such that y/(2*(y + 2x)) + 1/6 =
1/2 . It seems too much of trouble to get them ( y = 4 and x = 1 ).
Every time we add/remove/upgrade servers we need to recalculate new weights.

A simpler alternative it appears would be that each solr node register its
'query_weight' with zoo-keeper on joining the cluster. This 'query_weight'
could be a property similar to 'solr.solr.home' or 'zkHosts' that we
specify with startup commandline for solr server.

And all smart clients and solr servers, to honour that weight when they
distribute load.  Is there such a feature planned for Solr Cloud?




[C] GC/Memory usage related - From the documentation and videos available
on internet, it appears that solr perform well if index fits into the
memory and stord fields fit in the memory.  Holding just index in memory
has more degrading impact on solr performance and if we don't have enough
memory to hold index solr is still slower, and the moment java process hits
swap space solr will slow to a crawl.

My question is what th

Re: Field for 'species' data?

2014-07-05 Thread Jack Krupansky
Focus on your data model and queries first, then you can decide on the 
implementation.


Take a semi-complex example and manually break it down into field values and 
then write some queries, including filters, in English, that do the required 
navigation. Once you have a handle on what fields you need to populate, the 
analysis and processing details can be worked out.


-- Jack Krupansky

-Original Message- 
From: Dan Bolser

Sent: Saturday, July 5, 2014 4:49 AM
To: solr-user
Subject: Re: Field for 'species' data?

I'm super noob... Why choose to write it add a custom update request
processor rather than an analysis pipeline?

Cheers, Dan.
On 5 Jul 2014 03:45, "Alexandre Rafalovitch"  wrote:


Do that with a custom update request processor.

Just remember Solr is there to find things not to preserve structure. So
mangle your data until you can find it.

Also check if SirenDB would fit your requirements if you want to encode 
the

information as complex structure.

Regards,
Alex





Re: Field for 'species' data?

2014-07-05 Thread Dan Bolser
One requirement is that the hierarchical facet implementation marches
whatever the Drupal ApacheSolr module does with taxonomy terms.

The key thing is to add the taxonomy to the doc which only has one 'leaf'
term.
On 5 Jul 2014 15:01, "Jack Krupansky"  wrote:

> Focus on your data model and queries first, then you can decide on the
> implementation.
>
> Take a semi-complex example and manually break it down into field values
> and then write some queries, including filters, in English, that do the
> required navigation. Once you have a handle on what fields you need to
> populate, the analysis and processing details can be worked out.
>
> -- Jack Krupansky
>
> -Original Message- From: Dan Bolser
> Sent: Saturday, July 5, 2014 4:49 AM
> To: solr-user
> Subject: Re: Field for 'species' data?
>
> I'm super noob... Why choose to write it add a custom update request
> processor rather than an analysis pipeline?
>
> Cheers, Dan.
> On 5 Jul 2014 03:45, "Alexandre Rafalovitch"  wrote:
>
>  Do that with a custom update request processor.
>>
>> Just remember Solr is there to find things not to preserve structure. So
>> mangle your data until you can find it.
>>
>> Also check if SirenDB would fit your requirements if you want to encode
>> the
>> information as complex structure.
>>
>> Regards,
>> Alex
>>
>>
>


Re: Field for 'species' data?

2014-07-05 Thread Jack Krupansky
So, the immediate question is whether the value in the Solr source document 
has the full taxonomy path for the species, or just parts, and some external 
taxonomy definition must be consulted to "fill in" the rest of the hierarchy 
path for that species.


-- Jack Krupansky

-Original Message- 
From: Dan Bolser

Sent: Saturday, July 5, 2014 10:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Field for 'species' data?

One requirement is that the hierarchical facet implementation marches
whatever the Drupal ApacheSolr module does with taxonomy terms.

The key thing is to add the taxonomy to the doc which only has one 'leaf'
term.
On 5 Jul 2014 15:01, "Jack Krupansky"  wrote:


Focus on your data model and queries first, then you can decide on the
implementation.

Take a semi-complex example and manually break it down into field values
and then write some queries, including filters, in English, that do the
required navigation. Once you have a handle on what fields you need to
populate, the analysis and processing details can be worked out.

-- Jack Krupansky

-Original Message- From: Dan Bolser
Sent: Saturday, July 5, 2014 4:49 AM
To: solr-user
Subject: Re: Field for 'species' data?

I'm super noob... Why choose to write it add a custom update request
processor rather than an analysis pipeline?

Cheers, Dan.
On 5 Jul 2014 03:45, "Alexandre Rafalovitch"  wrote:

 Do that with a custom update request processor.


Just remember Solr is there to find things not to preserve structure. So
mangle your data until you can find it.

Also check if SirenDB would fit your requirements if you want to encode
the
information as complex structure.

Regards,
Alex








error during heavy indexing

2014-07-05 Thread navdeep agarwal
i am getting following error on heavy indexing .i am using Solr 4.7
.creating index in hdfs through map reduce .sending docs in batch of 50
.

ERROR org.apache.solr.core.SolrCore  – java.lang.RuntimeException: [was
class org.eclipse.jetty.io.EofException] early EOF
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:397)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:246)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:768)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:415)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:205)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.eclipse.jetty.io.EofException: early EOF
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:65)
at java.io.InputStream.read(InputStream.java:101)
at com.ctc.wstx.io.UTF8Reader.loadMore(UTF8Reader.java:365)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:110)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
... 36 more

5409264 [qtp667019593-1270] ERROR
org.apache.solr.servlet.SolrDispatchFilter  –
null:java.lang.RuntimeException: [was class
org.eclipse.jetty.io.EofException] early EOF
at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)

Re: Field for 'species' data?

2014-07-05 Thread Dan Bolser
The latter
On 5 Jul 2014 16:39, "Jack Krupansky"  wrote:

> So, the immediate question is whether the value in the Solr source
> document has the full taxonomy path for the species, or just parts, and
> some external taxonomy definition must be consulted to "fill in" the rest
> of the hierarchy path for that species.
>
> -- Jack Krupansky
>
> -Original Message- From: Dan Bolser
> Sent: Saturday, July 5, 2014 10:36 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Field for 'species' data?
>
> One requirement is that the hierarchical facet implementation marches
> whatever the Drupal ApacheSolr module does with taxonomy terms.
>
> The key thing is to add the taxonomy to the doc which only has one 'leaf'
> term.
> On 5 Jul 2014 15:01, "Jack Krupansky"  wrote:
>
>  Focus on your data model and queries first, then you can decide on the
>> implementation.
>>
>> Take a semi-complex example and manually break it down into field values
>> and then write some queries, including filters, in English, that do the
>> required navigation. Once you have a handle on what fields you need to
>> populate, the analysis and processing details can be worked out.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Dan Bolser
>> Sent: Saturday, July 5, 2014 4:49 AM
>> To: solr-user
>> Subject: Re: Field for 'species' data?
>>
>> I'm super noob... Why choose to write it add a custom update request
>> processor rather than an analysis pipeline?
>>
>> Cheers, Dan.
>> On 5 Jul 2014 03:45, "Alexandre Rafalovitch"  wrote:
>>
>>  Do that with a custom update request processor.
>>
>>>
>>> Just remember Solr is there to find things not to preserve structure. So
>>> mangle your data until you can find it.
>>>
>>> Also check if SirenDB would fit your requirements if you want to encode
>>> the
>>> information as complex structure.
>>>
>>> Regards,
>>> Alex
>>>
>>>
>>>
>>
>


Re: error during heavy indexing

2014-07-05 Thread Shawn Heisey
On 7/5/2014 9:40 AM, navdeep agarwal wrote:
> i am getting following error on heavy indexing .i am using Solr 4.7
> .creating index in hdfs through map reduce .sending docs in batch of 50
> .
> 
> ERROR org.apache.solr.core.SolrCore  – java.lang.RuntimeException: [was
> class org.eclipse.jetty.io.EofException] early EOF

This means that your client (the software making the HTTP connections)
disconnected before Solr was finished with the request, most likely due
to a configured connection inactivity timeout.  It's actually Jetty that
reports the situation, since it is handling the low-level TCP layers.

I don't know enough about the stacktraces to know whether this was the
client doing the indexing, or whether it was a client doing queries.  If
I had to guess, I would say it was likely a client doing queries.  Your
queries are likely taking longer during heavy indexing, long enough to
exceed the client timeout.

Thanks,
Shawn



Re: Solr Map Reduce Indexer Tool GoLive to SolrCloud with index on local file system

2014-07-05 Thread Erick Erickson
Ok, I asked some folks who know and the response is that "that should
work, but it's not supported/tested". IOW, you're into somewhat
uncharted territory. The people who wrote the code don't have this
use-case in their priority list and probably won't be expending energy
in this direction any time soon.

So feel free! It'd be great if you reported/supplied patches for any
problems you run across, this has been a recurring theme with
HdfsDirectoryFactory and Solr replicas: "Why should three replicas
have 9 copies of the index laying around?"

Do note that disk space is cheap, however and there is considerable
work done to minimize any performance issues with HDFS.

Best,
Erick

On Thu, Jul 3, 2014 at 9:18 AM, Tom Chen  wrote:
> Hi,
>
> In the GoLive stage, the MRIT sends the MERGEINDEXES requests to Solr
> instances. The request has a indexDir parameter with a hdfs path to the
> index generated on HDFS, as shown in the MRIT log:
>
> 2014-07-02 15:03:55,123 DEBUG
> org.apache.http.impl.conn.DefaultClientConnection: Sending request: GET
> /solr/admin/cores?action=MERGEINDEXES&core=collection1&indexDir=hdfs%3A%2F%
> 2Fhdtest041.test.com%3A9000%2Foutdir_webaccess_app%2Fresults%2Fpart-0%2Fdata%2Findex&wt=javabin&version=2
> HTTP/1.1
>
> So it's up to the Solr instance to understand reading index from HDFS
> (rather than for the MRIT to find the local disk to write from HDFS).
>
> The go-live option is very convenient to merge generated index to live
> index. It's desirable to use go-live than copy around indexes to local file
> system and then merge.
>
> I tried to start Solr instance with these properties to allow solr instance
> to write to local file system while being able to read index on HDFS when
> doing MERGEINDEXES:
>
>   -Dsolr.directoryFactory=HdfsDirectoryFactory \
>   -Dsolr.hdfs.confdir=$HADOOP_HOME/hadoop-conf \
>   -Dsolr.lock.type=hdfs \
>   -Dsolr.hdfs.home=file:///opt/test/solr/node/solr \
>
> i.e. the full command:
> java -DnumShards=2 \
>   -Dbootstrap_confdir=./solr/collection1/conf
> -Dcollection.configName=myconf \
>   -DzkHost=:2181 \
>   -Dhost= \
>   -DSTOP.PORT=7983 -DSTOP.KEY=key \
>   -Dsolr.directoryFactory=HdfsDirectoryFactory \
>   -Dsolr.hdfs.confdir=$HADOOP_HOME/hadoop-conf \
>   -Dsolr.lock.type=hdfs \
>   -Dsolr.hdfs.home=file:///opt/test/solr/node/solr \
>   -jar start.jar
>
>
> With that, the  go-live works fine.
>
> Any comment on this approach?
>
>
>
> Tom
>
> On Wed, Jul 2, 2014 at 9:50 PM, Erick Erickson 
> wrote:
>
>> How would the MapReduceIndexerTool (MRIT for short)
>> find the local disk to write from HDFS to for each shard?
>> All it has is the information in the Solr configs, which are
>> usually relative paths on the local Solr machines, relative
>> to SOLR_HOME. Which could be different on each node
>> (that would be screwy, but possible).
>>
>> Permissions would also be a royal pain to get right
>>
>> You _can_ forego the --go-live option and copy from
>> the HDFS nodes to your local drive and then execute
>> the "mergeIndexes" command, see:
>> https://cwiki.apache.org/confluence/display/solr/Merging+Indexes
>> Note that there is the MergeIndexTool, but there's also
>> the Core Admin command.
>>
>> The sub-indexes are in a partition in HDFS and numbered
>> sequentially.
>>
>> Best,
>> Erick
>>
>> On Wed, Jul 2, 2014 at 3:23 PM, Tom Chen  wrote:
>> > Hi,
>> >
>> >
>> > When we run Solr Map Reduce Indexer Tool (
>> > https://github.com/markrmiller/solr-map-reduce-example), it generates
>> > indexes on HDFS
>> >
>> > The last stage is Go Live to merge the generated index to live SolrCloud
>> > index.
>> >
>> > If the live SolrCloud write index to local file system (rather than
>> HDFS),
>> > the Go Live gives such error like this:
>> >
>> > 2014-07-02 13:41:01,518 INFO org.apache.solr.hadoop.GoLive: Live merge
>> > hdfs://
>> >
>> bdvs086.test.com:9000/tmp/088-140618120223665-oozie-oozi-W/results/part-0
>> > into http://bdvs087.test.com:8983/solr
>> > 2014-07-02 13:41:01,796 ERROR org.apache.solr.hadoop.GoLive: Error
>> sending
>> > live merge command
>> > java.util.concurrent.ExecutionException:
>> > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>> > directory '/opt/testdir/solr/node/hdfs:/
>> >
>> bdvs086.test.com:9000/tmp/088-140618120223665-oozie-oozi-W/results/part-1/data/index
>> '
>> > does not exist
>> > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:233)
>> > at java.util.concurrent.FutureTask.get(FutureTask.java:94)
>> > at org.apache.solr.hadoop.GoLive.goLive(GoLive.java:126)
>> > at
>> >
>> org.apache.solr.hadoop.MapReduceIndexerTool.run(MapReduceIndexerTool.java:867)
>> > at
>> >
>> org.apache.solr.hadoop.MapReduceIndexerTool.run(MapReduceIndexerTool.java:609)
>> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>> > at
>> >
>> org.apache.solr.hadoop.MapReduceIndexerTool.main(MapReduceIndexerTool.java:596)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Re: Solr 4.7 Payload

2014-07-05 Thread Erick Erickson
Take a look at PayloadTermQuery, I think that should give you some
hints.

Best,
Erick

On Fri, Jul 4, 2014 at 8:19 AM, Ranjith Venkatesan
 wrote:
> Hi all,
>
> I am evaluating Payload of lucene. I am using solr4.7.2 for this. I could
> able to index with payload, but i couldnt able to retrieve payload from
> DocsAndPositionsEnum. It is returning just null. But terms.hasPayloads() is
> returning true. And i can able to see payload value in luke (image attached
> below).
>
> I have following schema for payload field ,
>
> *schema.xml*
>
>   class="solr.TextField" >
>   
> 
>   encoder="float"/>
>   
> 
>
> *My indexing code,*
>
> for(int i=1;i<=1000;i++)
> {
> SolrInputDocument doc1= new SolrInputDocument();
> doc1.addField("id", "test:"+i);
> doc1.addField("uid", ""+i);
> doc1.addField("payloads", "_UID_|"+i+"f");
> doc1.addField("content", "test");
>
> server.add(doc1);
> if(i%1 == 0)
> {
> server.commit();
> }
> }
>
> server.commit();
>
> *Search code :*
> DocsAndPositionsEnum termPositionsEnum =
> solrSearcher.getAtomicReader().termPositionsEnum(t);
> int doc = -1;
>
> while((doc = termPositionsEnum.nextDoc()) !=
> DocsAndPositionsEnum.NO_MORE_DOCS)
> {
> System.out.println(termPositionsEnum.getPayload()); // returns null
> }
>
>
> *luke *
> 
>
> Am i missing some configuration or i am doing in a wrong way ??? Any help in
> resolving this issue will be appreciated.
>
> Thanks in advance
>
> Ranjith Venkatesan
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-4-7-Payload-tp4145641.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr and SolrCloud repllcation, and load balancing questions.

2014-07-05 Thread Erick Erickson
Question1, both sub-cases.

You're off on the wrong track here, you have to forget about replication.

When documents are added to the index, they get forwarded to _all_
replicas. So the flow is like this...
1> leader gets update request
2> leader indexes docs locally, and adds to (local) transaction log
  _and_ forwards request to all followers
3> followers add docs to tlog and index locally
4> followers ack back to leader
5> leader acks back to client.

There is no replication in the old sense at all in this scenario. I'll
add parenthetically that old-style replication _is_ still used to
"catch up" a follower that is waay behind, but the follower is
in the "recovering" state if this ever occurs.

About commit. If you commit from the client, the commit is forwarded
to all followers (actually, all nodes in the collection). If you have
autocommit configured, each of the replicas will fire their commit when
the time period expires.

Here's a blog that might help:
http://searchhub.org/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

[B] right, SolrCloud really supposes that the machines are pretty
similar so doesn't provide any way to do what you're asking. Really,
you're asking for some way to assign "beefiness" to the node in terms
of load sent to it... I don't know of a way to do that and I'm not
sure it's on the roadmap either.

What you'd really want, though, is some kind of heuristic that was
automatically applied. That would take into account transient load
problems, i.e. replica N happened to get a really nasty query to run
and is just slow for a while. I can see this being very tricky to get
right though. Would a GC pause get weighted as "slow" even though the
pause could be over already? Anyway, I don't think this is on the
roadmap at present but could well be wrong.

In your specific example, though (this works because of the convenient
2x) you could host 2x the number of shards/replicas on the beefier
machines.

[C] Right, memory allocation is difficult. The general recommendation
is that memory for Solr allocated in the JVM should be as small as
possible, and leave let the op system use memory for MMapDirectory.
See the excellent blog here:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html.
If you over-allocate memory to the JVM, your GC profile worsens...

Generally, when people throw "memory" around they're talking about JVM memory...

And don't be mislead by the notion of "the index fitting into memory".
You're absolutely right that when you get into a swapping situation,
performance will suffer. But there are some very interesting tricks
played to keep JVM consumption down. For instance, only every 128th
term is stored in the JVM memory. Other terms are then read as needed.
And stored in the OS memory via MMapDirectory implementations

Your GC stats look quite reasonable. You can get a snapshot of memory
usage by attaching, say, jConsole to the running JVM and see what
memory usage was after a forced GC. Sounds like you've already seen
this, but in case not:
http://searchhub.org/2011/03/27/garbage-collection-bootcamp-1-0/. It
was written before there was much mileage on the new G1 garbage
collector which has received mixed reviews.

Note that the stored fields kept in memory are controlled by the
documentCache in solrconfig.xml. I think of this as just something
that holds documents when assembling the return list, it really
doesn't have anything to do with searching per-se, just keeping disk
seeks down during processing for a particular query. I.e. for a query
returning 10 rows, only 10 docs will be kept here not the 5M rows that
matched.

Whether 4G is sufficient is not answerable. I've doubled the
memory requirements by changing the query without changing the index.
Here's a blog outlining why we can't predict and how to get an answer
empirically:
http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best,
Erick

On Sat, Jul 5, 2014 at 1:57 AM, Himanshu Mehrotra
 wrote:
> Hi,
>
> I had three quesions/doubts regarding Solr and SolrCloud functionality.
> Can anyone help clarify these? I know these are bit long, please bear with
> me.
>
> [A] Replication related - As I understand before SolrCloud, under a classic
> master/slave replication setup, every 'X' minutes slaves will pull/poll the
> updated index (index segments added and deleted/merged away ).  And when a
> client explicitly issues a 'commit' only master solr closes/finalizes
> current index segment and creates a new current index segment.  As port of
> this index segment merges as well as 'fsync' ensuring data is on the disk
> also happens.
>
> I read documentation regarding replication on SolrCloud but unfortunately
> it is still not very clear to me.
>
> Say I have solr cloud setup of 3 solr servers with just a single shard.
> Let's call them L (the leader) and F1 and F2, the followers.
>
> Case 1: We are not u

Re: Field for 'species' data?

2014-07-05 Thread Erick Erickson
re: do this in an update processor or in other parts of the pipeline:

whichever is easier, the result will be the same. Personally I like
putting stuff like this in other parts of the pipeline if for no other reason
than the load isn't concentrated on the Solr machine.

In particular if you enrich the document in the pipeline, you can then
scale up indexing by having multiple processes running the pipeline on
multiple clients. Eventually, you'll hit the Solr node's limits, but it'll
be later than if you do all your processing there.

It may be a little easier to manage since you don't have to worry about
getting your custom Jars to the solr nodes as you would in the update
processor case.

But really, whatever is most convenient and meets your SLA. If you
are _already_ going to have a pipeline, there are fewer moving parts there

Best,
Erick

On Sat, Jul 5, 2014 at 9:02 AM, Dan Bolser  wrote:
> The latter
> On 5 Jul 2014 16:39, "Jack Krupansky"  wrote:
>
>> So, the immediate question is whether the value in the Solr source
>> document has the full taxonomy path for the species, or just parts, and
>> some external taxonomy definition must be consulted to "fill in" the rest
>> of the hierarchy path for that species.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Dan Bolser
>> Sent: Saturday, July 5, 2014 10:36 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Field for 'species' data?
>>
>> One requirement is that the hierarchical facet implementation marches
>> whatever the Drupal ApacheSolr module does with taxonomy terms.
>>
>> The key thing is to add the taxonomy to the doc which only has one 'leaf'
>> term.
>> On 5 Jul 2014 15:01, "Jack Krupansky"  wrote:
>>
>>  Focus on your data model and queries first, then you can decide on the
>>> implementation.
>>>
>>> Take a semi-complex example and manually break it down into field values
>>> and then write some queries, including filters, in English, that do the
>>> required navigation. Once you have a handle on what fields you need to
>>> populate, the analysis and processing details can be worked out.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Dan Bolser
>>> Sent: Saturday, July 5, 2014 4:49 AM
>>> To: solr-user
>>> Subject: Re: Field for 'species' data?
>>>
>>> I'm super noob... Why choose to write it add a custom update request
>>> processor rather than an analysis pipeline?
>>>
>>> Cheers, Dan.
>>> On 5 Jul 2014 03:45, "Alexandre Rafalovitch"  wrote:
>>>
>>>  Do that with a custom update request processor.
>>>

 Just remember Solr is there to find things not to preserve structure. So
 mangle your data until you can find it.

 Also check if SirenDB would fit your requirements if you want to encode
 the
 information as complex structure.

 Regards,
 Alex



>>>
>>


TieredMergePolicy

2014-07-05 Thread Kireet Reddy
I have a question about the maxMergeAtOnce parameter. We are using 
elasticsearch and one of our nodes seems to have very high merge activity, 
However it seems to be high CPU activity and not I/O constrainted. I have 
enabled the IndexWriter info stream logs, and often times it seems to do merges 
of quite small segments (100KB) that are much below the floor size (2MB). I 
suspect this is due to frequent refreshes and/or using lots of threads 
concurrently to do indexing. 

My supposition is that this is leading to the merge policy doing lots of merges 
of very small segments into another small segment which will again require a 
merge to even reach the floor size. My index has 64 segments and 25 are below 
the floor size. I am wondering if there should be an exception for the 
maxMergesAtOnce parameter for the first level so that many small segments could 
be merged at once in this case?

I am considering changing the other parameters (wider tiers, lower floor size, 
more concurrent merges allowed) but these all seem to have side effects I may 
not necessarily want. Is there a good solution here?