solr nested multivalued fields

2012-06-12 Thread jerome


I would like to produce the following result in a Solr search result but not
sure it is possible to do? (Using Solr 3.6)


   
  
 John
 Darby
  
  
 Sue
 Berger
  
   


However, i cant seem to manage getting this Tree like structre in my
results. At best I can get something to look like the following which is not
even close:


   
  John
  Darby
  Sue
  Berger
   


There are two problem here. Firstly, I cannot seem to "group" these people
into a meaningful tag structure as per the top example. Second, I cant for
the life of me get the tags to display an attribute name like "lastName" or
"firstName" when inside an array?

In my project I am pulling this data using a DIH and from the example above
one can see that this is a on-to-many relationship between groups and users.

I really would appreciate it is someone has some suggestions or alternative
thoughts.

Any assistance would be greatly appreciated


--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-nested-multivalued-fields-tp3989114.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr nested multivalued fields

2012-06-12 Thread jerome
Thanks, From all the material i have looked at and searched I am inclined to
believe that those are indeed my options, any others are still welcome...

--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-nested-multivalued-fields-tp3989114p3989260.html
Sent from the Solr - User mailing list archive at Nabble.com.


Same query, inconsistent result in SolrCloud

2015-06-19 Thread Jerome Yang
Hi!

I'm facing a problem.
I'm using SolrCloud 4.10.3, with 2 shards, each shard have 2 replicas.

After index data to the collection, and run the same query,

http://localhost:8983/solr/catalog/select?q=a&wt=json&indent=true

Sometimes, it return the right,

{
  "responseHeader":{
"status":0,
"QTime":19,
"params":{
  "indent":"true",
  "q":"a",
  "wt":"json"}},
  "response":{"numFound":5,"start":0,"maxScore":0.43969032,"docs":[
  {},{},...

]

  }

}

But, when I re-run the same query, it return :

{
  "responseHeader":{
"status":0,
"QTime":14,
"params":{
  "indent":"true",
      "q":"a",
  "wt":"json"}},
  "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
  },
  "highlighting":{}}


Just some short word will show this kind of problem.

Do anyone know what's going on?

Thanks

Regards,

Jerome


Re: Same query, inconsistent result in SolrCloud

2015-06-23 Thread Jerome Yang
Dear Erick,

Thank you, I fond it's the problem of my text segmentation setting.
Anyway, thanks.

Regards,
Jerome

2015-06-21 0:43 GMT+08:00 Erick Erickson :

> Just that this _shouldn't_ be going on at all. Either
> 1> you've done something when setting up this collection that's
>  producing this incorrect result
> or
> 2> you've found something really bad.
>
> So:
> 1> How did you create your collection? Did you use the Collections API
> or try to define individual cores with the Core Admin api? If the latter,
> you likely missed some innocent-seeming property on the core and your
> collection isn't correctly configured. Please do NOT use the "core admin
> API"
> in SolrCloud unless you know the code very, very well. Use the
> Collections API always.
>
> 2> Try querying each replica with &distrib=false. That'll return only the
> docs on the particular replica you fire the query at. Do you have
> replicas in the _same_ shard with different numbers of docs? If so,
> what can you see that's different about those cores?
>
> 3> What does your clusterstate.json file show?
>
> 4> how did you index documents?
>
> Best,
> Erick
>
> On Fri, Jun 19, 2015 at 8:07 PM, Jerome Yang 
> wrote:
> > Hi!
> >
> > I'm facing a problem.
> > I'm using SolrCloud 4.10.3, with 2 shards, each shard have 2 replicas.
> >
> > After index data to the collection, and run the same query,
> >
> > http://localhost:8983/solr/catalog/select?q=a&wt=json&indent=true
> >
> > Sometimes, it return the right,
> >
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":19,
> > "params":{
> >   "indent":"true",
> >   "q":"a",
> >   "wt":"json"}},
> >   "response":{"numFound":5,"start":0,"maxScore":0.43969032,"docs":[
> >   {},{},...
> >
> > ]
> >
> >   }
> >
> > }
> >
> > But, when I re-run the same query, it return :
> >
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":14,
> > "params":{
> >   "indent":"true",
> >   "q":"a",
> >   "wt":"json"}},
> >   "response":{"numFound":0,"start":0,"maxScore":0.0,"docs":[]
> >   },
> >   "highlighting":{}}
> >
> >
> > Just some short word will show this kind of problem.
> >
> > Do anyone know what's going on?
> >
> > Thanks
> >
> > Regards,
> >
> > Jerome
>


Send kill -9 to a node and can not delete down replicas with onlyIfDown.

2016-07-18 Thread Jerome Yang
Hi all,

Here's the situation.
I'm using solr5.3 in cloud mode.

I have 4 nodes.

After use "kill -9 pid-solr-node" to kill 2 nodes.
These replicas in the two nodes still are "ACTIVE" in zookeeper's
state.json.

The problem is, when I try to delete these down replicas with
parameter onlyIfDown='true'.
It says,
"Delete replica failed: Attempted to remove replica :
demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
'active'."

>From this link:
<http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE>
<http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE>
<http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE>
<http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE>
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.State.html#ACTIVE

It says:
*NOTE*: when the node the replica is hosted on crashes, the replica's state
may remain ACTIVE in ZK. To determine if the replica is truly active, you
must also verify that its node
<http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/Replica.html#getNodeName-->
is
under /live_nodes in ZK (or use ClusterState.liveNodesContain(String)
<http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/common/cloud/ClusterState.html#liveNodesContain-java.lang.String->
).

So, is this a bug?

Regards,
Jerome


Re: Send kill -9 to a node and can not delete down replicas with onlyIfDown.

2016-07-19 Thread Jerome Yang
What I'm doing is to simulate host crashed situation.

Consider this, a host is not connected to the cluster.

So, if a host crashed, I can not delete the down replicas by using
onlyIfDown='true'.
But in solr admin ui, it shows down for these replicas.
And whiteout "onlyIfDown", it still show a failure:
Delete replica failed: Attempted to remove replica :
demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
'active'.

Is this the right behavior? If a hosts gone, I can not delete replicas in
this host?

Regards,
Jerome

On Wed, Jul 20, 2016 at 1:58 AM, Justin Lee  wrote:

> Thanks for taking the time for the detailed response. I completely get what
> you are saying. Makes sense.
> On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson 
> wrote:
>
> > Justin:
> >
> > Well, "kill -9" just makes it harder. The original question
> > was whether a replica being "active" was a bug, and it's
> > not when you kill -9; the Solr node has no chance to
> > tell Zookeeper it's going away. ZK does modify
> > the live_nodes by itself, thus there are checks as
> > necessary when a replica's state is referenced
> > whether the node is also in live_nodes. And an
> > overwhelming amount of the time this is OK, Solr
> > recovers just fine.
> >
> > As far as the write locks are concerned, those are
> > a Lucene level issue so if you kill Solr at just the
> > wrong time it's possible that that'll be left over. The
> > write locks are held for as short a period as possible
> > by Lucene, but occasionally they can linger if you kill
> > -9.
> >
> > When a replica comes up, if there is a write lock already, it
> > doesn't just take over; it fails to load instead.
> >
> > A kill -9 won't bring the cluster down by itself except
> > if there are several coincidences. Just don't make
> > it a habit. For instance, consider if you kill -9 on
> > two Solrs that happen to contain all of the replicas
> > for a shard1 for collection1. And you _happen_ to
> > kill them both at just the wrong time and they both
> > leave Lucene write locks for those replicas. Now
> > no replica will come up for shard1 and the collection
> > is unusable.
> >
> > So the shorter form is that using "kill -9" is a poor practice
> > that exposes you to some risk. The hard-core Solr
> > guys work extremely had to compensate for this kind
> > of thing, but kill -9 is a harsh, last-resort option and
> > shouldn't be part of your regular process. And you should
> > expect some "interesting" states when you do. And
> > you should use the bin/solr script to stop Solr
> > gracefully.
> >
> > Best,
> > Erick
> >
> >
> > On Tue, Jul 19, 2016 at 9:29 AM, Justin Lee 
> > wrote:
> > > Pardon me for hijacking the thread, but I'm curious about something you
> > > said, Erick.  I always thought that the point (in part) of going
> through
> > > the pain of using zookeeper and creating replicas was so that the
> system
> > > could seamlessly recover from catastrophic failures.  Wouldn't an OOM
> > > condition have a similar effect (or maybe java is better at cleanup on
> > that
> > > kind of error)?  The reason I ask is that I'm trying to set up a solr
> > > system that is highly available and I'm a little bit surprised that a
> > kill
> > > -9 on one process on one machine could put the entire system in a bad
> > > state.  Is it common to have to address problems like this with manual
> > > intervention in production systems?  Ideally, I'd hope to be able to
> set
> > up
> > > a system where a single node dying a horrible death would never require
> > > intervention.
> > >
> > > On Tue, Jul 19, 2016 at 8:54 AM Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > >> First of all, killing with -9 is A Very Bad Idea. You can
> > >> leave write lock files laying around. You can leave
> > >> the state in an "interesting" place. You haven't given
> > >> Solr a chance to tell Zookeeper that it's going away.
> > >> (which would set the state to "down"). In short
> > >> when you do this you have to deal with the consequences
> > >> yourself, one of which is this mismatch between
> > >> cluster state and live_nodes.
> > >>
> > >> Now, that rant done the bin/solr script tries to stop Solr
> >

Re: Send kill -9 to a node and can not delete down replicas with onlyIfDown.

2016-07-20 Thread Jerome Yang
Thanks a lot everyone!

By setting onlyIfDown=false, it did remove the replica. But still return a
failure message.
That confuse me.

Anyway, thanks Erick and Chris.

Regards,
Jerome

On Thu, Jul 21, 2016 at 5:47 AM, Chris Hostetter 
wrote:

>
> Maybe the problem here is some confusion/ambuguity about the meaning of
> "down" ?
>
> TL;DR: think of "onlyIfDown" as "onlyIfShutDownCleanly"
>
>
> IIUC, the purpose of the 'onlyIfDown' is a safety valve so (by default)
> the cluster will prevent you from removing a replica that wasn't shutdown
> *cleanly* and is officially in a "down" state -- as recorded in the
> ClusterState for the collection (either the collections state.json or the
> global clusterstate.json if you have an older solr instance)
>
> when you kill -9 a solr node, the replicas that were hosted on that node
> will typically still be listed in the cluster state as "active" -- but it
> will *not* be in live_nodes, which is how solr knows that replica can't
> currently be used (and leader recovery happens as needed, etc...).
>
> If, however, you shut the node down cleanly (or if -- for whatever reason
> -- the node is up, but the replica's SolrCore is not active) then the
> cluster state will record that replica as "down"
>
> Where things unfortunately get confusing, is that the CLUSTERSTATUS api
> call -- aparently in an attempt to try and implify things -- changes the
> recorded status of any replica to "down" if that replica is hosted on a
> node which is not in live_nodes.
>
> I suspect that since hte UI uses the CLUSTERSTATUS api to get it's state
> information, it doesn't display much diff between a replica shut down
> cleanly and a replica that is hosted on a node which died abruptly.
>
> I suspect that's where your confusion is coming from?
>
>
> Ultimately, what onlyIfDown is trying to do is help ensure that you don't
> accidently delete a replica that you didn't mean to.  the opertaing
> assumption is that the only replicas you will (typically) delete are
> replicas that you shut down cleanly ... if a replica is down because of a
> hard crash, then that is an exceptional situation and presumibly you will
> either: a) try to bring the replica back up; b) delete the replica using
> onlyIfDown=false to indicate that you know the replica you are deleting
> isn't 'down' intentionally, but you want do delete it anyway.
>
>
>
>
>
> On Wed, 20 Jul 2016, Erick Erickson wrote:
>
> : Date: Wed, 20 Jul 2016 08:26:32 -0700
> : From: Erick Erickson 
> : Reply-To: solr-user@lucene.apache.org
> : To: solr-user 
> : Subject: Re: Send kill -9 to a node and can not delete down replicas with
> : onlyIfDown.
> :
> : Yes, it's the intended behavior. The whole point of the
> : onlyIfDown flag was as a safety valve for those
> : who wanted to be cautious and guard against typos
> : and the like.
> :
> : If you specify onlyIfDown=false and the node still
> : isn't removed from ZK, it's not right.
> :
> : Best,
> : Erick
> :
> : On Tue, Jul 19, 2016 at 10:41 PM, Jerome Yang  wrote:
> : > What I'm doing is to simulate host crashed situation.
> : >
> : > Consider this, a host is not connected to the cluster.
> : >
> : > So, if a host crashed, I can not delete the down replicas by using
> : > onlyIfDown='true'.
> : > But in solr admin ui, it shows down for these replicas.
> : > And whiteout "onlyIfDown", it still show a failure:
> : > Delete replica failed: Attempted to remove replica :
> : > demo.public.tbl/shard0/core_node4 with onlyIfDown='true', but state is
> : > 'active'.
> : >
> : > Is this the right behavior? If a hosts gone, I can not delete replicas
> in
> : > this host?
> : >
> : > Regards,
> : > Jerome
> : >
> : > On Wed, Jul 20, 2016 at 1:58 AM, Justin Lee 
> wrote:
> : >
> : >> Thanks for taking the time for the detailed response. I completely
> get what
> : >> you are saying. Makes sense.
> : >> On Tue, Jul 19, 2016 at 10:56 AM Erick Erickson <
> erickerick...@gmail.com>
> : >> wrote:
> : >>
> : >> > Justin:
> : >> >
> : >> > Well, "kill -9" just makes it harder. The original question
> : >> > was whether a replica being "active" was a bug, and it's
> : >> > not when you kill -9; the Solr node has no chance to
> : >> > tell Zookeeper it's going away. ZK does modify
> : >> > the live_nodes by itself, thu

Delete replica on down node, after start down node, the deleted replica comes back.

2016-08-16 Thread Jerome Yang
Hi all,

I run into a strange behavior.
Both on solr6.1 and solr5.3.

For example, there are 4 nodes in cloud mode, one of them is stopped.
Then I delete a replica on the down node.
After that I start the down node.
The deleted replica comes back.

Is this a normal behavior?

Same situation.
4 nodes, 1 node is down.
And I delete a collection.
After start the down node.
Replicas in the down node of that collection come back again.
And I can not use collection api DELETE to delete it.
It says that collection is not exist.
But if I use CREATE action to create a same name collection, it says
collection is already exist.
The only way is to make things right is to clean it manually from zookeeper
and data directory.

How to prevent this happen?

Regards,
Jerome


In cloud mode, using implicit router. Leader changed, not available to index data, and no error occurred.

2016-09-18 Thread Jerome Yang
Hi all,

The situation is:
Three hosts, host1, host2, host3. Solr version 6.1 in cloud mode. 8 solr
nodes on each host.

Create a collection using implicit router. Execute index and delete index.
The collection works fine.
Then kill 3 nodes, some of shards change leader.
Then index data to new leaders of shards, and commit. But some of shards
still has 0 documents. And no error occurred.
By checking the log on that leader replica, it did receive the update
request and processed. No error found in the log.

After restart all nodes, everything works fine.

This is a serious bug I think.
Can you confirm it's a bug or not?

Regards,
Jerome


Re: In cloud mode, using implicit router. Leader changed, not available to index data, and no error occurred.

2016-09-19 Thread Jerome Yang
I'm sure I send documents to that shard. And execute commit.

I also use curl to index, but not error occurred and no documents are
indexed.

On Mon, Sep 19, 2016 at 11:27 PM, Erick Erickson 
wrote:

> Are all the documents in the collection? By using implicit router, you are
> assuming control of what shard each document ends up on. So my
> guess is that you are not routing the docs to each shard.
>
> If you want Solr to automatically assign the shard to a doc, you should
> be using the default compositeId routing scheme.
>
> If you index docs and not all of them are somewhere in the collection,
> that's a problem, assuming you are routing them properly when using
> the implicit router.
>
> Best,
> Erick
>
> On Sun, Sep 18, 2016 at 8:04 PM, Jerome Yang  wrote:
> > Hi all,
> >
> > The situation is:
> > Three hosts, host1, host2, host3. Solr version 6.1 in cloud mode. 8 solr
> > nodes on each host.
> >
> > Create a collection using implicit router. Execute index and delete
> index.
> > The collection works fine.
> > Then kill 3 nodes, some of shards change leader.
> > Then index data to new leaders of shards, and commit. But some of shards
> > still has 0 documents. And no error occurred.
> > By checking the log on that leader replica, it did receive the update
> > request and processed. No error found in the log.
> >
> > After restart all nodes, everything works fine.
> >
> > This is a serious bug I think.
> > Can you confirm it's a bug or not?
> >
> > Regards,
> > Jerome
>


Re: In cloud mode, using implicit router. Leader changed, not available to index data, and no error occurred.

2016-09-19 Thread Jerome Yang
That shard did receive update request, because it shows in the log.
And also commit request.
But no documents indexed.

On Tue, Sep 20, 2016 at 2:26 PM, Jerome Yang  wrote:

> I'm sure I send documents to that shard. And execute commit.
>
> I also use curl to index, but not error occurred and no documents are
> indexed.
>
> On Mon, Sep 19, 2016 at 11:27 PM, Erick Erickson 
> wrote:
>
>> Are all the documents in the collection? By using implicit router, you are
>> assuming control of what shard each document ends up on. So my
>> guess is that you are not routing the docs to each shard.
>>
>> If you want Solr to automatically assign the shard to a doc, you should
>> be using the default compositeId routing scheme.
>>
>> If you index docs and not all of them are somewhere in the collection,
>> that's a problem, assuming you are routing them properly when using
>> the implicit router.
>>
>> Best,
>> Erick
>>
>> On Sun, Sep 18, 2016 at 8:04 PM, Jerome Yang  wrote:
>> > Hi all,
>> >
>> > The situation is:
>> > Three hosts, host1, host2, host3. Solr version 6.1 in cloud mode. 8 solr
>> > nodes on each host.
>> >
>> > Create a collection using implicit router. Execute index and delete
>> index.
>> > The collection works fine.
>> > Then kill 3 nodes, some of shards change leader.
>> > Then index data to new leaders of shards, and commit. But some of shards
>> > still has 0 documents. And no error occurred.
>> > By checking the log on that leader replica, it did receive the update
>> > request and processed. No error found in the log.
>> >
>> > After restart all nodes, everything works fine.
>> >
>> > This is a serious bug I think.
>> > Can you confirm it's a bug or not?
>> >
>> > Regards,
>> > Jerome
>>
>
>


Solrcloud after restore collection, when index new documents into restored collection, leader not write to index.

2016-10-11 Thread Jerome Yang
Hi all,

I'm facing a strange problem.

Here's a solrcloud on a single machine which has 2 solr nodes, version:
solr6.1.

I create a collection with 2 shards and replica factor is 3 with default
router called "test_collection".
Index some documents and commit. Then I backup this collection.
After that, I restore from the backup and name the restored collection
"restore_test_collection".
Query from "restore_test_collection". It works fine and data is consistent.

Then, I index some new documents, and commit.
I find that the documents are all indexed in shard1 and the leader of
shard1 don't have these new documents but other replicas do have these new
documents.

Anyone have this issue?
Really need your help.

Regards,
Jerome


Re: Solrcloud after restore collection, when index new documents into restored collection, leader not write to index.

2016-10-11 Thread Jerome Yang
Using curl do some tests.

curl 'http://localhost:8983/solr/restore_test_collection/update?
*commit=true*&wt=json' --data-binary @test.json -H
'Content-type:application/json'

The leader don't have new documents, but other replicas have.

curl 'http://localhost:8983/solr/restore_test_collection/update?
*commitWithin**=1000*&wt=json' --data-binary @test.json -H
'Content-type:application/json'
All replicas in shard1 have new documents include leader, and all new
documents route to shard1.

On Tue, Oct 11, 2016 at 5:27 PM, Jerome Yang  wrote:

> Hi all,
>
> I'm facing a strange problem.
>
> Here's a solrcloud on a single machine which has 2 solr nodes, version:
> solr6.1.
>
> I create a collection with 2 shards and replica factor is 3 with default
> router called "test_collection".
> Index some documents and commit. Then I backup this collection.
> After that, I restore from the backup and name the restored collection
> "restore_test_collection".
> Query from "restore_test_collection". It works fine and data is consistent.
>
> Then, I index some new documents, and commit.
> I find that the documents are all indexed in shard1 and the leader of
> shard1 don't have these new documents but other replicas do have these new
> documents.
>
> Anyone have this issue?
> Really need your help.
>
> Regards,
> Jerome
>


Re: Solrcloud after restore collection, when index new documents into restored collection, leader not write to index.

2016-10-11 Thread Jerome Yang
@Mark Miller Please help~

On Tue, Oct 11, 2016 at 5:32 PM, Jerome Yang  wrote:

> Using curl do some tests.
>
> curl 'http://localhost:8983/solr/restore_test_collection/update?
> *commit=true*&wt=json' --data-binary @test.json -H
> 'Content-type:application/json'
>
> The leader don't have new documents, but other replicas have.
>
> curl 'http://localhost:8983/solr/restore_test_collection/update?
> *commitWithin**=1000*&wt=json' --data-binary @test.json -H
> 'Content-type:application/json'
> All replicas in shard1 have new documents include leader, and all new
> documents route to shard1.
>
> On Tue, Oct 11, 2016 at 5:27 PM, Jerome Yang  wrote:
>
>> Hi all,
>>
>> I'm facing a strange problem.
>>
>> Here's a solrcloud on a single machine which has 2 solr nodes, version:
>> solr6.1.
>>
>> I create a collection with 2 shards and replica factor is 3 with default
>> router called "test_collection".
>> Index some documents and commit. Then I backup this collection.
>> After that, I restore from the backup and name the restored collection
>> "restore_test_collection".
>> Query from "restore_test_collection". It works fine and data is
>> consistent.
>>
>> Then, I index some new documents, and commit.
>> I find that the documents are all indexed in shard1 and the leader of
>> shard1 don't have these new documents but other replicas do have these new
>> documents.
>>
>> Anyone have this issue?
>> Really need your help.
>>
>> Regards,
>> Jerome
>>
>
>


Re: Solrcloud after restore collection, when index new documents into restored collection, leader not write to index.

2016-10-11 Thread Jerome Yang
Hi Shawn,

I just check the clusterstate.json
<http://192.168.33.10:18983/solr/admin/zookeeper?detail=true&path=%2Fclusterstate.json>
which
is restored for "restore_test_collection".
The router is "router":{"name":"compositeId"},
not implicit.

So, it's a very serious bug I think.
Should this bug go into jira?

Please help!

Regards,
Jerome


On Tue, Oct 11, 2016 at 8:34 PM, Shawn Heisey  wrote:

> On 10/11/2016 3:27 AM, Jerome Yang wrote:
> > Then, I index some new documents, and commit. I find that the
> > documents are all indexed in shard1 and the leader of shard1 don't
> > have these new documents but other replicas do have these new documents.
>
> Not sure why the leader would be missing the documents but other
> replicas have them, but I do have a theory about why they are only in
> shard1.  Testing that theory will involve obtaining some information
> from your system:
>
> What is the router on the restored collection? You can see this in the
> admin UI by going to Cloud->Tree, opening "collections", and clicking on
> the collection.  In the right-hand side, there will be some info from
> zookeeper, with some JSON below it that should mention the router.  I
> suspect that the router on the new collection may have been configured
> as implicit, instead of compositeId.
>
> Thanks,
> Shawn
>
>


Re: Solrcloud after restore collection, when index new documents into restored collection, leader not write to index.

2016-10-11 Thread Jerome Yang
@Erick Please help😂

On Wed, Oct 12, 2016 at 10:21 AM, Jerome Yang  wrote:

> Hi Shawn,
>
> I just check the clusterstate.json
> <http://192.168.33.10:18983/solr/admin/zookeeper?detail=true&path=%2Fclusterstate.json>
>  which
> is restored for "restore_test_collection".
> The router is "router":{"name":"compositeId"},
> not implicit.
>
> So, it's a very serious bug I think.
> Should this bug go into jira?
>
> Please help!
>
> Regards,
> Jerome
>
>
> On Tue, Oct 11, 2016 at 8:34 PM, Shawn Heisey  wrote:
>
>> On 10/11/2016 3:27 AM, Jerome Yang wrote:
>> > Then, I index some new documents, and commit. I find that the
>> > documents are all indexed in shard1 and the leader of shard1 don't
>> > have these new documents but other replicas do have these new documents.
>>
>> Not sure why the leader would be missing the documents but other
>> replicas have them, but I do have a theory about why they are only in
>> shard1.  Testing that theory will involve obtaining some information
>> from your system:
>>
>> What is the router on the restored collection? You can see this in the
>> admin UI by going to Cloud->Tree, opening "collections", and clicking on
>> the collection.  In the right-hand side, there will be some info from
>> zookeeper, with some JSON below it that should mention the router.  I
>> suspect that the router on the new collection may have been configured
>> as implicit, instead of compositeId.
>>
>> Thanks,
>> Shawn
>>
>>
>


Reload schema or configs failed then drop index, can not recreate that index.

2016-11-22 Thread Jerome Yang
Hi all,


Here's my situation:

In cloud mode.

   1. I created a collection called "test" and then modified the
   managed-schemaI got an error as shown in picture 2.
   2. To get enough error message, I checked solr logs and get message
   shown in picture 3.
   3. If I corrected the managed-schema, everything would be fine. But I
   dropped the index. The index couldn't be created it again, like picture 4.
   I restarted gptext using "gptext-start -r" and recreated the index, it was
   created successfully like picture 5.


Re: Reload schema or configs failed then drop index, can not recreate that index.

2016-11-22 Thread Jerome Yang
Sorry, wrong message.
To correct.

In cloud mode.

   1. I created a collection called "test" and then modified the
   managed-schemaI, write something wrong, for example
   "id", then reload collection would failed.
   2. Then I drop the collection "test" and delete configs form zookeeper.
   It works fine. The collection is removed both from zookeeper and hard disk.
   3. Upload the right configs with the same name as before, try to create
   collection as name "test", it would failed and the error is "core with name
   '*' already exists". But actually not.
   4. The restart the whole cluster, do the create again, everything works
   fine.


I think when doing the delete collection, there's something still hold in
somewhere not deleted.
Please have a look

Regards,
Jerome

On Wed, Nov 23, 2016 at 10:16 AM, Jerome Yang  wrote:

> Hi all,
>
>
> Here's my situation:
>
> In cloud mode.
>
>1. I created a collection called "test" and then modified the
>managed-schemaI got an error as shown in picture 2.
>2. To get enough error message, I checked solr logs and get message
>shown in picture 3.
>3. If I corrected the managed-schema, everything would be fine. But I
>dropped the index. The index couldn't be created it again, like picture 4.
>I restarted gptext using "gptext-start -r" and recreated the index, it was
>created successfully like picture 5.
>
>


Re: Solr 6 Performance Suggestions

2016-11-22 Thread Jerome Yang
Have you run IndexUpgrader?

Index Format Changes

Solr 6 has no support for reading Lucene/Solr 4.x and earlier indexes.  Be
sure to run the Lucene IndexUpgrader included with Solr 5.5 if you might
still have old 4x formatted segments in your index. Alternatively: fully
optimize your index with Solr 5.5 to make sure it consists only of one
up-to-date index segment.

Regards,
Jerome

On Tue, Nov 22, 2016 at 10:48 PM, Yonik Seeley  wrote:

> It depends highly on what your requests look like, and which ones are
> slower.
> If you're request mix is heterogeneous, find the types of requests
> that seem to have the largest slowdown and let us know what they look
> like.
>
> -Yonik
>
>
> On Tue, Nov 22, 2016 at 8:54 AM, Max Bridgewater
>  wrote:
> > I migrated an application from Solr 4 to Solr 6.  solrconfig.xml  and
> > schema.xml are sensibly the same. The JVM params are also pretty much
> > similar.  The indicces have each about 2 million documents. No particular
> > tuning was done to Solr 6 beyond the default settings. Solr 4 is running
> in
> > Tomcat 7.
> >
> > Early results seem to show Solr 4 outperforming Solr 6. The first shows
> an
> > average response time of 280 ms while the second averages at 430 ms. The
> > test cases were exactly the same, the machines where exactly the same and
> > heap settings exactly the same (Xms24g, Xmx24g). Requests were sent with
> > Jmeter with 50 concurrent threads for 2h.
> >
> > I know that this is not enough information to claim that Solr 4 generally
> > outperforms Solr 6. I also know that this pretty much depends on what the
> > application does. So I am not claiming anything general. All I want to do
> > is get some input before I start digging.
> >
> > What are some things I could tune to improve the numbers for Solr 6? Have
> > you guys experienced such discrepancies?
> >
> > Thanks,
> > Max.
>


Re: Reload schema or configs failed then drop index, can not recreate that index.

2016-11-23 Thread Jerome Yang
It's solr 6.1, cloud mode.

Please ignore the first message. Just take check my second email.

I mean if I modify an existing collections's managed-schema and the
modification makes reload collection failed.
Then I delete the collection, and delete the configs from zookeeper.
After that upload an configs as the same name as before, and the
managed-schema is the not modified version.
Then recreate the collection, it will throw an error, "core already
exists". But actually it's not.
After restart the whole cluster, recreate collection will success.

Regards,
Jerome


On Wed, Nov 23, 2016 at 3:26 PM, Erick Erickson 
wrote:

> The mail server is pretty heavy-handed at deleting attachments, none of
> your
> (presumably) screenshots came through.
>
> You also haven't told us what version of Solr you're using.
>
> Best,
> Erick
>
> On Tue, Nov 22, 2016 at 6:25 PM, Jerome Yang  wrote:
> > Sorry, wrong message.
> > To correct.
> >
> > In cloud mode.
> >
> >1. I created a collection called "test" and then modified the
> >managed-schemaI, write something wrong, for example
> >"id", then reload collection would failed.
> >2. Then I drop the collection "test" and delete configs form
> zookeeper.
> >It works fine. The collection is removed both from zookeeper and hard
> disk.
> >3. Upload the right configs with the same name as before, try to
> create
> >collection as name "test", it would failed and the error is "core
> with name
> >'*' already exists". But actually not.
> >4. The restart the whole cluster, do the create again, everything
> works
> >fine.
> >
> >
> > I think when doing the delete collection, there's something still hold in
> > somewhere not deleted.
> > Please have a look
> >
> > Regards,
> > Jerome
> >
> > On Wed, Nov 23, 2016 at 10:16 AM, Jerome Yang  wrote:
> >
> >> Hi all,
> >>
> >>
> >> Here's my situation:
> >>
> >> In cloud mode.
> >>
> >>1. I created a collection called "test" and then modified the
> >>managed-schemaI got an error as shown in picture 2.
> >>2. To get enough error message, I checked solr logs and get message
> >>shown in picture 3.
> >>3. If I corrected the managed-schema, everything would be fine. But I
> >>dropped the index. The index couldn't be created it again, like
> picture 4.
> >>I restarted gptext using "gptext-start -r" and recreated the index,
> it was
> >>created successfully like picture 5.
> >>
> >>
>


Re: SolrCloud -Distribued Indexing

2016-11-23 Thread Jerome Yang
Hi,

1. You can usr solr collections api to create collection with "*implicit*"
router.
Please check, CREATE
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api1

2. There's several ways to indicate which collection you want send request
to.
a> setDefaultCollection
b> sendRequest(SolrRequest request, String collection)
Please check
https://lucene.apache.org/solr/6_1_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html

Regards,
Jerome


On Wed, Nov 23, 2016 at 6:43 PM, Udit Tyagi  wrote:

> Hi,
>
> I am a solr user, I am using solr-6.3.0 version, I have some doubts for
> Distributed indexing and sharding in SolrCloud pease clarify,
>
> 1. How can I index documents to a specific shard(I heard about document
> routing not documentation is not proper for that).
>
> I am using solr create command from terminal to create collection i don't
> have any option to specify router name while creating collection from
> terminal so how can i implement implicit router for my collection.
>
> 2.In documentation of Solr-6.3.0 for client API solrj the way to connect to
> solrcloud is specified as
>
> String zkHostString = "zkServerA:2181,zkServerB:2181,zkServerC:2181/solr";
> SolrClient solr = new
> CloudSolrClient.Builder().withZkHost(zkHostString).build();
>
> please update documentation or reply back how can i specify collection name
> to
> query after connecting to zookeeper.
>
> Any help will be appreciated,Thanks
>
> Regards,
> Udit Tyagi
>


Re: Reload schema or configs failed then drop index, can not recreate that index.

2016-11-24 Thread Jerome Yang
Thanks Erick!

On Fri, Nov 25, 2016 at 1:38 AM, Erick Erickson 
wrote:

> This is arguably a bug. I raised a JIRA, see:
>
>  https://issues.apache.org/jira/browse/SOLR-9799
>
> Managed schema is not necessary to show this problem, generically if
> you upload a bad config by whatever means, then
> RELOAD/DELETE/correct/CREATE it fails. The steps I outlined
> in the JIRA force the same replica to be created on the same Solr instance
> to insure it can be reproduced at will.
>
> In the meantime, you can keep from having to restart Solr by:
> - correcting the schema
> - pushing it to Zookeeper (managed schema API does this for you)
> - RELOAD the collection (do NOT delete it first).
>
> Since you can just RELOAD, I doubt this will be a high priority though.
>
> Thanks for reporting!
> Erick
>
>
> On Wed, Nov 23, 2016 at 6:37 PM, Jerome Yang  wrote:
> > It's solr 6.1, cloud mode.
> >
> > Please ignore the first message. Just take check my second email.
> >
> > I mean if I modify an existing collections's managed-schema and the
> > modification makes reload collection failed.
> > Then I delete the collection, and delete the configs from zookeeper.
> > After that upload an configs as the same name as before, and the
> > managed-schema is the not modified version.
> > Then recreate the collection, it will throw an error, "core already
> > exists". But actually it's not.
> > After restart the whole cluster, recreate collection will success.
> >
> > Regards,
> > Jerome
> >
> >
> > On Wed, Nov 23, 2016 at 3:26 PM, Erick Erickson  >
> > wrote:
> >
> >> The mail server is pretty heavy-handed at deleting attachments, none of
> >> your
> >> (presumably) screenshots came through.
> >>
> >> You also haven't told us what version of Solr you're using.
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Nov 22, 2016 at 6:25 PM, Jerome Yang  wrote:
> >> > Sorry, wrong message.
> >> > To correct.
> >> >
> >> > In cloud mode.
> >> >
> >> >1. I created a collection called "test" and then modified the
> >> >managed-schemaI, write something wrong, for example
> >> >"id", then reload collection would failed.
> >> >2. Then I drop the collection "test" and delete configs form
> >> zookeeper.
> >> >It works fine. The collection is removed both from zookeeper and
> hard
> >> disk.
> >> >3. Upload the right configs with the same name as before, try to
> >> create
> >> >collection as name "test", it would failed and the error is "core
> >> with name
> >> >'*' already exists". But actually not.
> >> >4. The restart the whole cluster, do the create again, everything
> >> works
> >> >fine.
> >> >
> >> >
> >> > I think when doing the delete collection, there's something still
> hold in
> >> > somewhere not deleted.
> >> > Please have a look
> >> >
> >> > Regards,
> >> > Jerome
> >> >
> >> > On Wed, Nov 23, 2016 at 10:16 AM, Jerome Yang 
> wrote:
> >> >
> >> >> Hi all,
> >> >>
> >> >>
> >> >> Here's my situation:
> >> >>
> >> >> In cloud mode.
> >> >>
> >> >>1. I created a collection called "test" and then modified the
> >> >>managed-schemaI got an error as shown in picture 2.
> >> >>2. To get enough error message, I checked solr logs and get
> message
> >> >>shown in picture 3.
> >> >>3. If I corrected the managed-schema, everything would be fine.
> But I
> >> >>dropped the index. The index couldn't be created it again, like
> >> picture 4.
> >> >>I restarted gptext using "gptext-start -r" and recreated the
> index,
> >> it was
> >> >>created successfully like picture 5.
> >> >>
> >> >>
> >>
>


DIH doucments not indexed because of loss in xsl transformation.

2013-12-10 Thread jerome . dupont

Hello

I'm indexing xml files with xpathEntityProcessor, and for some hundreads
documents on 12 millions are not processed.

When I tried to index only one of the KO documents it doesn't either index.
So it's not a matter of big number of documents.

We tried to do the xslt transformation externaly, to catch the xml
transformed and to index it in SOLR, it worked.
So the doc seems OK.
I looked on the doc, it was big, so I commented a part, it has been indexed
in solr with xsl transform.


So I downloaded the dih code and I debugged the execution of these lines,
which launch the xsl transformation, to see what was happening exactly

  SimpleCharArrayReader caw = new SimpleCharArrayReader();
  xslTransformer.transform(new StreamSource(data),
  new StreamResult(caw));
  data = caw.getReader();

It appeared that the caw missed data, so the xsltTransformer didn't work
correctly.
Digging further in TransformerImpl code, I see  the content of my xml file
in some buffer but  somewhere something goes wrong, that I don't understand
( it's getting very tricky for me).

xslTransformer is from class
com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl

Is there a mean to change the xslt transformer class,
or is there a known limitation of size in this xmltransformer, which can be
increased?

I've work in solr 4.2 and then in solr 4.6.

Thank in advance

Regards
JérÎme Dupont
BibliothĂšque Nationale de France
Département des SystÚmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---

Exposition  Astérix à la BnF !  - du 16 octobre 2013 au 19 janvier 2014 - BnF - 
François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à 
l'environnement. 

using facet enum et fc in the same query.

2014-09-22 Thread jerome . dupont
Hello, 

I have a solr index (12 M docs, 45Go) with facets, and I'm trying to 
improve facet queries performances.
1/ I tried to use docvalue on facet fields, it didn't work well
2/ I tried facet.threads=-1 in my querie, and worked perfectely (from more 
15s  to 2s for longest queries)

3/ I'm trying to use facet.method=enum. It's supposed to improve the 
performance for facets fileds with few differents values. (type of 
documents, things like that)

My problem is that I don't know if there is a way to specifiy enum method 
for some  facets (3 to 5000 different values), and fc method the some 
others (up to 12M different values) and the same query?

Is it possible with something like MyFacet..facet.method=enum

?

Thanks in advance for the answer.

---
JérÎme Dupont
BibliothĂšque Nationale de France
Département des SystÚmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---


Participez à l'acquisition d'un Trésor national - Le manuscrit royal de 
François I er Avant d'imprimer, pensez à l'environnement. 

RE: RE: using facet enum et fc in the same query.

2014-09-23 Thread jerome . dupont
First Thanks very much for your answers, and Alan's one

>> I have a solr index (12 M docs, 45Go) with facets, and I'm trying to 
improve facet queries performances.
>> 1/ I tried to use docvalue on facet fields, it didn't work well

> That was surprising, as the normal result of switching to DocValues is 
positive. Can you elaborate on what you did and how it failed?

When I said it failed, I just meant I was a little bit slower.


>> 2/ I tried facet.threads=-1 in my queries, and worked perfectely (from 
more
>> 15s  to 2s for longest queries)

> That tells us that your primary problem is not IO. If your usage is 
normally single-threaded
> that can work, but it also means that you have a lot of CPU cores 
standing idle most of the
> time. How many fields are you using for faceting and how many of them 
are large (more unique
> values than the 5000 you mention)?

The "slow" request corresponds to our website search query. It for our 
book catalog: some facets are for type of documents, author, title 
subjets, location of the book, dates...

In this request we have now 35 facets.
About unique value, for the "slow" query:
1 facet goes up to 4M unique values (authors),
1 facet has 250.000 uniques values
1 have 5
1 have 6700
4 have between 300 and 1000
5 have between 100 and 160
16 have less than 65


>> 3/ I'm trying to use facet.method=enum. It's supposed to improve the
>> performance for facets fileds with few differents values. (type of
>> documents, things like that)

> Having a mix of facet methods seems like a fine idea, although my 
personal experience is that
> enums gets slower than fc quite earlier than the 5000 unique values 
mark. As Alan states,
> the call is f.myfacetfield.facet.method=enum (Remember the 
'facet.'-part. See > 
https://wiki.apache.org/solr/SimpleFacetParameters#Parameters
>for details).

>Or you could try Sparse Faceting (Disclaimer: I am the author), which 
seems to fit your setup
>very well: http://tokee.github.io/lucene-solr/


Right now we use solr 4.6, and we soon deliver our relsease, and I'm 
afraid I won't have time to try  this time, but I can try for next release 
(next month I think).

Thanks very much again
Jerome
Dupont
jerome.dupont_at#bnf.fr
Participez à l'acquisition d'un Trésor national - Le manuscrit royal de 
François I er Avant d'imprimer, pensez à l'environnement. 

[SOLR 4.4 or 4.2] indexing with dih and solrcloud

2013-08-29 Thread jerome . dupont

Hello,

I'm trying to index documents with Data import handler and solrcloud at the
same time. (huge collection, need to make parallel indexing)

First I had a dih configuration whichs works with solr standalone.
(Indexing for two month every week)

I've transformed my configuration to "cloudify" it with one shard at the
begining (adding config file + launching with zkrun option)
I see my solr admin interface with the cloud panels (tree view, 1 shard
connected and active ...), so it seems to work.

When I indexusing DIH, it looks like it was working, the entry xml files
are read but no documents are stored in the index, exactly as I would have
put commit argument to false.

This is the answer of dih request
{
  "responseHeader":{
"status":0,
"QTime":32871},
  "initArgs":[
"defaults",[
  "config","mnb-data-config.xml"]],
  "command":"full-import",
  "mode":"debug",
  "documents":[],
  "verbose-output":[
"entity:noticebib",[
  "entity:processorDocument",[],
...
  "entity:processorDocument",[],
  null,"--- row #1-",
  "CHEMINRELATIF","3/7/000/37000143.xml",
  null,"-",
...
"status":"idle",
  "importResponse":"",
  "statusMessages":{
"Total Requests made to DataSource":"16",
"Total Rows Fetched":"15",
"Total Documents Skipped":"0",
"Full Dump Started":"2013-08-29 12:08:48",
"Total Documents Processed":"0",
"Time taken":"0:0:32.684"},

In the logs (see above), I see PRE_UPDATE FINISH message
And after, some debug messages about "Could not retrieve configuration"
coming from zookeeper.

So my question, what can be wrong in my config?
_ something about synchro in zookeeper (could not retrieve message)
_ A step missing in data import handler
I don't see how to diagnose that point?

DEBUG 2013-08-29 12:09:21,411 http-8080-1
org.apache.solr.handler.dataimport.URLDataSource  (92) - Accessing URL:
file:/X:/3/7/000/37000190.xml
DEBUG 2013-08-29 12:09:21,520 http-8080-1
org.apache.solr.handler.dataimport.LogTransformer  (58) - Notice fichier:
3/7/000/37000190.xml
DEBUG 2013-08-29 12:09:21,520 http-8080-1 fr.bnf.solr.BnfDateTransformer
(696) - NN=37000190
INFO 2013-08-29 12:09:21,520 http-8080-1
org.apache.solr.handler.dataimport.DocBuilder  (267) - Time taken =
0:0:32.684
DEBUG 2013-08-29 12:09:21,536 http-8080-1
org.apache.solr.update.processor.LogUpdateProcessor  (178) - PRE_UPDATE
FINISH {{params
(optimize=true&indent=true&start=10&commit=true&verbose=true&entity=noticebib&command=full-import&debug=true&wt=json&rows=5),defaults
(config=mnb-data-config.xml)}}
INFO 2013-08-29 12:09:21,536 http-8080-1
org.apache.solr.update.processor.LogUpdateProcessor  (198) - [noticesBIB]
webapp=/solr-0.4.0-pfd path=/dataimportMNb params=
{optimize=true&indent=true&start=10&commit=true&verbose=true&entity=noticebib&command=full-import&debug=true&wt=json&rows=5}
 {} 0 32871
DEBUG 2013-08-29 12:09:21,583 http-8080-1
org.apache.solr.servlet.SolrDispatchFilter  (388) - Closing out
SolrRequest: {{params
(optimize=true&indent=true&start=10&commit=true&verbose=true&entity=noticebib&command=full-import&debug=true&wt=json&rows=5),defaults
(config=mnb-data-config.xml)}}
DEBUG 2013-08-29 12:09:21,833 main-SendThread(127.0.0.1:9080)
org.apache.zookeeper.client.ZooKeeperSaslClient  (519) - Could not retrieve
login configuration: java.lang.SecurityException: Impossible de trouver une
configuration de connexion
DEBUG 2013-08-29 12:09:21,833 main-SendThread(127.0.0.1:9080)
org.apache.zookeeper.client.ZooKeeperSaslClient  (519) - Could not retrieve
login configuration: java.lang.SecurityException: Impossible de trouver une
configuration de connexion
DEBUG 2013-08-29 12:09:21,833 main-SendThread(127.0.0.1:9080)
org.apache.zookeeper.client.ZooKeeperSaslClient  (519) - Could not retrieve
login configuration: java.lang.SecurityException: Impossible de trouver une
configuration de connexion
DEBUG 2013-08-29 12:09:21,833 SyncThread:0
org.apache.zookeeper.server.FinalRequestProcessor  (88) - Processing
request:: sessionid:0x140c98bbe43 type:getData cxid:0x39d
zxid:0xfffe txntype:unknown reqpath:/overseer_elect/leader
DEBUG 2013-08-29 12:09:21,833 SyncThread:0
org.apache.zookeeper.server.FinalRequestProcessor  (160) -
sessionid:0x140c98bbe43 type:getData cxid:0x39d zxid:0xfffe
txntype:unknown reqpath:/overseer_elect/leader
DEBUG 2013-08-29 12:09:21,833 main-SendThread(127.0.0.1:9080)
org.apache.zookeeper.client.ZooKeeperSaslClient  (519) - Could not retrieve
login configuration: java.lang.SecurityException: Impossible de trouver une
configuration de connexion
DEBUG 2013-08-29 12:09:21,833 main-SendThread(127.0.0.1:9080)
org.apache.zookeeper.client.ZooKeeperSaslClient  (519) - Could not retrieve
login configuration: java.lang.SecurityException: Impossible de trouver une
configuration de connexion


PS: At the begining, I was in solr 4.2.1 and I tried with 4.0.0, but I have
the same problem.

Re

Re :Re: [SOLR 4.4 or 4.2] indexing with dih and solrcloud

2013-08-29 Thread jerome . dupont

Hello again

Finally, I found the problem.
It seems that
_ The indexation request was done with an http GET and not with POST,
because I was lauching it from a favorite in my navigator.
Launching indexation on my documents by the admin interface made indexation
work.
_ Antoher problem was that some documents are not indexed (in particular
the firsts of the list) for some reason (due to our configuration), So when
I was trying on the ten first documents, it couldn't owrk.

Now I will try with 2 shards...

Jerome


Fermeture annuelle des sites François-Mitterrand et Richelieu du 2 au 15 
septembre 2013 Avant d'imprimer, pensez Ă  l'environnement. 

solr cloud and DIH, indexation runs only on one shard.

2013-09-03 Thread jerome . dupont

Hello again,

I still trying to index a with solr cloud and dih. I can index but it seems
that indexation is done on only 1 shard. (my goal was to parallelze that to
go fast)
This my conf:
I have 2 tomcat instances,
One with zookeeper embedded in solr 4.4.0 started and 1 shard (port 8080)
The other with the second shard. (port 9180)
In my admin interface, I see 2 shards, each one is leader


When I launch the dih, documents are indexed. But only the shard1 is
working.
http://localhost:8080/solr-0.4.0-pfd/noticesBIBcollection/dataimportMNb?command=full-import&entity=noticebib&optimize=true&indent=true&clean=true&commit=true&verbose=false&debug=false&wt=json&rows=1000


In my first shard, I see messages coming from my indexation process:
DEBUG 2013-09-03 11:48:57,801 Thread-12
org.apache.solr.handler.dataimport.URLDataSource  (92) - Accessing URL:
file:/X:/3/7/002/37002118.xml
DEBUG 2013-09-03 11:48:57,832 Thread-12
org.apache.solr.handler.dataimport.URLDataSource  (92) - Accessing URL:
file:/X:/3/7/002/37002120.xml
DEBUG 2013-09-03 11:48:57,966 Thread-12
org.apache.solr.handler.dataimport.LogTransformer  (58) - Notice fichier:
3/7/002/37002120.xml
DEBUG 2013-09-03 11:48:57,966 Thread-12 fr.bnf.solr.BnfDateTransformer
(696) - NN=37002120

In the second instance, I just have this kind of logs, at it was receiving
notifications from zookeeper of new updates
INFO 2013-09-03 11:48:57,323 http-9180-7
org.apache.solr.update.processor.LogUpdateProcessor  (198) - [noticesBIB]
webapp=/solr-0.4.0-pfd path=/update params=
{distrib.from=http://172.20.48.237:8080/solr-0.4.0-pfd/noticesBIB/&update.distrib=TOLEADER&wt=javabin&version=2}
 {add=[37001748 (1445149264874307584), 37001757 (1445149264879550464),
37001764 (1445149264883744768), 37001786 (1445149264887939072), 37001817
(1445149264891084800), 37001819 (1445149264896327680), 37001837
(1445149264900521984), 37001861 (1445149264903667712), 37001869
(1445149264907862016), 37001963 (1445149264912056320)]} 0 41

I supposed there was a confusion between cores names and collection name,
and I tried to change the name of the collection, but it solved nothing.
When I come to dih interfaces, in shard1, I see indexation processing, and
on shard 2 "no information available"

Is there something specia to do to distributre indexation process?
Should I run zookeeper on both instances (even if it's not mandatory?
...
Regards
Jerome



Fermeture annuelle des sites François-Mitterrand et Richelieu du 2 au 15 
septembre 2013 Avant d'imprimer, pensez Ă  l'environnement. 

Re: solr cloud and DIH, indexation runs only on one shard.

2013-09-03 Thread jerome . dupont

It works

I've done what you said:
_ In my request to get list of documents, I add a where clause filtering on
the select getting the documents to index:
where noticebib.numnoticebib LIKE '%${dataimporter.request.suffixeNotice}'"
_ And I called my dih on each shard with the parameter suffixeNotice=2 or
suffixeNotice=1

Each shard indexed its part on the same time. (more or less 1000 do each
one).

When I execute a select on the collection, I get more or less 2000
documents.

No my goad is to merge indexes, but that's another story.

Another possiblity would have been to play with rows and start parameters,
but it supooses 2 things
_ to know the number of documents
_ add an order by clause to make sure the subsets of document are disjoints
(and even in that case, I'm not completly sure, because the source database
can change)

Thanks very much !!

JerĂŽme



Fermeture annuelle des sites François-Mitterrand et Richelieu du 2 au 15 
septembre 2013 Avant d'imprimer, pensez Ă  l'environnement. 

[DIH] Logging skipped documents

2013-09-23 Thread jerome . dupont

Hello,

I have a question, I index documents and a small part them are skipped, (I
am in onError="skip" mode)
I'm trying to get a list of them, in order to analyse what's worng with
these documents
Is there a mean to get the list of skipped documents, and some more
information (my onError="skip" is in an XPathEntityProcessor, the name of
the file processed would be OK)


Cordialement,
---
JérÎme Dupont
BibliothĂšque Nationale de France
Département des SystÚmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---



Participez Ă  la Grande Collecte 1914-1918 Avant d'imprimer, pensez Ă  
l'environnement. 

error while indexing huge filesystem with data import handler and FileListEntityProcessor

2013-05-24 Thread jerome . dupont

Hello,


We are trying to use data import handler and particularly on a collection
which contains many file (one xml per document)

Our configuration works  for a small amount of files, but dataimport fails
with OutofMemory Error when running it on 10M files (in several
directories...)

This is it the content of our config.xml:






When we try it on a directory which contains 10 subdirectoies each subdir
containing 1000 subdirectories, each one containing 1000 xml files (10M
files, so), indexation process doesn't work anymore,

We have a java.outofmemory excpetion (even with 512 Mo and 1GB memory)
ERROR 2013-05-24 15:26:25,733 http-9145-2
org.apache.solr.handler.dataimport.DataImporter  (96) - Full Import
failed:java.lang.RuntimeException: java.lang.RuntimeException:
java.lang.ClassCastException: java.lang.OutOfMemoryError cannot be cast to
java.lang.Exception
at org.apache.solr.handler.dataimport.DocBuilder.execute
(DocBuilder.java:266)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport
(DataImporter.java:422)
at org.apache.solr.handler.dataimport.DataImporter.runCmd
(DataImporter.java:487)
at
org.apache.solr.handler.dataimport.DataImportHandler.handleRequestBody
(DataImportHandler.java:179)
at org.apache.solr.handler.RequestHandlerBase.handleRequest
(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1817)

Monitoring the jvm with visualvm, I've seen that most of time is taken by
the method FileListEntityProcessor.accept (called by getFolderFiles), so I
assumed that the error occured when filling list of files to be indexed:
Indeed the list of files is done by this method which called by
getFolderFiles.

Basically, the list of files to index is done  by getFolderFiles, itself
called at first call to nextRow(). The indexation itself starts only after
that.
org/apache/solr/handler/dataimport/FileListEntityProcessor.java
  private void  [More ...] getFolderFiles(File dir, final List> fileDetails) {

I found back the variable fileDetails which contains the list of my xml
files. It contains 611345 entries (for approximatively 500 Mo of memory).
And I have 10M xml files (more or less...). That why I think it's not
finished yet.
To get the entire list I guess I need something between 5 and 10 Go for my
process.

So I have several questions :
_ Is it possible to have severalFileListEntityProcessor attached to only
one  XPathEntityProcessor in the data-config.xml : Like this I can do it in
ten times, with my 10 directories of first level.
_ Is there a roadmap to optimize this method, for example by not doing the
list of all file in  the first time, but each 1000 documents, for instance?
_ Or to store the file list in a temporary file in order to save some
memory?

Regards,
---
JérÎme Dupont
---

Exposition  Jean de Gonet, relieur  - jusqu'au 21 juillet 2013 - BnF - 
François-Mitterrand / Galerie François 1 er 
Jean de Gonet dédicacera le catalogue de l'exposition le samedi 25 mai de 16h30 
à 18 heures à l'entrée de l'exposition. Avant d'imprimer, pensez à 
l'environnement. 

Re: Re: error while indexing huge filesystem with data import handler and FileListEntityProcessor

2013-05-29 Thread jerome . dupont


The configuraiton works with LineEntityProcessor, with few documents (havn
(t test with many documents yet.
For information this the config








... fields defintion

file:///D:/jed/noticesBib/listeNotices.txt contains the follwing lines
jed/noticesBib/3/4/307/34307035.xml
jed/noticesBib/3/4/307/34307082.xml
jed/noticesBib/3/4/307/34307110.xml
jed/noticesBib/3/4/307/34307197.xml
jed/noticesBib/3/4/307/34307350.xml
jed/noticesBib/3/4/307/34307399.xml
...
(Could have containes all the location with the beginning, but I wanted to
test the concatenation of filename.

That works fine, thanks for the help!!

Next step, the same without using a file. (I'll write it in another post).

Regards,
JérÎme

Exposition  Guy Debord, un art de la guerre  - du 27 mars au 13 juillet 2013 - 
BnF - François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à 
l'environnement. 

[DIH] Using SqlEntity to get a list of files and read files in XpathEntityProcessor

2013-05-30 Thread jerome . dupont

Hello,

I want to use a index a huge list of xml file.
_ Using FileListEntityProcessor causes an OutOfMemoryException (too many
files...)
_ I can do it using a LineEntityProcessor reading a list of files,
generated externally, but I would prefer to generate the list in SOLR
_ So to avoid to mantain a list of files, I'm trying to generate the list
with an sql query, and to give the list of results to XPathEntityProcessor,
which will read the file

The query select DISTINCT... generate this result
CHEMINRELATIF
3/0/000/3001

But the problem is that with the following configuration, no request do db
is done, accoring to the message returned by DIH.

 "statusMessages":{
"Total Requests made to DataSource":"0",
"Total Rows Fetched":"0",
"Total Documents Processed":"0",
"Total Documents Skipped":"0",
"":"Indexing completed. Added/Updated: 0 documents. Deleted 0
documents.",
"Committed":"2013-05-30 10:23:30",
"Optimized":"2013-05-30 10:23:30",

And the log:
INFO 2013-05-30 10:23:29,924 http-8080-1
org.apache.solr.handler.dataimport.DataImporter  (121) - Loading DIH
Configuration: mnb-data-config.xml
INFO 2013-05-30 10:23:29,957 http-8080-1
org.apache.solr.handler.dataimport.DataImporter  (224) - Data Configuration
loaded successfully
INFO 2013-05-30 10:23:29,969 http-8080-1
org.apache.solr.handler.dataimport.DataImporter  (414) - Starting Full
Import
INFO 2013-05-30 10:23:30,009 http-8080-1
org.apache.solr.handler.dataimport.SimplePropertiesWriter  (219) - Read
dataimportMNb.properties
INFO 2013-05-30 10:23:30,045 http-8080-1
org.apache.solr.handler.dataimport.DocBuilder  (292) - Import completed
successfully


Did some has already done the kind of configuration, or is just not
possible?

The config:





I'm trying to inde
Cordialement,
---
JérÎme Dupont
BibliothĂšque Nationale de France
Département des SystÚmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---

Exposition  Guy Debord, un art de la guerre  - du 27 mars au 13 juillet 2013 - 
BnF - François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à 
l'environnement. 

Re: Re: [DIH] Using SqlEntity to get a list of files and read files in XpathEntityProcessor

2013-05-30 Thread jerome . dupont

Hi,

Thanks for your anwser, it made me go ahead.
The name of the entity was not good, not consistent with schema
Now the first entity works fine: the query is done to the database and
returns the good result.
The problem is that the second entity, which is a XPathEntityProcessor
entity, doesn't read the file specified in url attribute, but tries to
execute it as an sql query on my database.

I tried to put a fake query (select 1 from dual) but it changes nothing.
It's like the XPathEntityProcessor entity behaved like an
SqlEntityProcessor, using url attribute instead of query attrbute.

I've forgotten to say which version I use: SOLR 4.2.1 (can be changed, it's
just the beginning of the developpement)
See next the config, and the return message:


The verbose output:

  "verbose-output":[
"entity:noticebib",[
  "query","select DISTINCT   SUBSTR( to_char(noticebib.numnoticebib,
'9'), 3, 1) || '/' ||SUBSTR( to_char(noticebib.numnoticebib,
'9'), 4, 1) || '/' ||SUBSTR( to_char(noticebib.numnoticebib,
'9'), 5, 3) || '/' ||to_char(noticebib.numnoticebib) || '.xml'
as CHEMINRELATIF   from bnf.noticebibwhere numnoticebib = '3001'",
  "time-taken","0:0:0.141",
  null,"--- row #1-",
  "CHEMINRELATIF","3/0/000/3001.xml",
  null,"-",
  "entity:processorDocument",[
"document#1",[
  "query","file:///D:/jed/noticesbib/3/0/000/3001.xml",

"EXCEPTION","org.apache.solr.handler.dataimport.DataImportHandlerException:
Unable to execute query: file:///D:/jed/noticesbib/3/0/000/3001.xml
Processing Document # 1\r\n\tat
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow
(DataImportHandlerException.java:71)\r\n\tat ...
oracle.jdbc.driver.OracleStatementWrapper.execute
(OracleStatementWrapper.java:1203)\r\n\tat
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.
(JdbcDataSource.java:246)\r\n\t... 32 more\r\n",
  "time-taken","0:0:0.124",


This is the configuration
















Cordialement,
---
JérÎme Dupont
BibliothĂšque Nationale de France
Département des SystÚmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---

Exposition  Guy Debord, un art de la guerre  - du 27 mars au 13 juillet 2013 - 
BnF - François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à 
l'environnement. 

RE: [DIH] Using SqlEntity to get a list of files and read files in XpathEntityProcessor

2013-05-31 Thread jerome . dupont

Thanks very much, it works, with dataSource (capital S) !!!
Finally, I didn't have to define a "CHEMINRELATIF" field in the
configuration, it's working without it.

This is the definive working configuration:












Thanks again!

---
JérÎme Dupont
BibliothĂšque Nationale de France
Département des SystÚmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---


Exposition  Guy Debord, un art de la guerre  - du 27 mars au 13 juillet 2013 - 
BnF - François-Mitterrand / Grande Galerie Avant d'imprimer, pensez à 
l'environnement. 

Weird behaviour with phrase queries

2011-01-24 Thread Jerome Renard
Hi,

I have a problem with phrase queries, from times to times I do not get any
result
where as I know I should get returned something.

The search is run against a field of type "text" which definition is
available at the following URL :
- http://pastebin.com/Ncem7M8z

This field is defined with the following configuration:


I use the following request handler:


explicit
0.01
meta_text
meta_text

1<1 2<-1 5<-2 7<60%
100
*:*



Depending on the kind of phrase query I use I get either exactly what I am
looking for or nothing.

Index' contents is all french so I thought about a possible problem with
accents but I got queries working
with phrase queries containing "é" and "Ú" chars like "académie" or
"ingénieur".

As you will see the filter used in the "text" type uses the
SnowballPorterFilterFactory for the english language,
I plan to fix that by using the correct language for the index (French) and
the following protwords http://bit.ly/i8JeX6 .

But except this mistake with the stemmer, did I do something (else) wrong ?
Did I overlook something ? What could
explain I do not always get results for my phrase queries ?

Thanks in advance for your feedback.

Best Regards,

--
JérÎme


Re: Weird behaviour with phrase queries

2011-01-24 Thread Jerome Renard
Hi Em, Erick

thanks for your feedback.

Em : yes Here is the stopwords.txt I use :
-
http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/french_stop.txt

On Mon, Jan 24, 2011 at 6:58 PM, Erick Erickson wrote:

> Try submitting your query from the admin page with &debugQuery=on and see
> if that helps. The output is pretty dense, so feel free to cut-paste the
> results for
> help.
>
> Your stemmers have English as the language, which could also be
> "interesting".
>
>
Yes, I noticed that this will be fixed.


> As Em says, the analysis page may help here, but I'd start by taking out
> WordDelimiterFilterFactory, SnowballPorterFilterFactory and
> StopFilterFactory
> and build back up if you really need them. Although, again, the analysis
> page
> that's accessible from the admin page may help greatly (check "debug" in
> both
> index and query).
>
>
You will find attached two xml files one with no results (noresult.xml.gz)
and one with
a lot of results (withresults.xml.gz). You will also find attached two
screenshots showing
there is a highlighted section in the "Index analyzer" section when
analysing text.


> Oh, and you MUST re-index after changing your schema to have a true test.
>
>
Yes, the problem is that reindexing takes around 12 hours which makes it
really hard
for testing :/

Thanks in advance for your feedback.

Best Regards,

-- 
JérÎme


noresult.xml.gz
Description: GNU Zip compressed data


withresults.xml.gz
Description: GNU Zip compressed data


Re: Weird behaviour with phrase queries

2011-01-24 Thread Jerome Renard
Erick,

On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson wrote:

> Hmmm, I don't see any screen shots. Several things:
> 1> If your stopword file has comments, I'm not sure what the effect would
> be.
>

Ha, I thought comments were supported in stopwords.txt


> 2> Something's not right here, or I'm being fooled again. Your withresults
> xml has this line:
> +DisjunctionMaxQuery((meta_text:"ecol d
> ingenieur")~0.01) ()
> and your noresults has this line:
> +DisjunctionMaxQuery((meta_text:"academi
> charpenti")~0.01) DisjunctionMaxQuery((meta_text:"academi
> charpenti"~100)~0.01)
>
> the empty () in the first one often means you're NOT going to your
> configured dismax parser in solrconfig.xml. Yet that doesn't square with
> your custom qt, so I'm puzzled.
>
> Could we see your raw query string on the way in? It's almost as if you
> defined qt in one and defType in the other, which are not equivalent.
>

You are right I fixed this problem (my bad).

3> It may take 12 hours to index, but you could experiment with a smaller
> subset. You say you know that the noresults one should return documents,
> what proof do
> you have? If there's a single document that you know should match this,
> just
> index it and a few others and you should be able to make many runs until
> you
> get
> to the bottom of this...
>
>
I could but I always thought I had to fully re-index after updating
schema.xml. If
I update only few documents will that take the changes into account without
breaking
the rest ?


> And obviously your stemming is happening on the query, are you sure it's
> happening at index time too?
>
>
Since you did not get the screenshots you will find attached the full output
of the analysis
for a phrase that works and for another that does not.

Thanks for your support

Best Regards,

--
JérÎme


analysis-noresults.html.gz
Description: GNU Zip compressed data


analysis-withresults.html.gz
Description: GNU Zip compressed data


Re: Weird behaviour with phrase queries

2011-01-26 Thread Jerome Renard
Hi Erick,

On Tue, Jan 25, 2011 at 1:38 PM, Erick Erickson wrote:

> Frankly, this puzzles me. It *looks* like it should be OK. One warning, the
> analysis page sometimes is a bit misleading, so beware of that.
>
> But the output of your queries make it look like the query is parsing as
> you
> expect, which leaves the question of whether your index contains what
> you think it does. You might get a copy of Luke, which allows you to
> examine
> what's actually in your index instead of what you think is in there.
> Sometimes
> there are surprises here!
>
>
Bingo ! Some data were not in the index. Indexing them obviously fixed the
problem.


> I didn't mean to re-index your whole corpus, I was thinking that you could
> just index a few documents in a test index so you have something small to
> look at.
>
> Sorry I can't spot what's happening right away.
>
>
No worries, thanks for your support :)

-- 
JérÎme


Data not always returned

2011-06-07 Thread Jerome Renard
Hi all,

I have a problem with my index. Even though I always index the same
data over and over again, whenever I try
a couple of searches (they are always the same as they are issued by a
unit test suite) I do not get the same
results, sometimes I get 3 successes and 2 failures and sometimes it
is the other way around it is unpredictable.

Here is what I am trying to do:

I created a new Solr core with its specific solrconfig.xml and schema.xml
This core stores a list of towns which I plan to use with an
auto-suggestion system, using ngrams (no Suggester)

The indexing process is always the same :
1. the import script deletes all documents in the core :
*:* and 
2. the import script fetches date from postgres, 100 rows at a time
2. the import script adds these 100 documents and sends a 
3. once all the rows (around 40 000) have been imported the script
send an  query

Here is what happens:
I run the indexer once and search for 'foo' I get results I expect but
if I search for 'bar' I get nothing
I reindex once again and search for 'foo' I get nothing, but if I
search for 'bar' I get results
The search is made on the "name" field which is a pretty common
TextField with ngrams.

I tried to physically remove the index (rm -rf path/to/index) and
reindex everything as well and
not all searches work, sometimes the 'foo' search work, sometimes the 'bar' one.

I tried a lot of differents things but now I am running out of ideas.
This is why I am asking for help.

Some useful informations :
Solr version : 3.1.0
Solr Implementation Version: 3.1.0 1085815 - grantingersoll -
2011-03-26 18:00:07
Lucene Implementation Version: 3.1.0 1085809 - 2011-03-26 18:06:58
Java 1.5.0_24 on Mac Os X
solrconfig.xml and schema.xml are attached

Thanks in advance for your help.


schema.xml.gz
Description: GNU Zip compressed data


solrconfig.xml.gz
Description: GNU Zip compressed data


Re: Data not always returned

2011-06-07 Thread Jerome Renard
Hi Erick

On Tue, Jun 7, 2011 at 11:42 PM, Erick Erickson  wrote:
> Well, this is odd. Several questions
>
> 1> what do your logs show? I'm wondering if somehow some data is getting
>     rejected. I have no idea why that would be, but if you're seeing indexing
>     exceptions that would explain it.
> 2> on the admin/stats page, are maxDocs and numDocs the same in the success
>     /failure case? And are they equal to 40,000?
> 3> what does &debugQuery=on show in the two cases? I'd expect it to be
> identical, but...
> 4> admin/schema browser. Look at your three fields and see if things
> like unique-terms are
>     identical.
> 5> are the rows being returned before indexing in the same order? I'm
> wondering if somehow
>     you're getting documents overwritten by having the same id (uniqueKey).
> 6> Have you poked around with Luke to see what, if anything, is dissimilar?
>
> These are shots in the dark, but my supposition is that somehow you're
> not indexing what
> you expect, the questions above might give us a clue where to look next.
>

You were right, I found a nasty problem with the indexer and postgres which
prevented some documents to be indexed. Once I fixed this problem everything
worked fine.

Thanks a lot for your support.

Best Regards,

-- 
JérÎme


Issue with dataimport xml validation with dtd and jetty: conflict of use for user.dir variable

2019-02-08 Thread jerome . dupont
Hello,

I use solr and dataimport to index xml files with a dtd.
The dtd is referenced like this


Previously we were using solr4 in a tomcat container.
During the import process, solr tries to validate the xml file with the 
dtd.
To find it we were defining -Duser.dir=pathToDtD and solr could find te 
dtd and validation was working

Now, we are migrating to solr7 (and jetty embedded)
When we start solr  with -a "-Duser.dir=pathToDtd", solr doesn't start and 
returns an error: Cannot find jetty main class

So I removed the a "-Duser.dir=pathToDtd" option, and solr starts. 
BUT
Now solr cannot anymore open xml file, because it doesn't find the dtd 
during validation stage.

Is there a way to:
- activate an xml catalog file to indicate where the dtd is? (Seems it 
would be the better way, fat I didn't find how to do)
- disable dtd validation 

Regards,
---
JérÎme Dupont
BibliothĂšque Nationale de France
Département des SystÚmes d'Information
Tour T3 - Quai François Mauriac
75706 Paris Cedex 13
téléphone: 33 (0)1 53 79 45 40
e-mail: jerome.dup...@bnf.fr
---

Pass BnF lecture/culture : bibliothÚques, expositions, conférences, concerts en 
illimitĂ© pour 15 € / an  –  Acheter en ligne Avant d'imprimer, pensez Ă  
l'environnement. 

Re: Solr OpenNLP named entity extraction

2018-07-08 Thread Jerome Yang
Hi guys,

In Solrcloud mode, where to put the OpenNLP models?
Upload to zookeeper?
As I test on solr 7.3.1, seems absolute path on local host is not working.
And can not upload into zookeeper if the model size exceed 1M.

Regards,
Jerome

On Wed, Apr 18, 2018 at 9:54 AM Steve Rowe  wrote:

> Hi Alexey,
>
> First, thanks for moving the conversation to the mailing list.  Discussion
> of usage problems should take place here rather than in JIRA.
>
> I locally set up Solr 7.3 similarly to you and was able to get things to
> work.
>
> Problems with your setup:
>
> 1. Your update chain is missing the Log and Run update processors at the
> end (I see these are missing from the example in the javadocs for the
> OpenNLP NER update processor; I’ll fix that):
>
>  
>  
>
>The Log update processor isn’t strictly necessary, but, from <
> https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain
> >:
>
>Do not forget to add RunUpdateProcessorFactory at the end of any
>chains you define in solrconfig.xml. Otherwise update requests
>processed by that chain will not actually affect the indexed data.
>
> 2. Your example document is missing an “id” field.
>
> 3. For whatever reason, the pre-trained model "en-ner-person.bin" doesn’t
> extract anything from text “This is Steve Jobs 2”.  It will extract “Steve
> Jobs” from text “This is Steve Jobs in white” e.g. though.
>
> 4. (Not a problem necessarily) You may want to use a multi-valued “string”
> field for the “dest” field in your update chain, e.g. “people_str” (“*_str”
> in the default configset is so configured).
>
> --
> Steve
> www.lucidworks.com
>
> > On Apr 17, 2018, at 8:23 AM, Alexey Ponomarenko 
> wrote:
> >
> > Hi once more I am trying to implement named entities extraction using
> this
> > manual
> >
> https://lucene.apache.org/solr/7_3_0//solr-analysis-extras/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.html
> >
> > I am modified solrconfig.xml like this:
> >
> > 
> >class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
> > opennlp/en-ner-person.bin
> > text_opennlp
> > description_en
> > content
> >   
> > 
> >
> > But when I was trying to add data using:
> >
> > *request:*
> >
> > POST
> >
> http://localhost:8983/solr/numberplate/update?version=2.2&wt=xml&update.chain=multiple-extract
> >
> > This is Steve Jobs 2
> > This is text 2 > name="content">This is text for content 2
> >
> > *response*
> >
> > 
> > 
> >
> >0
> >3
> >
> > 
> >
> > But I don't see any data inserted to *content* field and in any other
> field.
> >
> > *If you need some additional data I can provide it.*
> >
> > Can you help me? What have I done wrong?
>
>

-- 
 Pivotal Greenplum | Pivotal Software, Inc. <https://pivotal.io/>


Re: Solr OpenNLP named entity extraction

2018-07-09 Thread Jerome Yang
Thanks Steve!


On Tue, Jul 10, 2018 at 5:20 AM Steve Rowe  wrote:

> Hi Jerome,
>
> See the ref guide[1] for a writeup of how to enable uploading files larger
> than 1MB into ZooKeeper.
>
> Local storage should also work - have you tried placing OpenNLP model
> files in ${solr.solr.home}/lib/ ? - make sure you do the same on each node.
>
> [1]
> https://lucene.apache.org/solr/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit
>
> --
> Steve
> www.lucidworks.com
>
> > On Jul 9, 2018, at 12:50 AM, Jerome Yang  wrote:
> >
> > Hi guys,
> >
> > In Solrcloud mode, where to put the OpenNLP models?
> > Upload to zookeeper?
> > As I test on solr 7.3.1, seems absolute path on local host is not
> working.
> > And can not upload into zookeeper if the model size exceed 1M.
> >
> > Regards,
> > Jerome
> >
> > On Wed, Apr 18, 2018 at 9:54 AM Steve Rowe  wrote:
> >
> >> Hi Alexey,
> >>
> >> First, thanks for moving the conversation to the mailing list.
> Discussion
> >> of usage problems should take place here rather than in JIRA.
> >>
> >> I locally set up Solr 7.3 similarly to you and was able to get things to
> >> work.
> >>
> >> Problems with your setup:
> >>
> >> 1. Your update chain is missing the Log and Run update processors at the
> >> end (I see these are missing from the example in the javadocs for the
> >> OpenNLP NER update processor; I’ll fix that):
> >>
> >> 
> >> 
> >>
> >>   The Log update processor isn’t strictly necessary, but, from <
> >>
> https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain
> >>> :
> >>
> >>   Do not forget to add RunUpdateProcessorFactory at the end of any
> >>   chains you define in solrconfig.xml. Otherwise update requests
> >>   processed by that chain will not actually affect the indexed data.
> >>
> >> 2. Your example document is missing an “id” field.
> >>
> >> 3. For whatever reason, the pre-trained model "en-ner-person.bin"
> doesn’t
> >> extract anything from text “This is Steve Jobs 2”.  It will extract
> “Steve
> >> Jobs” from text “This is Steve Jobs in white” e.g. though.
> >>
> >> 4. (Not a problem necessarily) You may want to use a multi-valued
> “string”
> >> field for the “dest” field in your update chain, e.g. “people_str”
> (“*_str”
> >> in the default configset is so configured).
> >>
> >> --
> >> Steve
> >> www.lucidworks.com
> >>
> >>> On Apr 17, 2018, at 8:23 AM, Alexey Ponomarenko <
> alex1989s...@gmail.com>
> >> wrote:
> >>>
> >>> Hi once more I am trying to implement named entities extraction using
> >> this
> >>> manual
> >>>
> >>
> https://lucene.apache.org/solr/7_3_0//solr-analysis-extras/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.html
> >>>
> >>> I am modified solrconfig.xml like this:
> >>>
> >>> 
> >>>   >> class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
> >>>opennlp/en-ner-person.bin
> >>>text_opennlp
> >>>description_en
> >>>content
> >>>  
> >>> 
> >>>
> >>> But when I was trying to add data using:
> >>>
> >>> *request:*
> >>>
> >>> POST
> >>>
> >>
> http://localhost:8983/solr/numberplate/update?version=2.2&wt=xml&update.chain=multiple-extract
> >>>
> >>> This is Steve Jobs 2
> >>> This is text 2 >>> name="content">This is text for content 2
> >>>
> >>> *response*
> >>>
> >>> 
> >>> 
> >>>   
> >>>   0
> >>>   3
> >>>   
> >>> 
> >>>
> >>> But I don't see any data inserted to *content* field and in any other
> >> field.
> >>>
> >>> *If you need some additional data I can provide it.*
> >>>
> >>> Can you help me? What have I done wrong?
> >>
> >>
> >
> > --
> > Pivotal Greenplum | Pivotal Software, Inc. <https://pivotal.io/>
>
>

-- 
 Pivotal Greenplum | Pivotal Software, Inc. <https://pivotal.io/>


Re: Solr OpenNLP named entity extraction

2018-07-09 Thread Jerome Yang
Hi Steve,

Put models under " ${solr.solr.home}/lib/ " is not working.
I check the "ZkSolrResourceLoader" seems it will first try to find modes in
config set.
If not find, then it uses class loader to load from resources.

Regards,
Jerome

On Tue, Jul 10, 2018 at 9:58 AM Jerome Yang  wrote:

> Thanks Steve!
>
>
> On Tue, Jul 10, 2018 at 5:20 AM Steve Rowe  wrote:
>
>> Hi Jerome,
>>
>> See the ref guide[1] for a writeup of how to enable uploading files
>> larger than 1MB into ZooKeeper.
>>
>> Local storage should also work - have you tried placing OpenNLP model
>> files in ${solr.solr.home}/lib/ ? - make sure you do the same on each node.
>>
>> [1]
>> https://lucene.apache.org/solr/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>> > On Jul 9, 2018, at 12:50 AM, Jerome Yang  wrote:
>> >
>> > Hi guys,
>> >
>> > In Solrcloud mode, where to put the OpenNLP models?
>> > Upload to zookeeper?
>> > As I test on solr 7.3.1, seems absolute path on local host is not
>> working.
>> > And can not upload into zookeeper if the model size exceed 1M.
>> >
>> > Regards,
>> > Jerome
>> >
>> > On Wed, Apr 18, 2018 at 9:54 AM Steve Rowe  wrote:
>> >
>> >> Hi Alexey,
>> >>
>> >> First, thanks for moving the conversation to the mailing list.
>> Discussion
>> >> of usage problems should take place here rather than in JIRA.
>> >>
>> >> I locally set up Solr 7.3 similarly to you and was able to get things
>> to
>> >> work.
>> >>
>> >> Problems with your setup:
>> >>
>> >> 1. Your update chain is missing the Log and Run update processors at
>> the
>> >> end (I see these are missing from the example in the javadocs for the
>> >> OpenNLP NER update processor; I’ll fix that):
>> >>
>> >> 
>> >> 
>> >>
>> >>   The Log update processor isn’t strictly necessary, but, from <
>> >>
>> https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain
>> >>> :
>> >>
>> >>   Do not forget to add RunUpdateProcessorFactory at the end of any
>> >>   chains you define in solrconfig.xml. Otherwise update requests
>> >>   processed by that chain will not actually affect the indexed
>> data.
>> >>
>> >> 2. Your example document is missing an “id” field.
>> >>
>> >> 3. For whatever reason, the pre-trained model "en-ner-person.bin"
>> doesn’t
>> >> extract anything from text “This is Steve Jobs 2”.  It will extract
>> “Steve
>> >> Jobs” from text “This is Steve Jobs in white” e.g. though.
>> >>
>> >> 4. (Not a problem necessarily) You may want to use a multi-valued
>> “string”
>> >> field for the “dest” field in your update chain, e.g. “people_str”
>> (“*_str”
>> >> in the default configset is so configured).
>> >>
>> >> --
>> >> Steve
>> >> www.lucidworks.com
>> >>
>> >>> On Apr 17, 2018, at 8:23 AM, Alexey Ponomarenko <
>> alex1989s...@gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi once more I am trying to implement named entities extraction using
>> >> this
>> >>> manual
>> >>>
>> >>
>> https://lucene.apache.org/solr/7_3_0//solr-analysis-extras/org/apache/solr/update/processor/OpenNLPExtractNamedEntitiesUpdateProcessorFactory.html
>> >>>
>> >>> I am modified solrconfig.xml like this:
>> >>>
>> >>> 
>> >>>  > >> class="solr.OpenNLPExtractNamedEntitiesUpdateProcessorFactory">
>> >>>opennlp/en-ner-person.bin
>> >>>text_opennlp
>> >>>description_en
>> >>>content
>> >>>  
>> >>> 
>> >>>
>> >>> But when I was trying to add data using:
>> >>>
>> >>> *request:*
>> >>>
>> >>> POST
>> >>>
>> >>
>> http://localhost:8983/solr/numberplate/update?version=2.2&wt=xml&update.chain=multiple-extract
>> >>>
>> >>> This is Steve Jobs 2
>> >>> This is text 2> >>> name="content">This is text for content 2
>> >>>
>> >>> *response*
>> >>>
>> >>> 
>> >>> 
>> >>>   
>> >>>   0
>> >>>   3
>> >>>   
>> >>> 
>> >>>
>> >>> But I don't see any data inserted to *content* field and in any other
>> >> field.
>> >>>
>> >>> *If you need some additional data I can provide it.*
>> >>>
>> >>> Can you help me? What have I done wrong?
>> >>
>> >>
>> >
>> > --
>> > Pivotal Greenplum | Pivotal Software, Inc. <https://pivotal.io/>
>>
>>
>
> --
>  Pivotal Greenplum | Pivotal Software, Inc. <https://pivotal.io/>
>
>

-- 
 Pivotal Greenplum | Pivotal Software, Inc. <https://pivotal.io/>


Re: Solr OpenNLP named entity extraction

2018-07-10 Thread Jerome Yang
Thanks a lot Steve!

On Wed, Jul 11, 2018 at 10:24 AM Steve Rowe  wrote:

> Hi Jerome,
>
> I was able to setup a configset to perform OpenNLP NER, loading the model
> files from local storage.
>
> There is a trick though[1]: the model files must be located *in a jar* or
> *in a subdirectory* under ${solr.solr.home}/lib/ or under a directory
> specified via a solrconfig.xml  directive.
>
> I tested with the bin/solr cloud example, and put model files under the
> two solr home directories, at example/cloud/node1/solr/lib/opennlp/ and
> example/cloud/node1/solr/lib/opennlp/.  The “opennlp/“ subdirectory is
> required, though its name can be anything else you choose.
>
> [1] As you noted, ZkSolrResourceLoader delegates to its parent classloader
> when it can’t find resources in a configset, and the parent classloader is
> set up to load from subdirectories and jar files under
> ${solr.solr.home}/lib/ or under a directory specified via a solrconfig.xml
>  directive.  These directories themselves are not included in the set
> of directories from which resources are loaded; only their children are.
>
> --
> Steve
> www.lucidworks.com
>
> > On Jul 9, 2018, at 10:10 PM, Jerome Yang  wrote:
> >
> > Hi Steve,
> >
> > Put models under " ${solr.solr.home}/lib/ " is not working.
> > I check the "ZkSolrResourceLoader" seems it will first try to find modes
> in
> > config set.
> > If not find, then it uses class loader to load from resources.
> >
> > Regards,
> > Jerome
> >
> > On Tue, Jul 10, 2018 at 9:58 AM Jerome Yang  wrote:
> >
> >> Thanks Steve!
> >>
> >>
> >> On Tue, Jul 10, 2018 at 5:20 AM Steve Rowe  wrote:
> >>
> >>> Hi Jerome,
> >>>
> >>> See the ref guide[1] for a writeup of how to enable uploading files
> >>> larger than 1MB into ZooKeeper.
> >>>
> >>> Local storage should also work - have you tried placing OpenNLP model
> >>> files in ${solr.solr.home}/lib/ ? - make sure you do the same on each
> node.
> >>>
> >>> [1]
> >>>
> https://lucene.apache.org/solr/guide/7_4/setting-up-an-external-zookeeper-ensemble.html#increasing-the-file-size-limit
> >>>
> >>> --
> >>> Steve
> >>> www.lucidworks.com
> >>>
> >>>> On Jul 9, 2018, at 12:50 AM, Jerome Yang  wrote:
> >>>>
> >>>> Hi guys,
> >>>>
> >>>> In Solrcloud mode, where to put the OpenNLP models?
> >>>> Upload to zookeeper?
> >>>> As I test on solr 7.3.1, seems absolute path on local host is not
> >>> working.
> >>>> And can not upload into zookeeper if the model size exceed 1M.
> >>>>
> >>>> Regards,
> >>>> Jerome
> >>>>
> >>>> On Wed, Apr 18, 2018 at 9:54 AM Steve Rowe  wrote:
> >>>>
> >>>>> Hi Alexey,
> >>>>>
> >>>>> First, thanks for moving the conversation to the mailing list.
> >>> Discussion
> >>>>> of usage problems should take place here rather than in JIRA.
> >>>>>
> >>>>> I locally set up Solr 7.3 similarly to you and was able to get things
> >>> to
> >>>>> work.
> >>>>>
> >>>>> Problems with your setup:
> >>>>>
> >>>>> 1. Your update chain is missing the Log and Run update processors at
> >>> the
> >>>>> end (I see these are missing from the example in the javadocs for the
> >>>>> OpenNLP NER update processor; I’ll fix that):
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>  The Log update processor isn’t strictly necessary, but, from <
> >>>>>
> >>>
> https://lucene.apache.org/solr/guide/7_3/update-request-processors.html#custom-update-request-processor-chain
> >>>>>> :
> >>>>>
> >>>>>  Do not forget to add RunUpdateProcessorFactory at the end of any
> >>>>>  chains you define in solrconfig.xml. Otherwise update requests
> >>>>>  processed by that chain will not actually affect the indexed
> >>> data.
> >>>>>
> >>>>> 2. Your example document is missing an “id” field.
> >>>>>
> >>>>> 3. For whatever reason, the pre-trained model "en-ner-person.bin&

Solr 1.3 query and index perf tank during optimize

2009-11-12 Thread Jerome L Quinn

Hi, everyone, this is a problem I've had for quite a while,
and have basically avoided optimizing because of it.  However,
eventually we will get to the point where we must delete as
well as add docs continuously.

I have a Solr 1.3 index with ~4M docs at around 90G.  This is a single
instance running inside tomcat 6, so no replication.  Merge factor is the
default 10.  ramBufferSizeMB is 32.  maxWarmingSearchers=4.
autoCommit is set at 3 sec.

We continually push new data into the index, at somewhere between 1-10 docs
every 10 sec or so.  Solr is running on a quad-core 3.0GHz server.
under IBM java 1.6.  The index is sitting on a local 15K scsi disk.
There's nothing
else of substance running on the box.

Optimizing the index takes about 65 min.

As long as I'm not optimizing, search and indexing times are satisfactory.

When I start the optimize, I see massive problems with timeouts pushing new
docs
into the index, and search times balloon.  A typical search while
optimizing takes
about 1 min instead of a few seconds.

Can anyone offer me help with fixing the problem?

Thanks,
Jerry Quinn

Re: Solr 1.3 query and index perf tank during optimize

2009-11-13 Thread Jerome L Quinn

Mark Miller  wrote on 11/12/2009 07:18:03 PM:
> Ah, the pains of optimization. Its kind of just how it is. One solution
> is to use two boxes and replication - optimize on the master, and then
> queries only hit the slave. Out of reach for some though, and adds many
> complications.

Yes, in my use case 2 boxes isn't a great option.


> Another kind of option is to use the partial optimize feature:
>
>  
>
> Using this, you can optimize down to n segments and take a shorter hit
> each time.

Is this a 1.4 feature?  I'm planning to migrate to 1.4, but it'll take a
while since
I have to port custom code forward, including a query parser.


> Also, if optimizing is so painful, you might lower the merge factor
> amortize that pain better. Thats another way to slowly get there - if
> you lower the merge factor, as merging takes place, the new merge factor
> will be respected, and semgents will merge down. A merge factor of 2
> (the lowest) will make it so you only ever have 2 segments. Sometimes
> that works reasonably well - you could try 3-6 or something as well.
> Then when you do your partial optimizes (and eventually a full optimize
> perhaps), you want have so far to go.

So this will slow down indexing but speed up optimize somewhat?
Unfortunately
right now I lose docs I'm indexing, as well slowing searching to a crawl.
Ugh.

I've got plenty of CPU horsepower.  This is where having the ability to
optimize
on another filesystem would be useful.

Would it perhaps make sense to set up a master/slave on the same machine?
Then
I suppose I can have an index being optimized that might not clobber the
search.
Would new indexed items still be dropped on the floor?

Thanks,
Jerry

Re: Solr 1.3 query and index perf tank during optimize

2009-11-13 Thread Jerome L Quinn
ysee...@gmail.com wrote on 11/13/2009 09:06:29 AM:
>
> On Fri, Nov 13, 2009 at 6:27 AM, Michael McCandless
>  wrote:
> > I think we sorely need a Directory impl that down-prioritizes IO
> > performed by merging.
>
> It's unclear if this case is caused by IO contention, or the OS cache
> of the hot parts of the index being lost by that extra IO activity.
> Of course the latter would lead to the former, but without that OS
> disk cache, the searches may be too slow even w/o the extra IO.

Is there a way to configure things so that search and new data indexing
get cached under the control of solr/lucene?  Then we'd be less reliant
on the OS behavior.

Alternatively if there are OS params I can tweak (RHEL/Centos 5)
to solve the problem, that's an option for me.

Would you know if 1.4 is better behaved than 1.3?

Thanks,
Jerry

Re: Solr 1.3 query and index perf tank during optimize

2009-11-13 Thread Jerome L Quinn
ysee...@gmail.com wrote on 11/13/2009 09:06:29 AM:

> On Fri, Nov 13, 2009 at 6:27 AM, Michael McCandless
>  wrote:
> > I think we sorely need a Directory impl that down-prioritizes IO
> > performed by merging.
>
> It's unclear if this case is caused by IO contention, or the OS cache
> of the hot parts of the index being lost by that extra IO activity.
> Of course the latter would lead to the former, but without that OS
> disk cache, the searches may be too slow even w/o the extra IO.

On linux there's the ionice command to try to throttle processes.  Would it
be possible and make sense to have a separate process for optimizing that
had ionice set it to idle?  Can the index be shared this way?

Thanks,
Jerry

Re: Solr 1.3 query and index perf tank during optimize

2009-11-14 Thread Jerome L Quinn


Lance Norskog  wrote on 11/13/2009 11:18:42 PM:

> The 'maxSegments' feature is new with 1.4.  I'm not sure that it will
> cause any less disk I/O during optimize.

It could still be useful to manage the "too many open files" problem that
rears its ugly head on occasion.

> The 'mergeFactor=2' idea is not what you think: in this case the index
> is always "mostly optimized", so you never need to run optimize.
> Indexing is always slower, because you amortize the optimize time into
> little continuous chunks during indexing. You never stop indexing. You
> should not lose documents.

Is the space taken by deleted documents recovered in this case?

Jerry

Re: Solr 1.3 query and index perf tank during optimize

2009-11-16 Thread Jerome L Quinn


Otis Gospodnetic  wrote on 11/13/2009 11:15:43
PM:

> Let's take a step back.  Why do you need to optimize?  You said: "As
> long as I'm not optimizing, search and indexing times are
satisfactory." :)
>
> You don't need to optimize just because you are continuously adding
> and deleting documents.  On the contrary!


That's a fair question.

Basically, search entries are keyed to other documents.  We have finite
storage,
so we purge old documents.  My understanding was that deleted documents
still
take space until an optimize is done.  Therefore, if I don't optimize, the
index
size on disk will grow without bound.

Am I mistaken?  If I don't ever have to optimize, it would make my life
easier.

Thanks,
Jerry


Plans for 1.3.1?

2009-01-07 Thread Jerome L Quinn

Hi, all.  Are there any plans for putting together a bugfix release?  I'm
not looking for particular bugs, but would like to know if bug fixes are
only going to be done mixed in with new features.

Thanks,
Jerry Quinn

Help with Solr 1.3 lockups?

2009-01-15 Thread Jerome L Quinn

Hi, all.

I'm running solr 1.3 inside Tomcat 6.0.18.  I'm running a modified query
parser, tokenizer, highlighter, and have a CustomScoreQuery for dates.

After some amount of time, I see solr stop responding to update requests.
When crawling through the logs, I see the following pattern:

Jan 12, 2009 7:27:42 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
Jan 12, 2009 7:28:11 PM org.apache.solr.common.SolrException log
SEVERE: Error during auto-warming of
key:org.apache.solr.search.queryresult...@ce0f92b9:java.lang.OutOfMemoryError
at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
at org.apache.lucene.index.SegmentTermEnum.term
(SegmentTermEnum.java:167)
at org.apache.lucene.index.SegmentMergeInfo.next
(SegmentMergeInfo.java:66)
at org.apache.lucene.index.MultiSegmentReader$MultiTermEnum.next
(MultiSegmentReader.java:492)
at org.apache.lucene.search.FieldCacheImpl$7.createValue
(FieldCacheImpl.java:267)
at org.apache.lucene.search.FieldCacheImpl$Cache.get
(FieldCacheImpl.java:72)
at org.apache.lucene.search.FieldCacheImpl.getInts
(FieldCacheImpl.java:245)
at org.apache.solr.search.function.IntFieldSource.getValues
(IntFieldSource.java:50)
at org.apache.solr.search.function.SimpleFloatFunction.getValues
(SimpleFloatFunction.java:41)
at org.apache.solr.search.function.BoostedQuery$CustomScorer.
(BoostedQuery.java:111)
at org.apache.solr.search.function.BoostedQuery$CustomScorer.
(BoostedQuery.java:97)
at org.apache.solr.search.function.BoostedQuery
$BoostedWeight.scorer(BoostedQuery.java:88)
at org.apache.lucene.search.IndexSearcher.search
(IndexSearcher.java:132)
at org.apache.lucene.search.Searcher.search(Searcher.java:126)
at org.apache.lucene.search.Searcher.search(Searcher.java:105)
at org.apache.solr.search.SolrIndexSearcher.getDocListNC
(SolrIndexSearcher.java:966)
at org.apache.solr.search.SolrIndexSearcher.getDocListC
(SolrIndexSearcher.java:838)
at org.apache.solr.search.SolrIndexSearcher.access$000
(SolrIndexSearcher.java:56)
at org.apache.solr.search.SolrIndexSearcher$2.regenerateItem
(SolrIndexSearcher.java:260)
at org.apache.solr.search.LRUCache.warm(LRUCache.java:194)
at org.apache.solr.search.SolrIndexSearcher.warm
(SolrIndexSearcher.java:1518)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1018)
at java.util.concurrent.FutureTask$Sync.innerRun
(FutureTask.java:314)
at java.util.concurrent.FutureTask.run(FutureTask.java:149)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask
(ThreadPoolExecutor.java:896)
at java.util.concurrent.ThreadPoolExecutor$Worker.run
(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:735)

Jan 12, 2009 7:28:11 PM org.apache.tomcat.util.net.JIoEndpoint$Acceptor run
SEVERE: Socket accept failed
Throwable occurred: java.lang.OutOfMemoryError
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:414)
at java.net.ServerSocket.implAccept(ServerSocket.java:464)
at java.net.ServerSocket.accept(ServerSocket.java:432)
at
org.apache.tomcat.util.net.DefaultServerSocketFactory.acceptSocket
(DefaultServerSocketFactory.java:61)
at org.apache.tomcat.util.net.JIoEndpoint$Acceptor.run
(JIoEndpoint.java:310)
at java.lang.Thread.run(Thread.java:735)

<<<>
<< Java dumps core and heap at this point >>
<<<>

Jan 12, 2009 7:28:21 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
timed out: SingleInstanceLock: write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:85)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1140)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:938)
at org.apache.solr.update.SolrIndexWriter.
(SolrIndexWriter.java:116)
at org.apache.solr.update.UpdateHandler.createMainIndexWriter
(UpdateHandler.java:122)
at org.apache.solr.update.DirectUpdateHandler2.openWriter
(DirectUpdateHandler2.java:167)
at org.apache.solr.update.DirectUpdateHandler2.addDoc
(DirectUpdateHandler2.java:221)
at org.apache.solr.update.processor.RunUpdateProcessor.processAdd
(RunUpdateProcessorFactory.java:59)
at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate
(XmlUpdateRequestHandler.java:196)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody
(XmlUpdateRequestHandler.java:123)
at org.apache.solr.handler.RequestHandlerBase.handleRequest
(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
at org.apache.solr.servlet.SolrDispatchFilter.execute
(

Re: I get SEVERE: Lock obtain timed out

2009-01-23 Thread Jerome L Quinn


Julian Davchev  wrote on 01/20/2009 10:07:48 AM:

> Julian Davchev 
> 01/20/2009 10:07 AM
>
> I get SEVERE: Lock obtain timed out
>
> Hi,
> Any documents or something I can read on how locks work and how I can
> controll it. When do they occur etc.
> Cause only way I got out of this mess was restarting tomcat
>
> SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock obtain
> timed out: SingleInstanceLock: write.lock


I've seen this with my customized setup.  Before I saw the write.lock
messages, I had an OutOfMemoryError, but the container didn't shut down.
After that Solr spewed write lock messages and I had to restart.

So, you might want to search backwards in your logs and see if you can find
when the write lock problems and if there is some identifiable problem
preceding that.

Jerry Quinn



Re: Help with Solr 1.3 lockups?

2009-01-26 Thread Jerome L Quinn
Hi and thanks for looking at the problem ...


Mark Miller  wrote on 01/15/2009 02:58:24 PM:

> Mark Miller 
> 01/15/2009 02:58 PM
>
> Re: Help with Solr 1.3 lockups?
>
> How much RAM are you giving the JVM? Thats running out of memory loading
> a FieldCache, which can be a more memory intensive data structure. It
> pretty much points to the JVM not having enough RAM to do what you want.
> How many fields do you sort on? How many fields do you facet on? How
> much RAM do you have available and how much have you given Solr? How
> many documents are you working with?

I'm using the stock tomcat and JVM settings.  I see the VM footprint
sitting at 877M
right now.  It hasn't locked up yet this time around.

There are 2 fields we facet and 1 that we sort on.  The machine has 16G of
memory, and the
index is currently sitting at 38G, though I haven't run an optimize in a
while.  There are
about 1 million docs in the index, though we have 3 full copies of the data
stored in
different fields and processed in different ways.

I do a commit every 10 docs or 3 seconds, whichever comes first.  We're
approximating
real-time updating.

The index is currently sitting on NFS, which I know isn't great for
performance.  I didn't
think it could cause reliability issues though.


> As far as rebooting a failed server, the best technique is generally
> external. I would recommend a script/program on another machine that
> hits the Solr instance with a simple query every now and again. If you
> don't get a valid response within a reasonable amount of time, or after
> a reasonable number of tries, fire off alert emails and issue a command
> to that server to reboot the JVM. Or something to that effect.

I suspect I'll add a watchdog, no matter what's causing the problem here.

> However, you should figure out why you are running out of memory. You
> don't want to use more resources than you have available if you can help
it.

Definitely. That's on the agenda :-)

Thanks,
Jerry



> - Mark
>
> Jerome L Quinn wrote:
> > Hi, all.
> >
> > I'm running solr 1.3 inside Tomcat 6.0.18.  I'm running a modified
query
> > parser, tokenizer, highlighter, and have a CustomScoreQuery for dates.
> >
> > After some amount of time, I see solr stop responding to update
requests.
> > When crawling through the logs, I see the following pattern:
> >
> > Jan 12, 2009 7:27:42 PM org.apache.solr.update.DirectUpdateHandler2
commit
> > INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
> > Jan 12, 2009 7:28:11 PM org.apache.solr.common.SolrException log
> > SEVERE: Error during auto-warming of
> >
>
key:org.apache.solr.search.queryresult...@ce0f92b9:java.lang.OutOfMemoryError

> > at org.apache.lucene.index.TermBuffer.toTerm
(TermBuffer.java:122)
> > at org.apache.lucene.index.SegmentTermEnum.term
> > (SegmentTermEnum.java:167)
> > at org.apache.lucene.index.SegmentMergeInfo.next
> > (SegmentMergeInfo.java:66)
> > at org.apache.lucene.index.MultiSegmentReader
$MultiTermEnum.next
> > (MultiSegmentReader.java:492)
> > at org.apache.lucene.search.FieldCacheImpl$7.createValue
> > (FieldCacheImpl.java:267)
> > at org.apache.lucene.search.FieldCacheImpl$Cache.get
> > (FieldCacheImpl.java:72)
> > at org.apache.lucene.search.FieldCacheImpl.getInts
> > (FieldCacheImpl.java:245)
> > at org.apache.solr.search.function.IntFieldSource.getValues
> > (IntFieldSource.java:50)
> > at
org.apache.solr.search.function.SimpleFloatFunction.getValues
> > (SimpleFloatFunction.java:41)
> > at org.apache.solr.search.function.BoostedQuery
$CustomScorer.
> > (BoostedQuery.java:111)
> > at org.apache.solr.search.function.BoostedQuery
$CustomScorer.
> > (BoostedQuery.java:97)
> > at org.apache.solr.search.function.BoostedQuery
> > $BoostedWeight.scorer(BoostedQuery.java:88)
> > at org.apache.lucene.search.IndexSearcher.search
> > (IndexSearcher.java:132)
> > at org.apache.lucene.search.Searcher.search(Searcher.java:126)
> > at org.apache.lucene.search.Searcher.search(Searcher.java:105)
> > at org.apache.solr.search.SolrIndexSearcher.getDocListNC
> > (SolrIndexSearcher.java:966)
> > at org.apache.solr.search.SolrIndexSearcher.getDocListC
> > (SolrIndexSearcher.java:838)
> > at org.apache.solr.search.SolrIndexSearcher.access$000
> > (SolrIndexSearcher.java:56)
> > at org.apache.solr.search.SolrIndexSearcher$2.regenerateItem
> > (SolrIndexSearcher.java:260)
> > at org.apache.solr

Re: Help with Solr 1.3 lockups?

2009-01-26 Thread Jerome L Quinn


"Lance Norskog"  wrote on 01/20/2009 02:16:47 AM:

> "Lance Norskog" 
> 01/20/2009 02:16 AM

> Java 1.5 has thread-locking bugs. Switching to Java 1.6 may cure this
> problem.

Thanks for taking time to look at the problem.  Unfortunately, this is
happening on Java 1.6,
so I can't put the blame there.

Thanks,
Jerry

Re: Help with Solr 1.3 lockups?

2009-01-28 Thread Jerome L Quinn


Mark Miller  wrote on 01/26/2009 04:30:00 PM:

> Just a point or I missed: with such a large index (not doc size large,
> but content wise), I imagine a lot of your 16GB of RAM is being used by
> the system disk cache - which is good. Another reason you don't want to
> give too much RAM to the JVM. But you still want to give it enough to
> avoid the OOM :) Assuming you are using the RAM you are legitimately.
> And I don't yet have a reason to think you are not.

I've bumped the JVM max memory to 2G.  Hopefully that is enough.  I'll be
keeping an eye on it.


> Also, there has been a report of or two of a lockup that didn't appear
> to involve an OOM, so this is not guaranteed to solve that. However,
> seeing that the lockup comes after the OOM, its the likely first thing
> to fix. Once the memory problems are taken care of, the locking issue
> can be addressed if you find it still remains. My bet is that fixing the
> OOM will clear it up.

I've gone through my code looking for possible leaks and didn't find
anything.  That doesn't mean they're not there of course.

I ran an analyzer on the heap dump from the last OOM event.  These were the
likely items it identified:

org/apache/catalina/connector/Connector java/util/WeakHashMap
$Entry399,913,269 bytes
org/apache/catalina/connector/Connector java/lang/Object[ ]
197,256,078 bytes
org/apache/lucene/search/ExtendedFieldCachejava/util/WeakHashMap$Entry
[ ] 177,893,021 bytes
org/apache/lucene/search/ExtendedFieldCachejava/util/HashMap$Entry[ ]
42,490,720 bytes
org/apache/lucene/search/ExtendedFieldCachejava/util/HashMap$Entry[ ]
   42,490,656 bytes

I'm not sure what to make of this, though.


> > You also might lower the max warming searchers setting if that makes
> > sense.

I'm using the default setting of 2.  I have seen an error about too many
warming searchers once or twice, but not often.

Thanks,
Jerry

[1.3] help with update timeout issue?

2010-01-14 Thread Jerome L Quinn


Hi, folks,

I am using Solr 1.3 pretty successfully, but am running into an issue that
hits once in a long while.  I'm still using 1.3 since I have some custom
code I will have to port forward to 1.4.

My basic setup is that I have data sources continually pushing data into
Solr, around 20K adds per day.  The index is currently around 100G, stored
on local disk on a fast linux server.  I'm trying to make new docs
searchable as quickly as possible, so I currently have autocommit set to
15s.  I originally had 3s but that seems to be a little too unstable.  I
never optimize the index since optimize will lock things up solid for 2
hours, dropping docs until the optimize completes.  I'm using the default
segment merging settings.

Every once in a while I'm getting a socket timeout when trying to add a
document.  I traced it to a 20s timeout and then found the corresponding
point in the Solr log.

Jan 13, 2010 2:59:15 PM org.apache.solr.core.SolrCore execute
INFO: [tales] webapp=/solr path=/update params={} status=0 QTime=2
Jan 13, 2010 2:59:15 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true)
Jan 13, 2010 2:59:56 PM org.apache.solr.search.SolrIndexSearcher 
INFO: Opening searc...@26e926e9 main
Jan 13, 2010 2:59:56 PM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush

Solr locked up for 41 seconds here while doing some of the commit work.
So, I have a few questions.

Is this related to GC?
Does Solr always lock up when merging segments and I just have to live with
losing the doc I want to add?
Is there a timeout that would guarantee me a write success?
Should I just retry in this situation? If so, how do I distinguish between
this and Solr just being down?
I already have had issues in the past with too many files open, so
increasing the merge factor isn't an option.


On a related note, I had previously asked about optimizing and was told
that segment merging would take care of cleaning up deleted docs.  However,
I have the following stats for my index:

numDocs : 2791091
maxDoc : 4811416

My understanding is that numDocs is the docs being searched and maxDoc is
the number of docs including ones that will disappear after optimization.
How do I get this cleanup without using optimize, since it locks up Solr
for multiple hours.  I'm deleting old docs daily as well.

Thanks for all the help,
Jerry

Re: [1.3] help with update timeout issue?

2010-01-15 Thread Jerome L Quinn
Otis Gospodnetic  wrote on 01/14/2010 10:07:15
PM:

> See those "waitFlush=true,waitSearcher=true" ?  Do things improve if
> you make them false? (not sure how with autocommit without looking
> at the config and not sure if this makes a difference when
> autocommit triggers commits)

Looking at DirectUpdateHandler2, it appears that those values are hardwired
to true for autocommit.  Unless there's another mechanism for changing
that.

> Re deleted docs, they are probably getting expunged, it's just that
> you always have more deleted docs, so those 2 numbers will never be
> the same without optimize.

I can accept that they will always be different, but that's a large
difference.
Hmm, a couple weeks ago, I manually deleted a bunch of docs that had
associated
data get corrupted.  Normally, I'd only be deleting a day's worth of docs
at
a time.  Is there a time I could expect the old stuff to get cleaned up by
without optimizing?

Thanks,
Jerry

Re: [1.3] help with update timeout issue?

2010-01-20 Thread Jerome L Quinn


Lance Norskog  wrote on 01/16/2010 12:43:09 AM:

> If your indexing software does not have the ability to retry after a
> failure, you might with to change the timeout from 20 seconds to, say,
> 5 minutes.

I can make it retry, but I have somewhat real-time processes doing these
updates.  Does anyone
push updates into a temporary file and then have an async process push the
updates so that it
can survive the lockups without worry?  This seems like a real hack, but I
don't want a
long timeout like that in the program that currently pushes the data.

One thing that worries me is that solr may not respond to searches in these
windows.  I'm basing
that on the observation that search does not respond when solr is
optimizing.

Can anyone offer me insight on why these delays happen?

Thanks,
Jerry

Re: solr blocking on commit

2010-01-20 Thread Jerome L Quinn

ysee...@gmail.com wrote on 01/19/2010 06:05:45 PM:
> On Tue, Jan 19, 2010 at 5:57 PM, Steve Conover 
wrote:
> > I'm using latest solr 1.4 with java 1.6 on linux.  I have a 3M
> > document index that's 10+GB.  We currently give solr 12GB of ram to
> > play in and our machine has 32GB total.
> >
> > We're seeing a problem where solr blocks during commit - it won't
> > server /select requests - in some cases for more than 15-30 seconds.
> > We'd like to somehow configure things such that there's no
> > interruption in /select service.
>
> A commit shouldn't cause searches to block.
> Could this perhaps be a stop-the-word GC pause that coincides with the
commit?

This is essentially the same problem I'm fighting with.  Once in a while,
commit
causes everything to freeze, causing add commands to timeout.

My large index sees pauses on the order of 50 seconds once every day or
two.  I
have a small index of 700M on disk that sees 20 second pauses once in a
while.

I'm using the IBM 1.6 jvm on linux.

Jerry

Re: solr blocking on commit

2010-01-20 Thread Jerome L Quinn

ysee...@gmail.com wrote on 01/20/2010 02:24:04 PM:
> On Wed, Jan 20, 2010 at 2:18 PM, Jerome L Quinn 
wrote:
> > This is essentially the same problem I'm fighting with.  Once in a
while,
> > commit
> > causes everything to freeze, causing add commands to timeout.
>
> This could be a bit different.  Commits do currently block other
> update operations such as adds, but not searches.

Ah, this is good to know.  Is there any logging in solr 1.3 I could turn on
to verify that this is indeed what's happening for me?

Thanks,
Jerry

Re: solr blocking on commit

2010-01-20 Thread Jerome L Quinn

ysee...@gmail.com wrote on 01/20/2010 02:24:04 PM:
> On Wed, Jan 20, 2010 at 2:18 PM, Jerome L Quinn 
wrote:
> > This is essentially the same problem I'm fighting with.  Once in a
while,
> > commit
> > causes everything to freeze, causing add commands to timeout.
>
> This could be a bit different.  Commits do currently block other
> update operations such as adds, but not searches.

How solr organized so that search can continue when a commit has closed the
index?
Also, looking at lucene docs, commit causes a system fsync().  Won't search
also
get blocked by the IO traffic generated?

Thanks,
Jerry

Re: solr blocking on commit

2010-01-22 Thread Jerome L Quinn


Otis Gospodnetic  wrote on 01/22/2010 12:20:45
AM:
> I'm missing the bigger context of this thread here, but from the
> snippet below - sure, commits cause in-memory index to get written
> to disk, that causes some IO, and that *could* affect search *if*
> queries are running on the same box.  When index and/or query volume
> is high, one typically puts indexing and searching on different servers.

After some more research, I realize that what we're trying to do is
essentially
near-real-time processing.

I have data collection that is near-real-time and I'm trying to avoid
arbitrary delays pushing
the data into the index so that the data collection doesn't stall.  On the
search side,
 we don't have a lot of search traffic but would like it to be responsive
when it comes in.

We also dynamically purge old data to keep the storage requirements within
limits.

So, basically I'm trying to have the system tuned so that this all works
well :-)  I'm trying
to keep search on a single system to keep the costs down as well.

One thing I'm trying now is to put an intermediary in so that updates can
be asynchronous.  Then
my data collection processes can continue without waiting for unpredictable
index merges.

Thanks,
Jerry



Re: SolrJ commit options

2010-03-05 Thread Jerome L Quinn
Shalin Shekhar Mangar  wrote on 02/25/2010 07:38:39
AM:

> On Thu, Feb 25, 2010 at 5:34 PM, gunjan_versata
wrote:
>
> >
> > We are using SolrJ to handle commits to our solr server.. All runs
fine..
> > But whenever the commit happens, the server becomes slow and stops
> > responding.. therby resulting in TimeOut errors on our production. We
are
> > using the default commit with waitFlush = true, waitSearcher = true...
> >
> > Can I change there values so that the requests coming to solr dont
block on
> > recent commit?? Also, what will be the impact of changing these
values??
> >
>
> Solr does not block reads during a commit/optimize. Write operations are
> queued up but they are still accepted. Are you using the same Solr server
> for reads as well as writes?

I've seen similar things with Solr 1.3 (not using SolrJ).  If I try to
optimize the
index, queries will take much longer - easily a minute or more, resulting
in timeouts.

Jerry