Solr Cloud - is replication really a feature on the trunk?

2011-09-07 Thread Pulkit Singhal
Hello,

I'm working off the trunk and the following wiki link:
http://wiki.apache.org/solr/SolrCloud

The wiki link has a section that seeks to quickly familiarize a user
with replication in SolrCloud - "Example B: Simple two shard cluster
with shard replicas"

But after going through it, I have to wonder if this is truly
replication? Because if it is truly replication then somewhere along
the line, the following properties must have been set
programmatically:
replicateAfter, confFiles, masterUrl, pollInterval
Can someone tell me: Where exactly in the code is this happening?

I've been looking through some older threads where I see stuff like:
[Jan Høydahl]: Question: Is ReplicationHandler ZK-aware yet?
[Mark Miller]: As I think you now know, not yet ;)

Not sure if the comments above really fit-in with my question but it
certainly isn't encouraging.

SolrCloud does an excellent job of super-simplifying the sharding
process, so I'm hoping that can anyone tell me what needs to happen to
make it do the same for replication? I'm willing to get my hands dirty
and contribute to the trunk if someone can provide high-level
mentoring/guidance around the already existing SolrCloud code.


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Pulkit Singhal
Thank You Yury. After looking at your thread, there's something I must
clarify: Is solr.xml not uploaded and held in ZooKeeper? I ask this
because you have a slightly different config between Node 1 & 2:
http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html

On Wed, Sep 7, 2011 at 8:34 PM, Yury Kats  wrote:
> On 9/7/2011 3:18 PM, Pulkit Singhal wrote:
>> Hello,
>>
>> I'm working off the trunk and the following wiki link:
>> http://wiki.apache.org/solr/SolrCloud
>>
>> The wiki link has a section that seeks to quickly familiarize a user
>> with replication in SolrCloud - "Example B: Simple two shard cluster
>> with shard replicas"
>>
>> But after going through it, I have to wonder if this is truly
>> replication?
>
> Not really. Replication is not set up in the example.
> The example use "replicas" as "copies", to demonstrate high search
> availability.
>
>> Because if it is truly replication then somewhere along
>> the line, the following properties must have been set
>> programmatically:
>> replicateAfter, confFiles, masterUrl, pollInterval
>> Can someone tell me: Where exactly in the code is this happening?
>
> Nowhere.
>
> If you want replication, you need to set all the properties you listed
> in solrconfig.xml.
>
> I've done it recently, see 
> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html
>
>


Re: SolrCloud Feedback

2011-09-09 Thread Pulkit Singhal
Hello Jan,

You've made a very good point in (b). I would be happy to make the
edit to the wiki if I understood your explanation completely.

When you say that it is "looking up what collection that core is part
of" ... I'm curious how a core is being put under a particular
collection in the first place? And what that collection is named?
Obviously you've made it clear that colelction1 is really the name of
the core itself. And where this association is being stored for the
code to look it up?

If not Jan, then perhaps the gurus who wrote Solr Cloud could answer :)

Thanks!
- Pulkit

On Thu, Feb 10, 2011 at 9:10 AM, Jan Høydahl  wrote:
> Hi,
>
> I have so far just tested the examples and got a N by M cluster running. My 
> feedback:
>
> a) First of all, a major update of the SolrCloud Wiki is needed, to clearly 
> state what is in which version, what are current improvement plans and get 
> rid of outdated stuff. That said I think there are many good ideas there.
>
> b) The "collection" terminology is too much confused with "core", and should 
> probably be made more distinct. I just tried to configure two cores on the 
> same Solr instance into the same collection, and that worked fine, both as 
> distinct shards and as same shard (replica). The wiki examples give the 
> impression that "collection1" in 
> localhost:8983/solr/collection1/select?distrib=true is some magic collection 
> identifier, but what it really does is doing the query on the *core* named 
> "collection1", looking up what collection that core is part of and 
> distributing the query to all shards in that collection.
>
> c) ZK is not designed to store large files. While the files in conf are 
> normally well below the 1M limit ZK imposes, we should perhaps consider using 
> a lightweight distributed object or k/v store for holding the /CONFIGS and 
> let ZK store a reference only
>
> d) How are admins supposed to update configs in ZK? Install their favourite 
> ZK editor?
>
> e) We should perhaps not be so afraid to make ZK a requirement for Solr in 
> v4. Ideally you should interact with a 1-node Solr in the same manner as you 
> do with a 100-node Solr. An example is the Admin GUI where the "schema" and 
> "solrconfig" links assume local file. This requires decent tool support to 
> make ZK interaction intuitive, such as "import" and "export" commands.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 19. jan. 2011, at 21.07, Mark Miller wrote:
>
>> Hello Users,
>>
>> About a little over a year ago, a few of us started working on what we 
>> called SolrCloud.
>>
>> This initial bit of work was really a combination of laying some base work - 
>> figuring out how to integrate ZooKeeper with Solr in a limited way, dealing 
>> with some infrastructure - and picking off some low hanging search side 
>> fruit.
>>
>> The next step is the indexing side. And we plan on starting to tackle that 
>> sometime soon.
>>
>> But first - could you help with some feedback?ISome people are using our 
>> SolrCloud start - I have seen evidence of it ;) Some, even in production.
>>
>> I would love to have your help in targeting what we now try and improve. Any 
>> suggestions or feedback? If you have sent this before, I/others likely 
>> missed it - send it again!
>>
>> I know anyone that has used SolrCloud has some feedback. I know it because 
>> I've used it too ;) It's too complicated to setup still. There are still 
>> plenty of pain points. We accepted some compromise trying to fit into what 
>> Solr was, and not wanting to dig in too far before feeling things out and 
>> letting users try things out a bit. Thinking that we might be able to adjust 
>> Solr to be more in favor of SolrCloud as we go, what is the ideal state of 
>> the work we have currently done?
>>
>> If anyone using SolrCloud helps with the feedback, I'll help with the coding 
>> effort.
>>
>> - Mark Miller
>> -- lucidimagination.com
>
>


Re: SolrCloud Feedback

2011-09-09 Thread Pulkit Singhal
I think I understand it a bit better now but wouldn't mind some validation.

1) solr.xml does not become part of ZooKeeper
2) The default looks like this out-of-box:
  

  
so that may leave one wondering where the core's association to a
collection name is made?

It can be made like so:
a) statically in a file:

b) at start time via java:
java ... -Dcollection.configName=myconf ... -jar start.jar

And I'm guessing that since the core's name ("collection1") for shard1
has already been associated with -Dcollection.configname=myconf in
http://wiki.apache.org/solr/SolrCloud#Example_A:_Simple_two_shard_cluster
once already, adding an additional shard2 with the same core name
("collection1"), automatically throws it in with the collection name
("myconf") without any need to specify anything at startup via -D or
statically in solr.xml file.

Validate away otherwise I'll just accept any hate mail after making
edits to the Solr wiki directly.

- Pulkit

On Fri, Sep 9, 2011 at 11:38 AM, Pulkit Singhal  wrote:
> Hello Jan,
>
> You've made a very good point in (b). I would be happy to make the
> edit to the wiki if I understood your explanation completely.
>
> When you say that it is "looking up what collection that core is part
> of" ... I'm curious how a core is being put under a particular
> collection in the first place? And what that collection is named?
> Obviously you've made it clear that colelction1 is really the name of
> the core itself. And where this association is being stored for the
> code to look it up?
>
> If not Jan, then perhaps the gurus who wrote Solr Cloud could answer :)
>
> Thanks!
> - Pulkit
>
> On Thu, Feb 10, 2011 at 9:10 AM, Jan Høydahl  wrote:
>> Hi,
>>
>> I have so far just tested the examples and got a N by M cluster running. My 
>> feedback:
>>
>> a) First of all, a major update of the SolrCloud Wiki is needed, to clearly 
>> state what is in which version, what are current improvement plans and get 
>> rid of outdated stuff. That said I think there are many good ideas there.
>>
>> b) The "collection" terminology is too much confused with "core", and should 
>> probably be made more distinct. I just tried to configure two cores on the 
>> same Solr instance into the same collection, and that worked fine, both as 
>> distinct shards and as same shard (replica). The wiki examples give the 
>> impression that "collection1" in 
>> localhost:8983/solr/collection1/select?distrib=true is some magic collection 
>> identifier, but what it really does is doing the query on the *core* named 
>> "collection1", looking up what collection that core is part of and 
>> distributing the query to all shards in that collection.
>>
>> c) ZK is not designed to store large files. While the files in conf are 
>> normally well below the 1M limit ZK imposes, we should perhaps consider 
>> using a lightweight distributed object or k/v store for holding the /CONFIGS 
>> and let ZK store a reference only
>>
>> d) How are admins supposed to update configs in ZK? Install their favourite 
>> ZK editor?
>>
>> e) We should perhaps not be so afraid to make ZK a requirement for Solr in 
>> v4. Ideally you should interact with a 1-node Solr in the same manner as you 
>> do with a 100-node Solr. An example is the Admin GUI where the "schema" and 
>> "solrconfig" links assume local file. This requires decent tool support to 
>> make ZK interaction intuitive, such as "import" and "export" commands.
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>>
>> On 19. jan. 2011, at 21.07, Mark Miller wrote:
>>
>>> Hello Users,
>>>
>>> About a little over a year ago, a few of us started working on what we 
>>> called SolrCloud.
>>>
>>> This initial bit of work was really a combination of laying some base work 
>>> - figuring out how to integrate ZooKeeper with Solr in a limited way, 
>>> dealing with some infrastructure - and picking off some low hanging search 
>>> side fruit.
>>>
>>> The next step is the indexing side. And we plan on starting to tackle that 
>>> sometime soon.
>>>
>>> But first - could you help with some feedback?ISome people are using our 
>>> SolrCloud start - I have seen evidence of it ;) Some, even in production.
>>>
>>> I would love to have your help in targeting what we now try and improve. 
>>> Any suggestions or feedback? If you have sent this before, I/othe

Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Pulkit Singhal
Thanks Again.

Another question:

My solr.xml has:
  

  

And I omitted -Dcollection.configName=myconf from the startup command
because I felt that specifying collection="myconf" should take care of
that:
cd /trunk/solr/example
java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar

But the zookeeper.jsp page doesn't seem to take any of that into
effect and shows:
 /collections (v=6 children=1)
  collection1 (v=0 children=1) "configName=configuration1"
   shards (v=0 children=1)
shard1 (v=0 children=1)
 tiklup-mac.local:8983_solr_ (v=0)
"node_name=tiklup-mac.local:8983_solr
url=http://tiklup-mac.local:8983/solr/";

Then what is the point of naming the core and the collection?

- Pulkit

2011/9/9 Yury Kats :
> On 9/9/2011 10:52 AM, Pulkit Singhal wrote:
>> Thank You Yury. After looking at your thread, there's something I must
>> clarify: Is solr.xml not uploaded and held in ZooKeeper?
>
> Not as far as I understand. Cores are loaded/created by the local
> Solr server based on solr.xml and then registered with ZK, so that
> ZK know what cores are out there and how they are organized in shards.
>
>
>> because you have a slightly different config between Node 1 & 2:
>> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html
>
>
> I have two shards, each shard having a master and a slave core.
> Cores are located so that master and slave are on different nodes.
> This protects search (but not indexing) from node failure.
>


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-09 Thread Pulkit Singhal
I had forgotten to save the file, the collection name at least shows
up but the core name is still not used, is it simply decorative?

/collections (v=6 children=1)
  myconf (v=0 children=1) "configName=configuration1"
shards (v=0 children=1)
  shard1 (v=0 children=1)
tiklup-mac.local:8983_solr_ (v=0)
"node_name=tiklup-mac.local:8983_solr
 url=http://tiklup-mac.local:8983/solr/";

Thanks!
- Pulkit

On Fri, Sep 9, 2011 at 5:54 PM, Pulkit Singhal  wrote:
> Thanks Again.
>
> Another question:
>
> My solr.xml has:
>  
>    
>  
>
> And I omitted -Dcollection.configName=myconf from the startup command
> because I felt that specifying collection="myconf" should take care of
> that:
> cd /trunk/solr/example
> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar
>
> But the zookeeper.jsp page doesn't seem to take any of that into
> effect and shows:
>     /collections (v=6 children=1)
>          collection1 (v=0 children=1) "configName=configuration1"
>               shards (v=0 children=1)
>                    shard1 (v=0 children=1)
>                         tiklup-mac.local:8983_solr_ (v=0)
> "node_name=tiklup-mac.local:8983_solr
> url=http://tiklup-mac.local:8983/solr/";
>
> Then what is the point of naming the core and the collection?
>
> - Pulkit
>
> 2011/9/9 Yury Kats :
>> On 9/9/2011 10:52 AM, Pulkit Singhal wrote:
>>> Thank You Yury. After looking at your thread, there's something I must
>>> clarify: Is solr.xml not uploaded and held in ZooKeeper?
>>
>> Not as far as I understand. Cores are loaded/created by the local
>> Solr server based on solr.xml and then registered with ZK, so that
>> ZK know what cores are out there and how they are organized in shards.
>>
>>
>>> because you have a slightly different config between Node 1 & 2:
>>> http://lucene.472066.n3.nabble.com/Replication-setup-with-SolrCloud-Zk-td2952602.html
>>
>>
>> I have two shards, each shard having a master and a slave core.
>> Cores are located so that master and slave are on different nodes.
>> This protects search (but not indexing) from node failure.
>>
>


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-10 Thread Pulkit Singhal
1s of all, thanks everyone, your expertise and time is much appreciated.

@Jamie:
Great suggestion, I just have one small objection to it ... I wouldn't
want to mix the core's name with the collection's configName. Wouldn't
you also want to keep the two separate for clarity? What do you think
about that?

@Yury:
Overall what you said makes sense and I'll roll with it. But FYI,
through experimentation I found out that collection="myconf" does not
become the value for configName when I inspect ZooKeeper.jsp, here's
an example of what shows up if I setup the solr.xml file but don't say
anything in the cmd line startup:

myconf (v=0 children=1) "configName=configuration1"

But perhaps that's exactly what you are trying to warn me about. I'll
experiment more and get back.

- Pulkit

On Fri, Sep 9, 2011 at 10:17 PM, Jamie Johnson  wrote:
> as a note you could change out the values in solr.xml to be as follows
> and pull these values from System Properties.
>
>  
>    
>  
>
> unless someone says otherwise, but the quick tests I've run seem to
> work perfectly well with this setup.
>
> 2011/9/9 Yury Kats :
>> On 9/9/2011 6:54 PM, Pulkit Singhal wrote:
>>> Thanks Again.
>>>
>>> Another question:
>>>
>>> My solr.xml has:
>>>   
>>>     >> collection="myconf"/>
>>>   
>>>
>>> And I omitted -Dcollection.configName=myconf from the startup command
>>> because I felt that specifying collection="myconf" should take care of
>>> that:
>>> cd /trunk/solr/example
>>> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar start.jar
>>
>> With this you are telling ZK to bootstrap a collection with content of 
>> specific
>> files, but you don't tell what collection that should be.
>>
>> Hence you want collection.configName parameter, and you want
>> solr.xml to reference the same name in 'collection' attribute for the cores,
>> so that SolrCloud knows where to pull configuration for that core from.
>>
>>
>>
>


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-10 Thread Pulkit Singhal
Yes now I'm sure that
a) collection="blah" in solr.xml, and
b) -Dcollection.configName="myconf" at cmd line
actually fill in values for two very different fields.

Here's why I say so:

Example config # 1:


Results in:
/collections (v=6 children=1)
  scaleDeep (v=0 children=1) "configName=myconf"

Example config # 2:
Results in:
/collections (v=6 children=1)
  scaleDeep (v=0 children=1) "configName=scaleDeep"

What do you think about that? I maybe mis-interpreting the resutls so
pleaase pelase feel free to set me straight on this.

Also it would be nice if I knew the code well enough to just look @ it
and give an authoritative answer. Does anyone have that kind of
expertise? Reverse-engineering is getting a bit mundane.

Thanks!
- Pulkit

On Sat, Sep 10, 2011 at 11:43 AM, Pulkit Singhal
 wrote:
> 1s of all, thanks everyone, your expertise and time is much appreciated.
>
> @Jamie:
> Great suggestion, I just have one small objection to it ... I wouldn't
> want to mix the core's name with the collection's configName. Wouldn't
> you also want to keep the two separate for clarity? What do you think
> about that?
>
> @Yury:
> Overall what you said makes sense and I'll roll with it. But FYI,
> through experimentation I found out that collection="myconf" does not
> become the value for configName when I inspect ZooKeeper.jsp, here's
> an example of what shows up if I setup the solr.xml file but don't say
> anything in the cmd line startup:
>
> myconf (v=0 children=1) "configName=configuration1"
>
> But perhaps that's exactly what you are trying to warn me about. I'll
> experiment more and get back.
>
> - Pulkit
>
> On Fri, Sep 9, 2011 at 10:17 PM, Jamie Johnson  wrote:
>> as a note you could change out the values in solr.xml to be as follows
>> and pull these values from System Properties.
>>
>>  
>>    
>>  
>>
>> unless someone says otherwise, but the quick tests I've run seem to
>> work perfectly well with this setup.
>>
>> 2011/9/9 Yury Kats :
>>> On 9/9/2011 6:54 PM, Pulkit Singhal wrote:
>>>> Thanks Again.
>>>>
>>>> Another question:
>>>>
>>>> My solr.xml has:
>>>>   
>>>>     >>> collection="myconf"/>
>>>>   
>>>>
>>>> And I omitted -Dcollection.configName=myconf from the startup command
>>>> because I felt that specifying collection="myconf" should take care of
>>>> that:
>>>> cd /trunk/solr/example
>>>> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar 
>>>> start.jar
>>>
>>> With this you are telling ZK to bootstrap a collection with content of 
>>> specific
>>> files, but you don't tell what collection that should be.
>>>
>>> Hence you want collection.configName parameter, and you want
>>> solr.xml to reference the same name in 'collection' attribute for the cores,
>>> so that SolrCloud knows where to pull configuration for that core from.
>>>
>>>
>>>
>>
>


Re: Solr Cloud - is replication really a feature on the trunk?

2011-09-10 Thread Pulkit Singhal
Sorry a message got sent without me finishing it up, ctrl+s is not save but
send ... sigh!

Yes now I'm sure that
a) collection="blah" in solr.xml, and
b) -Dcollection.configName="myconf" at cmd line
actually fill in values for two very different fields.

Here's why I say so:

Example config # 1:

java -Dcollection.configName=*myconf* ... -DzkRun -jar start.jar

Results in:
/collections (v=6 children=1)
*scaleDeep* (v=0 children=1) "configName=*myconf*"

Example config # 2:

java -Dcollection.configName=*scaleDeep* ... -DzkRun -jar start.jar

Results in:
/collections (v=6 children=1)
*scaleDeep* (v=0 children=1) "configName=*scaleDeep*"

What do you think about that? I maybe mis-interpreting the results so
please please feel free to set me straight on this.

Also it would be nice if I knew the code well enough to just look @ it
and give an authoritative answer. Does anyone have that kind of
expertise? Reverse-engineering is getting a bit mundane.

Thanks!
- Pulkit

> On Sat, Sep 10, 2011 at 11:43 AM, Pulkit Singhal
>  wrote:
>> 1s of all, thanks everyone, your expertise and time is much appreciated.
>>
>> @Jamie:
>> Great suggestion, I just have one small objection to it ... I wouldn't
>> want to mix the core's name with the collection's configName. Wouldn't
>> you also want to keep the two separate for clarity? What do you think
>> about that?
>>
>> @Yury:
>> Overall what you said makes sense and I'll roll with it. But FYI,
>> through experimentation I found out that collection="myconf" does not
>> become the value for configName when I inspect ZooKeeper.jsp, here's
>> an example of what shows up if I setup the solr.xml file but don't say
>> anything in the cmd line startup:
>>
>> myconf (v=0 children=1) "configName=configuration1"
>>
>> But perhaps that's exactly what you are trying to warn me about. I'll
>> experiment more and get back.
>>
>> - Pulkit
>>
>> On Fri, Sep 9, 2011 at 10:17 PM, Jamie Johnson  wrote:
>>> as a note you could change out the values in solr.xml to be as follows
>>> and pull these values from System Properties.
>>>
>>>  
>>>
>>>  
>>>
>>> unless someone says otherwise, but the quick tests I've run seem to
>>> work perfectly well with this setup.
>>>
>>> 2011/9/9 Yury Kats :
>>>> On 9/9/2011 6:54 PM, Pulkit Singhal wrote:
>>>>> Thanks Again.
>>>>>
>>>>> Another question:
>>>>>
>>>>> My solr.xml has:
>>>>>   
>>>>> 
>>>>>   
>>>>>
>>>>> And I omitted -Dcollection.configName=myconf from the startup command
>>>>> because I felt that specifying collection="myconf" should take care of
>>>>> that:
>>>>> cd /trunk/solr/example
>>>>> java -Dbootstrap_confdir=./solr/conf -Dslave=disabled -DzkRun -jar
start.jar
>>>>
>>>> With this you are telling ZK to bootstrap a collection with content of
specific
>>>> files, but you don't tell what collection that should be.
>>>>
>>>> Hence you want collection.configName parameter, and you want
>>>> solr.xml to reference the same name in 'collection' attribute for the
cores,
>>>> so that SolrCloud knows where to pull configuration for that core from.
>>>>
>>>>
>>>>
>>>
>>
>


Re: Replication setup with SolrCloud/Zk

2011-09-10 Thread Pulkit Singhal
Hi Yury,

How do you manage to start the instances without any issues? The way I see
it, no matter which instance is started first, the slave will complain about
not being to find its respective master because that instance hasn't been
started yet ... no?

Thanks,
- Pulkit

2011/5/17 Yury Kats 

> On 5/17/2011 10:17 AM, Stefan Matheis wrote:
> > Yury,
> >
> > perhaps Java-Pararms (like used for this sample:
> >
> http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node
> )
> > can help you?
>
> Ah, thanks! It does seem to work!
>
> Cluster's solrconfig.xml (shared between all Solr instances and cores via
> SolrCloud/ZK):
> 
>  
> ${enable.master:false}
> commit
>startup
>  
>   
>${enable.slave:false}
>00:01:00
>http://
> ${masterHost:xyz}/solr/master/replication
>  
> 
>
> Node 1 solr.xml:
>  
> collection="myconf" >
>  
>
> collection="myconf">
>  
>  
>
>  
>
> Node 2 solr.xml:
>  
> collection="myconf" >
>  
>
> collection="myconf">
>  
>  
>
>  
>
>


Re: Replication setup with SolrCloud/Zk

2011-09-10 Thread Pulkit Singhal
Sorry, stupid question, now I see that the core still starts and the polling
process simply logs an error:

SEVERE: Master at: http://localhost:7574/solr/master2/replication is not
available.
Index fetch failed. Exception: Connection refused

I was able to setup the instructions in-detail with this thread's help here:
http://pulkitsinghal.blogspot.com/2011/09/multicore-master-slave-replication-in.html

Thanks,
- Pulkit

On Sat, Sep 10, 2011 at 2:54 PM, Pulkit Singhal wrote:

> Hi Yury,
>
> How do you manage to start the instances without any issues? The way I see
> it, no matter which instance is started first, the slave will complain about
> not being to find its respective master because that instance hasn't been
> started yet ... no?
>
> Thanks,
> - Pulkit
>
> 2011/5/17 Yury Kats 
>
>> On 5/17/2011 10:17 AM, Stefan Matheis wrote:
>> > Yury,
>> >
>> > perhaps Java-Pararms (like used for this sample:
>> >
>> http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node
>> )
>> > can help you?
>>
>> Ah, thanks! It does seem to work!
>>
>> Cluster's solrconfig.xml (shared between all Solr instances and cores via
>> SolrCloud/ZK):
>> 
>>  
>> ${enable.master:false}
>> commit
>>startup
>>  
>>   
>>${enable.slave:false}
>>00:01:00
>>http://
>> ${masterHost:xyz}/solr/master/replication
>>  
>> 
>>
>> Node 1 solr.xml:
>>  
>>> collection="myconf" >
>>  
>>
>>> collection="myconf">
>>  
>>  
>>
>>  
>>
>> Node 2 solr.xml:
>>  
>>> collection="myconf" >
>>  
>>
>>> collection="myconf">
>>  
>>  
>>
>>  
>>
>>
>


Re: Example Solr Config on EC2

2011-09-11 Thread Pulkit Singhal
Just to clarify, that link doesn't do anything to promote an already running
slave into a master. One would have to bounce the Solr node which has that
slave and then make the shift.  It is not something that happens at runtime
live.

On Wed, Aug 10, 2011 at 4:04 PM, Akshay  wrote:

> Yes you can promote a slave to be master refer
>
> http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node
>
> In AWS one can use an elastic IP(http://aws.amazon.com/articles/1346) to
> refer to the master and this can be assigned to slaves as they assume the
> role of master(in case of failure). All slaves will then refer to this new
> master and there will be no need to regenerate data.
>
> Automation of this maybe possible through CloudWatch alarm-actions. I don't
> know of any available example automation scripts.
>
> Cheers
> Akshay.
>
> On Wed, Aug 10, 2011 at 9:08 PM, Matt Shields 
> wrote:
>
> > If I were to build a master with multiple slaves, is it possible to
> promote
> > a slave to be the new master if the original master fails?  Will all the
> > slaves pickup right where they left off, or any time the master fails
> will
> > we need to completely regenerate all the data?
> >
> > If this is possible, are there any examples of this being automated?
> >  Especially on Win2k3.
> >
> > Matthew Shields
> > Owner
> > BeanTown Host - Web Hosting, Domain Names, Dedicated Servers, Colocation,
> > Managed Services
> > www.beantownhost.com
> > www.sysadminvalley.com
> > www.jeeprally.com
> >
> >
> >
> > On Mon, Aug 8, 2011 at 5:34 PM,  wrote:
> >
> > > Matthew,
> > >
> > > Here's another resource:
> > >
> > >
> >
> http://www.lucidimagination.com/blog/2010/02/01/solr-shines-through-the-cloud-lucidworks-solr-on-ec2/
> > >
> > >
> > > Michael Bohlig
> > > Lucid Imagination
> > >
> > >
> > >
> > > - Original Message 
> > > From: Matt Shields 
> > > To: solr-user@lucene.apache.org
> > > Sent: Mon, August 8, 2011 2:03:20 PM
> > > Subject: Example Solr Config on EC2
> > >
> > > I'm looking for some examples of how to setup Solr on EC2.  The
> > > configuration I'm looking for would have multiple nodes for redundancy.
> > > I've tested in-house with a single master and slave with replication
> > > running in Tomcat on Windows Server 2003, but even if I have multiple
> > > slaves
> > > the single master is a single point of failure.  Any suggestions or
> > example
> > > configurations?  The project I'm working on is a .NET setup, so ideally
> > I'd
> > > like to keep this search cluster on Windows Server, even though I
> prefer
> > > Linux.
> > >
> > > Matthew Shields
> > > Owner
> > > BeanTown Host - Web Hosting, Domain Names, Dedicated Servers,
> Colocation,
> > > Managed Services
> > > www.beantownhost.com
> > > www.sysadminvalley.com
> > > www.jeeprally.com
> > >
> > >
> >
>


How to combine RSS w/ Tika when using Data Import Handler (DIH)

2011-09-12 Thread Pulkit Singhal
Given an RSS raw feed source link such as the following:
http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn

I can easily get to the value of the description for an item like so:


But the content of "description" happens to be in HTML and sadly it is this
HTML chunk that has some pretty decent information that I would like to
import as well.
1) For example it has the image for the item:
http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg"; ... />
2) It has the price for the item:
$13.99
And many other useful pieces of data that aren't in a proper rss format but
they are simply thrown together inside the html chunk that is served as the
value for the xpath="/rss/item/description"

So, how can I configure DIH to start importing this html information as
well?
Is Tika the way to go?
Can someone give a brief example of what a config file with both Tika config
and RSS config would/should look like?

Thanks!
- Pulkit


Re: Parameter not working for master/slave

2011-09-12 Thread Pulkit Singhal
Hello Bill,

I can't really answer your question about replicaiton being supported on
Solr3.3 (I use trunk 4.x myself) BUT I can tell you that if each Solr node
has just one core ... only then does it make sense to use
-Denable.master=true and -Denable.slave=true ... otherwise, as Yury points
out, you should use solr.xml to pass in the value for each core
individually.

What is a node you ask? To me it means one App Server (Jetty) running Solr
... doesn't matter if its multiple ones on the same machine or single ones
on different machines. That's what I mean by a node here.

2011/9/12 Yury Kats 

> On 9/11/2011 11:24 PM, William Bell wrote:
> > I am using 3.3 SOLR. I tried passing in -Denable.master=true and
> > -Denable.slave=true on the Slave machine.
> > Then I changed solrconfig.xml to reference each as per:
> >
> >
> http://wiki.apache.org/solr/SolrReplication#enable.2BAC8-disable_master.2BAC8-slave_in_a_node
>
> These are core parameters, you need to set them in solr.xml per core.
>


Re: Re; DIH Scheduling

2011-09-12 Thread Pulkit Singhal
I don't see anywhere in:
http://issues.apache.org/jira/browse/SOLR-2305
any statement that shows the code's inclusion was "decided against"
when did this happen and what is needed from the community before
someone with the powers to do so will actually commit this?

2011/6/24 Noble Paul നോബിള്‍ नोब्ळ् 

> On Thu, Jun 23, 2011 at 9:13 PM, simon  wrote:
> > The Wiki page describes a design for a scheduler, which has not been
> > committed to Solr yet (I checked). I did see a patch the other day
> > (see https://issues.apache.org/jira/browse/SOLR-2305) but it didn't
> > look well tested.
> >
> > I think that you're basically stuck with something like cron at this
> > time. If your application is written in java, take a look at the
> > Quartz scheduler - http://www.quartz-scheduler.org/
>
> It was considered and decided against.
> >
> > -Simon
> >
>
>
>
> --
> -
> Noble Paul
>


Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)

2011-09-13 Thread Pulkit Singhal
Hello Everyone,

I've been investigating and I understand that using the RegexTransformer is
an option that is open for identifying and extracting data to multiple
fields from a single rss value source ... But rather than hack together
something I once again wanted to check with the community: Is there another
option for navigating the HTML DOM tree using some well-tested transformer
or TIka or something?

Thanks!
- Pulkit

On Mon, Sep 12, 2011 at 1:45 PM, Pulkit Singhal wrote:

> Given an RSS raw feed source link such as the following:
>
> http://persistent.info/cgi-bin/feed-proxy?url=http%3A%2F%2Fwww.amazon.com%2Frss%2Ftag%2Fblu-ray%2Fnew%2Fref%3Dtag_rsh_hl_ersn
>
> I can easily get to the value of the description for an item like so:
> 
>
> But the content of "description" happens to be in HTML and sadly it is this
> HTML chunk that has some pretty decent information that I would like to
> import as well.
> 1) For example it has the image for the item:
> http://ecx.images-amazon.com/images/I/51yyAAoYzKL._SL160_SS160_.jpg"; ...
> />
> 2) It has the price for the item:
> $13.99
> And many other useful pieces of data that aren't in a proper rss format but
> they are simply thrown together inside the html chunk that is served as the
> value for the xpath="/rss/item/description"
>
> So, how can I configure DIH to start importing this html information as
> well?
> Is Tika the way to go?
> Can someone give a brief example of what a config file with both Tika
> config and RSS config would/should look like?
>
> Thanks!
> - Pulkit
>


Re: DIH load only selected documents with XPathEntityProcessor

2011-09-13 Thread Pulkit Singhal
This solution doesn't seem to be working for me.

I am using Solr trunk and I have the same question as Bernd with a small
twist: the field that should NOT be empty, happens to be a derived field
called price, see the config below:






...


I have also changed the sample script to check the price field isntead of
the link field that was being used as an example in this thread earlier:





Does anyone have any thoughts on what I'm missing?
Thanks!
- Pulkit

On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
bernd.fehl...@uni-bielefeld.de> wrote:

> Hi Gora,
>
> thanks a lot, very nice solution, works perfectly.
> I will dig more into ScriptTransformer, seems to be very powerful.
>
> Regards,
> Bernd
>
> Am 08.01.2011 14:38, schrieb Gora Mohanty:
> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
> >  wrote:
> >> Hello list,
> >>
> >> is it possible to load only selected documents with
> XPathEntityProcessor?
> >> While loading docs I want to drop/skip/ignore documents with missing
> URL.
> >>
> >> Example:
> >> 
> >>
> >>first title
> >>identifier_01
> >>http://www.foo.com/path/bar.html
> >>
> >>
> >>second title
> >>identifier_02
> >>
> >>
> >> 
> >>
> >> The first document should be loaded, the second document should be
> ignored
> >> because it has an empty link (should also work for missing link field).
> > [...]
> >
> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
> > E.g., something like this for your data import configuration file:
> >
> > 
> >  >   function skipRow(row) {
> > var link = row.get( 'link' );
> > if( link == null || link == '' ) {
> >   row.put( '$skipRow', 'true' );
> > }
> > return row;
> >   }
> > ]]>
> > 
> > 
> >  > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
> > recursive="true" rootEntity="false" dataSource="null">
> >  > forEach="/documents/document" url="${f.fileAbsolutePath}"
> > transformer="script:skipRow">
> >
> >
> >
> > 
> > 
> > 
> > 
> >
> > Regards,
> > Gora
>


Re: DIH load only selected documents with XPathEntityProcessor

2011-09-13 Thread Pulkit Singhal
Oh and I"m sure that I'm using Java 6 because the properties from the Solr
webpage spit out:

java.runtime.version = 1.6.0_26-b03-384-10M3425


On Tue, Sep 13, 2011 at 4:15 PM, Pulkit Singhal wrote:

> This solution doesn't seem to be working for me.
>
> I am using Solr trunk and I have the same question as Bernd with a small
> twist: the field that should NOT be empty, happens to be a derived field
> called price, see the config below:
>
>transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,
> script:skipRow">
>
>xpath="/rss/channel/item/description"
>   />
>
>   regex=".*\$(\d*.\d*)"
>  sourceColName="description"
>  />
> ...
> 
>
> I have also changed the sample script to check the price field isntead of
> the link field that was being used as an example in this thread earlier:
>
>
> 
> <![CDATA[
> function skipRow(row) {
> var price = row.get( 'price' );
> if ( price == null || price == '' ) {
>
> row.put( '$skipRow', 'true' );
> }
> return row;
> }
> ]]>
> 
>
> Does anyone have any thoughts on what I'm missing?
> Thanks!
> - Pulkit
>
>
> On Mon, Jan 10, 2011 at 3:06 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
>
>> Hi Gora,
>>
>> thanks a lot, very nice solution, works perfectly.
>> I will dig more into ScriptTransformer, seems to be very powerful.
>>
>> Regards,
>> Bernd
>>
>> Am 08.01.2011 14:38, schrieb Gora Mohanty:
>> > On Fri, Jan 7, 2011 at 12:30 PM, Bernd Fehling
>> >  wrote:
>> >> Hello list,
>> >>
>> >> is it possible to load only selected documents with
>> XPathEntityProcessor?
>> >> While loading docs I want to drop/skip/ignore documents with missing
>> URL.
>> >>
>> >> Example:
>> >> 
>> >>
>> >>first title
>> >>identifier_01
>> >>http://www.foo.com/path/bar.html
>> >>
>> >>
>> >>second title
>> >>identifier_02
>> >>
>> >>
>> >> 
>> >>
>> >> The first document should be loaded, the second document should be
>> ignored
>> >> because it has an empty link (should also work for missing link field).
>> > [...]
>> >
>> > You can use a ScriptTransformer, along with $skipRow/$skipDoc.
>> > E.g., something like this for your data import configuration file:
>> >
>> > 
>> > <![CDATA[
>> >   function skipRow(row) {
>> > var link = row.get( 'link' );
>> > if( link == null || link == '' ) {
>> >   row.put( '$skipRow', 'true' );
>> > }
>> > return row;
>> >   }
>> > ]]>
>> > 
>> > 
>> > > > baseDir="/home/gora/test" fileName=".*xml" newerThan="'NOW-3DAYS'"
>> > recursive="true" rootEntity="false" dataSource="null">
>> > > > forEach="/documents/document" url="${f.fileAbsolutePath}"
>> > transformer="script:skipRow">
>> >
>> >
>> >
>> > 
>> > 
>> > 
>> > 
>> >
>> > Regards,
>> > Gora
>>
>
>


DIH skipping imports with skipDoc vs skipDoc

2011-09-13 Thread Pulkit Singhal
Hello,

1)  The documented explanation of skipDoc and skipRow is not enough
for me to discern the difference between them:
$skipDoc : Skip the current document . Do not add it to Solr. The
value can be String true/false
$skipRow : Skip the current row. The document will be added with rows
from other entities. The value can be String true/false
Can someone please elaborate and help me out with an example?

2) I am working off the Solr trunk (4.x) and nothing I do seems to
make the import for a given row/doc get skipped.
As proof I've added these tests to my data import xml and all the rows
are still getting indexed!!!
If anyone sees something wrong with my config please tell me.
Make sure to take note of the blatant use of row.put( '$skipDoc',
'true' ); and 
Yet stuff still gets imported, this is beyond me. Need a fresh pair of eyes :)







http://www.amazon.com/gp/rss/new-releases/apparel/1040660/ref=zg_bsnr_1040660_rsslink";
processor="XPathEntityProcessor"
forEach="/rss/channel | /rss/channel/item"

transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow,TemplateTransformer">









Thanks!
- Pulkit


RegexTransformer - need help with regex value

2011-09-14 Thread Pulkit Singhal
Hello,

Feel free to point me to alternate sources of information if you deem
this question unworthy of the Solr list :)

But until then please hear me out!

When my config is something like:

I don't get any data.

But when my config is like:

I get the following data as the value for imageUrl:
http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_.gif";
width="64"

As the result shows, this is a string that should be able to match
even on the 1st regex=".*img src=.(.*)\.gif..alt=.*" and produce a
result like:
http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_
But it doesn't!
Can anyone tell me why that would be the case?
Is it something about the way RegexTransformer is wired or is it just
my regex value that isn't right?


Re: RegexTransformer - need help with regex value

2011-09-14 Thread Pulkit Singhal
Thanks a bunch, got it working with a reluctant qualifier and the use
of " as the escaped representation of double qoutes within the
regex value so that the config file doesn't crash & burn:



Cheers,
- Pulkit

On Wed, Sep 14, 2011 at 2:24 PM, Pulkit Singhal  wrote:
> Hello,
>
> Feel free to point me to alternate sources of information if you deem
> this question unworthy of the Solr list :)
>
> But until then please hear me out!
>
> When my config is something like:
>                               regex=".*img src=.(.*)\.gif..alt=.*"
>                   sourceColName="description"
>                   />
> I don't get any data.
>
> But when my config is like:
>                               regex=".*img src=.(.*)..alt=.*"
>                   sourceColName="description"
>                   />
> I get the following data as the value for imageUrl:
> http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_.gif";
> width="64"
>
> As the result shows, this is a string that should be able to match
> even on the 1st regex=".*img src=.(.*)\.gif..alt=.*" and produce a
> result like:
> http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/stars-5-0._V192240867_
> But it doesn't!
> Can anyone tell me why that would be the case?
> Is it something about the way RegexTransformer is wired or is it just
> my regex value that isn't right?
>


[DIH] How to use combine Regex and HTML transformers

2011-09-15 Thread Pulkit Singhal
Hello,

I need to pull out the price and imageURL for products in an Amazon RSS feed.

PROBLEM STATEMENT:
The following:



works but I am left with html junk inside the description!

USELESS WORKAROUND:
If I try to strip the html from the data being fed into description
while letting the price and imageURL know of the direct path of the
RSS feed field like so:



then this fails and only the last configured field in this list
(imageURL) ends up having any data imported.
Is this a bug?

CRUX OF THE PROBLEM:
Also I tried to then create a field just to store the raw html data
like so but this configuration yields no results for the description
field so I'm back to where I started:




I was suspicious of trying to combine sourceColName with stripHTML to
begin with ... I suppose that I was hoping that the regex transformer
will run first and copy all the html data as-is which will then be
stripped out later by the HTML transformer but this didn't work. Why?
what else can I do?

Thanks!
- Pulkit


Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal
Hello Everyone,

I have a goal of populating Solr with a million unique products in
order to create a test environment for a proof of concept. I started
out by using DIH with Amazon RSS feeds but I've quickly realized that
there's no way I can glean a million products from one RSS feed. And
I'd go mad if I just sat at my computer all day looking for feeds and
punching them into DIH config for Solr.

Has anyone ever had to create large mock/dummy datasets for test
environments or for POCs/Demos to convince folks that Solr was the
wave of the future? Any tips would be greatly appreciated. I suppose
it sounds a lot like crawling even though it started out as innocent
DIH usage.

- Pulkit


Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal
Ah missing } doh!

BTW I still welcome any ideas on how to build an e-commerce test base.
It doesn't have to be amazon that was jsut my approach, any one?

- Pulkit

On Thu, Sep 15, 2011 at 8:52 PM, Pulkit Singhal  wrote:
> Thanks for all the feedback thus far. Now to get  little technical about it :)
>
> I was thinking of feeding a file with all the tags of amazon that
> yield close to roughly 5 results each into a file and then running
> my rss DIH off of that, I came up with the following config but
> something is amiss, can someone please point out what is off about
> this?
>
>    
>                        processor="LineEntityProcessor"
>                url="file:///xxx/yyy/zzz/amazonfeeds.txt"
>                rootEntity="false"
>                dataSource="myURIreader1"
>                transformer="RegexTransformer,DateFormatTransformer"
>                >
>                                pk="link"
>                    url="${amazonFeeds.rawLine"
>                    processor="XPathEntityProcessor"
>                    forEach="/rss/channel | /rss/channel/item"
>
> transformer="RegexTransformer,HTMLStripTransformer,DateFormatTransformer,script:skipRow">
> ...
>
> The rawline should feed into the url key but instead i get:
>
> Caused by: java.net.MalformedURLException: no protocol:
> null${amazonFeeds.rawLine
>        at 
> org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)
>
> Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
>
> Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
> SEVERE: Exception while solr rollback.
>
> Thanks in advance!
>
> On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
>  wrote:
>> If we want to test with huge amounts of data we feed portions of the 
>> internet.
>> The problem is it takes a lot of bandwith and lots of computing power to get
>> to a `reasonable` size. On the positive side, you deal with real text so it's
>> easier to tune for relevance.
>>
>> I think it's easier to create a simple XML generator with mock data, prices,
>> popularity rates etc. It's fast to generate millions of mock products and 
>> once
>> you have a large quantity of XML files, you can easily index, test, change
>> config or schema and reindex.
>>
>> On the other hand, the sample data that comes with the Solr example is a good
>> set as well as it proves the concepts well, especially with the stock 
>> Velocity
>> templates.
>>
>> We know Solr will handle enormous sets but quantity is not always a part of a
>> PoC.
>>
>>> Hello Everyone,
>>>
>>> I have a goal of populating Solr with a million unique products in
>>> order to create a test environment for a proof of concept. I started
>>> out by using DIH with Amazon RSS feeds but I've quickly realized that
>>> there's no way I can glean a million products from one RSS feed. And
>>> I'd go mad if I just sat at my computer all day looking for feeds and
>>> punching them into DIH config for Solr.
>>>
>>> Has anyone ever had to create large mock/dummy datasets for test
>>> environments or for POCs/Demos to convince folks that Solr was the
>>> wave of the future? Any tips would be greatly appreciated. I suppose
>>> it sounds a lot like crawling even though it started out as innocent
>>> DIH usage.
>>>
>>> - Pulkit
>>
>


Re: Generating large datasets for Solr proof-of-concept

2011-09-15 Thread Pulkit Singhal
Thanks for all the feedback thus far. Now to get  little technical about it :)

I was thinking of feeding a file with all the tags of amazon that
yield close to roughly 5 results each into a file and then running
my rss DIH off of that, I came up with the following config but
something is amiss, can someone please point out what is off about
this?




...

The rawline should feed into the url key but instead i get:

Caused by: java.net.MalformedURLException: no protocol:
null${amazonFeeds.rawLine
at 
org.apache.solr.handler.dataimport.URLDataSource.getData(URLDataSource.java:90)

Sep 15, 2011 8:48:01 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback

Sep 15, 2011 8:48:01 PM org.apache.solr.handler.dataimport.SolrWriter rollback
SEVERE: Exception while solr rollback.

Thanks in advance!

On Thu, Sep 15, 2011 at 4:12 PM, Markus Jelsma
 wrote:
> If we want to test with huge amounts of data we feed portions of the internet.
> The problem is it takes a lot of bandwith and lots of computing power to get
> to a `reasonable` size. On the positive side, you deal with real text so it's
> easier to tune for relevance.
>
> I think it's easier to create a simple XML generator with mock data, prices,
> popularity rates etc. It's fast to generate millions of mock products and once
> you have a large quantity of XML files, you can easily index, test, change
> config or schema and reindex.
>
> On the other hand, the sample data that comes with the Solr example is a good
> set as well as it proves the concepts well, especially with the stock Velocity
> templates.
>
> We know Solr will handle enormous sets but quantity is not always a part of a
> PoC.
>
>> Hello Everyone,
>>
>> I have a goal of populating Solr with a million unique products in
>> order to create a test environment for a proof of concept. I started
>> out by using DIH with Amazon RSS feeds but I've quickly realized that
>> there's no way I can glean a million products from one RSS feed. And
>> I'd go mad if I just sat at my computer all day looking for feeds and
>> punching them into DIH config for Solr.
>>
>> Has anyone ever had to create large mock/dummy datasets for test
>> environments or for POCs/Demos to convince folks that Solr was the
>> wave of the future? Any tips would be greatly appreciated. I suppose
>> it sounds a lot like crawling even though it started out as innocent
>> DIH usage.
>>
>> - Pulkit
>


How to set up the schema to avoid NumberFormatException

2011-09-16 Thread Pulkit Singhal
Hello Folks,

Surprisingly, the value from the following raw data gives me a NFE
(Number Format Exception) when running the DIH (Data Import Handler):
$1,000.00

The error logs look like:
Caused by: org.apache.solr.common.SolrException: Error while creating
field 'price{type=sdouble,properties=indexed,stored,omitNorms,sortMissingLast}'
from value '1,000'
at org.apache.solr.schema.FieldType.createField(FieldType.java:249)
at org.apache.solr.schema.SchemaField.createField(SchemaField.java:102)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:198)
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:257)
... 13 more
Caused by: java.lang.NumberFormatException: For input string: "1,000"
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222)
at java.lang.Double.parseDouble(Double.java:510)
at 
org.apache.solr.util.NumberUtils.double2sortableStr(NumberUtils.java:129)
at 
org.apache.solr.schema.SortableDoubleField.toInternal(SortableDoubleField.java:61)
at org.apache.solr.schema.FieldType.createField(FieldType.java:247)

It is pretty obvious from this that the "sdouble" schema fieldtype is
not setup to parse out group-separators from a number.
1) Then my question is which type pf schema fieldtype will parse out
the comma group-separator from 1,000?
2) Also, shouldn't we think about making locale based parsing be part
of this stack trace as well?
at 
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1222)
at java.lang.Double.parseDouble(Double.java:510)
at 
org.apache.solr.util.NumberUtils.double2sortableStr(NumberUtils.java:129)

Thanks!
- Pulkit


Miscellaneous DIH related questions

2011-09-17 Thread Pulkit Singhal
My DIH's full-import logs end with a tailing output saying that 1500
documents were added, which is correct because I have 16 sources and
one of them was down and each source is supposed to give me 100
results:
(1500 adds)],optimize=} 0 0

But When I check my document count I get only 1384 results:
INFO: [rss] webapp=/solr path=/select params={start=0&q=*:*&rows=0}
hits=1384 status=0 QTime=0

1) I think I may have duplicates based on the primary key for the data
coming in. Is there any other explnation than that?
2) Is there some way to get a log of how many documents were deleted?
Because an update does a delete then add, this would allow me to make
sure of what is going on.

The sources I have are URL based, soemtimes they appear to be down
because the request gets denied I suppose:
SEVERE: Exception thrown while getting data
java.io.FileNotFoundException:
http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100
Caused by: java.io.FileNotFoundException:
http://www.amazon.com/rss/tag/anime/popular/ref=tag_tdp_rss_pop_man?length=100
at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1434)

3) Is there some way to configure the datasource to retry 3 time or
something like that? I have increased the values for connectionTimeout
and readTimeout but it doesn't help when sometimes the server simply
denies the request due to heavy load. I need to be able to retry at
those times. The onError has only the abort,skip,continue options, non
of which really let me retry anything.

Thank You.
- Pulkit


Re: Generating large datasets for Solr proof-of-concept

2011-09-17 Thread Pulkit Singhal
Thanks Hoss. I agree that the way you restated the question is better
for getting results. BTW I think you've tipped me off to exactly what
I needed with this URL: http://bbyopen.com/

Thanks!
- Pulkit

On Fri, Sep 16, 2011 at 4:35 PM, Chris Hostetter
 wrote:
>
> : Has anyone ever had to create large mock/dummy datasets for test
> : environments or for POCs/Demos to convince folks that Solr was the
> : wave of the future? Any tips would be greatly appreciated. I suppose
> : it sounds a lot like crawling even though it started out as innocent
> : DIH usage.
>
> the better question to ask is where you can find good sample data sets for
> building proof of concept implementations.
>
> If you want an example of product data, the best buy product catalog is
> available for developers using either an API or a bulk download of xml
> files...
>
>        http://bbyopen.com/
>
> ...last time i looked (~1 year ago) there were about 1 million products in
> the data dump.
>
>
> -Hoss
>


Re: JSON and DataImportHandler

2011-09-18 Thread Pulkit Singhal
Any updates on this topic?

On Fri, Jul 16, 2010 at 5:36 PM, P Williams
 wrote:
> Hi All,
>
>    Has anyone gotten the DataImportHandler to work with json as input?  Is
> there an even easier alternative to DIH?  Could you show me an example?
>
> Many thanks,
> Tricia
>


Re: JSON and DataImportHandler

2011-09-18 Thread Pulkit Singhal
Ah I see now:
http://wiki.apache.org/solr/UpdateJSON#Example
Not part of DIH that's all.

On Sun, Sep 18, 2011 at 5:42 PM, Pulkit Singhal  wrote:
> Any updates on this topic?
>
> On Fri, Jul 16, 2010 at 5:36 PM, P Williams
>  wrote:
>> Hi All,
>>
>>    Has anyone gotten the DataImportHandler to work with json as input?  Is
>> there an even easier alternative to DIH?  Could you show me an example?
>>
>> Many thanks,
>> Tricia
>>
>


JSON indexing failing...

2011-09-19 Thread Pulkit Singhal
Hello,

I am running a simple test after reading:
http://wiki.apache.org/solr/UpdateJSON

I am only using one object from a large json file to test and see if
the indexing works:
curl 'http://localhost:8983/solr/update/json?commit=true'
--data-binary @productSample.json -H 'Content-type:application/json'

The data is from bbyopen.com, I've attached the one single object that
I'm testing with.

The indexing process fails with:
Sep 19, 2011 2:37:54 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: invalid key: url [1701]
at org.apache.solr.handler.JsonLoader.parseDoc(JsonLoader.java:355)

I thought that any json attributes that did not have a mapping in the
schema.xml file would simply not get indexed.
(a) Is this not true?

But this error made me retry after adding url to schema.xml file:

I retried after a restart but I still keep getting the same error!
(b) Can someone wise perhaps point me in the right direction for
troubleshooting this issue?

Thank You!
- Pulkit


productSample.json
Description: application/json


How does Solr deal with JSON data?

2011-09-19 Thread Pulkit Singhal
Hello Everyone,

I'm quite curious about how does the following data get understood and
indexed by Solr?
[{
"id":"Fubar",
"url": null,
"regularPrice": 3.99,
 "offers": [
{
  "url": "",
  "text": "On Sale",
  "id": "OS"
}
 ]
}]

1) The field "id" is present as part of the main object and as part of
a nested offers object, so how does Solr make sense of it?
2) Is the data under offers expected to be stored as multi-value in
Solr? Or am I supposed to create offerURL, offerText and offerId
fields in schema.xml? Even if I do that how do I tell Solr what data
to match up where?

Please be kind, I know I'm not thinking about this in the right
manner, just gently set me straight about all this :)
- Pulkit


Re: JSON indexing failing...

2011-09-19 Thread Pulkit Singhal
Ok a little bit of deleting lines from the json file led me to realize
that Solr isn't happy with the following:
  "offers": [
{
  "url": "",
  "text": "On Sale",
  "id": "OS"
}
  ],
But as to why? Or what to do to remedy this ... I have no clue :(

- Pulkit

On Mon, Sep 19, 2011 at 2:45 PM, Pulkit Singhal  wrote:
> Hello,
>
> I am running a simple test after reading:
> http://wiki.apache.org/solr/UpdateJSON
>
> I am only using one object from a large json file to test and see if
> the indexing works:
> curl 'http://localhost:8983/solr/update/json?commit=true'
> --data-binary @productSample.json -H 'Content-type:application/json'
>
> The data is from bbyopen.com, I've attached the one single object that
> I'm testing with.
>
> The indexing process fails with:
> Sep 19, 2011 2:37:54 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: invalid key: url [1701]
>        at org.apache.solr.handler.JsonLoader.parseDoc(JsonLoader.java:355)
>
> I thought that any json attributes that did not have a mapping in the
> schema.xml file would simply not get indexed.
> (a) Is this not true?
>
> But this error made me retry after adding url to schema.xml file:
> 
> I retried after a restart but I still keep getting the same error!
> (b) Can someone wise perhaps point me in the right direction for
> troubleshooting this issue?
>
> Thank You!
> - Pulkit
>


Troubleshooting OOM in DIH w/ FileListEntityProcessor and XPathEntityProcessor

2011-09-20 Thread Pulkit Singhal
Hello Everyone,

I need help in:
(a) figuring out the causes of OutOfMemoryError (OOM) when I run Data
Import Handler (DIH),
(b) finding workarounds and fixes to get rid of the OOM issue per cause.

The stacktrace is at the very bottom to avoid having your eyes glaze
over and to prevent you from skipping this thread ;)

1) Based on the documentation so far, I would say that "batchSize"
based control does not exist for FileListEntityProcessor or
XPathEntityProcessor. Please correct me if I'm wrong about this.

2) The files being processed by FileListEntityProcessor range from
90.9 to 2.8 MB in size.
2.1) Is there some way to let FileListEntityProcessor bring in only
one file at a time? Or is that the default already?
2.2) Is there some way to let FileListEntityProcessor stream the file
to its nested XPathEntityProcessor?
2.3) If streaming a file is something that should be configured
directly on XPathEntityProcessor, then please let me know how to do
that as well.

3) Where are the default xms and xmx for Solr configured? Please let
me know so I may try tweaking them for startup.


STACKTRACE:

SEVERE: Exception while processing: bbyopenProductsArchive document : null:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap space
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:718)
...
Caused by: java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2734)
at java.util.ArrayList.toArray(ArrayList.java:275)
at java.util.ArrayList.(ArrayList.java:131)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.getDeepCopy(XPathRecordReader.java:586)
...
INFO: start rollback
Sep 20, 2011 4:22:26 PM org.apache.solr.handler.dataimport.SolrWriter rollback
SEVERE: Exception while solr rollback.
java.lang.NullPointerException
at 
org.apache.solr.update.DefaultSolrCoreState.rollbackIndexWriter(DefaultSolrCoreState.java:73)


Re: How to set up the schema to avoid NumberFormatException

2011-09-20 Thread Pulkit Singhal
Hi Hoss,

Thanks for the input!

Something rather strange happened. I fixed my regex such that instead
of returning just 1,000 ... it would return 1,000.00 and voila it
worked! So Parsing group separators is already supported apparently
then ... its just that the format is also looking for a
decimal-separator and digits after that ... weird huh?



- Pulkit

On Fri, Sep 16, 2011 at 10:53 AM, Chris Hostetter
 wrote:
>
> : It is pretty obvious from this that the "sdouble" schema fieldtype is
> : not setup to parse out group-separators from a number.
>
> correct.  the numeric (and date) field types are all designed to deal with
> conversion of the canonical string represetantion.
>
> : 1) Then my question is which type pf schema fieldtype will parse out
> : the comma group-separator from 1,000?
>
> that depends on how you wnat to interpret/use those values..
>
> : 2) Also, shouldn't we think about making locale based parsing be part
> : of this stack trace as well?
>
> Not in the field types.
>
> 1) adding extra parse logic there would be inefficient for people who are
> only ever sending well formed data.
> 2) as a client/server setup, it would be a bad idea for hte server to
> assume the client is using the same locale
>
> The right place in the stack for this type of logic would be in an
> UpdateProcessor (for indexing docs) or in a
> QueryParser/DocTransformer (for querying / writing back values in the
> results).
>
> Solr could certainly use some more generla purpose UpdateProcessors for
> parsing various non-canonical input formats (we've talked about one for
> doing rule based SimpleDateParsing as well) if you'd like to take a stab
> at writting one and contributing it.
>
>
> -Hoss
>


How to skip fields when using DIH?

2011-09-20 Thread Pulkit Singhal
The data I'm running through the DIH looks like:


  
false
false
349.99

  


As you can see, in this particular instance of a product, there is no
value for "salesRankShortTerm" which happens to be defined in my
schema like so:


Having an empty value in the incoming DIH data leads to an exception:
Caused by: java.lang.NumberFormatException: For input string: ""

1) How can I skip this field if its empty?

If I use script transformer like so:
  

  
THEN, I will end up skipping the entire document :(

2) So please help me understand how I can configure it to only skip a
field and not the document?

Thanks,
- Pulkit


Re: How to skip fields when using DIH?

2011-09-20 Thread Pulkit Singhal
OMG, I'm so sorry, please ignore.

Its so simple, just had to use:
row.remove( 'salesRankShortTerm' );
because the script runs at the end after the entire entity has been
processed (I suppose) rather than per field.

Thanks!

On Tue, Sep 20, 2011 at 5:42 PM, Pulkit Singhal  wrote:
> The data I'm running through the DIH looks like:
>
> 
>  
>    false
>    false
>    349.99
>    
>  
> 
>
> As you can see, in this particular instance of a product, there is no
> value for "salesRankShortTerm" which happens to be defined in my
> schema like so:
>  />
>
> Having an empty value in the incoming DIH data leads to an exception:
> Caused by: java.lang.NumberFormatException: For input string: ""
>
> 1) How can I skip this field if its empty?
>
> If I use script transformer like so:
>  
>        <![CDATA[
>        function skipRow(row) {
>            var salesRankShortTerm = row.get( 'salesRankShortTerm' );
>            if ( salesRankShortTerm == null || salesRankShortTerm == '' ) {
>                row.put( '$skipRow', 'true' );
>            }
>            return row;
>        }
>        ]]>
>  
> THEN, I will end up skipping the entire document :(
>
> 2) So please help me understand how I can configure it to only skip a
> field and not the document?
>
> Thanks,
> - Pulkit
>


Best Practices for indexing nested XML in Solr via DIH

2011-09-21 Thread Pulkit Singhal
Hello Everyone,

I was wondering what are the various best practices that everyone
follows for indexing nested XML into Solr. Please don't feel limited
by examples, feel free to share your own experiences.

Given an xml structure such as the following:


cat001
Everything


cat002
Music


cat003
Pop



How do you make the best use of the data when indexing?

1) Do you use Scenario A?
categoryPath_category_id = cat001 cat002 cat003 (flattened)
categoryPath_category_name = Everything Music Pop (flattened)
If so then how do you manage to find the corresponding
categoryPath_category_id if someone's search matches a value in the
categoryPath_category_name field? I understand that Solr is not about
lookups but this may be important information for you to display right
away as part of the search results page rendering.

2) Do you use Scenario B?
categoryPath_category_id = [cat001 cat002 cat003] (the [] signifies a
multi-value field)
categoryPath_category_name = [Everything Music Pop] (the [] signifies
a multi-value field)
And once again how do you find associated data sets once something matches.
Side Question: How can one configure DIH to store the data this way
for Scenario B?

Thanks!
- Pulkit


Re: How to write core's name in log

2011-09-21 Thread Pulkit Singhal
Not sure if this is a good lead for you but when I run out-of-the-box
multi-core example-DIH instance of Solr, I often see core name thrown
about in the logs. Perhaps you can look there?

On Thu, Sep 15, 2011 at 6:50 AM, Joan  wrote:
> Hi,
>
> I have multiple core in Solr and I want to write core name in log through to
> lo4j.
>
> I've found in SolrException a method called log(Logger log, Throwable e) but
> when It try to build a Exception it haven't core's name.
>
> The Exception is built in toStr() method in SolrException class, so I want
> to write core's name in the message of Exception.
>
> I'm thinking to add MDC variable, this will be name of core. Finally I'll
> use it in log4j configuration like this in ConversionPattern %X{core}
>
> The idea is that when Solr received a request I'll add this new variable
> "name of core".
>
> But I don't know if it's a good idea or not.
>
> or Do you already exists any solution for add name of core in log?
>
> Thanks
>
> Joan
>


Re: strange copied field problem

2011-09-21 Thread Pulkit Singhal
I am NOT claiming that making a copy of a copy field is wrong or leads
to a race condition. I don't know that. BUT did you try to copy into
the text field directly from the genre field? Instead of the
genre_search field? Did that yield working queries?

On Wed, Sep 21, 2011 at 12:16 PM, Tanner Postert
 wrote:
> i have 3 fields that I am working with: genre, genre_search and text. genre
> is a string field which comes from the data source. genre_search is a text
> field that is copied from genre, and text is a text field that is copied
> from genre_search and a few other fields. Text field is the default search
> field for queries. When I search for q=genre_search:indie+rock, solr returns
> several records that have both Indie as a genre and Rock as a genre, which
> is great, but when I search for q=indie+rock or q=text:indie+rock, i get no
> results.
>
> Why would the source field return the value and the destination wouldn't.
> Both genre_search and text are the same data type, so there shouldn't be any
> strange translations happening.
>


Re: OOM errors and -XX:OnOutOfMemoryError flag not working on solr?

2011-09-21 Thread Pulkit Singhal
Usually any good piece of java code refrains from capturing Throwable
so that Errors will bubble up unlike exceptions. Having said that,
perhaps someone in the list can help, if you share which particular
Solr version you are using where you suspect that the Error is being
eaten up.

On Fri, Sep 16, 2011 at 2:47 PM, Jason Toy  wrote:
> I have solr issues where I keep running out of memory. I am working on
> solving the memory issues (this will take a long time), but in the meantime,
> I'm trying to be notified when the error occurs.  I saw with the jvm I can
> pass the -XX:OnOutOfMemoryError= flag and pass a script to run. Every time
> the out of memory issue occurs though my script never runs. Does solr let
> the error bubble up so that the jvm can call this script? If not how can I
> have a script run when solr gets an out of memory issue?
>


Re: add quartz like scheduling cabalities to solr-DIH

2011-09-21 Thread Pulkit Singhal
I think what Ahmet is trying to say is that such functionality does not exist.
As the functionality does not exist, there is no procedure or conf
file related work to speak of.
There has been request to have this work done and you can vote/watch
for it here:
https://issues.apache.org/jira/browse/SOLR-1251

On Fri, Sep 16, 2011 at 7:35 AM, vighnesh  wrote:
> thanks iroxxx
>
>
> but how can l add quartz like scheduling to solr dih ,is there any changes
> required in anyof the configuration files please specify the procedure.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/add-quartz-like-scheduling-cabalities-to-solr-DIH-tp3341141p3341795.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr Indexing - Null Values in date field

2011-09-21 Thread Pulkit Singhal
Also you may use the script transformer to explicitly remove the field
from the document if the field is null. I do this for all my sdouble
and sdate fields ... its a bit manual and I would like to see Solr
enhanced to simply skip stuff like this by having a flag for its DIH
code but until then it suffices:

... transformer="DateFormatTransformer,script:skipEmptyFields"

  

  



On Wed, Sep 21, 2011 at 6:06 AM, Gora Mohanty  wrote:
> On Wed, Sep 21, 2011 at 4:08 PM, mechravi25  wrote:
>> Hi,
>>
>> I have a field in my source with data type as string and that field has NULL
>> values. I am trying to index this field in solr as a date data type with
>> multivalued = true. Following is the entry for that field in my schema.xml
> [...]
>
> One cannot have NULL values as input for Solr date fields. The
> multivalued part is irrelevant here.
>
> As it seems like you are getting the input data from a database,
> you will need to supply some invalid date for NULL date values.
> E.g., with mysql, we have:
> COALESCE( CreationDate, STR_TO_DATE( '1970,1,1', '%Y,%m,%d' ) )
> The required syntax will be different for other databases.
>
> Regards,
> Gora
>


Debugging DIH by placing breakpoints

2011-09-21 Thread Pulkit Singhal
Hello,

I was wondering where can I find the source code for DIH? I want to
checkout the source and step-trhought it breakpoint by breakpoint to
understand it better :)

Thanks!
- Pulkit


Re: Debugging DIH by placing breakpoints

2011-09-21 Thread Pulkit Singhal
Correct! With that additional info, plus
http://wiki.apache.org/solr/HowToContribute (ant eclipse), plus a
refreshed (close/open) eclipse project ... I'm all set.

Thanks Again.

On Wed, Sep 21, 2011 at 1:43 PM, Gora Mohanty  wrote:
> On Thu, Sep 22, 2011 at 12:08 AM, Pulkit Singhal
>  wrote:
>> Hello,
>>
>> I was wondering where can I find the source code for DIH? I want to
>> checkout the source and step-trhought it breakpoint by breakpoint to
>> understand it better :)
>
> Should be under contrib/dataimporthandler in your Solr source
> tree.
>
> Regards,
> Gora
>


Re: strange copied field problem

2011-09-21 Thread Pulkit Singhal
No probs. I would still hope someone would comment on you thread with
some expert opinions about making a copy of a copy :)

On Wed, Sep 21, 2011 at 1:38 PM, Tanner Postert
 wrote:
> sure enough that worked. could have sworn we had it this way before, but
> either way, that fixed it. Thanks.
>
> On Wed, Sep 21, 2011 at 11:01 AM, Tanner Postert
> wrote:
>
>> i believe that was the original configuration, but I can switch it back and
>> see if that yields any results.
>>
>>
>> On Wed, Sep 21, 2011 at 10:54 AM, Pulkit Singhal 
>> wrote:
>>
>>> I am NOT claiming that making a copy of a copy field is wrong or leads
>>> to a race condition. I don't know that. BUT did you try to copy into
>>> the text field directly from the genre field? Instead of the
>>> genre_search field? Did that yield working queries?
>>>
>>> On Wed, Sep 21, 2011 at 12:16 PM, Tanner Postert
>>>  wrote:
>>> > i have 3 fields that I am working with: genre, genre_search and text.
>>> genre
>>> > is a string field which comes from the data source. genre_search is a
>>> text
>>> > field that is copied from genre, and text is a text field that is copied
>>> > from genre_search and a few other fields. Text field is the default
>>> search
>>> > field for queries. When I search for q=genre_search:indie+rock, solr
>>> returns
>>> > several records that have both Indie as a genre and Rock as a genre,
>>> which
>>> > is great, but when I search for q=indie+rock or q=text:indie+rock, i get
>>> no
>>> > results.
>>> >
>>> > Why would the source field return the value and the destination
>>> wouldn't.
>>> > Both genre_search and text are the same data type, so there shouldn't be
>>> any
>>> > strange translations happening.
>>> >
>>>
>>
>>
>


ScriptTransformer question

2011-09-22 Thread Pulkit Singhal
Hello,

I'm using DIH in the trunk version and I have placed breakpoints in
the Solr code.
I can see that the value for a row being fed into the
ScriptTransformer instance is:
{buybackPlans.buybackPlan.type=[PSP-PRP],
buybackPlans.buybackPlan.name=[2-Year Buy Back Plan],
buybackPlans.buybackPlan.sku=[2490748],
$forEach=/products/product/buybackPlans/buybackPlan,
buybackPlans.buybackPlan.price=[]}

Now price cannot be empty because Solr will complain so the following
script should be running but it doesn't do anything!!!
Can anyone spot the issue here?
function skipEmptyFieldsInBuybackPlans(row) {
var buybackPlans_buybackPlan_price = row.get(
'buybackPlans.buybackPlan.price' );
if ( buybackPlans_buybackPlan_price == null ||
 buybackPlans_buybackPlan_price == '' ||
 buybackPlans_buybackPlan_price.length == 0)
{
row.remove( 'buybackPlans.buybackPlan.price' );
}
return row;
}
I would hate to have to get the rhino javascript engine source code
and step-through that.
I'm sure I'm being really dumb and am hoping that someone on the Solr
mailing list can help me spot the issue :)

Thanks!
- Pulkit


Re: DIH error when nested db datasource and file data source

2011-09-23 Thread Pulkit Singhal
Few thoughts:

1) If you place the script transformer method on the entity named "x"
and then pass the ${topic_tree.topic_id} to that as an argument, then
shouldn't you have everything you need to work with x's row? Even if
you can't look up at the parent, all you needed to know was the
topic_id and based on that you can edit or not edit x's row ...
shouldn't that be sufficient to get you what you need to do?

2) Regarding the manner in which you are trying to use the following
xpath syntax:
forEach="/gvpVideoMetaData/mediaItem[@media_id='${topic_tree.topic_id}']"
There are two other closely related thread that I've come across:
(a) 
http://lucene.472066.n3.nabble.com/DIH-Enhance-XPathRecordReader-to-deal-with-body-FLATTEN-true-and-body-h1-td2799005.html
(b) 
http://lucene.472066.n3.nabble.com/using-DIH-with-mets-alto-file-sets-td1926642.html

They both seemed to want to use the full power of XPath like you do
and I think that in a roundabout way they were told utilize the xsl
attribute to make up for what the XPath was lacking by default.

Here are some choice words by Lance that I've extracted out for you:

"XPathEntityProcessor parses a very limited XPath syntax. However, you
can add an XSL script as an attribute, and this somehow gets called
instead."

- Lance


There is an option somewhere to use the full XML DOM implementation
for using xpaths. The purpose of the XPathEP is to be as simple and
dumb as possible and handle most cases: RSS feeds and other open
standards.
Search for xsl(optional)
http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

- Lance

I hope you can make some sense of this, I'm no expert, but just
thought I'd offer my 2 cts.

On Fri, Sep 23, 2011 at 9:21 AM, abhayd  wrote:
> hi
> I am not getting exception anymore.. I had issue with database
>
> But now real problem i always have ...
> Now that i can fetch ID's from database how would i fetch correcponding data
> from ID in xm file
>
> So after getting DB info from jdbcsource I use xpath processor like this,
> but it does not work.
>  baseDir="${solr.solr.home}" fileName=".xml"
>                recursive="false" rootEntity="true"
> dataSource="video_datasource">
>           
> forEach="/gvpVideoMetaData/mediaItem[@media_id='${topic_tree.topic_id}']"
>            url="${f.fileAbsolutePath}"
>                    >
>
> I even tried using script transformer but "row" in script transformer has
> scope limited to entity "f"  If this is nested under another entity u cant
> access top level variables with "row" .
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DIH-error-when-nested-db-datasource-and-file-data-source-tp3345664p3362007.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: UIMA DictionaryAnnotator partOfSpeach

2011-09-28 Thread Pulkit Singhal
At first glance it seems like a simple localization issue as indicated by this:

> org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotatorProcessException:
> EXCEPTION MESSAGE LOCALIZATION FAILED: java.util.MissingResourceException:
> Can't find bundle for base name
> org.apache.uima.annotator.dict_annot.dictionaryAnnotatorMessages, locale
> en_US

Perhaps you can get the source code for UIMA and run the server
hosting Solr in debug mode then remote connect to it via eclipse or
some other IDE and use a breakpoint to figure out which resource is
the issue.

After that it would be UIMA specific solution, I think.

On Wed, Sep 28, 2011 at 4:11 PM, chanhangfai  wrote:
> Hi all,
>
> I have the dictionary Annotator UIMA-solr running,
> used my own dictionary file and it works,
> it will match all the words (Nouns, Verbs and Adjectives) from my dictionary
> file.
>
> *but now, if I only want to match "Nouns",  (ignore other part of speech)*
>
> how can I configure it?
>
>
> http://uima.apache.org/d/uima-addons-current/DictionaryAnnotator/DictionaryAnnotatorUserGuide.html
>
> From the above user guide, in section (3.3. Input Match Type Filters),
> i added the following code to my DictionaryAnnotatorDescriptor.xml,
>
> 
>   InputMatchFilterFeaturePath
>   
>      *partOfSpeach*
>   
> 
>
> 
>   FilterConditionOperator
>   
>      EQUALS
>   
> 
>
> 
>   FilterConditionValue
>   
>      noun
>   
> 
>
>
> but it fails, and the error said featurePathElementNames "*partOfSpeach*" is
> invalid.
>
> org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotatorProcessException:
> EXCEPTION MESSAGE LOCALIZATION FAILED: java.util.MissingResourceException:
> Can't find bundle for base name
> org.apache.uima.annotator.dict_annot.dictionaryAnnotatorMessages, locale
> en_US
>        at
> org.apache.uima.annotator.dict_annot.impl.FeaturePathInfo_impl.typeSystemInit(FeaturePathInfo_impl.java:110)
>        at
> org.apache.uima.annotator.dict_annot.impl.DictionaryAnnotator.typeSystemInit(DictionaryAnnotator.java:383)
>        at
> org.apache.uima.analysis_component.CasAnnotator_ImplBase.checkTypeSystemChange(CasAnnotator_ImplBase.java:100)
>        at
> org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:55)
>        at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
>        at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
>        at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
>        at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409)
>        at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342)
>        at
> org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267)
>        at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267)
>        at
> org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280)
>
>
>
> Any idea please,
> Thanks in advance..
>
> Frankie
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/UIMA-DictionaryAnnotator-partOfSpeach-tp3377440p3377440.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: basic solr cloud questions

2011-09-28 Thread Pulkit Singhal
@Darren: I feel that the question itself is misleading. Creating
shards is meant to separate out the data ... not keep the exact same
copy of it.

I think the two node setup that was attempted by Sam mislead him and
us into thinking that configuring two nodes which are to be named
"shard1" ... somehow means that they are instantly replicated too ...
this is not the case! I can see how this misunderstanding can develop
as I too was confused until Yury cleared it up.

@Sam: If you are interested in performing a quick exercise to
understand the pieces involved for replication rather than sharding
... perhaps this link would be of help in taking you through it:
http://pulkitsinghal.blogspot.com/2011/09/setup-solr-master-slave-replication.html

- Pulkit

2011/9/27 Yury Kats :
> On 9/27/2011 5:16 PM, Darren Govoni wrote:
>> On 09/27/2011 05:05 PM, Yury Kats wrote:
>>> You need to either submit the docs to both nodes, or have a replication
>>> setup between the two. Otherwise they are not in sync.
>> I hope that's not the case. :/ My understanding (or hope maybe) is that
>> the new Solr Cloud implementation will support auto-sharding and
>> distributed indexing. This means that shards will receive different
>> documents regardless of which node received the submitted document
>> (spread evenly based on a hash<->node assignment). Distributed queries
>> will thus merge all the solr shard/node responses.
>
> All cores in the same shard must somehow have the same index.
> Only then can you continue servicing searches when individual cores
> fail. Auto-sharding and distributed indexing don't have anything to
> do with this.
>
> In the future, SolrCloud may be managing replication between cores
> in the same shard automatically. But right now it does not.
>


Re: Why I can't take an full-import with entity name?

2011-09-28 Thread Pulkit Singhal
Can you monitor the DB side to see what results it returned for that query?

2011/8/30 于浩 :
> I am using solr1.3,I updated solr index throgh solr delta import every two
> hours. but the delta import is database connection wasteful.
> So i want to use full-import with entity name instead of delta import.
>
> my db-data-config.xml  file:
> 
>                
> 
>   query="select Article_ID,Article_Title,Article_Abstract from Article_Detail
> where Article_ID>'${dataimporter.request.minID}' and Article_ID
> <='{dataimporter.request.maxID}'
> ">
>                
> 
>
>
> then I uses
> http://192.168.1.98:8081/solr/db_article/dataimport?command=full-import&entity=delta_article&commit=true&clean=false&maxID=1000&minID=10
> but the solr will finish nearyly instant,and there is no any record
> imported. but what the fact is there are many records meets the condtion of
> maxID and minID.
>
>
> the tomcat log:
> 信息: [db_article] webapp=/solr path=/dataimport
> params={maxID=6737277&clean=false&commit=true&entity=delta_article&command=full-import&minID=6736841}
> status=0 QTime=0
> 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> 信息: Starting Full Import
> 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> 信息: Read dataimport.properties
> 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.SolrWriter
> persistStartTime
> 信息: Wrote last indexed time to dataimport.properties
> 2011-8-29 19:00:03 org.apache.solr.handler.dataimport.DocBuilder commit
> 信息: Full Import completed successfully
>
>
> some body who can help or some advices?
>


Re: SolrCloud: is there a programmatic way to create an ensemble

2011-09-28 Thread Pulkit Singhal
Did you find out about this?

2011/8/2 Yury Kats :
> I have multiple SolrCloud instances, each running its own Zookeeper
> (Solr launched with -DzkRun).
>
> I would like to create an ensemble out of them. I know about -DzkHost
> parameter, but can I achieve the same programmatically? Either with
> SolrJ or REST API?
>
> Thanks,
> Yury
>


Re: basic solr cloud questions

2011-09-30 Thread Pulkit Singhal
SOLR-2355 is definitely a step in the right direction but something I
would like to get clarified:

a) There were some fixes to it that went on the 3.4 & 3.5 branch based
on the comments section ... are they not available or not needed on
4.x trunk?

b) Does this basic implementation distribute across shards or across
cores? I think that distributing across all the cores in a shard is
the key towards using it successfully with SolrCloud and I really
don't know if this does this right now as I am not familiar with the
source code. If someone could answer this it would be great otherwise
I'll post back eventually when I do become familiar.

Cheers,
- Pulkit


Re: basic solr cloud questions

2011-09-30 Thread Pulkit Singhal
BTW I update the wiki with the following, hope it keeps it simpel for
others starting out:

Example B: Simple two shard cluster with shard replicas
Note: This setup leverages copy/paste to setup 2 cores per shard and
distributed searches validate a succesful completion of this
example/exercise. But DO NOT assume that any new data that you index
will be distributed across and indexes at each core of a given shard.
That will not happen. Distributed Indexing is not part of SolrCloud
yet. You may however adapt a basic implementation of distributed
indexing by referring to SOLR-2355.

On Fri, Sep 30, 2011 at 11:26 AM, Pulkit Singhal
 wrote:
> SOLR-2355 is definitely a step in the right direction but something I
> would like to get clarified:
>
> a) There were some fixes to it that went on the 3.4 & 3.5 branch based
> on the comments section ... are they not available or not needed on
> 4.x trunk?
>
> b) Does this basic implementation distribute across shards or across
> cores? I think that distributing across all the cores in a shard is
> the key towards using it successfully with SolrCloud and I really
> don't know if this does this right now as I am not familiar with the
> source code. If someone could answer this it would be great otherwise
> I'll post back eventually when I do become familiar.
>
> Cheers,
> - Pulkit
>


Bug in DIH?

2011-10-01 Thread Pulkit Singhal
Its rather strange stacktrace(at the bottom).
An entire 1+ dataset finishes up only to end up crashing & burning
due to a log statement :)

Based on what I can tell from the stacktrace and the 4.x trunk source
code, it seems that the follwoign log statement dies:
//LogUpdateProcessorFactory.java:188
log.info( ""+toLog + " 0 " + (elapsed) );

Eventually at the strict cast:
//NamedList.java:127
return (String)nvPairs.get(idx << 1);

I was wondering what kind of mistaken data would I have ended up
getting misplaced into:
//LogUpdateProcessorFactory.java:76
private final NamedList toLog;

To cause the java.util.ArrayList cannot be cast to java.lang.String issue?
Could it be due to the multivalued fields that I'm trying to index?
Is this a bug or just a mistake in how I use DIH, please let me know
your thoughts!

SEVERE: Full Import failed:java.lang.ClassCastException:
java.util.ArrayList cannot be cast to java.lang.String
at org.apache.solr.common.util.NamedList.getName(NamedList.java:127)
at org.apache.solr.common.util.NamedList.toString(NamedList.java:263)
at java.lang.String.valueOf(String.java:2826)
at java.lang.StringBuilder.append(StringBuilder.java:115)
at 
org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188)
at 
org.apache.solr.handler.dataimport.SolrWriter.close(SolrWriter.java:57)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:265)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:372)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:440)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:421)


Enabling the right logs for DIH

2011-10-01 Thread Pulkit Singhal

The Problem:

When using DIH with trunk 4.x, I am seeing some very funny numbers
with a particularly large XML file that I'm trying to import. Usually
there are bound to be more rows than documents indexed in DIH because
of the foreach property but my other xm lfiles have maybe 1.5 times
the rows compared to the # of docs indexed.

This particular funky file ends up with something like:
25614008
1048
That's 25 million rows fetched before even a measly 1000 docs are indexed!
Something has to be wrong here.
I checked the xml for well-formed-ness in vim by running "!:xmllint
--noout %" so I think there are no issues there.


The Question:

For those intimately familiar with DIH code/behaviour: What is the
appropriate log-level that will let me see the rows & docs printed out
to log as each one is fetched/created? I don't want to make the logs
explode because then I won't be able to read through them. Is there
some gentle balance here that I can leverage?

Thanks!
- Pulkit


Re: Bug in DIH?

2011-10-01 Thread Pulkit Singhal
Thanks Lance, its logged as:
https://issues.apache.org/jira/browse/SOLR-2804

- Pulkit

On Sat, Oct 1, 2011 at 8:59 PM, Lance Norskog  wrote:
> Should bugs in LogProcessor should be ignored by DIH? They are not required
> to index data, right?
>
> Please open an issue for this. The fix should have two parts:
> 1) fix the exception
> 2) log and ignore exceptions in the LogProcessor
>
> On Sat, Oct 1, 2011 at 2:02 PM, Pulkit Singhal wrote:
>
>> Its rather strange stacktrace(at the bottom).
>> An entire 1+ dataset finishes up only to end up crashing & burning
>> due to a log statement :)
>>
>> Based on what I can tell from the stacktrace and the 4.x trunk source
>> code, it seems that the follwoign log statement dies:
>>    //LogUpdateProcessorFactory.java:188
>>    log.info( ""+toLog + " 0 " + (elapsed) );
>>
>> Eventually at the strict cast:
>>    //NamedList.java:127
>>    return (String)nvPairs.get(idx << 1);
>>
>> I was wondering what kind of mistaken data would I have ended up
>> getting misplaced into:
>>    //LogUpdateProcessorFactory.java:76
>>    private final NamedList toLog;
>>
>> To cause the java.util.ArrayList cannot be cast to java.lang.String issue?
>> Could it be due to the multivalued fields that I'm trying to index?
>> Is this a bug or just a mistake in how I use DIH, please let me know
>> your thoughts!
>>
>> SEVERE: Full Import failed:java.lang.ClassCastException:
>> java.util.ArrayList cannot be cast to java.lang.String
>>        at org.apache.solr.common.util.NamedList.getName(NamedList.java:127)
>>        at
>> org.apache.solr.common.util.NamedList.toString(NamedList.java:263)
>>        at java.lang.String.valueOf(String.java:2826)
>>        at java.lang.StringBuilder.append(StringBuilder.java:115)
>>        at
>> org.apache.solr.update.processor.LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:188)
>>        at
>> org.apache.solr.handler.dataimport.SolrWriter.close(SolrWriter.java:57)
>>        at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:265)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:372)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:440)
>>        at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:421)
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


DIH full-import with clean=false is still removing old data

2011-10-04 Thread Pulkit Singhal
Hello,

I have a unique dataset of 1,110,000 products, each as its own file.
It is split into three different directories as 500,000 and 110,000
files and 500,000.

When I run:
http://localhost:8983/solr/bbyopen/dataimport?command=full-import&clean=false&commit=true
The first 500,000 entries are successfully indexed and then the next
110,000 entries also work ... but after I run the third full-import on
the last set of 500,000 entries, the document count remains at 610,000
... it doesn't go up to 1,110,000!

1) Is there some kind of limit here? Why can the full-import keep the
initial 500,000 entries and then let me do a full-import with 110,000
more entries ... but when I try to do a 3rd full-import, the document
count doesn't go up.

2) I know for sure that all the data is unique. Since I am not doing
delta-imports, I have NOT specified any primary key in the
data-import.xml file. But I do have a uniqueKey in the schema.xml
file.

Any tips?
- Pulkit



Re: DIH full-import with clean=false is still removing old data

2011-10-04 Thread Pulkit Singhal
Bah it worked after cleaning it out for the 3rd time, don't know what
I did differently this time :(



On Tue, Oct 4, 2011 at 8:00 PM, Pulkit Singhal  wrote:
> Hello,
>
> I have a unique dataset of 1,110,000 products, each as its own file.
> It is split into three different directories as 500,000 and 110,000
> files and 500,000.
>
> When I run:
> http://localhost:8983/solr/bbyopen/dataimport?command=full-import&clean=false&commit=true
> The first 500,000 entries are successfully indexed and then the next
> 110,000 entries also work ... but after I run the third full-import on
> the last set of 500,000 entries, the document count remains at 610,000
> ... it doesn't go up to 1,110,000!
>
> 1) Is there some kind of limit here? Why can the full-import keep the
> initial 500,000 entries and then let me do a full-import with 110,000
> more entries ... but when I try to do a 3rd full-import, the document
> count doesn't go up.
>
> 2) I know for sure that all the data is unique. Since I am not doing
> delta-imports, I have NOT specified any primary key in the
> data-import.xml file. But I do have a uniqueKey in the schema.xml
> file.
>
> Any tips?
> - Pulkit
>


Interesting DIH challenge

2011-10-09 Thread Pulkit Singhal
Hello Folks,

I'm a big DIH fan but I'm fairly sure that now I've run into a scenario
where it can't help me anymore ... but before I give up and roll my own
solution, I jsut wanted to check with everyone else.

The scenario:
- already have 1M+ documents indexed
- the schema.xml needs to have one more field added to it ...
problem/do-able? yes? no? remove all the old data? or do the update per doc
(add/delete)?
- need to populate data from a file that has a key and value per line and i
need to use the key to find the doc to update and then add the value to the
new schema field

Any ideas?


Re: Interesting DIH challenge

2011-10-09 Thread Pulkit Singhal
@Gora Thank You!

I know that Solr accepts xml with Solr specific elements that are commands
that only it understands ... such as ,  etc.

Question: Is there some way to ask Solr to dump out whatever it has in its
index already ... as a Solr xml document?

Plan: I intend to message that xml dump (add the field + value that I need
in every doc's xml element) and then I should be able to push this dump back
to Solr to get data indexed again, I hope.

Thanks!
- Pulkit

On Sun, Oct 9, 2011 at 2:57 PM, Gora Mohanty  wrote:

> On Mon, Oct 10, 2011 at 1:17 AM, Pulkit Singhal 
> wrote:
> > Hello Folks,
> >
> > I'm a big DIH fan but I'm fairly sure that now I've run into a scenario
> > where it can't help me anymore ... but before I give up and roll my own
> > solution, I jsut wanted to check with everyone else.
> >
> > The scenario:
> > - already have 1M+ documents indexed
> > - the schema.xml needs to have one more field added to it ...
> > problem/do-able? yes? no? remove all the old data? or do the update per
> doc
> > (add/delete)?
>
> This is independent of DIH. If you want to add a new field to the schema,
> you should reindex. 1M documents should not take that long.
>
> > - need to populate data from a file that has a key and value per line and
> i
> > need to use the key to find the doc to update and then add the value to
> the
> > new schema field
>
> It is best just to reindex, but it should be possible to write a script to
> pull
> the doc from the existing Solr index, massage the return format into
> Solr's XML format, adding a value for the new field in the process, and
> then posting the new file to Solr for indexing.
>
> Regards,
> Gora
>


Re: Interesting DIH challenge

2011-10-09 Thread Pulkit Singhal
Oh also: Does DIH have any experimental way for folks to be reading data
from one solr core and then massaging it and importing it into another core?
If not, then would that be a good addition or just a waste of time for some
architectural reason?

On Sun, Oct 9, 2011 at 8:00 PM, Pulkit Singhal wrote:

> @Gora Thank You!
>
> I know that Solr accepts xml with Solr specific elements that are commands
> that only it understands ... such as ,  etc.
>
> Question: Is there some way to ask Solr to dump out whatever it has in its
> index already ... as a Solr xml document?
>
> Plan: I intend to message that xml dump (add the field + value that I need
> in every doc's xml element) and then I should be able to push this dump back
> to Solr to get data indexed again, I hope.
>
> Thanks!
> - Pulkit
>
>
> On Sun, Oct 9, 2011 at 2:57 PM, Gora Mohanty  wrote:
>
>> On Mon, Oct 10, 2011 at 1:17 AM, Pulkit Singhal 
>> wrote:
>> > Hello Folks,
>> >
>> > I'm a big DIH fan but I'm fairly sure that now I've run into a scenario
>> > where it can't help me anymore ... but before I give up and roll my own
>> > solution, I jsut wanted to check with everyone else.
>> >
>> > The scenario:
>> > - already have 1M+ documents indexed
>> > - the schema.xml needs to have one more field added to it ...
>> > problem/do-able? yes? no? remove all the old data? or do the update per
>> doc
>> > (add/delete)?
>>
>> This is independent of DIH. If you want to add a new field to the schema,
>> you should reindex. 1M documents should not take that long.
>>
>> > - need to populate data from a file that has a key and value per line
>> and i
>> > need to use the key to find the doc to update and then add the value to
>> the
>> > new schema field
>>
>> It is best just to reindex, but it should be possible to write a script to
>> pull
>> the doc from the existing Solr index, massage the return format into
>> Solr's XML format, adding a value for the new field in the process, and
>> then posting the new file to Solr for indexing.
>>
>> Regards,
>> Gora
>>
>
>


Re: Replication fails in SolrCloud

2011-11-08 Thread Pulkit Singhal
@Prakash: Can your please format the body a bit for readability?

@Solr-Users: Is anybody else having any problems when running Zookeeper
from the latest code in the trunk(4.x)?

On Mon, Nov 7, 2011 at 4:44 PM, prakash chandrasekaran <
prakashchandraseka...@live.com> wrote:

>
> hi all, i followed steps in link
> http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_shard_replicas_and_zookeeper_ensembleand
>  created "Two shard cluster with shard replicas and zookeeper ensemble",
> and then for Solr Replication i followed steps in link
> http://wiki.apache.org/solr/SolrReplication ..
> now after server start, when slave tries to pull data from master .. i m
> seeing below error messages ..
> org.apache.solr.common.SolrException logSEVERE:
> org.apache.solr.common.cloud.ZooKeeperException: ZkSolrResourceLoader does
> not support getConfigDir() - likely, what you are trying to do is not
> supported in ZooKeeper modeat
> org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:99)
>at
> org.apache.solr.handler.ReplicationHandler.getConfFileInfoFromCache(ReplicationHandler.java:378)
> at
> org.apache.solr.handler.ReplicationHandler.getFileList(ReplicationHandler.java:364)
> i have few questions regarding this 1) Does Solr Cloud supports
> Replication ??2) or do we need to follow different steps to achieve
> Replication in Solr Cloud ??
>
> Thanks,prakash
>
> > From: prakashchandraseka...@live.com
> > To: solr-user@lucene.apache.org
> > Subject: Zookeeper aware Replication in SolrCloud
> > Date: Fri, 4 Nov 2011 03:36:27 +
> >
> >
> >
> > hi,
> > i m using SolrCloud and i wanted to add Replication feature to it ..
> > i followed the steps in Solr Wiki .. but when the client tried to poll
> for data from server i got below Error Message ..
> > in Master LogNov 3, 2011 8:34:00 PM
> >
> > in Slave logNov 3, 2011 8:34:00 PM
> org.apache.solr.handler.ReplicationHandler doFetchSEVERE: SnapPull failed
> org.apache.solr.common.SolrException: Request failed for the url
> org.apache.commons.httpclient.methods.PostMethod@18eabf6at
> org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.java:197)
> at org.apache.solr.handler.SnapPuller.fetchFileList(SnapPuller.java:219)
>  at
> org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:281)
>   at
> org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:284)
> > but i could see the slave pointing to correct master from link :
> http://localhost:7574/solr/replication?command=details
> > i m also seeing these values in replication details link .. (
> http://localhost:7574/solr/replication?command=details)
> > Thu Nov 03 20:28:00 PDT
> 2011Thu Nov 03 20:27:00 PDT 2011Thu Nov 03 20:26:00
> PDT 2011Thu Nov 03 20:25:00 PDT 2011  name="replicationFailedAtList"> Thu Nov 03 20:28:00 PDT 2011
> Thu Nov 03 20:27:00 PDT 2011 Thu Nov 03 20:26:00 PDT
> 2011 Thu Nov 03 20:25:00 PDT 2011
> >
> >
> > Thanks,Prakash
>


Re: Error while trying to load JSON

2012-03-16 Thread Pulkit Singhal
It seems that you are using the bbyopen data. If have made up your mind on
using the JSON data then simply store it in ElasticSearch instead of Solr
as they do take any valid JSON structure. Otherwise, you can download the
xml archive from bbyopen and prepare a schema:

Here are some generic instructions to familiarize you with building schema
given arbitrary data, it should help speed things up, they don't apply
directly to bbyopen data though:
http://pulkitsinghal.blogspot.com/2011/10/import-dynamic-fields-from-xml-into.html
http://pulkitsinghal.blogspot.com/2011/09/import-data-from-amazon-rss-feeds-into.html

Keep in mind, ES also does you a favor by building the right schema
dynamically on the fly as you feed it the JSON data. So it is much easier
to work with.

On Fri, Mar 16, 2012 at 1:26 PM, Erick Erickson wrote:

> bq: Shouldn't it be able to take any valid JSON structure?
>
> No, that was never the intent. The intent here was just to provide
> a JSON-compatible format for indexing data for those who
> don't like/want to use XML or SolrJ or Solr doesn't index arbitrary
> XML either. And I have a hard time imagining what the
> schema.xml file would look like when trying to map
> arbitrary JSON (or XML or) into fields.
>
> Best
> Erick
>
> On Fri, Mar 16, 2012 at 12:54 PM, Chambeda  wrote:
> > Ok, so my issue is that it must be a flat structure.  Why isn't the JSON
> > parser able to deconstruct the object into a flatter structure for
> indexing?
> > Shouldn't it be able to take any valid JSON structure?
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Error-while-trying-to-load-JSON-tp3832518p3832611.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Schema error unknown field

2010-02-18 Thread Pulkit Singhal
I'm getting the following exception
SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'desc'

I'm wondering what I need to do in order to add the "desc" field to
the Solr schema for indexing?


@Field annotation support

2010-02-18 Thread Pulkit Singhal
Hello All,

When I use Maven or Eclipse to try and compile my bean which has the
@Field annotation as specified in http://wiki.apache.org/solr/Solrj
page ... the compiler doesn't find any class to support the
annotation. What jar should we use to bring in this custom Solr
annotation?


Re: Schema error unknown field

2010-02-18 Thread Pulkit Singhal
I guess my n00b-ness is showing :)

I started off using the instructions directly from
http://wiki.apache.org/solr/Solrj and there was no mention of schema
there and even after gettign this error and searching for schema.xml
in the wiki ... I found no meaningful hits so I thought it best to
ask.

With your advice, I searched for schema.xml and found 13 instances of it:

\solr_1.4.0\client\ruby\solr-ruby\solr\conf\schema.xml
\solr_1.4.0\client\ruby\solr-ruby\test\conf\schema.xml
\solr_1.4.0\contrib\clustering\src\test\resource\schema.xml
\solr_1.4.0\contrib\extraction\src\test\resource\schema.xml
\solr_1.4.0\contrib\velocity\src\main\solr\conf\schema.xml
\solr_1.4.0\example\example-DIH\solr\db\conf\schema.xml
\solr_1.4.0\example\example-DIH\solr\mail\conf\schema.xml
\solr_1.4.0\example\example-DIH\solr\rss\conf\schema.xml
\solr_1.4.0\example\multicore\core0\conf\schema.xml
\solr_1.4.0\example\multicore\core1\conf\schema.xml
\solr_1.4.0\example\solr\conf\schema.xml
\solr_1.4.0\src\test\test-files\solr\conf\schema.xml
\solr_1.4.0\src\test\test-files\solr\shared\conf\schema.xml

I took a wild guess and added the field I wanted ("desc") into this
file since its name seemed to be the most generic one:
C:\apps\solr_1.4.0\example\solr\conf\schema.xml

And it worked ... a bit strange that an example directory is used but
I suppose it is configurable somewhere?

Thanks for you help Erick!

Cheers,
- Pulkit

On Thu, Feb 18, 2010 at 9:53 AM, Erick Erickson  wrote:
> Add desc as a  in your schema.xml
> file would be my first guess.
>
> Providing some explanation of what you're trying to do
> would help diagnose your issues.
>
> HTH
> Erick
>
> On Thu, Feb 18, 2010 at 12:21 PM, Pulkit Singhal 
> wrote:
>
>> I'm getting the following exception
>> SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'desc'
>>
>> I'm wondering what I need to do in order to add the "desc" field to
>> the Solr schema for indexing?
>>
>


Run Solr within my war

2010-02-18 Thread Pulkit Singhal
Hello Everyone,

I do NOT want to host Solr separately. I want to run it within my war
with the Java Application which is using it. How easy/difficult is
that to setup? Can anyone with past experience on this topic, please
comment.

thanks,
- Pulkit


Re: Run Solr within my war

2010-02-18 Thread Pulkit Singhal
Yeah I have been pitching that but I want all the functionality of
Solr in a small package because it is not a concern given the
specifically limited data set being searched upon. I understand that
the # of users is still another part of this equation but there just
aren't that many at this time and having it separate will add to
deployment complexity and kill the product before it ever takes off.
Adoption is key for me.

On Thu, Feb 18, 2010 at 2:25 PM, Dave Searle  wrote:
> Why would you want to? Surely having it seperate increases scalablity?
>
> On 18 Feb 2010, at 22:23, "Pulkit Singhal" 
> wrote:
>
>> Hello Everyone,
>>
>> I do NOT want to host Solr separately. I want to run it within my war
>> with the Java Application which is using it. How easy/difficult is
>> that to setup? Can anyone with past experience on this topic, please
>> comment.
>>
>> thanks,
>> - Pulkit
>


Re: Run Solr within my war

2010-02-19 Thread Pulkit Singhal
Using EmbeddedSolrServer is a client side way of communicating with
Solr via the file system. Solr has to still be up and running before
that. My question is more along the lines of how to put the server
jars that perform the core functionality and bundle them to start up
within a war which is also the application war for the program that
will communicate as the client with the Solr server.

On Thu, Feb 18, 2010 at 5:49 PM, Richard Frovarp  wrote:
> On 2/18/2010 4:22 PM, Pulkit Singhal wrote:
>>
>> Hello Everyone,
>>
>> I do NOT want to host Solr separately. I want to run it within my war
>> with the Java Application which is using it. How easy/difficult is
>> that to setup? Can anyone with past experience on this topic, please
>> comment.
>>
>> thanks,
>> - Pulkit
>>
>>
>
> So basically you're talking about running an embedded version of Solr like
> the EmbeddedSolrServer? I have no experience on this, but this should
> provide you the correct search term to find documentation on use. From what
> little code I've seen to run test cases against Solr, it looks relatively
> straight forward to get running. To use you would use the SolrJ library to
> communicate with the embedded solr server.
>
> Richard
>


Re: @Field annotation support

2010-02-19 Thread Pulkit Singhal
Ok then, is this the correct class to support the @Field annotation?
Because I have it on the path but its not working.

org\apache\solr\solr-solrj\1.4.0\solr-solrj-1.4.0.jar/org\apache\solr\client\solrj\beans\Field.class

2010/2/18 Noble Paul നോബിള്‍  नोब्ळ् :
> solrj jar
>
> On Thu, Feb 18, 2010 at 10:52 PM, Pulkit Singhal
>  wrote:
>> Hello All,
>>
>> When I use Maven or Eclipse to try and compile my bean which has the
>> @Field annotation as specified in http://wiki.apache.org/solr/Solrj
>> page ... the compiler doesn't find any class to support the
>> annotation. What jar should we use to bring in this custom Solr
>> annotation?
>>
>
>
>
> --
> -
> Noble Paul | Systems Architect| AOL | http://aol.com
>