Re: SolrCloud DIH issue

2015-09-20 Thread Upayavira
It is worth noting that the ref guide page on configsets refers to
non-cloud mode (a useful new feature) whereas people may confuse this
with configsets in cloud mode,  which use Zookeeper.

Upayavira

On Sun, Sep 20, 2015, at 04:59 AM, Ravi Solr wrote:
> Cant thank you enough for clarifying it at length. Yeah its pretty
> confusing even for experienced Solr users. I used the upconfig and
> linkconfig commands to update 4 collections into zookeeper...As you
> described, I lucked out as I used the same name for configset and the
> collection and hence did not have to use the collections API :-)
> 
> Thanks,
> 
> Ravi Kiran Bhaskar
> 
> On Sat, Sep 19, 2015 at 11:22 PM, Erick Erickson
> 
> wrote:
> 
> > Let's back up a second. Configsets are what _used_ to be in the conf
> > directory for each core on a local drive, it's just that they're now
> > kept up on Zookeeper. Otherwise, you'd have to put them on each
> > instance in SolrCloud, and bringing up a new replica on a new machine
> > would look a lot like adding a core with the old core admin API.
> >
> > So instead, configurations are kept on zookeeper. A config set
> > consists of, essentially, a named old-style "conf" directory. There's
> > no a-priori limit to the number of config sets you can have. Look in
> > the admin UI, Cloud>>tree>>configs and you'll see each name you've
> > pushed to ZK. If you explore that tree, you'll see a lot of old
> > familiar faces, schema.xml, solrconfig.xml, etc.
> >
> > So now we come to associating configs with collections. You've
> > probably done one of the examples where some things happen under the
> > covers, including explicitly pushing the configset to Zookeeper.
> > Currently, there's no option in the bin/solr script to push a config,
> > although I know there's a JIRA to do that.
> >
> > So, to put a new config set up you currently need to use the zkCli.sh
> > script see:
> > https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities,
> > the "upconfig" command. That pushes the configset up to ZK and gives
> > it a name.
> >
> > Now, you create a collection and it needs a configset stored in ZK.
> > It's a little tricky in that if you do _not_ explicitly specify a
> > configest (using the collection.configName parameter to the
> > collections API CREATE command), then by default it'll look for a
> > configset with the same name as the collection. If it doesn't find
> > one, _and_ there is one and only one configset, then it'll use that
> > one (personally I find that confusing, but that's the way it works).
> > See: https://cwiki.apache.org/confluence/display/solr/Collections+API
> >
> > If you have two or more configsets in ZK, then either the configset
> > name has to be identical to the collection name (if you don't specify
> > collection.configName), _or_ you specify collection.configName at
> > create time.
> >
> > NOTE: there are _no_ config files on the local disk! When a replica of
> > a collection loads, it "knows" what collection it's part of and pulls
> > the corresponding configset from ZK.
> >
> > So typically the process is this.
> > > you create the config set by editing all the usual suspects, schema.xml,
> > solrconfig.xml, DIH config etc.
> > > you put those configuration files into some version control system (you
> > are using one, right?)
> > > you push the configs to Zookeeper
> > > you create the collection
> > > you figure out you need to change the configs so you
> >   > check the code out of your version control
> >   > edit them
> >   > put the current version back into version control
> >   > push the configs up to zookeeper, overwriting the ones already
> > there with that name
> >   > reload the collection or bounce all the servers. As each replica
> > in the collection comes up,
> >  it downloads the latest configs from Zookeeper to memory (not to
> > disk) and uses them.
> >
> > Seems like a long drawn-out process, but pretty soon it's automatic.
> > And really, the only extra step is the push to Zookeeper, the rest is
> > just like old-style cores with the exception that you don't have to
> > manually push all the configs to all the machines hosting cores.
> >
> > Notice that I have mostly avoided talking about "cores" here. Although
> > it's true that a replica in a collection is just another core, it's
> > "special" in that it has certain very specific properties set. I
> > _strongly_ advise you stop thinking about old-style Solr cores and
> > instead thing about collections and replicas. And above all, do _not_
> > use the admin core API to try to create members of a collection
> > (cores), use the collections API to ADDREPLICA/DELETEREPLICA instead.
> > Loading/unloading cores is less "fraught", but I try to avoid that too
> > and use
> >
> > Best,
> > Erick
> >
> > On Sat, Sep 19, 2015 at 9:08 PM, Ravi Solr  wrote:
> > > Thanks Erick, I will report back once the reindex is finished. Oh, your
> > > answer reminded me of another question - Regarding configs

Re: SolrCloud DIH issue

2015-09-20 Thread Ravi Solr
Yes Upayavira, that's exactly what prompted me to ask Erick as soon as I
read https://cwiki.apache.org/confluence/display/solr/Config+Sets

Erick, Regarding my delta-import not working I do see the
dataimport.properties in zookeeper. after I "upconfig" and "linkconfig" my
conf files into ZK...see below

[zk: localhost: (CONNECTED) 0] ls /configs/xx
[admin-extra.menu-top.html, person-synonyms.txt, entity-stopwords.txt,
protwords.txt, location-synonyms.txt, solrconfig.xml,
organization-synonyms.txt, stopwords.txt, spellings.txt,
dataimport.properties, admin-extra.html, xslt, synonyms.txt, scripts.conf,
subject-synonyms.txt, elevate.xml, admin-extra.menu-bottom.html,
solr-import-config.xml, clustering, schema.xml]

However, when I look into dataimport.properties in my 'conf' folder it
hasn't updated even after running full-import on Sep 19 2015 1:00AM
successfully and subsequent delta-import on Sep 20 2015 11:AM which did not
import newer docs, This prompted me to look into the dataimport.properties
in the conf folder...the details are shown below, you can clearly see the
dates are quite a bit off.

[@y conf]$ cat dataimport.properties
#Tue Sep 15 18:11:17 UTC 2015
reindex-docs.last_index_time=2015-09-15 18\:11\:16
last_index_time=2015-09-15 18\:11\:16
sep.last_index_time=2014-03-24 13\:41\:46


I saw some JIRA tickets about different location of dataimport.properties
for SolrCloud but couldnt find the path where it stores...Anybody have idea
where it stores it ?

Thanks

Ravi Kiran Bhaskar



On Sun, Sep 20, 2015 at 5:28 AM, Upayavira  wrote:

> It is worth noting that the ref guide page on configsets refers to
> non-cloud mode (a useful new feature) whereas people may confuse this
> with configsets in cloud mode,  which use Zookeeper.
>
> Upayavira
>
> On Sun, Sep 20, 2015, at 04:59 AM, Ravi Solr wrote:
> > Cant thank you enough for clarifying it at length. Yeah its pretty
> > confusing even for experienced Solr users. I used the upconfig and
> > linkconfig commands to update 4 collections into zookeeper...As you
> > described, I lucked out as I used the same name for configset and the
> > collection and hence did not have to use the collections API :-)
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > On Sat, Sep 19, 2015 at 11:22 PM, Erick Erickson
> > 
> > wrote:
> >
> > > Let's back up a second. Configsets are what _used_ to be in the conf
> > > directory for each core on a local drive, it's just that they're now
> > > kept up on Zookeeper. Otherwise, you'd have to put them on each
> > > instance in SolrCloud, and bringing up a new replica on a new machine
> > > would look a lot like adding a core with the old core admin API.
> > >
> > > So instead, configurations are kept on zookeeper. A config set
> > > consists of, essentially, a named old-style "conf" directory. There's
> > > no a-priori limit to the number of config sets you can have. Look in
> > > the admin UI, Cloud>>tree>>configs and you'll see each name you've
> > > pushed to ZK. If you explore that tree, you'll see a lot of old
> > > familiar faces, schema.xml, solrconfig.xml, etc.
> > >
> > > So now we come to associating configs with collections. You've
> > > probably done one of the examples where some things happen under the
> > > covers, including explicitly pushing the configset to Zookeeper.
> > > Currently, there's no option in the bin/solr script to push a config,
> > > although I know there's a JIRA to do that.
> > >
> > > So, to put a new config set up you currently need to use the zkCli.sh
> > > script see:
> > >
> https://cwiki.apache.org/confluence/display/solr/Command+Line+Utilities,
> > > the "upconfig" command. That pushes the configset up to ZK and gives
> > > it a name.
> > >
> > > Now, you create a collection and it needs a configset stored in ZK.
> > > It's a little tricky in that if you do _not_ explicitly specify a
> > > configest (using the collection.configName parameter to the
> > > collections API CREATE command), then by default it'll look for a
> > > configset with the same name as the collection. If it doesn't find
> > > one, _and_ there is one and only one configset, then it'll use that
> > > one (personally I find that confusing, but that's the way it works).
> > > See: https://cwiki.apache.org/confluence/display/solr/Collections+API
> > >
> > > If you have two or more configsets in ZK, then either the configset
> > > name has to be identical to the collection name (if you don't specify
> > > collection.configName), _or_ you specify collection.configName at
> > > create time.
> > >
> > > NOTE: there are _no_ config files on the local disk! When a replica of
> > > a collection loads, it "knows" what collection it's part of and pulls
> > > the corresponding configset from ZK.
> > >
> > > So typically the process is this.
> > > > you create the config set by editing all the usual suspects,
> schema.xml,
> > > solrconfig.xml, DIH config etc.
> > > > you put those configuration files into some versio

Questions regarding indexing JSON data

2015-09-20 Thread Kevin Vasko
I am new to Apache Solr and have been struggling with indexing some JSON files.

I have several TB of twitter data in JSON format that I am having trouble 
posting/indexing. I am trying to use a schemaless schema so I don't have to add 
200+ records fields manually.

1.

The first issue is none of the records have '[' or ']' wrapped around the 
records. So it looks like this:

 { "created_at": "Sun Apr 19 23:45:45 + 2015","id": 5.899379634353e+17, 
"id_str": "589937963435302912",}


Just to validate the schemaless portion was working I used a single "tweet" and 
trimmed it down to bare minimum. The brackets not being in the origian appears 
to be a problem as when I tried to process just a small portion of one record 
it requires me to wrap the row in a [ ] (I assume to make it an array) to index 
correctly.  Like the following:

[{ "created_at": "Sun Apr 19 23:45:45 + 2015","id": 5.899379634353e+17, 
"id_str": "589937963435302912",}]

Is there a way around this? I didn't want to preprocess the TB's of JSON data 
that is in this format to add '[', ',' and '[' around all of the data.

2. 

The second issue is some of the fields have null values. 
e.g. "in_reply_to_status_id": null,

I think I figured a way to resolve this by manually adding the field as a 
"strings" type but if I miss one it will kick the file out. Just wanted to see 
if there was something I could add to the schemaless configuration to have it 
pick up null fields as replace them as strings automatically? Or is there a 
better way to handle this?


3. 
The last issue I think my most difficult issue. Which is dealing with "nested" 
or "children" fields in my JSON data.

The data looks like this. https://gist.github.com/gnip/764239. Is there anyways 
to index this information preferably automatically (schemaless method) without 
having to flatten all of my data?

Thanks.


Re: Does more shards in core improve performance?

2015-09-20 Thread Zheng Lin Edwin Yeo
I didn't find any increase in indexing throughput by adding shards in the
same machine.

However, I've managed to feed the index to Solr from more than one thread
at a time. It can take up to 3 threads without affecting the indexing
speed. Anything more than that, the CPU will hit 100%, and the indexing
speed in all the threads will be reduced.

Regards,
Edwin


On 18 September 2015 at 19:38, Gili Nachum  wrote:

> If cpu is just 50% and adding a shard does increase indexing throughput
> then check for disk bottleneck.
> On Sep 17, 2015 18:19, "Zheng Lin Edwin Yeo"  wrote:
>
> > Thank you everyone for your reply.
> >
> > > How many CPUs on that machine? How many other requests using the
> server?
> >
> > A) There's 8 CPU on the machine, and there is no other requests that's
> > using the server. Only the indexing script is running.
> >
> > > A simple metric is to look at CPU usage on the machine: If it is near
> > 100% when you index, you will need extra hardware to get more speed.
> > If it is substantially less than 100%, then feed Solr from more than one
> > thread at a time.
> >
> > A) So far from what I observe, the CPU usage is usually around 50% to
> 70%.
> > It haven't go up to 100% yet. But I'll probably try to do sharing on a
> > different machine, as that is probably the case for the real production
> > server.
> >
> >
> > Regards,
> > Edwin
> >
> >
> > On 17 September 2015 at 19:55, Toke Eskildsen 
> > wrote:
> >
> > > On Thu, 2015-09-17 at 16:58 +0800, Zheng Lin Edwin Yeo wrote:
> > >
> > > > I was trying with 2 shards and 4 shards but all on the same machine,
> > > > and they have the same performance (no improvement in performance) as
> > > > the one with 1 shard. My machine has a 32GB RAM.
> > >
> > > As you are testing indexing speed, Shalin's post is spot-on: Sharding
> on
> > > the same machine won't help you. I just added my comment on search to
> > > help build a complete picture.
> > >
> > > A simple metric is to look at CPU usage on the machine: If it is near
> > > 100% when you index, you will need extra hardware to get more speed.
> > > If it is substantially less than 100%, then feed Solr from more than
> one
> > > thread at a time.
> > >
> > > - Toke Eskildsen, State and University Library, Denmark
> > >
> > >
> > >
> > >
> >
>


Cost of using group.cache.percent parameters in Result Grouping

2015-09-20 Thread Zheng Lin Edwin Yeo
Hi,

I've been trying to improve the speed of my Result Grouping, and I've found
that by setting the parameter group.cache.percent to 100 actually does
improve the speed, especially for the longer query string.

But I would like to find out is whether is there any cost in doing so? Like
in terms of memory usage, or will there be other side effects to other
functions in the system?


Regards,
Edwin


How can I get a monotonically increasing field value for docs?

2015-09-20 Thread Gili Nachum
I've implemented a custom solr2solr ongoing unidirectional replication
mechanism.

A Replicator (acting as solrJ client), crawls documents from SolrCloud1 and
writes them to SolrCloud2 in batches.
The replicator crawl logic is to read documents with a time greater/equale
to the time of the last replicated document.
Whenever a document is added/updated, I auto updated a a tdate field
"last_updated_in_solr" using TimestampUpdateProcessorFactory.

*My problem: *When a client indexes a batch of 100 documents, all 100 docs
have the same "last_updated_in_solr" value. This makes my ongoing
replication check for new documents to replicate much more complex than if
the time value was unique.

1. Can I use some other processor to generate increasing unique values?
2. Can I use the internal _version_ field for this? is it guaranteed to be
monotonically increasing for the entire collection or only per document,
with each add/update?
Any other options?

Schema.xml:


solrconfig.xml:

   
   last_updated_in_solr
   
   
   


I know there's work for a build-in replication mechanism, but it's not yet
released.
Using Solr 4.7.2.