Lucene Sorting

2013-04-17 Thread pankaj.pandey4
Hi,

We are facing sorting issue on the data indexed using Solr. Below is the sample 
code. Problem is, data returned by the below code is not properly sorted i.e. 
there's no ordering of data. Can anyone assist me on this?

TopDocs topDocs = null;
  Directory directory = FSDirectory.open(indexDir);
  IndexSearcher searcher = new IndexSearcher(IndexReader.open(directory));
Sort column = new Sort(new SortField(sortColumn, SortField.STRING, reverse));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36);
queryParser = new QueryParser(Version.LUCENE_36, fieldName, analyzer);
  queryParser.setAllowLeadingWildcard(true);
  queryParser.setDefaultOperator(Operator.AND);
topDocs = searcher.search(queryParser.parse(queryStr), filter, maxHits, column);

Thanks!

Regards,
Pankaj

Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


Billion document index

2013-05-14 Thread pankaj.pandey4
Hi,

We have to setup a billion document index using Apache Solr(about 2 billion 
docs). I need some assistance on choosing the right configuration for the 
environment setup. I have been going through the Solr documentation, but 
couldn't figure out what would be the best configuration for same.

Below is the configuration on one of the box we have, can you please assist if 
this will suffice our requirement.
OS: SunOS
RAM : 32GB
Processor: 4 (SPARC64-VII+/2660Mhz)
Server: Sun SPARC Enterprise M5000

Desired System Requirements:
Expected number of requests/day: 10,000
New Documents ingested/day: 1Million

I have below questions on same:

1.   Does the above system configuration seems ok for our requirement?

2.   Is it ok if we host entire index on a single physical box or I should 
use multiple physical box?

3.   Should I go for the simple installation or SolrCloud?

4.   If I should use SolrCloud, then probably I may have to use some 
master/slave setup(not sure though)? What I have in mind is to use master for 
ingestion of new documents and slave for querying. Then at the end of the day I 
can have the replication to update the slaves? Can you please advise if this is 
a good approach and most importantly if this is feasible?

On google, I could find that many people have already setup such environment, 
but I couldn't figure out the configuration they are using. If some can share 
their experience, then it will probably help others as well.

Thanks!

Regards,
Pankaj


Please do not print this email unless it is absolutely necessary.

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments.

WARNING: Computer viruses can be transmitted via email. The recipient should 
check this email and any attachments for the presence of viruses. The company 
accepts no liability for any damage caused by any virus transmitted by this 
email.

www.wipro.com


RE: Billion document index

2013-05-15 Thread pankaj.pandey4
Thanks Shawn for explaining everything in such detail, it was really helpful.

Have few more queries on the same. Can you please explain the purpose of the 
3rd box in minimal configuration, with the standalone zookeeper?

On separate note, if we go with ahead with 4 box(8 shard with replication 
factor 2 for each):
1. Would it be ok to maintain the replica on the same box or we would 
need separate box for that?
2. Is the above configuration sufficient enough to guarantee failover 
and high availability?
3. How can I configure my application to query always against the 
replica and let the master be used only for ingestion. Replica will be synced 
withmaster after working hours(overnight).


Regards,
Pankaj

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org]
Sent: Wednesday, May 15, 2013 12:01 PM
To: solr-user@lucene.apache.org
Subject: Billion document index

On 5/14/2013 11:00 PM, pankaj.pand...@wipro.com wrote:
> We have to setup a billion document index using Apache Solr(about 2 billion 
> docs). I need some assistance on choosing the right configuration for the 
> environment setup. I have been going through the Solr documentation, but 
> couldn't figure out what would be the best configuration for same.
>
> Below is the configuration on one of the box we have, can you please assist 
> if this will suffice our requirement.
> OS: SunOS
> RAM : 32GB
> Processor: 4 (SPARC64-VII+/2660Mhz)
> Server: Sun SPARC Enterprise M5000
>
> Desired System Requirements:
> Expected number of requests/day: 10,000 New Documents ingested/day:
> 1Million
>
> I have below questions on same:
>
> 1.   Does the above system configuration seems ok for our requirement?
>
> 2.   Is it ok if we host entire index on a single physical box or I 
> should use multiple physical box?
>
> 3.   Should I go for the simple installation or SolrCloud?
>
> 4.   If I should use SolrCloud, then probably I may have to use some 
> master/slave setup(not sure though)? What I have in mind is to use master for 
> ingestion of new documents and slave for querying. Then at the end of the day 
> I can have the replication to update the slaves? Can you please advise if 
> this is a good approach and most importantly if this is feasible?
>
> On google, I could find that many people have already setup such environment, 
> but I couldn't figure out the configuration they are using. If some can share 
> their experience, then it will probably help others as well.

There's a lot of information here for you to digest.  Be sure to read all the 
way to the end.  The numbers (and the costs associated with those numbers) 
might scare you.

I have no idea what's going to be in your index.  I even looked up the website 
for your email domain, and still don't know what you might be trying to search.

Because it uses a 32-bit signed number to track things, a single Solr index 
(not sharded) is limited to a little more than 2 billion documents.  This means 
you'll want to use distributed search (sharding) from the beginning.  For new 
deployments, SolrCloud is much better than trying to handle sharding yourself.  
You'll probably want 8 or more shards and a minimum replication factor of 2, so 
that you have two copies of every shard.  That doesn't necessarily mean that 
you'll need that many machines, but you might want to plan on at least four of 
them.
 You'll probably be putting more than one shard per server.

SolrCloud is a true cluster - there is no master and no slaves.  Both indexing 
and queries are completely distributed.  Clients that you write in Java get 
these distributed features with no extra requirements, non-Java clients will 
require some form of external load balancing to ensure that they are always 
talking to a server that's up.

The absolute minimum number of physical machines you need for SolrCloud is 
three.  Two of those need to be the beefy workhorses.  Each of them will run 
Solr and a standalone ZooKeeper.  The third machine can be modest and will just 
run a third instance of zookeeper.  If you have more than two servers that will 
run Solr, then you can just run the standalone zookeeper on three of them and 
won't need any extra hardware.

Memory is going to be your real problem with a very large index.  When it comes 
to the amount of required memory, you might want to read this wiki page, then 
come back here:

http://wiki.apache.org/solr/SolrPerformanceProblems

I really was serious about reading that page, and not just because I wrote it.  
The information you'll find there is key to understanding the scale of what you 
propose and what I'm going to say below.

Even with very small documents, an index with 2 billion of them is probably 
going to be at least 100GB, and quite possibly 300GB, 500GB, or larger.

For discussion purposes, let's say that you've got the extremely conservative 
index size of 100GB and you're going to put that on four servers.  To cache 
this in