RE: help required: how to design a large scale solr system

Ben Shlomo, Yatir Wed, 24 Sep 2008 04:37:15 -0700

Thanks Mark!.
Do you have any comment regarding the performance differences between
indexing TSV files as opposed to directly indexing each document via
http post?

-----Original Message-----
From: Mark Miller [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: help required: how to design a large scale solr system

 From my limited experience:

I think you might have a bit of trouble getting 60 mil docs on a single 
machine. Cached queries will probably still be *very* fast, but non 
cached queries are going to be very slow in many cases. Is that 5 
seconds for all queries? You will never meet that on first run queries 
with 60mil docs on that machine. The light query load might make things 
workable...but your near the limits of a single machine (4 core or not) 
with 60 mil. You want to use a very good stopword list...common term 
queries will be killer. The docs being so small will be your only 
possible savior if you go the one machine route - that and cached hits. 
You don't have enough ram to get as much of the filesystem into RAM as 
youd like for 60 mil docs either.

I think you might try two machines with 30, 3 with 20, or 4 with 15. The

more you spread, even with slower machines, the faster your likely to 
index, which as you say, will take a long time for 60 mil docs (start 
today <g>). Multiple machines will help the indexing speed the most for 
sure - its still going to take a long time.

I don't think you will get much advantage using more than one solr 
install on a single machine - if you do, that should be addressed in the

code, even with RAID.

So I say, spread if you can. Faster indexing, faster search, easy to 
expand later. Distributed search is so easy with solr 1.3, you wont 
regret it. I think there is a bug to be addressed if your needing this 
in a week though - in my experience, with distributed search, for every 
million docs on a machine beyond the first, you lose a doc in a search 
across all machines (ie 1 mil on machine 1, 1 million on machine 2, a 
*:* search will be missing 1 doc. 10 mil each on 3 machines, a *:* 
search will be missing 30. Not a big deal, but could be a concern for 
some with picky, look at everything customers.

- Mark

Ben Shlomo, Yatir wrote:
> Hi!
>
> I am already using solr 1.2 and happy with it.
>
> In a new project with very tight dead line (10 development days from
> today) I need to setup a more ambitious system in terms of scale
> Here is the spec:
>
>  
>
> *                                 I need to index about 60,000,000
> documents 
>
> *         Each document is has 11 textual fields to be indexed &
stored
> and 4 more fields to be stored only 
>
> *         Most fields are short (2-14 characters) however 2 indexed
> fields can be up to 1KB and another stored field is up to 1KB 
>
> *         On average every document is about 0.5 KB to be stored and
> 0.4KB to be indexed 
>
> *         The SLA for data freshness is a full nightly re-index ( I
> cannot obtain an incremental update/delete lists of the modified
> documents) 
>
> *         The SLA for query time is 5 seconds 
>
> *         the number of expected queries is 2-3 queries per second 
>
> *         the queries are simple a combination of Boolean operation
and
> name searches (no fancy fuzzy searches and levinstien distances, no
> faceting, etc) 
>
> *         I have a 64 bit Dell 2950 4-cpu machine  (2 dual cores )
with
> RAID 10, 200 GB HD space, and 8GB RAM memory 
>
> *         The documents are not given to me explicitly - I am given a
> raw-documents in RAM - one by one, from which I create my document in
> RAM.
> and then I can either http-post is to index it directly or append it
to
> a tsv file for later indexing 
>
> *         Each document has a unique ID
>
>  
>
> I have a few directions I am thinking about
>
>  
>
> The simple approach
>
> *                                 Have one solr instance that will
index
> the entire document set (from files). I am afraid this will take too
> much time
>
>  
>
> Direction 1
>
> *                                 Create TSV files from all the
> documents - this will take around 3-4 hours 
>
> *                                 Have all the documents partitioned
> into several subsets (how many should I choose? ) 
>
> *                                 Have multiple solr instances on the
> same machine 
>
> *                                 Let each solr instance concurrently
> index the appropriate subset 
>
> *                                 At the end merge all the indices
using
> the IndexMergeTool - (how much time will it take ?)
>
>  
>
> Direction 2
>
> *                                 Like  the previous but instead of
> using the IndexMergeTool , use distributed search with shards
(upgrading
> to solr 1.3)
>
>  
>
> Direction 3,4
>
> *                                 Like previous directions only avoid
> using TSV files at all and directly index the documents from RAM
>
> Questions:
>
> *         Which direction do you recommend in order to meet the SLAs
in
> the fastest way? 
>
> *         Since I have RAID on the machine can I gain performance by
> using multiple solr instances on the same machine or only multiple
> machines will help me 
>
> *         What's the minimal number of machines I should require (I
> might get more weaker machines) 
>
> *         How many concurrent indexers are recommended? 
>
> *         Do you agree that the bottle neck is the indexing time?
>
> Any help is appreciated 
>
> Thanks in advance
>
> yatir
>
>  
>
>
>

RE: help required: how to design a large scale solr system

Reply via email to