Hi!

I am already using solr 1.2 and happy with it.

In a new project with very tight dead line (10 development days from
today) I need to setup a more ambitious system in terms of scale
Here is the spec:

 

*                                 I need to index about 60,000,000
documents 

*         Each document is has 11 textual fields to be indexed & stored
and 4 more fields to be stored only 

*         Most fields are short (2-14 characters) however 2 indexed
fields can be up to 1KB and another stored field is up to 1KB 

*         On average every document is about 0.5 KB to be stored and
0.4KB to be indexed 

*         The SLA for data freshness is a full nightly re-index ( I
cannot obtain an incremental update/delete lists of the modified
documents) 

*         The SLA for query time is 5 seconds 

*         the number of expected queries is 2-3 queries per second 

*         the queries are simple a combination of Boolean operation and
name searches (no fancy fuzzy searches and levinstien distances, no
faceting, etc) 

*         I have a 64 bit Dell 2950 4-cpu machine  (2 dual cores ) with
RAID 10, 200 GB HD space, and 8GB RAM memory 

*         The documents are not given to me explicitly - I am given a
raw-documents in RAM - one by one, from which I create my document in
RAM.
and then I can either http-post is to index it directly or append it to
a tsv file for later indexing 

*         Each document has a unique ID

 

I have a few directions I am thinking about

 

The simple approach

*                                 Have one solr instance that will index
the entire document set (from files). I am afraid this will take too
much time

 

Direction 1

*                                 Create TSV files from all the
documents - this will take around 3-4 hours 

*                                 Have all the documents partitioned
into several subsets (how many should I choose? ) 

*                                 Have multiple solr instances on the
same machine 

*                                 Let each solr instance concurrently
index the appropriate subset 

*                                 At the end merge all the indices using
the IndexMergeTool - (how much time will it take ?)

 

Direction 2

*                                 Like  the previous but instead of
using the IndexMergeTool , use distributed search with shards (upgrading
to solr 1.3)

 

Direction 3,4

*                                 Like previous directions only avoid
using TSV files at all and directly index the documents from RAM

Questions:

*         Which direction do you recommend in order to meet the SLAs in
the fastest way? 

*         Since I have RAID on the machine can I gain performance by
using multiple solr instances on the same machine or only multiple
machines will help me 

*         What's the minimal number of machines I should require (I
might get more weaker machines) 

*         How many concurrent indexers are recommended? 

*         Do you agree that the bottle neck is the indexing time?

Any help is appreciated 

Thanks in advance

yatir

 

Reply via email to