Re: Lucene-based Distributed Index Leveraging Hadoop

Srikant Jakilinki Fri, 08 Feb 2008 23:16:18 -0800

Hi Ning,

In continuation with our offline conversation, here is a publicexpression of interest in your work and a description of our work. Sorryfor the length in advance and I hope that the folk will be able tocollaborate and/or share experiences and/or give us some pointers...

1) We are trying to leverage Lucene on Hadoop for blog archiving andsearching i.e. ever-increasing data (in terabytes) on commodity hardwarein a generic LAN. These machines are not hi-spec nor are dedicated butactually used within the lab by users for day to day tasks.Unfortunately, Nutch and Solr are not applicable to our situation -atleast directly. Think of us as an academic oriented Technorati

2) There are 2 aspects.One is that we want to archive the blogposts thatwe hit under a UUID/timestamp taxonomy. This archive can be used formany things like cached copies, diffing, surf acceleration etc. Theother aspect is to archive the indexes. You see, the indexes have alifecycle. For simplicity sake, an index consists of one days worth ofblogposts (roughly, 15MM documents) and follow the <YYYYMMDD> taxonomy.Ideally, we want to store an indefinite archive of blogposts and theirindexes side-by-side but 1 year or 365 days is a start

3) We want to use the taxonomical name of the post as a specific IDfield in the Lucene index and want to get away with not storing thecontent of the post at all but only a file pointer/reference to it. Thiswe hope will keep the index sizes low but the fact remains that this isa case of multiple threads on multiple JVMs handling multiple indexes onmultiple machines. Further, the posts and indexes are mostly WORM butthere may be situations where they have to be updated. For example, ifsome blog posts have edited content or have to be removed for copyright,or updated with metadata like rank. There is some duplication detectionwork that has to be done here but it is out of scope for now. And oh,the lab is a Linux-Windows environment

4) Our first port of call is to have Hadoop running on this group ofmachines (without clustering or load balancing or grid or master/slavemumbo jumbo) in the simplest way possible. The goal being to makeapplications see the bunch of machines as a reliable, scalable,fault-tolerant, average-performing file store with simple, file CRUDoperations. For example, the blog crawler should be able to put theblogposts in this HDFS in live or in batch mode. With about 20 machinesand each being installed with a 240GB drive for the experiment, we haveabout 4.5 TB of storage available

5) Next we want to handle Lucene and exploit the structure of its indexand the algorithms behind it. Since a Lucene index is a directory offiles, we intend to 'tag' the files as belonging to one index and storethem on the HDFS. At any instant in time, an index can be regeneratedand used. The regenerated index is however not used directly from HDFSbut copied into the local filesystem of the indexer/searcher. This copyis subject to change and every once in a while, the constituent files inthe HDFS are overwritten with the latest files. Hence, naming is quiteimportant to us. Even so, we feel that the number of files that have tobe updated are quite less and that we can use MD5 sums to make sure weonly update the content changed files. However, this means that out of4.5 TB available, we use half of it for archival and the other half forsearching. Even so, we should be able to store a years worth of postsand indexes. Disks are no problem

6) Right then. So, we have (365*15MM) posts and (365*LFi) Lucene filesegments on the HDFS. Suppose there are N machines online, then eachmachine will have to own 365/N indexes. N constantly keeps changing butat any instant the 365 indexes should be live and we are working on thebest way to achieve this kind of 'fair' autonomic computing cloud wherewhen a machine goes down, the other machines will add some indexes totheir kitty. If a machine is added, then it relieves other machines ofsome indexes. The searcher runs on each of these machines and is aservice (IP:port) and queries are served using a ParallelMultiSearch()[on the machines] and a MultiSearch() [within the machines] so that weneed not have an unmanageable number of JVMs per machine. Atmost, 1 forHadoop, 1 for Cloud and 1 for Search. We are wondering if Solr can beused for search if it supports multiple indexes available on the samemachine

As you can see, this is not a simple endeavour and it is obvious, Isuppose, that we are still in theory stage and only now getting to knowthe Lucene projects better. There is a huge body of work, albeit notacknowledged in the scientific community as it should be, and I want tosay kudos to all who have been responsible for it.I wish and hope to utilize the collective consciousness to mount ourchallenge. Any pointers, code, help, collaboration et al. for any of the6 points above - it goes with saying/asking - is welcome and lookforward to share our experiences in a formal written discourse as andwhen we have them.


Cheers,
Srikant

Ning Li wrote:

There have been several proposals for a Lucene-based distributed index
architecture.
 1) Doug Cutting's "Index Server Project Proposal" at
    http://www.mail-archive.com/[EMAIL PROTECTED]/msg00338.html
 2) Solr's "Distributed Search" at
    http://wiki.apache.org/solr/DistributedSearch
 3) Mark Butler's "Distributed Lucene" at
    http://wiki.apache.org/hadoop/DistributedLucene


----------------------------------------------------------------------
Find out how you can get spam free email.
http://www.bluebottle.com/tag/3

Re: Lucene-based Distributed Index Leveraging Hadoop

Reply via email to