Re: Experience with indexing billions of documents?

Bradford Stephens Tue, 13 Apr 2010 13:14:23 -0700

Hey there,

We've actually been tackling this problem at Drawn to Scale. We'd really
like to get our hands on LuceHBase to see how it scales. Our faceting still
needs to be done in-memory, which is kinda tricky, but it's worth
exploring.


On Mon, Apr 12, 2010 at 7:27 AM, Thomas Koch <tho...@koch.ro> wrote:

> Hi,
>
> could I interest you in this project?
> http://github.com/thkoch2001/lucehbase
>
> The aim is to store the index directly in HBase, a database system modelled
> after google's Bigtable to store data in the regions of tera or petabytes.
>
> Best regards, Thomas Koch
>
> Lance Norskog:
> > The 2B limitation is within one shard, due to using a signed 32-bit
> > integer. There is no limit in that regard in sharding- Distributed
> > Search uses the stored unique document id rather than the internal
> > docid.
> >
> > On Fri, Apr 2, 2010 at 10:31 AM, Rich Cariens <richcari...@gmail.com>
> wrote:
> > > A colleague of mine is using native Lucene + some home-grown
> > > patches/optimizations to index over 13B small documents in a 32-shard
> > > environment, which is around 406M docs per shard.
> > >
> > > If there's a 2B doc id limitation in Lucene then I assume he's patched
> it
> > > himself.
> > >
> > > On Fri, Apr 2, 2010 at 1:17 PM, <dar...@ontrenet.com> wrote:
> > >> My guess is that you will need to take advantage of Solr 1.5's
> upcoming
> > >> cloud/cluster renovations and use multiple indexes to comfortably
> > >> achieve those numbers. Hypthetically, in that case, you won't be
> limited
> > >> by single index docid limitations of Lucene.
> > >>
> > >> > We are currently indexing 5 million books in Solr, scaling up over
> the
> > >> > next few years to 20 million.  However we are using the entire book
> as
> > >> > a Solr document.  We are evaluating the possibility of indexing
> > >> > individual pages as there are some use cases where users want the
> most
> > >> > relevant
> > >>
> > >> pages
> > >>
> > >> > regardless of what book they occur in.  However, we estimate that we
> > >> > are talking about somewhere between 1 and 6 billion pages and have
> > >> > concerns over whether Solr will scale to this level.
> > >> >
> > >> > Does anyone have experience using Solr with 1-6 billion Solr
> > >> > documents?
> > >> >
> > >> > The lucene file format document
> > >> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> > >> > mentions a limit of about 2 billion document ids.   I assume this is
> > >> > the lucene internal document id and would therefore be a per
> index/per
> > >> > shard limit.  Is this correct?
> > >> >
> > >> >
> > >> > Tom Burton-West.
> >
>
> Thomas Koch, http://www.koch.ro
>



-- 
Bradford Stephens,
Founder, Drawn to Scale
drawntoscalehq.com
727.697.7528

http://www.drawntoscalehq.com --  The intuitive, cloud-scale data solution.
Process, store, query, search, and serve all your data.

http://www.roadtofailure.com -- The Fringes of Scalability, Social Media,
and Computer Science

Re: Experience with indexing billions of documents?

Reply via email to