We have found that 200-250mb per Lucene index is where efficiency drops off and Lucene gets slow. You will have to use a sharding approach: many small indexes, and all have different sets of documents. Solr has a tool for doing queries across many shards, called Distributed Search.
http://wiki.apache.org/solr/DistributedSearch There is a great book from Packt Books - https://www.packtpub.com/solr-1-4-enterprise-search-server/book On Tue, Aug 24, 2010 at 10:10 AM, Liz Sommers <lizswo...@gmail.com> wrote: > We will be ingesting gigabytes of new data per day, but have a lot of legacy > data (petabytes) that will also need to be indexed. We will probably index > many fields per record (ave. 50/record) and hope to add facets in the near > future. > > If this solution gives us the speed and facet capabilities we are hoping > for, our searches per hour will go up by 10 times or more but will probably > max out at a couple of searches per second. > Thanks. > > Liz Sommers > > On Tue, Aug 24, 2010 at 12:53 PM, Glen Newton <glen.new...@gmail.com> wrote: > >> Liz, >> >> I've built terrabyte (1-2 TB) test Lucene indexes, but have not >> reached to the petabyte level, so I am not sure. Certainly there is >> overhead in using the http and xml marshaling/de-marshaling, which may >> or may not be a critical factor for you. >> >> Could you give more information with respect to your application, i.e. >> the nature of your data loading (i.e. many PB at once or GB per >> hour/day/week accumulating to PB or MB per second/minute/hour >> eventually accumulating to PB...;) searching ( i.e. the number of >> fields indexed & the query complexity; if you are using facets, etc), >> number of queries per second expected... >> >> Lucene has a limit on the number of documents (in a single index) that >> might impact your application: >> >> http://lucene.apache.org/java/3_0_2/api/core/org/apache/lucene/index/IndexWriter.html#numDocs%28%29 >> of a 32bit int, 2 147 483 648. >> >> -glen >> >> On 24 August 2010 12:29, Liz Sommers <lizswo...@gmail.com> wrote: >> > I was worried that it wouldn't scale. We are going to be indexing >> petabytes >> > of data. Does the httpserver solution scale? >> > >> > Thanks >> > >> > Liz Sommers >> > lizswo...@gmail.com >> > >> > On Tue, Aug 24, 2010 at 12:23 PM, Thomas Joiner >> > <thomas.b.joi...@gmail.com>wrote: >> > >> >> Is there any reason you aren't using http://wiki.apache.org/solr/Solrjto >> >> interact with Solr? >> >> >> >> On Tue, Aug 24, 2010 at 11:12 AM, Liz Sommers <lizswo...@gmail.com> >> wrote: >> >> >> >> > I am very new to the solr/lucene world. I am using solr 1.4.0 and >> cannot >> >> > move to 1.4.1. >> >> > >> >> > I have to index about 50 fields for each document, these fields are >> >> already >> >> > in key/value pairs by the time I get to my index methods. I was able >> to >> >> > index them with lucene without any problem, but found that I could not >> >> then >> >> > read the indexes with solr/admin. So, I decided to use Solr for my >> >> > indexing. >> >> > >> >> > The error I am currently getting is >> >> > java.lang.RuntimeException: Can't find resource 'synonyms.txt' in >> >> classpath >> >> > or 'solr/conf'/' >> >> > >> >> > This exception is being thrown by SolrResourceLoader.openResource(line >> >> > 260). >> >> > which is called by IndexSchema<init> (line 102) >> >> > >> >> > My code that leads up to this follows: >> >> > >> >> > <code> >> >> > String path = "c:/swdev/apache-solr-1.4.0/IDW" >> >> > SolrConfig cfg new SolrConfig(path + "/solr/conf/solrconfig.xml"); >> >> > schema = new IndexSchema(cfg,path + "/solr/conf/schema.xml",null); >> >> > >> >> > </code> >> >> > >> >> > This also fails if I use >> >> > schema = new IndexSchema(cfg,"schema.xml",null); >> >> > >> >> > >> >> > Any help would be greatly appreciated. >> >> > >> >> > Thank you >> >> > >> >> > Liz Sommers >> >> > lizswo...@gmail.com >> >> > >> >> >> > >> >> >> >> -- >> >> - >> > -- Lance Norskog goks...@gmail.com