Re: Solr server requirements for 100+ million documents

Erick Erickson Sat, 25 Jan 2014 04:52:50 -0800

Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string"
types. This is tangential to your question, but I thought I'd butt in anyway.


String types are totally unanalyzed. So if the input for a field is "I
like Strings",
the only match will be "I like Strings". "I like strings" won't match
due to the
lower-case 's' in strings. "like" won't match since it isn't the complete input.

You may already know this, but thought I'd point it out. For tokenized
searches, text_general is a good place to start. Pardon me if this is repeating
what you already know....

Lots of string types sometimes lead people with DB backgrounds to
search for *like* which will be slow FWIW.

Best,
Erick

On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson <s...@csi.se> wrote:
> That got away a little early...
>
> The inserter is a small C++ program that uses pglib to speek to postgres
> and the a http-client library that uses libcurl under the hood. The
> inserter draws very little CPU and we normally use 2 writer threads that
> each posts 1000 records at a time. Its very inefficient to post one at a
> time but I've not done any specific testing to know if 1000 is better that
> 500....
>
> What we're doing now is trying to figure out how to get the query
> performance up since is not where we need it to be so we're not done
> either...
>
>
> 2014/1/25 svante karlsson <s...@csi.se>
>
>> We are using a postgres server on a different host (same hardware as the
>> test solr server). The reason we take the data from the postgres server is
>> that is easy to automate testing since we use the same server to produce
>> queries. In production we preload the solr from a csv file from a hive
>> (hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
>> use solr as NoSQL dabase since we really want to do SHOULD queries against
>> all the fields. The fields are typically very small text fields (<30 chars)
>> but occasionally bigger but I don't think I have more than 128 chars on
>> anything in the whole dataset.
>>
>> <?xml version="1.0" encoding="UTF-8" ?>
>> <schema name="example" version="1.1">
>>   <types>
>>   <fieldType name="uuid" class="solr.UUIDField" indexed="true" />
>>   <fieldType name="string" class="solr.StrField" sortMissingLast="true"
>> omitNorms="true"/>
>>    <fieldType name="boolean" class="solr.BoolField"
>> sortMissingLast="true"/>
>>    <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6"
>> positionIncrementGap="0"/>
>>    <fieldType name="int" class="solr.TrieIntField" precisionStep="0"
>> positionIncrementGap="0"/>
>>    <fieldType name="long" class="solr.TrieLongField" precisionStep="0"
>> positionIncrementGap="0"/>
>>    </types>
>> <fields>
>> <field name="_version_" type="long" indexed="true" stored="true"
>> multiValued="false"/>
>> <field name="id" type="string" indexed="true" stored="true"
>> required="true" multiValued="false" />
>> <field name="name" type="int" indexed="true" stored="true"/>
>> <field name="fieldA" type="string" indexed="true" stored="true"/>
>> <field name="fieldB" type="string" indexed="true" stored="true"/>
>> <field name="fieldC" type="int" indexed="true" stored="true"/>
>> <field name="fieldD" type="int" indexed="true" stored="true"/>
>> <field name="fieldE" type="int" indexed="true" stored="true"/>
>> <field name="fieldF" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldG" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldH" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldI" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldJ" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldK" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldL" type="string" indexed="true" stored="true"/>
>> <field name="fieldM" type="string" indexed="true" stored="true"
>> multiValued="true"/>
>> <field name="fieldN" type="string" indexed="true" stored="true"/>
>>
>> <field name="fieldO" type="string" indexed="false" stored="true"
>> required="false" />
>> <field name="ts"  type="long" indexed="true" stored="true"/>
>> </fields>
>> <uniqueKey>id</uniqueKey>
>> <solrQueryParser defaultOperator="OR"/>
>> </schema>
>>
>>
>>
>>
>>
>> 2014/1/25 Kranti Parisa <kranti.par...@gmail.com>
>>
>>> can you post the complete solrconfig.xml file and schema.xml files to
>>> review all of your settings that would impact your indexing performance.
>>>
>>> Thanks,
>>> Kranti K. Parisa
>>> http://www.linkedin.com/in/krantiparisa
>>>
>>>
>>>
>>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
>>> susheel.ku...@thedigitalgroup.net> wrote:
>>>
>>> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
>>> you
>>> > please provide some more detail on how you are indexing db records. Is
>>> it
>>> > thru DataImportHandler? And what database? Is that local db?  We are
>>> > indexing around 70 fields (60 multivalued) but data is not populated
>>> always
>>> > in all fields. The average size of document is in 5-10 kbs.
>>> >
>>> > -----Original Message-----
>>> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
>>> > svante karlsson
>>> > Sent: Friday, January 24, 2014 5:05 PM
>>> > To: solr-user@lucene.apache.org
>>> > Subject: Re: Solr server requirements for 100+ million documents
>>> >
>>> > I just indexed 100 million db docs (records) with 22 fields (4
>>> > multivalued) in 9524 sec using libcurl.
>>> > 11 million took 763 seconds so the speed drops somewhat with increasing
>>> > dbsize.
>>> >
>>> > We write 1000 docs (just an arbitrary number) in each request from two
>>> > threads. If you will be using solrcloud you will want more writer
>>> threads.
>>> >
>>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
>>> SSD
>>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
>>> machine.
>>> >
>>> > /svante
>>> >
>>> >
>>> >
>>> >
>>> > 2014/1/24 Susheel Kumar <susheel.ku...@thedigitalgroup.net>
>>> >
>>> > > Thanks, Erick for the info.
>>> > >
>>> > > For indexing I agree the more time is consumed in data acquisition
>>> > > which in our case from Database.  For indexing currently we are using
>>> > > the manual process i.e. Solr dashboard Data Import but now looking to
>>> > > automate.  How do you suggest to automate the index part. Do you
>>> > > recommend to use SolrJ or should we try to automate using Curl?
>>> > >
>>> > >
>>> > > -----Original Message-----
>>> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> > > Sent: Friday, January 24, 2014 2:59 PM
>>> > > To: solr-user@lucene.apache.org
>>> > > Subject: Re: Solr server requirements for 100+ million documents
>>> > >
>>> > > Can't be done with the information you provided, and can only be
>>> > > guessed at even with more comprehensive information.
>>> > >
>>> > > Here's why:
>>> > >
>>> > >
>>> > >
>>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
>>> > > -dont-have-a-definitive-answer/
>>> > >
>>> > > Also, at a guess, your indexing speed is so slow due to data
>>> > > acquisition; I rather doubt you're being limited by raw Solr indexing.
>>> > > If you're using SolrJ, try commenting out the
>>> > > server.add() bit and running again. My guess is that your indexing
>>> > > speed will be almost unchanged, in which case it's the data
>>> > > acquisition process is where you should concentrate efforts. As a
>>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
>>> > > without any attempts at parallelization.
>>> > >
>>> > >
>>> > > Best,
>>> > > Erick
>>> > >
>>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>>> > > susheel.ku...@thedigitalgroup.net> wrote:
>>> > > > Hi,
>>> > > >
>>> > > > Currently we are indexing 10 million document from database (10 db
>>> > > > data
>>> > > entities) & index size is around 8 GB on windows virtual box. Indexing
>>> > > in one shot taking 12+ hours while indexing parallel in separate cores
>>> > > & merging them together taking 4+ hours.
>>> > > >
>>> > > > We are looking to scale to 100+ million documents and looking for
>>> > > recommendation on servers requirements on below parameters for a
>>> > > Production environment. There can be 200+ users performing search same
>>> > time.
>>> > > >
>>> > > > No of physical servers (considering solr cloud) Memory requirement
>>> > > > Processor requirement (# cores) Linux as OS oppose to windows
>>> > > >
>>> > > > Thanks in advance.
>>> > > > Susheel
>>> > > >
>>> > >
>>> >
>>>
>>
>>

Re: Solr server requirements for 100+ million documents

Reply via email to