Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string" types. This is tangential to your question, but I thought I'd butt in anyway.
String types are totally unanalyzed. So if the input for a field is "I like Strings", the only match will be "I like Strings". "I like strings" won't match due to the lower-case 's' in strings. "like" won't match since it isn't the complete input. You may already know this, but thought I'd point it out. For tokenized searches, text_general is a good place to start. Pardon me if this is repeating what you already know.... Lots of string types sometimes lead people with DB backgrounds to search for *like* which will be slow FWIW. Best, Erick On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson <s...@csi.se> wrote: > That got away a little early... > > The inserter is a small C++ program that uses pglib to speek to postgres > and the a http-client library that uses libcurl under the hood. The > inserter draws very little CPU and we normally use 2 writer threads that > each posts 1000 records at a time. Its very inefficient to post one at a > time but I've not done any specific testing to know if 1000 is better that > 500.... > > What we're doing now is trying to figure out how to get the query > performance up since is not where we need it to be so we're not done > either... > > > 2014/1/25 svante karlsson <s...@csi.se> > >> We are using a postgres server on a different host (same hardware as the >> test solr server). The reason we take the data from the postgres server is >> that is easy to automate testing since we use the same server to produce >> queries. In production we preload the solr from a csv file from a hive >> (hadoop) job and then only write updates ( < 500 / sec ). In our usecase we >> use solr as NoSQL dabase since we really want to do SHOULD queries against >> all the fields. The fields are typically very small text fields (<30 chars) >> but occasionally bigger but I don't think I have more than 128 chars on >> anything in the whole dataset. >> >> <?xml version="1.0" encoding="UTF-8" ?> >> <schema name="example" version="1.1"> >> <types> >> <fieldType name="uuid" class="solr.UUIDField" indexed="true" /> >> <fieldType name="string" class="solr.StrField" sortMissingLast="true" >> omitNorms="true"/> >> <fieldType name="boolean" class="solr.BoolField" >> sortMissingLast="true"/> >> <fieldType name="tdate" class="solr.TrieDateField" precisionStep="6" >> positionIncrementGap="0"/> >> <fieldType name="int" class="solr.TrieIntField" precisionStep="0" >> positionIncrementGap="0"/> >> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" >> positionIncrementGap="0"/> >> </types> >> <fields> >> <field name="_version_" type="long" indexed="true" stored="true" >> multiValued="false"/> >> <field name="id" type="string" indexed="true" stored="true" >> required="true" multiValued="false" /> >> <field name="name" type="int" indexed="true" stored="true"/> >> <field name="fieldA" type="string" indexed="true" stored="true"/> >> <field name="fieldB" type="string" indexed="true" stored="true"/> >> <field name="fieldC" type="int" indexed="true" stored="true"/> >> <field name="fieldD" type="int" indexed="true" stored="true"/> >> <field name="fieldE" type="int" indexed="true" stored="true"/> >> <field name="fieldF" type="string" indexed="true" stored="true" >> multiValued="true"/> >> <field name="fieldG" type="string" indexed="true" stored="true" >> multiValued="true"/> >> <field name="fieldH" type="string" indexed="true" stored="true" >> multiValued="true"/> >> <field name="fieldI" type="string" indexed="true" stored="true" >> multiValued="true"/> >> <field name="fieldJ" type="string" indexed="true" stored="true" >> multiValued="true"/> >> <field name="fieldK" type="string" indexed="true" stored="true" >> multiValued="true"/> >> <field name="fieldL" type="string" indexed="true" stored="true"/> >> <field name="fieldM" type="string" indexed="true" stored="true" >> multiValued="true"/> >> <field name="fieldN" type="string" indexed="true" stored="true"/> >> >> <field name="fieldO" type="string" indexed="false" stored="true" >> required="false" /> >> <field name="ts" type="long" indexed="true" stored="true"/> >> </fields> >> <uniqueKey>id</uniqueKey> >> <solrQueryParser defaultOperator="OR"/> >> </schema> >> >> >> >> >> >> 2014/1/25 Kranti Parisa <kranti.par...@gmail.com> >> >>> can you post the complete solrconfig.xml file and schema.xml files to >>> review all of your settings that would impact your indexing performance. >>> >>> Thanks, >>> Kranti K. Parisa >>> http://www.linkedin.com/in/krantiparisa >>> >>> >>> >>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < >>> susheel.ku...@thedigitalgroup.net> wrote: >>> >>> > Thanks, Svante. Your indexing speed using db seems to really fast. Can >>> you >>> > please provide some more detail on how you are indexing db records. Is >>> it >>> > thru DataImportHandler? And what database? Is that local db? We are >>> > indexing around 70 fields (60 multivalued) but data is not populated >>> always >>> > in all fields. The average size of document is in 5-10 kbs. >>> > >>> > -----Original Message----- >>> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of >>> > svante karlsson >>> > Sent: Friday, January 24, 2014 5:05 PM >>> > To: solr-user@lucene.apache.org >>> > Subject: Re: Solr server requirements for 100+ million documents >>> > >>> > I just indexed 100 million db docs (records) with 22 fields (4 >>> > multivalued) in 9524 sec using libcurl. >>> > 11 million took 763 seconds so the speed drops somewhat with increasing >>> > dbsize. >>> > >>> > We write 1000 docs (just an arbitrary number) in each request from two >>> > threads. If you will be using solrcloud you will want more writer >>> threads. >>> > >>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one >>> SSD >>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual >>> machine. >>> > >>> > /svante >>> > >>> > >>> > >>> > >>> > 2014/1/24 Susheel Kumar <susheel.ku...@thedigitalgroup.net> >>> > >>> > > Thanks, Erick for the info. >>> > > >>> > > For indexing I agree the more time is consumed in data acquisition >>> > > which in our case from Database. For indexing currently we are using >>> > > the manual process i.e. Solr dashboard Data Import but now looking to >>> > > automate. How do you suggest to automate the index part. Do you >>> > > recommend to use SolrJ or should we try to automate using Curl? >>> > > >>> > > >>> > > -----Original Message----- >>> > > From: Erick Erickson [mailto:erickerick...@gmail.com] >>> > > Sent: Friday, January 24, 2014 2:59 PM >>> > > To: solr-user@lucene.apache.org >>> > > Subject: Re: Solr server requirements for 100+ million documents >>> > > >>> > > Can't be done with the information you provided, and can only be >>> > > guessed at even with more comprehensive information. >>> > > >>> > > Here's why: >>> > > >>> > > >>> > > >>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we >>> > > -dont-have-a-definitive-answer/ >>> > > >>> > > Also, at a guess, your indexing speed is so slow due to data >>> > > acquisition; I rather doubt you're being limited by raw Solr indexing. >>> > > If you're using SolrJ, try commenting out the >>> > > server.add() bit and running again. My guess is that your indexing >>> > > speed will be almost unchanged, in which case it's the data >>> > > acquisition process is where you should concentrate efforts. As a >>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes >>> > > without any attempts at parallelization. >>> > > >>> > > >>> > > Best, >>> > > Erick >>> > > >>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < >>> > > susheel.ku...@thedigitalgroup.net> wrote: >>> > > > Hi, >>> > > > >>> > > > Currently we are indexing 10 million document from database (10 db >>> > > > data >>> > > entities) & index size is around 8 GB on windows virtual box. Indexing >>> > > in one shot taking 12+ hours while indexing parallel in separate cores >>> > > & merging them together taking 4+ hours. >>> > > > >>> > > > We are looking to scale to 100+ million documents and looking for >>> > > recommendation on servers requirements on below parameters for a >>> > > Production environment. There can be 200+ users performing search same >>> > time. >>> > > > >>> > > > No of physical servers (considering solr cloud) Memory requirement >>> > > > Processor requirement (# cores) Linux as OS oppose to windows >>> > > > >>> > > > Thanks in advance. >>> > > > Susheel >>> > > > >>> > > >>> > >>> >> >>