Hi Jack Thanks for the response.
I have 106 fields in a document and for 20 of them are integer, 30 character varying and rest all are text fields. We don't have any blob data. To add on to this, 40% of the documents are of smaller in size as it has less or no content in text field values(we don't index that field at all in this case) and rest are of the bigger in size. PG data is all of varchar as well as text and we do trim the data before converting to xml, so we can avoid about trailing blanks. How do you query?: This is used for a free text search portal where Solr is used to retrieve the data mainly based on the text field values. We do have a copy field all_text which copies most of these 106 fields to a single big_text filed which is the default search field in df param. This is not a new project, but is been live for an year now, search works as expected. We do push incremental data on a daily basis and the indexing issue occurs when we have to re index the full data again for certain field additions and all. Just curious, - is there any difference in posting the data in json format vs xml? - do we get any performance improvement if we generate the json/xml files, scp to the solr server and then push via curl command Regards, Aneesh N On Fri, Mar 4, 2016 at 1:44 AM, Jack Krupansky <jack.krupan...@gmail.com> wrote: > What does a typical document look like - number of columns, data type, > size? How much is text vs. numeric? Are there any large blobs? I mean, 15M > docs in 425GB indicates about 28K per row/document which seems rather > large. > > Is the PG data VARCHAR(n) or CHAR(n). IOW, might it have lots of trailing > blanks for text columns? > > As always, the very first question in Solr data modeling is always how do > you intend to query the data - queries will determine the data model. > > Ultimately, the issue will not be how long it takes to index, but query > latency and query throughput. > > 30GB sounds way too small for a 425GB index in terms of odds of low query > latency. > > -- Jack Krupansky > > On Thu, Mar 3, 2016 at 12:54 PM, Aneesh Mon N <aneeshm...@gmail.com> > wrote: > > > Hi, > > > > We are facing a huge performance issue while indexing the data to Solr, > we > > have around 15 million records in a PostgreSql database which has to be > > indexed to Solr 5.3.1 server. > > It takes around 16 hours to complete the indexing as of now. > > > > To be noted that all the fields are stored so as to support the atomic > > updates. > > > > Current approach: > > We use a ETL tool(Pentaho) to fetch the data from database in chunks of > > 1000 records, convert them into xml format and pushes to Solr. This is > run > > in 10 parallel threads. > > > > System params > > Solr Version: 5.3.1 > > Size on disk: 425 GB > > > > Database, ETL machine and SOLR are of 16 core and 30 GB RAM > > Database and SOLR Disk: RAID > > > > Any pointers best approaches to index these kind of data would be > helpful. > > > > -- > > Regards, > > Aneesh Mon N > > Chennai > > +91-8197-188-588 > > > -- Regards, Aneesh Mon N Bangalore +91-8197-188-588