Re: What is the best way to index 15 million documents of total size 425 GB?

Aneesh Mon N Thu, 03 Mar 2016 23:11:29 -0800

Hi Jack

Thanks for the response.

I have 106 fields in a document and for 20 of them are integer, 30
character varying and rest all are text fields. We don't have any blob data.

To add on to this, 40% of the documents are of smaller in size as it has
less or no content in text field values(we don't index that field at all in
this case) and rest are of the bigger in size.

PG data is all of varchar as well as text and we do trim the data before
converting to xml, so we can avoid about trailing blanks.

How do you query?: This is used for a free text search portal where Solr is
used to retrieve the data mainly based on the text field values. We do have
a copy field all_text which copies most of these 106 fields to a single
big_text filed which is the default search field in df param.

This is not a new project, but is been live for an year now, search works
as expected.
We do push incremental data on a daily basis and the indexing issue occurs
when we have to re index the full data again for certain field additions
and all.

Just curious,

   - is there any difference in posting the data in json format vs xml?
   - do we get any performance improvement if we generate the json/xml
   files, scp to the solr server and then push via curl command

Regards,
Aneesh N

On Fri, Mar 4, 2016 at 1:44 AM, Jack Krupansky <jack.krupan...@gmail.com>
wrote:

> What does a typical document look like - number of columns, data type,
> size? How much is text vs. numeric? Are there any large blobs? I mean, 15M
> docs in 425GB indicates about 28K per row/document which seems rather
> large.
>
> Is the PG data VARCHAR(n) or CHAR(n). IOW, might it have lots of trailing
> blanks for text columns?
>
> As always, the very first question in Solr data modeling is always how do
> you intend to query the data - queries will determine the data model.
>
> Ultimately, the issue will not be how long it takes to index, but query
> latency and query throughput.
>
> 30GB sounds way too small for a 425GB index in terms of odds of low query
> latency.
>
> -- Jack Krupansky
>
> On Thu, Mar 3, 2016 at 12:54 PM, Aneesh Mon N <aneeshm...@gmail.com>
> wrote:
>
> > Hi,
> >
> > We are facing a huge performance issue while indexing the data to Solr,
> we
> > have around 15 million records in a PostgreSql database which has to be
> > indexed to Solr 5.3.1 server.
> > It takes around 16 hours to complete the indexing as of now.
> >
> > To be noted that all the fields are stored so as to support the atomic
> > updates.
> >
> > Current approach:
> > We use a ETL tool(Pentaho) to fetch the data from database in chunks of
> > 1000 records, convert them into xml format and pushes to Solr. This is
> run
> > in 10 parallel threads.
> >
> > System params
> > Solr Version: 5.3.1
> > Size on disk: 425 GB
> >
> > Database, ETL machine and SOLR are of 16 core and 30 GB RAM
> > Database and SOLR Disk: RAID
> >
> > Any pointers best approaches to index these kind of data would be
> helpful.
> >
> > --
> > Regards,
> > Aneesh Mon N
> > Chennai
> > +91-8197-188-588
> >
>

-- 
Regards,
Aneesh Mon N
Bangalore
+91-8197-188-588

Re: What is the best way to index 15 million documents of total size 425 GB?

Reply via email to