Suresh and Meena,

I have solved this problem by taking a row count on a query, and adding its
modulo as another field called threadid.     The base query is wrapped in a
query that selects a subset of the results for indexing.   The modulo on
the row number was intentional - you cannot rely on id columns to be well
distributed and you cannot rely on the number of rows to stay constant over
time.

To make it more concrete, I have a base DataImportHandler configuration
that looks something like what's below - your SQL may differ as we use
Oracle.

 <entity name="medsite" dataSource="oltp01_prod"
            rootEntity="true"
            query="SELECT * FROM (SELECT t.*, Mod(RowNum, 4) threadid FROM
medplus.public_topic_sites_us_v t) WHERE threadid = %%d%%"
            transformer="TemplateTransformer">
        ...

 </entity>


To get it to be multi-threaded, I then copy it to 4 different configuration
files as follows:

echo "Medical Sites Configuration - "
${MEDSITES_CONF:=medical-sites-conf.xml}
echo "Medical Sites Prototype - "
${MEDSITES_PROTOTYPE:=medical-sites-%%d%%-conf.xml}
for tid in `seq 0 3`; do
   MEDSITES_OUT=`echo $MEDSITES_PROTOTYPE| sed -e "s/%%d%%/$tid/"`
   sed -e "s/%%d%%/$tid/" $MEDSITES_CONF > $MEDSITES_OUT
done


Then, I have 4 requestHandlers in solrconfig.xml that point to each of
these files.    They are "/import/medical-sites-0" through
"/import/medical-sites-3".   Note that this wouldn't work with a single
Data Import Handler that was parameterized - a particular data Import
Handler is either idle or busy, and no longer should be run in multiple
threads.   How this would work if the first entity weren't the root entity
is another question - you can usually structure it with the first SQL query
being the root entity if you are using SQL.   XML is another story, however.

I did it this way because I wanted to stay with Solr "out-of-the-box"
because it was an evaluation of what Data Import Handler could do.   If I
were doing this without some business requirement to evaluate whether Solr
"out-of-the-box" could do multithreaded database improt, I'd probably write
a multi-threaded front-end that did the queries and transformations I
needed to do.   In this case, I was considering the best way to do "all"
our data imports from RDBMS, and Data Import Handler is the only good
solution that involves writing configuration, not code.   The distinction
is slight, I think.

Hope this helps,

Dan Davis

On Wed, Feb 4, 2015 at 3:02 AM, Mikhail Khludnev <mkhlud...@griddynamics.com
> wrote:

> Suresh,
>
> There are a few common workaround for such problem. But, I think that
> submitting more than "maxIndexingThreads" is not really productive. Also, I
> think that out-of-memory problem is caused not by indexing, but by opening
> searcher. Do you really need to open it? I don't think it's a good idea to
> search on the instance which cooks many T index at the same time. Are you
> sure you don't issue superfluous commit, and you've disabled auto-commit?
>
> let's nail down oom problem first, and then deal with indexing speedup. I
> like huge indices!
>
> On Wed, Feb 4, 2015 at 1:10 AM, Arumugam, Suresh <suresh.arumu...@emc.com>
> wrote:
>
> > We are also facing the same problem in loading 14 Billion documents into
> > Solr 4.8.10.
> >
> > Dataimport is working in Single threaded, which is taking more than 3
> > weeks. This is working fine without any issues but it takes months to
> > complete the load.
> >
> > When we tried SolrJ with the below configuration in Multithreaded load,
> > the Solr is taking more memory & at one point we will end up in out of
> > memory as well.
> >
> >         Batch Doc count      :  100000 docs
> >         No of Threads          : 16/32
> >
> >         Solr Memory Allocated : 200 GB
> >
> > The reason can be as below.
> >
> >         Solr is taking the snapshot, whenever we open a SearchIndexer.
> >         Due to this more memory is getting consumed & solr is extremely
> > slow while running 16 or more threads for loading.
> >
> > If anyone have already done the multithreaded data load into Solr in a
> > quicker way, Can you please share the code or logic in using the SolrJ
> API?
> >
> > Thanks in advance.
> >
> > Regards,
> > Suresh.A
> >
> > -----Original Message-----
> > From: Dyer, James [mailto:james.d...@ingramcontent.com]
> > Sent: Tuesday, February 03, 2015 1:58 PM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Solr 4.9 Calling DIH concurrently
> >
> > DIH is single-threaded.  There was once a threaded option, but it was
> > buggy and subsequently was removed.
> >
> > What I do is partition my data and run multiple dih request handlers at
> > the same time.  It means redundant sections in solrconfig.xml and its not
> > very elegant but it works.
> >
> > For instance, for a sql query, I add something like this: "where mod(id,
> >
> ${dataimporter.request.numPartitions})=${dataimporter.request.currentPartition}".
> >
> > I think, though, most users who want to make the most out of
> > multithreading write their own program and use the solrj api to send the
> > updates.
> >
> > James Dyer
> > Ingram Content Group
> >
> >
> > -----Original Message-----
> > From: meena.sri...@mathworks.com [mailto:meena.sri...@mathworks.com]
> > Sent: Tuesday, February 03, 2015 3:43 PM
> > To: solr-user@lucene.apache.org
> > Subject: Solr 4.9 Calling DIH concurrently
> >
> > Hi
> >
> > I am using solr 4.9 and need to index million of documents from database.
> > I am using DIH and sending request to fetch by ids. Is there a way to run
> > multiple indexing threads, concurrently in DIH.
> > I want to take advantage of
> > <maxIndexingThreads>
> > parameter. How do I do it. I am just invoking DIH handler using solrj
> > HttpSolrServer.
> > And issue requests sequentially.
> >
> >
> http://localhost:8983/solr/db/dataimport?command=full-import&clean=false&maxId=100&minId=1
> >
> >
> >
> http://localhost:8983/solr/db/dataimport?command=full-import&clean=false&maxId=201&minId=101
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Solr-4-9-Calling-DIH-concurrently-tp4183744.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> <mkhlud...@griddynamics.com>
>

Reply via email to