We import anywhere from five to fifty million small documents a day from a postgres database. I wrestled to get the DIH stuff to work for us for about a year and was much happier when I ditched that approach and switched to writing the few hundred lines of relatively simple code to handle directly the logic of what gets updated and how it gets queried from postgres ourselves.

The DIH stuff is great for lots of cases, but if you are getting to the point of trying to hack its undocumented internals, I suspect you are better off spending a day or two of your time just writing all of the update logic yourself.

We found a relatively simple combination of postgres triggers, export to csv based on those triggers, and then just calling update/csv to work best for us.

-hal

On 3/16/15 9:59 AM, Shawn Heisey wrote:
On 3/16/2015 7:15 AM, sreedevi s wrote:
I had checked this post.I dont know whether this is possible but my query
is whether I can use the configuration for DIH for indexing via SolrJ

You can use SolrJ for accessing DIH.  I have code that does this, but
only for full index rebuilds.

It won't be particularly obvious how to do it.  Writing code that can
intepret DIH status and know when it finishes, succeeds, or fails is
very tricky because DIH only uses human-readable status info, not
machine-readable, and the info is not very consistent.

I can't just share my code, because it's extremely convoluted ... but
the general gist is to create a SolrQuery object, use setRequestHandler
to set the handler to "/dataimport" or whatever your DIH handler is, and
set the other parameters on the request like "command" to "full-import"
and so on.

Thanks,
Shawn


--
Hal Roberts
Fellow
Berkman Center for Internet & Society
Harvard University

Reply via email to