We import anywhere from five to fifty million small documents a day from
a postgres database. I wrestled to get the DIH stuff to work for us for
about a year and was much happier when I ditched that approach and
switched to writing the few hundred lines of relatively simple code to
handle directly the logic of what gets updated and how it gets queried
from postgres ourselves.
The DIH stuff is great for lots of cases, but if you are getting to the
point of trying to hack its undocumented internals, I suspect you are
better off spending a day or two of your time just writing all of the
update logic yourself.
We found a relatively simple combination of postgres triggers, export to
csv based on those triggers, and then just calling update/csv to work
best for us.
-hal
On 3/16/15 9:59 AM, Shawn Heisey wrote:
On 3/16/2015 7:15 AM, sreedevi s wrote:
I had checked this post.I dont know whether this is possible but my query
is whether I can use the configuration for DIH for indexing via SolrJ
You can use SolrJ for accessing DIH. I have code that does this, but
only for full index rebuilds.
It won't be particularly obvious how to do it. Writing code that can
intepret DIH status and know when it finishes, succeeds, or fails is
very tricky because DIH only uses human-readable status info, not
machine-readable, and the info is not very consistent.
I can't just share my code, because it's extremely convoluted ... but
the general gist is to create a SolrQuery object, use setRequestHandler
to set the handler to "/dataimport" or whatever your DIH handler is, and
set the other parameters on the request like "command" to "full-import"
and so on.
Thanks,
Shawn
--
Hal Roberts
Fellow
Berkman Center for Internet & Society
Harvard University