The disadvantages of DIH are 1> it's a black box, debugging it isn't easy 2> it puts all the work on the Solr node. Parsing documents in various forms can be pretty heavy-weight and steal cycles from indexing and searching. 2a> the extracting request handler also puts all the load on Solr FWIW.
Personally I prefer an external program (and I was gratified to see Yavar's reference to the indexing with SolrJ article...). But then I'm a Java programmer by training, so that seems easy... Best, Erick On Tue, Apr 7, 2015 at 7:41 AM, Dan Davis <dansm...@gmail.com> wrote: > Sangeetha, > > You can also run Tika directly from data import handler, and Data Import > Handler can be made to run several threads if you can partition the input > documents by directory or database id. I've done 4 "threads" by having a > base configuration that does an Oracle query like this: > > SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ... > WHERE ...) WHERE threadid = %d > > A bash/sed script writes several data import handler XML files. > I can then index several threads at a time. > > Each of these threads can then use all the transformers, e.g. > templateTransformer, etc. > XML can be transformed via XSLT. > > The Data Import Handler has other entities that go out to the web and then > index the document via Tika. > > If you are indexing generic HTML, you may want to figure out an approach to > SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika > locally, because Boilerpipe has a bug that has been fixed, but not pushed > to Maven Central. Without that, the ASF cannot include the fix, but > distributions such as LucidWorks Solr Enterprise can. > > I can drop some configs into github.com if I clean them up to obfuscate > host names, passwords, and such. > > > On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain <yavarhus...@gmail.com> wrote: > >> Well have indexed heterogeneous sources including a variety of NoSQL's, >> RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite >> of using SolrJ is that you should have an API to fetch data from your data >> source (Say JDBC for RDBMS, Tika for extracting text content from rich >> documents etc.) than SolrJ is so damn great and simple. Its as simple as >> downloading the jar and few lines of code to send data to your solr server >> after pre-processing your data. More details here: >> >> http://lucidworks.com/blog/indexing-with-solrj/ >> >> https://wiki.apache.org/solr/Solrj >> >> http://www.solrtutorial.com/solrj-tutorial.html >> >> Cheers, >> Yavar >> >> >> >> On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com < >> sangeetha.subraman...@gtnexus.com> wrote: >> >> > Hi, >> > >> > I am a newbie to SOLR and basically from database background. We have a >> > requirement of indexing files of different formats (x12,edifact, >> csv,xml). >> > The files which are inputted can be of any format and we need to do a >> > content based search on it. >> > >> > From the web I understand we can use TIKA processor to extract the >> content >> > and store it in SOLR. What I want to know is, is there any better >> approach >> > for indexing files in SOLR ? Can we index the document through streaming >> > directly from the Application ? If so what is the disadvantage of using >> it >> > (against DIH which fetches from the database)? Could someone share me >> some >> > insight on this ? ls there any web links which I can refer to get some >> idea >> > on it ? Please do help. >> > >> > Thanks >> > Sangeetha >> > >> > >>