On 10/6/2014 5:09 AM, Karunakar Reddy wrote: > Please suggest me effective way of using data import handler. > > Here is my use case. > > I have different kind of items which needs to be indexed in solr . Eg( > books, shoes,electronics etc... ) each one has in different relational > table. > I have only one core as of now which is been used for public search and for > other search pages like (book search page/ electronics search page..) > and updates are happening through indexing script which we are maintaining > internally . > We are planning to use DIH(data import handler). > > 1)Is it best way to use DIH/over indexing script? any pros and cons of > using DIH? > > 2) How can we index different type of documents(books,electronic.. the > data is there in different tables in mysql ) through document import > handler? > > 3)What is the best way to do delta-import.? how do we fire delta-import > request? is there any thing like auto delta import like autocommit?
If you already have an effective indexing method that does everything you need, I would suggest sticking with it. I think of DIH as stopgap feature, a way to get started with Solr when using a structured data store, until you can write your own indexing procedure that is highly tailored to your situation. I'm actually still using DIH for full reindexes, controlled with SolrJ, but I have grand designs for replacing it with a multi-threaded approach that hopefully will be much faster. DIH is a fairly efficient single-threaded way of accessing a single flat table space from a database. As soon as you try to make it include multiple and/or nested entities, its performance will often drop significantly. If you can reduce all of your interaction with the database to as single SELECT call -- using joins, a stored procedure, or something similar, then you MIGHT be able to use DIH effectively. The DIH handler on each of my shards uses exactly one SELECT call. There is currently no DIH scheduler built-in to Solr. There are two reasons that the idea has met with resistance: 1) There is already a built-in scheduling apparatus on *every* modern operating system, one that has been tested, debugged, and is generally bulletproof. If a feature like that is built into Solr, users will be unhappy if it doesn't work as advertised because we made a mistake in the code. I'd rather rely on an OS feature that's been around for multiple decades. 2) As a group, the developers are resistant to features that would cause Solr to make changes in the index without being *told* to do it by an outside force. There is already an issue in Jira for a DIH scheduler, but the patch hasn't been committed. Some developers would like to include it. Thanks, Shawn