That's not quite the question I asked. Do a distinct on 'id' only in the database itself. If your ids are NOT unique, you need to create a composite or a virtual id for Solr. Because whatever your solrconfig.xml say is uniqueKey will be used to deduplicate the documents. If you have 10 documents with the same id value, only one will be in the final Solr.
I am not saying that's where the problem is, DIH is fiddly. But just get that out of the way. If that's not the case, you may need to isolate which documents are failing. The easiest way to do so is probably to index a smaller subset of records, say 1000. Pick a condition in your SQL to do so (e.g. id value range). Then, see how many made it into Solr. If not all 1000, export the list of IDs from SQL, then a list of IDs from Solr (use CSV format and just fl=id). Sort both, compare, see what ids are missing. Look what is strange about those documents as opposed to the documents that did make it into Solr. Try to push one of those missing documents explicitly into Solr by either modifying SQL query in DIH or as CSV or whatever. Good luck, Alex. ---- Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 7 November 2015 at 19:07, Yangrui Guo <guoyang...@gmail.com> wrote: > Hi thanks for the continued support. I'm really worried as my project > deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select > distinct in the beginning of the query because IMDB doesn't have a table > for cast & crew. It puts movie and person and their roles into one huge > table 'cast_info'. Hence there are multiple rows for a director, one row > per his movie. > > On Saturday, November 7, 2015, Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > >> Just to get the paranoid option out of the way, is 'id' actually the >> column that has unique ids in your database? If you do "select >> distinct id from imdb.director" - how many items do you get? >> >> Regards, >> Alex. >> ---- >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: >> http://www.solr-start.com/ >> >> >> On 7 November 2015 at 18:21, Yangrui Guo <guoyang...@gmail.com >> <javascript:;>> wrote: >> > Hello >> > >> > I'm being troubled by solr's data import handler. My solr version is >> 5.3.1 >> > and mysql is 5.5. I tried to index imdb data but found solr only >> partially >> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the >> query >> > result was 1636549. However DIH only fetched and indexed 287041 rows. I >> > didn't see any error in the log. Why was this happening? >> > >> > Here's my data-config.xml >> > >> > <dataConfig> >> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" >> > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" /> >> > <document> >> > <entity name="director" transformer="RegexTransformer" query="SELECT >> > DISTINCT * FROM imdb.director"> >> > <field name="id" column="id" /> >> > <field name="content_type" column="content_type" /> >> > </entity> >> > </document> >> > </dataConfig> >> > >> > Yangrui Guo >>