Yes the id is unique. If I only select distinct id,count(id) I get the same results. However I found this is more likely a MySQL issue. I created a new table called director1 and ran query "insert into director1 select * from director" I got only 287041 results inserted, which was the same as Solr. I don't know why the same query is causing two different results.
On Saturday, November 7, 2015, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > That's not quite the question I asked. Do a distinct on 'id' only in > the database itself. If your ids are NOT unique, you need to create a > composite or a virtual id for Solr. Because whatever your > solrconfig.xml say is uniqueKey will be used to deduplicate the > documents. If you have 10 documents with the same id value, only one > will be in the final Solr. > > I am not saying that's where the problem is, DIH is fiddly. But just > get that out of the way. > > If that's not the case, you may need to isolate which documents are > failing. The easiest way to do so is probably to index a smaller > subset of records, say 1000. Pick a condition in your SQL to do so > (e.g. id value range). Then, see how many made it into Solr. If not > all 1000, export the list of IDs from SQL, then a list of IDs from > Solr (use CSV format and just fl=id). Sort both, compare, see what ids > are missing. Look what is strange about those documents as opposed to > the documents that did make it into Solr. Try to push one of those > missing documents explicitly into Solr by either modifying SQL query > in DIH or as CSV or whatever. > > Good luck, > Alex. > ---- > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 7 November 2015 at 19:07, Yangrui Guo <guoyang...@gmail.com > <javascript:;>> wrote: > > Hi thanks for the continued support. I'm really worried as my project > > deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select > > distinct in the beginning of the query because IMDB doesn't have a table > > for cast & crew. It puts movie and person and their roles into one huge > > table 'cast_info'. Hence there are multiple rows for a director, one row > > per his movie. > > > > On Saturday, November 7, 2015, Alexandre Rafalovitch <arafa...@gmail.com > <javascript:;>> > > wrote: > > > >> Just to get the paranoid option out of the way, is 'id' actually the > >> column that has unique ids in your database? If you do "select > >> distinct id from imdb.director" - how many items do you get? > >> > >> Regards, > >> Alex. > >> ---- > >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > >> http://www.solr-start.com/ > >> > >> > >> On 7 November 2015 at 18:21, Yangrui Guo <guoyang...@gmail.com > <javascript:;> > >> <javascript:;>> wrote: > >> > Hello > >> > > >> > I'm being troubled by solr's data import handler. My solr version is > >> 5.3.1 > >> > and mysql is 5.5. I tried to index imdb data but found solr only > >> partially > >> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the > >> query > >> > result was 1636549. However DIH only fetched and indexed 287041 rows. > I > >> > didn't see any error in the log. Why was this happening? > >> > > >> > Here's my data-config.xml > >> > > >> > <dataConfig> > >> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" > >> > url="jdbc:mysql://localhost:3306/imdb" user="root" > password="password" /> > >> > <document> > >> > <entity name="director" transformer="RegexTransformer" query="SELECT > >> > DISTINCT * FROM imdb.director"> > >> > <field name="id" column="id" /> > >> > <field name="content_type" column="content_type" /> > >> > </entity> > >> > </document> > >> > </dataConfig> > >> > > >> > Yangrui Guo > >> >