Yes the id is unique. If I only select distinct id,count(id) I get the same
results. However I found this is more likely a MySQL issue. I created a new
table called director1 and ran query "insert into director1 select * from
director" I got only 287041 results inserted, which was the same as Solr. I
don't know why the same query is causing two different results.

On Saturday, November 7, 2015, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

> That's not quite the question I asked. Do a distinct on 'id' only in
> the database itself. If your ids are NOT unique, you need to create a
> composite or a virtual id for Solr. Because whatever your
> solrconfig.xml say is uniqueKey will be used to deduplicate the
> documents. If you have 10 documents with the same id value, only one
> will be in the final Solr.
>
> I am not saying that's where the problem is, DIH is fiddly. But just
> get that out of the way.
>
> If that's not the case, you may need to isolate which documents are
> failing. The easiest way to do so is probably to index a smaller
> subset of records, say 1000. Pick a condition in your SQL to do so
> (e.g. id value range). Then, see how many made it into Solr. If not
> all 1000, export the list of IDs from SQL, then a list of IDs from
> Solr (use CSV format and just fl=id). Sort both, compare, see what ids
> are missing. Look what is strange about those documents as opposed to
> the documents that did make it into Solr. Try to push one of those
> missing documents explicitly into Solr by either modifying SQL query
> in DIH or as CSV or whatever.
>
> Good luck,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 7 November 2015 at 19:07, Yangrui Guo <guoyang...@gmail.com
> <javascript:;>> wrote:
> > Hi thanks for the continued support. I'm really worried as my project
> > deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> > distinct in the beginning of the query because IMDB doesn't have a table
> > for cast & crew. It puts movie and person and their roles into one huge
> > table 'cast_info'. Hence there are multiple rows for a director, one row
> > per his movie.
> >
> > On Saturday, November 7, 2015, Alexandre Rafalovitch <arafa...@gmail.com
> <javascript:;>>
> > wrote:
> >
> >> Just to get the paranoid option out of the way, is 'id' actually the
> >> column that has unique ids in your database? If you do "select
> >> distinct id from imdb.director" - how many items do you get?
> >>
> >> Regards,
> >>    Alex.
> >> ----
> >> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> >> http://www.solr-start.com/
> >>
> >>
> >> On 7 November 2015 at 18:21, Yangrui Guo <guoyang...@gmail.com
> <javascript:;>
> >> <javascript:;>> wrote:
> >> > Hello
> >> >
> >> > I'm being troubled by solr's data import handler. My solr version is
> >> 5.3.1
> >> > and mysql is 5.5. I tried to index imdb data but found solr only
> >> partially
> >> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
> >> query
> >> > result was 1636549. However DIH only fetched and indexed 287041 rows.
> I
> >> > didn't see any error in the log. Why was this happening?
> >> >
> >> > Here's my data-config.xml
> >> >
> >> > <dataConfig>
> >> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
> >> > url="jdbc:mysql://localhost:3306/imdb" user="root"
> password="password" />
> >> > <document>
> >> > <entity name="director" transformer="RegexTransformer" query="SELECT
> >> > DISTINCT * FROM imdb.director">
> >> > <field name="id" column="id" />
> >> > <field name="content_type" column="content_type" />
> >> > </entity>
> >> > </document>
> >> > </dataConfig>
> >> >
> >> > Yangrui Guo
> >>
>

Reply via email to