That's not quite the question I asked. Do a distinct on 'id' only in
the database itself. If your ids are NOT unique, you need to create a
composite or a virtual id for Solr. Because whatever your
solrconfig.xml say is uniqueKey will be used to deduplicate the
documents. If you have 10 documents with the same id value, only one
will be in the final Solr.

I am not saying that's where the problem is, DIH is fiddly. But just
get that out of the way.

If that's not the case, you may need to isolate which documents are
failing. The easiest way to do so is probably to index a smaller
subset of records, say 1000. Pick a condition in your SQL to do so
(e.g. id value range). Then, see how many made it into Solr. If not
all 1000, export the list of IDs from SQL, then a list of IDs from
Solr (use CSV format and just fl=id). Sort both, compare, see what ids
are missing. Look what is strange about those documents as opposed to
the documents that did make it into Solr. Try to push one of those
missing documents explicitly into Solr by either modifying SQL query
in DIH or as CSV or whatever.

Good luck,
   Alex.
----
Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 7 November 2015 at 19:07, Yangrui Guo <guoyang...@gmail.com> wrote:
> Hi thanks for the continued support. I'm really worried as my project
> deadline is near. It was 1636549 in MySQL vs 287041 in Solr. I put select
> distinct in the beginning of the query because IMDB doesn't have a table
> for cast & crew. It puts movie and person and their roles into one huge
> table 'cast_info'. Hence there are multiple rows for a director, one row
> per his movie.
>
> On Saturday, November 7, 2015, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
>> Just to get the paranoid option out of the way, is 'id' actually the
>> column that has unique ids in your database? If you do "select
>> distinct id from imdb.director" - how many items do you get?
>>
>> Regards,
>>    Alex.
>> ----
>> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
>> http://www.solr-start.com/
>>
>>
>> On 7 November 2015 at 18:21, Yangrui Guo <guoyang...@gmail.com
>> <javascript:;>> wrote:
>> > Hello
>> >
>> > I'm being troubled by solr's data import handler. My solr version is
>> 5.3.1
>> > and mysql is 5.5. I tried to index imdb data but found solr only
>> partially
>> > indexed. I ran "SELECT DISTINCT COUNT(*) FROM imdb.director" and the
>> query
>> > result was 1636549. However DIH only fetched and indexed 287041 rows. I
>> > didn't see any error in the log. Why was this happening?
>> >
>> > Here's my data-config.xml
>> >
>> > <dataConfig>
>> > <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver"
>> > url="jdbc:mysql://localhost:3306/imdb" user="root" password="password" />
>> > <document>
>> > <entity name="director" transformer="RegexTransformer" query="SELECT
>> > DISTINCT * FROM imdb.director">
>> > <field name="id" column="id" />
>> > <field name="content_type" column="content_type" />
>> > </entity>
>> > </document>
>> > </dataConfig>
>> >
>> > Yangrui Guo
>>

Reply via email to