Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Chantal Ackermann Thu, 23 Jul 2009 04:30:14 -0700

Hi Paul, hi Glen, hi all,

thank you for your answers.

I have followed Paul's solution (as I received it earlier). (I'll keepyour suggestion in mind, though, Glen.)


It looks good, except that it's not creating any documents... ;-)

It is most probably some misunderstanding on my side, and maybe you canhelp me correct that?

So, I have subclassed the SqlEntityProcessor by overwriting basicallynextRow() as Paul suggested:


public Map<String, Object> nextRow() {
        if (rowcache != null)
                return getFromRowCache();
        if (rowIterator == null) {
                String q = getQuery();
                initQuery(resolver.replaceTokens(q));
        }
        Map<String, Object> pivottedRow = new HashMap<String, Object>();
        Map<String, Object> fieldRow = getNext();
        while (fieldRow != null) {
                // populate pivottedRow
                fieldRow = getNext();
        }
        pivottedRow = applyTransformer(pivottedRow);
        log.info("Returning: " + pivottedRow);
        return pivottedRow;
}

This seems to work as intended. From the log output, I can see that Iget only the rows that I expect for one iteration in the correctkey-value structure. I can also see, that the returned pivottedRow iswhat I want it to be: a map containing columns where each columncontains what previously was input as a row.


Example (shortened):
INFO: Next fieldRow: {value=2, name=audio, id=1}
INFO: Next fieldRow: {value=773, name=cat, id=23}
INFO: Next fieldRow: {value=642058, name=sid, id=17}

INFO: Returning: {sid=642058, cat=[773], audio=2}

The entity declaration in the dih config (db_data_config.xml) looks likethis (shortened):

<entity name="my_value" processor="PivotSqlEntityProcessor"
        columnValue="value" columnName="name"

query="select id, name, value from datamart whereparent_id=${id_definition.ID} and id in (1,23,17)">

        <field column="sid" name="sid" />
        <field column="audio" name="audio" />
        <field column="cat" name="cat" />
</entity>

id_definition is the root entity. Per parent_id there are several rowsin the datamart table which describe one data set (=>lucene document).

The object type of "value" is either String, String[] or List. I am nothandling that explicitly, yet. If that'd be the problem it would throwan exception, wouldn't it?

But it is not creating any documents at all, although the data seems tobe returned correctly from the processor, so it's pobably something farmore fundamental.

<str name="Total Requests made to DataSource">1069</str>
<str name="Total Rows Fetched">1069</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2009-07-23 12:57:07</str>
−
<str name="">
Indexing completed. Added/Updated: 0 documents. Deleted 0 documents.
</str>

Any help / hint on what the root cause is or how to debug it would begreatly appreciated.


Thank you!
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:

alternately, you can write your own EntityProcessor and just override
the nextRow() . I guess you can still use the JdbcDataSource

On Wed, Jul 22, 2009 at 10:05 PM, Chantal
Ackermann<chantal.ackerm...@btelligent.de> wrote:

Hi all,

this is my first post, as I am new to SOLR (some Lucene exp).

I am trying to load data from an existing datamart into SOLR using the
DataImportHandler but in my opinion it is too slow due to the special
structure of the datamart I have to use.

Root Cause:
This datamart uses a row based approach (pivot) to present its data. It was
so done to allow adding more attributes to the data set without having to
change the table structure.

Impact:
To use the DataImportHandler, i have to pivot the data to create again one
row per data set. Unfortunately, this results in more and less performant
queries. Moreover, there are sometimes multiple rows for a single attribute,
that require separate queries - or more tricky subselects that probably
don't speed things up.

Here is an example of the relation between DB requests, row fetches and
actual number of documents created:

<lst name="statusMessages">
<str name="Total Requests made to DataSource">3737</str>
<str name="Total Rows Fetched">5380</str>
<str name="Total Documents Skipped">0</str>
<str name="Full Dump Started">2009-07-22 18:19:06</str>
-
<str name="">
Indexing completed. Added/Updated: 934 documents. Deleted 0 documents.
</str>
<str name="Committed">2009-07-22 18:22:29</str>
<str name="Optimized">2009-07-22 18:22:29</str>
<str name="Time taken ">0:3:22.484</str>
</lst>

(Full index creation.)
There are about half a million data sets, in total. That would require about
30h for indexing? My feeling is that there are far too many row fetches per
data set.

I am testing it on a smaller machine (2GB, Windows :-( ), Tomcat6 using
around 680MB RAM, Java6. I haven't changed the Lucene configuration (merge
factor 10, ram buffer size 32).

Possible solutions?
A) Write my own DataImportHandler?
B) Write my own "MultiRowTransformer" that accepts several rows as input
argument (not sure this is a valid option)?
C) Approach the DB developers to add a flat table with one data set per row?
D) ...?

If someone would like to share their experiences, that would be great!

Thanks a lot!
Chantal



--
Chantal Ackermann




--
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler / Import from DB : one data set comes in multiple rows

Reply via email to