Re: DIH for multilingual index & multiValued field?

Ken Stanley Sat, 13 Nov 2010 15:00:19 -0800

On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan <iori...@yahoo.com> wrote:
> For (1) you probably need to write a custom transformer. Something like:
> public Object transformRow(Map<String, Object> row)     {
> String language_code = row.get("language_code");
> String text = row.get("text");
> if("en".equals(language_code))
>       row.put("text_en", text);
> else if if("fr".equals(language_code))
>       row.put("text_fr", text);
>
> return row;
> }
>
>
> For (2), it doable with regex transformer.
>
> "<field column="mailId" splitBy="," sourceColName="emailids"/>
> The 'emailids' field in the table can be a comma separated value. So it ends 
> up giving out one or more than one email ids and we expect the 'mailId' to be 
> a multivalued field in Solr." [1]
>
> [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer
>


In my opinion, I think that this is a bit of overkill. Since the DIH
supports multiple entities, with no real limit on the SQL queries, I
think that the easiest (and less involved) approach would be to create
three entities for the languages the OP wishes to index:

<entity name="english" query="SELECT * FROM documents WHERE
language_code='en'" transformer="RegexTransformer">
    <field column="text_en" column="text" />
    <field column="tags" column="tags" splitBy="," />
</entity>

<entity name="french" query="SELECT * FROM documents WHERE
language_code='fr'" transformer="RegexTransformer">
    <field column="text_fr" column="text" />
    <field column="tags" column="tags" splitBy="," />
</entity>

<entity name="chinese" query="SELECT * FROM documents WHERE
language_code='zh'" transformer="RegexTransformer">
    <field column="text_zh" column="text" />
    <field column="tags" column="tags" splitBy="," />
</entity>

But, I admit that depending on future growth of languages, as well as
other factors (i.e., needing more specific logic, etc), a programmatic
approach might be warranted.

I would recommend, however, that the database table be a little more
normalized. Your definition for tags is quite limiting, and could be
better served using a many-to-many relationship. Something like the
following might serve you well:

   CREATE TABLE documents (
       id INT NOT NULL AUTO_INCREMENT,
       language_code CHAR(2),
       tags CHAR(30),
       text TEXT,
       PRIMARY KEY (id)
   );

   CREATE TABLE document_tags (
       id INT NOT NULL AUTO_INCREMENT,
       tag CHAR(30),
       PRIMARY KEY (id)
   );

   CREATE TABLE document_tag_lookup (
       document_id INT NOT NULL,
       tag_id INT NOT NULL,
       PRIMARY KEY (document_id, tag_id)
   );

Then in the DIH, you simply nest a second entity to look up the zero
or more tags that might be associated with your documents; take the
"english" entity from above:

<entity name="english" query="SELECT * FROM documents WHERE
language_code='en'" transformer="RegexTransformer">
    <field name="text_en" column="text" />

    <entity name="english_tags" query="SELECT * FROM document_tags dt
INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND
dtl.document_id='${english.id}')">
        <field name="tags" column="tag" />
    </entity>
</entity>

This would allow for growth, and is easy to maintain. Additionally, if
you wanted to implement a custom transformer of your own, you could.
As an aside, a sort of compromise, you could also use the
ScriptTransformer [1] to create a Javascript function that can do your
language logic and create the necessary fields, and not have to worry
about maintaining any custom Java code.

[1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer

- Ken

Re: DIH for multilingual index & multiValued field?

Reply via email to