On Sat, Nov 13, 2010 at 4:56 PM, Ahmet Arslan <iori...@yahoo.com> wrote: > For (1) you probably need to write a custom transformer. Something like: > public Object transformRow(Map<String, Object> row) { > String language_code = row.get("language_code"); > String text = row.get("text"); > if("en".equals(language_code)) > row.put("text_en", text); > else if if("fr".equals(language_code)) > row.put("text_fr", text); > > return row; > } > > > For (2), it doable with regex transformer. > > "<field column="mailId" splitBy="," sourceColName="emailids"/> > The 'emailids' field in the table can be a comma separated value. So it ends > up giving out one or more than one email ids and we expect the 'mailId' to be > a multivalued field in Solr." [1] > > [1]http://wiki.apache.org/solr/DataImportHandler#RegexTransformer >
In my opinion, I think that this is a bit of overkill. Since the DIH supports multiple entities, with no real limit on the SQL queries, I think that the easiest (and less involved) approach would be to create three entities for the languages the OP wishes to index: <entity name="english" query="SELECT * FROM documents WHERE language_code='en'" transformer="RegexTransformer"> <field column="text_en" column="text" /> <field column="tags" column="tags" splitBy="," /> </entity> <entity name="french" query="SELECT * FROM documents WHERE language_code='fr'" transformer="RegexTransformer"> <field column="text_fr" column="text" /> <field column="tags" column="tags" splitBy="," /> </entity> <entity name="chinese" query="SELECT * FROM documents WHERE language_code='zh'" transformer="RegexTransformer"> <field column="text_zh" column="text" /> <field column="tags" column="tags" splitBy="," /> </entity> But, I admit that depending on future growth of languages, as well as other factors (i.e., needing more specific logic, etc), a programmatic approach might be warranted. I would recommend, however, that the database table be a little more normalized. Your definition for tags is quite limiting, and could be better served using a many-to-many relationship. Something like the following might serve you well: CREATE TABLE documents ( id INT NOT NULL AUTO_INCREMENT, language_code CHAR(2), tags CHAR(30), text TEXT, PRIMARY KEY (id) ); CREATE TABLE document_tags ( id INT NOT NULL AUTO_INCREMENT, tag CHAR(30), PRIMARY KEY (id) ); CREATE TABLE document_tag_lookup ( document_id INT NOT NULL, tag_id INT NOT NULL, PRIMARY KEY (document_id, tag_id) ); Then in the DIH, you simply nest a second entity to look up the zero or more tags that might be associated with your documents; take the "english" entity from above: <entity name="english" query="SELECT * FROM documents WHERE language_code='en'" transformer="RegexTransformer"> <field name="text_en" column="text" /> <entity name="english_tags" query="SELECT * FROM document_tags dt INNER JOIN document_tag_lookup dtl ON (dtl.tag_id = dt.id AND dtl.document_id='${english.id}')"> <field name="tags" column="tag" /> </entity> </entity> This would allow for growth, and is easy to maintain. Additionally, if you wanted to implement a custom transformer of your own, you could. As an aside, a sort of compromise, you could also use the ScriptTransformer [1] to create a Javascript function that can do your language logic and create the necessary fields, and not have to worry about maintaining any custom Java code. [1] http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - Ken