Hi,

   I am working on Solr using DataImortHander for indexing rich documents
like pdf,word,image etc 
I am using TikaEntityProcessor for extracting contents from the files.

I have one small issue regarding setting value to 'url' entry.

My data-config.xml file is like so:

<dataConfig>
    <dataSource name="db_ds" type="JdbcDataSource"
    driver="oracle.jdbc.OracleDriver"
    url="jdbc:oracle:thin:@KOR308051.bmh.apac.bosch.com:1521:xe"
    user="ezbdb"
    password="ezbdb"/>
        
    <dataSource name="tk_ds" type="BinFileDataSource" />

     
        
    <document name="db_doc">
                <entity name="db_link"
                                query="SELECT 
                                                d.doc_url as Link,
                                                d.doc_name as Name,
                                                cast(trunc(d.last_modified) as 
date) as Last_modified
                                                FROM doc_data d
                                dataSource="db_ds"
transformer="DateFormatTransformer,script:getFilePath">
                        <field column="LINK" name="link"/>
                        <field column="NAME" name="name"/>
                        <field column="LAST_MODIFIED" name="last_modified" 
xpath="/RDF/item/date"
dateTimeFormat="yyyy-MM-dd HH:mm:ss"/>
        
                        <entity name="tika-doc" dataSource="tk_ds"
processor="TikaEntityProcessor"
                              url="${db_link.LINK}" format="text"
onError="skip">
                             <field column="text" name="content"/>
                       </entity>
                        
               </entity>
    </document>
</dataConfig>

The thing is, the file path is stored in a different pattern in the
database:
"doc_url" is the field in db which stores the url or file path. The file
path is stored in this way:
             *D:\Games\CS2\setup.doc#D:\Games\CS2\setup.doc#*
i.e. the path is stored twice seperated by a '#'. I am not sure why it is
done. It has been done by our client.

All I need is only the one file path i.e. D:\Games\CS2\setup.doc
I am passing the url value to tika as * url="${db_link.LINK}"
*
But the *${db_link.LINK}* contains the path coming from database directly.
I have tried using script transformer and splitting the path string to parts
by '#' and taking the first path using the method *getFilePath(row)* but no
luck.

I am still getting the path as stored in db. This gives a *FileNotFound*
exception while trying to index it and that is obvious because the path is
incorrect.

What can be done to get only the path and leaving out rest of the path
having # and all?

Help would be much appreciated :)







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tika-url-issue-tp4139781.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to