Re: Tika: url issue

Paul Rogers Thu, 05 Jun 2014 15:11:30 -0700

Hi

Can you not split it using oracle's string functions (as part of your
select statement)?


Something along the lines of:

SELECT .........

RIGHT(LEFT(d.doc_name, (INSTR(d.doc_name, '#') - 1)),
LENGTH(LEFT(d.doc_name, (INSTR(d.doc_name, '#') - 1))) - 1)  as Name,
 ^----- (strip asterisk from front)
...........

Regards

P

On 4 June 2014 06:46, harshrossi <harshro...@gmail.com> wrote:

> Hi,
>
>    I am working on Solr using DataImortHander for indexing rich documents
> like pdf,word,image etc
> I am using TikaEntityProcessor for extracting contents from the files.
>
> I have one small issue regarding setting value to 'url' entry.
>
> My data-config.xml file is like so:
>
> <dataConfig>
>     <dataSource name="db_ds" type="JdbcDataSource"
>     driver="oracle.jdbc.OracleDriver"
>     url="jdbc:oracle:thin:@KOR308051.bmh.apac.bosch.com:1521:xe"
>     user="ezbdb"
>     password="ezbdb"/>
>
>     <dataSource name="tk_ds" type="BinFileDataSource" />
>
>
>
>     <document name="db_doc">
>                 <entity name="db_link"
>                                 query="SELECT
>                                                 d.doc_url as Link,
>                                                 d.doc_name as Name,
>
> cast(trunc(d.last_modified) as date) as Last_modified
>                                                 FROM doc_data d
>                                 dataSource="db_ds"
> transformer="DateFormatTransformer,script:getFilePath">
>                         <field column="LINK" name="link"/>
>                         <field column="NAME" name="name"/>
>                         <field column="LAST_MODIFIED" name="last_modified"
> xpath="/RDF/item/date"
> dateTimeFormat="yyyy-MM-dd HH:mm:ss"/>
>
>                         <entity name="tika-doc" dataSource="tk_ds"
> processor="TikaEntityProcessor"
>                               url="${db_link.LINK}" format="text"
> onError="skip">
>                              <field column="text" name="content"/>
>                        </entity>
>
>                </entity>
>     </document>
> </dataConfig>
>
> The thing is, the file path is stored in a different pattern in the
> database:
> "doc_url" is the field in db which stores the url or file path. The file
> path is stored in this way:
>              *D:\Games\CS2\setup.doc#D:\Games\CS2\setup.doc#*
> i.e. the path is stored twice seperated by a '#'. I am not sure why it is
> done. It has been done by our client.
>
> All I need is only the one file path i.e. D:\Games\CS2\setup.doc
> I am passing the url value to tika as * url="${db_link.LINK}"
> *
> But the *${db_link.LINK}* contains the path coming from database directly.
> I have tried using script transformer and splitting the path string to
> parts
> by '#' and taking the first path using the method *getFilePath(row)* but no
> luck.
>
> I am still getting the path as stored in db. This gives a *FileNotFound*
> exception while trying to index it and that is obvious because the path is
> incorrect.
>
> What can be done to get only the path and leaving out rest of the path
> having # and all?
>
> Help would be much appreciated :)
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tika-url-issue-tp4139781.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Tika: url issue

Reply via email to