Hi Can you not split it using oracle's string functions (as part of your select statement)?
Something along the lines of: SELECT ......... RIGHT(LEFT(d.doc_name, (INSTR(d.doc_name, '#') - 1)), LENGTH(LEFT(d.doc_name, (INSTR(d.doc_name, '#') - 1))) - 1) as Name, ^----- (strip asterisk from front) ........... Regards P On 4 June 2014 06:46, harshrossi <harshro...@gmail.com> wrote: > Hi, > > I am working on Solr using DataImortHander for indexing rich documents > like pdf,word,image etc > I am using TikaEntityProcessor for extracting contents from the files. > > I have one small issue regarding setting value to 'url' entry. > > My data-config.xml file is like so: > > <dataConfig> > <dataSource name="db_ds" type="JdbcDataSource" > driver="oracle.jdbc.OracleDriver" > url="jdbc:oracle:thin:@KOR308051.bmh.apac.bosch.com:1521:xe" > user="ezbdb" > password="ezbdb"/> > > <dataSource name="tk_ds" type="BinFileDataSource" /> > > > > <document name="db_doc"> > <entity name="db_link" > query="SELECT > d.doc_url as Link, > d.doc_name as Name, > > cast(trunc(d.last_modified) as date) as Last_modified > FROM doc_data d > dataSource="db_ds" > transformer="DateFormatTransformer,script:getFilePath"> > <field column="LINK" name="link"/> > <field column="NAME" name="name"/> > <field column="LAST_MODIFIED" name="last_modified" > xpath="/RDF/item/date" > dateTimeFormat="yyyy-MM-dd HH:mm:ss"/> > > <entity name="tika-doc" dataSource="tk_ds" > processor="TikaEntityProcessor" > url="${db_link.LINK}" format="text" > onError="skip"> > <field column="text" name="content"/> > </entity> > > </entity> > </document> > </dataConfig> > > The thing is, the file path is stored in a different pattern in the > database: > "doc_url" is the field in db which stores the url or file path. The file > path is stored in this way: > *D:\Games\CS2\setup.doc#D:\Games\CS2\setup.doc#* > i.e. the path is stored twice seperated by a '#'. I am not sure why it is > done. It has been done by our client. > > All I need is only the one file path i.e. D:\Games\CS2\setup.doc > I am passing the url value to tika as * url="${db_link.LINK}" > * > But the *${db_link.LINK}* contains the path coming from database directly. > I have tried using script transformer and splitting the path string to > parts > by '#' and taking the first path using the method *getFilePath(row)* but no > luck. > > I am still getting the path as stored in db. This gives a *FileNotFound* > exception while trying to index it and that is obvious because the path is > incorrect. > > What can be done to get only the path and leaving out rest of the path > having # and all? > > Help would be much appreciated :) > > > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Tika-url-issue-tp4139781.html > Sent from the Solr - User mailing list archive at Nabble.com. >