I am pulling a large amount of data from a local source D:\foo\resource\. I am using tika through a DIH to index the multiple file formats with text and metadata. I have almost all the information being pulled that I want, however, I am having a couple of issues:
1. I need to run a regex replace of the D:\foo\resource\ to be http://, which is part of what I want to use XPath for. I have the regex written, but not the replacement and I am not sure of where it needs to be located in my data-config.xml file. 2. I want to strip html where necessary also using XPath. 3. I need to remove \n, \t, \r, and any other extra crap I am getting in the text field to just get to the text content of the document, whatever mime type that might be so that it can be searchable. I am running it through the solr admin data import as opposed to the post.jar (I have tried both). And this is running on Windows and cannot be run on Linux as we have no one who can support it. I am posting my tika-data-config.xml (not tikaconfig) I named it this way so as not to be confused with our db-config for our catalog pull. Thanks in advance for any help. And I will upload any additional files that might be helpful upon request - I don't want to overload the post. tika-data-config-2.xml <http://lucene.472066.n3.nabble.com/file/t494707/tika-data-config-2.xml> -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html