Unsuccessful queries for terms next to tabs and newlines in uploaded Word documents

chtjfi Mon, 31 Mar 2014 00:24:07 -0700

Short Version: What do I need to do to successfully query for terms that are
adjacent to tabs and newlines (i.e. \t, \n) in an uploaded Word document?

Long Version:

I am using Solr 4.6.1. I am running an unmodified version of the example
core that is started by running java -jar start.jar in the example
directory. The schema.xml in use is example/solr/collection1/conf/schema.xml
and is unmodified (it is the one downloaded with the distribution), so I
won't post it unless someone says it is helpful.

After uploading a Word document to Solr with the command
http://localhost:8983/solr/update/extract?literal.id=yabba&uprefix=attr_&fmap.content=attr_content&commit=true
there are hundreds of tab and newline characters (i.e. \n and \t) in the
attr_content field. When a string occurs only once in the document, and is
adjacent to one of these characters, queries for that term are not
successful.

A specific example is an uploaded Word document that after upload contains
"Vorname:\t\t\tYasmin" in the attr_content field. The original document
contained "Vorname:", then two tab characters, then "Yasmin" (the string
"\t" does not appear in the document). The string "Yasmin" appears only in
that location in the document.

When I query for "Yasmin" with the query
http://127.0.0.1:8983/solr/collection1/select?q=Yasmin&wt=json&indent=true I
get no results. Queries for terms that are not next to a \t or a \n are
successful.

What can I do so that a query for a term next to a tab or newline will be
successful? Must I change the way the document is uploaded? Or change the
way the search is performed?

--
View this message in context:
http://lucene.472066.n3.nabble.com/Unsuccessful-queries-for-terms-next-to-tabs-and-newlines-in-uploaded-Word-documents-tp4128090.html
Sent from the Solr - User mailing list archive at Nabble.com.

Unsuccessful queries for terms next to tabs and newlines in uploaded Word documents

Reply via email to