There’s a space between “l” and “oad” in your second doc. Or perhaps it has markup etc. If you do what I mentioned and use the /terms endpoint to examine what’s actually in your index, I’m pretty sure you’ll see “l” and “oad” so not finding it is perfectly understandable.
What this is is that however you turn the doc into your xml format breaks it up like this. I’ve seen this happen with other markups. In other words, this has nothing to do with Solr and everything to do with whatever extracts the text from the original document. If you’re using ExtractingRequestHandler to process this, you’re getting the defaults that Tika uses, which can be tweaked if you run Tika outisde Solr, see the Tika website. And you’ll never get this 100%. Every document format does weird things, and docs produced by one version don’t necessarily match another version even in the same format (say PDF). Extracting the plain text is correctly for every version of every format is near impossible unless you do them one-by-one. Best, Erick > On Jul 23, 2020, at 2:48 AM, Khare, Kushal (MIND) > <kushal.kh...@mind-infotech.com> wrote: > > I did this debug query thing and everything seems good but still am unable to > get the desired doc in my result. > > "debug":{ "rawquerystring":"load", > "querystring":"load", > "parsedquery":"_text_:load", > "parsedquery_toString":"_text_:load", > > Actually , CASE 2 in my previous mail is the same text : "Doing load test > for Solr" but the diff I forgot to mention was here the text is formatted to > BOLD & Text color is RED. > In case 1, it was simple text. > What I observed is while parsing, if I print the the textHandler String...I > get this > > [Content_Types].xml > > _rels/.rels > > word/document.xml > Thi s docum ent is being used for the QDMS l > oad testing . > > > So, I don't know what goes wrong when i have same text but formatted. > Please help me with this as it is critical and needs to be delivered very > soon. > > Thanks ! > ________________________________ > From: Erick Erickson <erickerick...@gmail.com> > Sent: Thursday, July 23, 2020 1:49 AM > To: solr-user@lucene.apache.org <solr-user@lucene.apache.org> > Subject: Re: Can't search formatted text in solr > > ---- This email originated from an external source i.e. outside of the > organization. Please do not click on links or open any attachment unless you > recognize the sender and know the content is safe ---- > > There’s not much info to go on here. Try attaching &debug=query to the > queries and see if the parsed query returned is what you expect. If it is, > the next thing I’d do is attach > &debug=true&explainOther=id:id_of_doc_that_isnt_showing_up > > This last will show you how scoring was done whether or not the doc is > returned in the result set. > > Finally, you can use the admin UI to look at the actual tokens indexed. > > My bet is that your doc format isn’t being analyzed properly, perhaps to do > markup and the second case doesn’t get indexed the way you think it should. > You can use the terms handler to examine exactly what’s in the index > > Best, > Erick > >> On Jul 22, 2020, at 12:42 PM, Khare, Kushal (MIND) >> <kushal.kh...@mind-infotech.com> wrote: >> >> Hello guys, >> I have been using solr for my java application to carry out content search >> from the saved docs. >> I am facing a problem in searching for a word - 'load' >> There are 2 cases, in 1st search is working good but in second case with the >> same doc and same query - 'load' am not getting the result >> >> CASE 1 : >> >> "Doing load test for Solr" - Simple text in doc format. >> Works fine >> >> CASE 2 : >> >> "Doing load test for Solr" - Simple text in doc format. >> In this case, the solr search fails. I don't get the result when I search >> for the term load. >> >> >> Please help me with this as am unable to get any help with this >> >> >> Thanks ! >> Regards, >> Kushal Khare >> >