Ok, thanks for providing the information. Regards, Edwin
On Fri, 18 Jan 2019 at 00:33, Tim Allison <talli...@apache.org> wrote: > Y, I tracked this down within Solr. This is a feature, not a bug. I > found a solution (set {{captureAttr}} to {{true}}): > > https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263 > > Please, though, for the sake of Solr, please run Tika outside of Solr > in production (e.g. SolrJ...see: > https://lucidworks.com/2012/02/14/indexing-with-solrj/) > > On Thu, Jan 17, 2019 at 2:15 AM Zheng Lin Edwin Yeo > <edwinye...@gmail.com> wrote: > > > > Based on the discussion in Tika and also on the Jira (TIKA-2814), it was > > said that the issue could be with the Solr's ExtractingRequestHandler, in > > which the HTMLParser is either not being applied, or is somehow not > > stripping the content of <span/> elements. Straight Tika app is able to > do > > the right thing. > > > > Regards, > > Edwin > > > > On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > > wrote: > > > > > Hi Alex, > > > > > > Thanks for the suggestions. > > > Yes, I have posted it in the Tika mailing list too. > > > > > > Regards, > > > Edwin > > > > > > On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch < > arafa...@gmail.com> > > > wrote: > > > > > >> I think asking this question on Tika mailing list may give you better > > >> answers. Then, if the conclusion is that the behavior is configurable, > > >> you can see how to do it in Solr. It may be however, that you need to > > >> do the parsing outside of Solr with standalone Tika. Standalone Tika > > >> is a production advice anyway. > > >> > > >> I would suggest the title be something like "How to prefer plain/text > > >> part of an email message when parsing .eml files". > > >> > > >> Regards, > > >> Alex. > > >> > > >> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> > > >> wrote: > > >> > > > >> > Hi, > > >> > > > >> > I have uploaded a sample EML file here: > > >> > > > >> > https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing > > >> > > > >> > This is what is indexed in the content: > > >> > > > >> > "content":" font-size: 14pt; font-family: book antiqua, > > >> > palatino, serif; Hi There, <br><br> font-size: 14pt; font-family: > > >> > book antiqua, palatino, serif; My client owns the domain name “ > > >> > font-size: 14pt; color: #0000ff; font-family: arial black, > sans-serif; > > >> > TravelInsuranceEurope.com font-size: 14pt; font-family: book > > >> > antiqua, palatino, serif; ” and is considering putting it in > market. > > >> > It is keyword rich domain with good search volume,adword bidding and > > >> > type-in-traffic. <br><br> font-size: 14pt; font-family: book > > >> > antiqua, palatino, serif; Based on our extensive study, we strongly > > >> > feel that you should consider buying this domain name to improve the > > >> > SEO, Online visibility, brand image, authority and type-in-traffic > for > > >> > your business. We also do provide free 1 year hosting and unlimited > > >> > emails along with domain name. <br><br> font-size: 14pt; > > >> > font-family: book antiqua, palatino, serif; Besides this, if you > need > > >> > any other domain name, web and app designing services and digital > > >> > marketing services (SEO, PPC and SMO) at reasonable charges, feel > free > > >> > to contact us. <br><br> font-size: 14pt; font-family: book > antiqua, > > >> > palatino, serif; Best Regards, <br><br> font-size: 14pt; > > >> > font-family: book antiqua, palatino, serif; Josh <br><br>", > > >> > > > >> > > > >> > As you can see, this is taken from the Content-Type: text/html. > > >> > However, the Content-Type: text/plain looks clean, and that is what > we > > >> want > > >> > it to be indexed. > > >> > > > >> > How can we configure the Tika in Solr to change the priority to get > the > > >> > content from Content-Type: text/plain instead of Content-Type: > > >> text/html? > > >> > > > >> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo < > edwinye...@gmail.com > > >> > > > >> > wrote: > > >> > > > >> > > Hi, > > >> > > > > >> > > I am using Solr 7.5.0 with Tika 1.18. > > >> > > > > >> > > Currently I am facing a situation during the indexing of EML > files, > > >> > > whereby the content is being extracted from the > Content-type=text/html > > >> > > instead of Content-type=text/plain. > > >> > > > > >> > > The problem with Content-type=text/html is that it contains alot > of > > >> words > > >> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and > all of > > >> > > these get indexed in Solr as well, which makes the content very > > >> cluttered, > > >> > > and it also affect the search, as when we search for words like > > >> "font", all > > >> > > the contents gets returned because of this. > > >> > > > > >> > > Would like to enquire on the following: > > >> > > 1. Why Tika didn't get the text part (text/plain). Is there any > way to > > >> > > configure the Tika in Solr to change the priority to get the text > part > > >> > > (text/plain) instead of html part (text/html). > > >> > > 2. If that is not possible, as you can see, the content is not > clean, > > >> > > which is not right. How can we get this to be clean when Tika is > > >> extracting > > >> > > text? > > >> > > > > >> > > Regards, > > >> > > Edwin > > >> > > > > >> > > > >