plain

Zheng Lin Edwin Yeo Sat, 19 Jan 2019 06:25:23 -0800

Ok, thanks for providing the information.

Regards,
Edwin


On Fri, 18 Jan 2019 at 00:33, Tim Allison <talli...@apache.org> wrote:

> Y, I tracked this down within Solr.  This is a feature, not a bug.  I
> found a solution (set {{captureAttr}} to {{true}}):
>
> https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263
>
> Please, though, for the sake of Solr, please run Tika outside of Solr
> in production (e.g. SolrJ...see:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/)
>
> On Thu, Jan 17, 2019 at 2:15 AM Zheng Lin Edwin Yeo
> <edwinye...@gmail.com> wrote:
> >
> > Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
> > said that the issue could be with the Solr's ExtractingRequestHandler, in
> > which the HTMLParser is either not being applied, or is somehow not
> > stripping the content of <span/> elements. Straight Tika app is able to
> do
> > the right thing.
> >
> > Regards,
> > Edwin
> >
> > On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> > wrote:
> >
> > > Hi Alex,
> > >
> > > Thanks for the suggestions.
> > > Yes, I have posted it in the Tika mailing list too.
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > > wrote:
> > >
> > >> I think asking this question on Tika mailing list may give you better
> > >> answers. Then, if the conclusion is that the behavior is configurable,
> > >> you can see how to do it in Solr. It may be however, that you need to
> > >> do the parsing outside of Solr with standalone Tika. Standalone Tika
> > >> is a production advice anyway.
> > >>
> > >> I would suggest the title be something like "How to prefer plain/text
> > >> part of an email message when parsing .eml files".
> > >>
> > >> Regards,
> > >>   Alex.
> > >>
> > >> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I have uploaded a sample EML file here:
> > >> >
> > >>
> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
> > >> >
> > >> > This is what is indexed in the content:
> > >> >
> > >> >         "content":"  font-size: 14pt; font-family: book antiqua,
> > >> > palatino, serif;  Hi There,   <br><br> font-size: 14pt; font-family:
> > >> > book antiqua, palatino, serif;  My client owns the domain name “
> > >> > font-size: 14pt; color: #0000ff; font-family: arial black,
> sans-serif;
> > >> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> > >> > antiqua, palatino, serif;  ” and is considering putting it in
> market.
> > >> > It is keyword rich domain with good search volume,adword bidding and
> > >> > type-in-traffic.   <br><br> font-size: 14pt; font-family: book
> > >> > antiqua, palatino, serif;  Based on our extensive study, we strongly
> > >> > feel that you should consider buying this domain name to improve the
> > >> > SEO, Online visibility, brand image, authority and type-in-traffic
> for
> > >> > your business. We also do provide free 1 year hosting and unlimited
> > >> > emails along with domain name.   <br><br> font-size: 14pt;
> > >> > font-family: book antiqua, palatino, serif;  Besides this, if you
> need
> > >> > any other domain name, web and app designing services and digital
> > >> > marketing services (SEO, PPC and SMO) at reasonable charges, feel
> free
> > >> > to contact us.   <br><br> font-size: 14pt; font-family: book
> antiqua,
> > >> > palatino, serif;  Best Regards,   <br><br> font-size: 14pt;
> > >> > font-family: book antiqua, palatino, serif;  Josh   <br><br>",
> > >> >
> > >> >
> > >> > As you can see, this is taken from the Content-Type: text/html.
> > >> > However, the Content-Type: text/plain looks clean, and that is what
> we
> > >> want
> > >> > it to be indexed.
> > >> >
> > >> > How can we configure the Tika in Solr to change the priority to get
> the
> > >> > content from Content-Type: text/plain  instead of Content-Type:
> > >> text/html?
> > >> >
> > >> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >> >
> > >> > wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I am using Solr 7.5.0 with Tika 1.18.
> > >> > >
> > >> > > Currently I am facing a situation during the indexing of EML
> files,
> > >> > > whereby the content is being extracted from the
> Content-type=text/html
> > >> > > instead of Content-type=text/plain.
> > >> > >
> > >> > > The problem with Content-type=text/html is that it contains alot
> of
> > >> words
> > >> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and
> all of
> > >> > > these get indexed in Solr as well, which makes the content very
> > >> cluttered,
> > >> > > and it also affect the search, as when we search for words like
> > >> "font", all
> > >> > > the contents gets returned because of this.
> > >> > >
> > >> > > Would like to enquire on the following:
> > >> > > 1. Why Tika didn't get the text part (text/plain). Is there any
> way to
> > >> > > configure the Tika in Solr to change the priority to get the text
> part
> > >> > > (text/plain) instead of html part (text/html).
> > >> > > 2. If that is not possible, as you can see, the content is not
> clean,
> > >> > > which is not right. How can we get this to be clean when Tika is
> > >> extracting
> > >> > > text?
> > >> > >
> > >> > > Regards,
> > >> > > Edwin
> > >> > >
> > >>
> > >
>

Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

Reply via email to