Re: Content from EML files indexing from text/html (which is not clean) instead of text/plain

2019-01-19 Thread Zheng Lin Edwin Yeo
Ok, thanks for providing the information.

Regards,
Edwin

On Fri, 18 Jan 2019 at 00:33, Tim Allison  wrote:

> Y, I tracked this down within Solr.  This is a feature, not a bug.  I
> found a solution (set {{captureAttr}} to {{true}}):
>
> https://issues.apache.org/jira/browse/TIKA-2814?focusedCommentId=16745263&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16745263
>
> Please, though, for the sake of Solr, please run Tika outside of Solr
> in production (e.g. SolrJ...see:
> https://lucidworks.com/2012/02/14/indexing-with-solrj/)
>
> On Thu, Jan 17, 2019 at 2:15 AM Zheng Lin Edwin Yeo
>  wrote:
> >
> > Based on the discussion in Tika and also on the Jira (TIKA-2814), it was
> > said that the issue could be with the Solr's ExtractingRequestHandler, in
> > which the HTMLParser is either not being applied, or is somehow not
> > stripping the content of  elements. Straight Tika app is able to
> do
> > the right thing.
> >
> > Regards,
> > Edwin
> >
> > On Tue, 15 Jan 2019 at 10:56, Zheng Lin Edwin Yeo 
> > wrote:
> >
> > > Hi Alex,
> > >
> > > Thanks for the suggestions.
> > > Yes, I have posted it in the Tika mailing list too.
> > >
> > > Regards,
> > > Edwin
> > >
> > > On Mon, 14 Jan 2019 at 21:16, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > > wrote:
> > >
> > >> I think asking this question on Tika mailing list may give you better
> > >> answers. Then, if the conclusion is that the behavior is configurable,
> > >> you can see how to do it in Solr. It may be however, that you need to
> > >> do the parsing outside of Solr with standalone Tika. Standalone Tika
> > >> is a production advice anyway.
> > >>
> > >> I would suggest the title be something like "How to prefer plain/text
> > >> part of an email message when parsing .eml files".
> > >>
> > >> Regards,
> > >>   Alex.
> > >>
> > >> On Mon, 14 Jan 2019 at 00:20, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > I have uploaded a sample EML file here:
> > >> >
> > >>
> https://drive.google.com/file/d/1z1gujv4SiacFeganLkdb0DhfZsNeGD2a/view?usp=sharing
> > >> >
> > >> > This is what is indexed in the content:
> > >> >
> > >> > "content":"  font-size: 14pt; font-family: book antiqua,
> > >> > palatino, serif;  Hi There,font-size: 14pt; font-family:
> > >> > book antiqua, palatino, serif;  My client owns the domain name “
> > >> > font-size: 14pt; color: #ff; font-family: arial black,
> sans-serif;
> > >> >  TravelInsuranceEurope.com   font-size: 14pt; font-family: book
> > >> > antiqua, palatino, serif;  ” and is considering putting it in
> market.
> > >> > It is keyword rich domain with good search volume,adword bidding and
> > >> > type-in-traffic.font-size: 14pt; font-family: book
> > >> > antiqua, palatino, serif;  Based on our extensive study, we strongly
> > >> > feel that you should consider buying this domain name to improve the
> > >> > SEO, Online visibility, brand image, authority and type-in-traffic
> for
> > >> > your business. We also do provide free 1 year hosting and unlimited
> > >> > emails along with domain name.font-size: 14pt;
> > >> > font-family: book antiqua, palatino, serif;  Besides this, if you
> need
> > >> > any other domain name, web and app designing services and digital
> > >> > marketing services (SEO, PPC and SMO) at reasonable charges, feel
> free
> > >> > to contact us.font-size: 14pt; font-family: book
> antiqua,
> > >> > palatino, serif;  Best Regards,font-size: 14pt;
> > >> > font-family: book antiqua, palatino, serif;  Josh   ",
> > >> >
> > >> >
> > >> > As you can see, this is taken from the Content-Type: text/html.
> > >> > However, the Content-Type: text/plain looks clean, and that is what
> we
> > >> want
> > >> > it to be indexed.
> > >> >
> > >> > How can we configure the Tika in Solr to change the priority to get
> the
> > >> > content from Content-Type: text/plain  instead of Content-Type:
> > >> text/html?
> > >> >
> > >> > On Mon, 14 Jan 2019 at 11:18, Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >> >
> > >> > wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > I am using Solr 7.5.0 with Tika 1.18.
> > >> > >
> > >> > > Currently I am facing a situation during the indexing of EML
> files,
> > >> > > whereby the content is being extracted from the
> Content-type=text/html
> > >> > > instead of Content-type=text/plain.
> > >> > >
> > >> > > The problem with Content-type=text/html is that it contains alot
> of
> > >> words
> > >> > > like "*FONT-SIZE: 9pt; FONT-FAMILY: arial*" in the content, and
> all of
> > >> > > these get indexed in Solr as well, which makes the content very
> > >> cluttered,
> > >> > > and it also affect the search, as when we search for words like
> > >> "font", all
> > >> > > the contents gets returned because of this.
> > >> > >
> > >> > > Would like to enquire on the following:
> > >> > > 1. Why Tika didn't get the text part (text/plain). Is there any
> way to
> 

Should Solr and SolrJ version should match?

2019-01-19 Thread Arunan Sugunakumar
Hi,

I created a project with solrj 7.2.1 which worked perfectly with Apache
Solr 7.2. But it does not seem to work with Apache Solr 7.6. I would like
to know whether it is mandatory to use the same solrj version as the solr.

I have pasted the stacktrace for the exception below.

[corePostProcess]
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteExecutionException:
Error from server at http://localhost:8983/solr/biotestmine-search: error
processing commands
[corePostProcess]   at
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteExecutionException.create(HttpSolrClient.java:829)
[corePostProcess]   at
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:620)
[corePostProcess]   at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:255)
[corePostProcess]   at
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:244)
[corePostProcess]   at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:194)
[corePostProcess]   at
org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:211)
[corePostProcess]   at
org.intermine.api.searchengine.solr.SolrIndexHandler.addFieldNameToSchema(SolrIndexHandler.java:204)
[corePostProcess]   at
org.intermine.api.searchengine.solr.SolrIndexHandler.createIndex(SolrIndexHandler.java:86)
[corePostProcess]   at
org.intermine.bio.postprocess.CreateSearchIndexProcess.postProcess(CreateSearchIndexProcess.java:66)
[corePostProcess]   at
org.intermine.task.PostProcessorTask.execute(PostProcessorTask.java:70)

Thank you,
Arunan

*Sugunakumar Arunan*
Undergraduate - CSE | UOM

Email : aruna ns...@cse.mrt.ac.lk
LinkedIn : https://www.linkedin.com/in/arunans23/


Re: Should Solr and SolrJ version should match?

2019-01-19 Thread Shawn Heisey

On 1/19/2019 10:05 AM, Arunan Sugunakumar wrote:

I created a project with solrj 7.2.1 which worked perfectly with Apache
Solr 7.2. But it does not seem to work with Apache Solr 7.6. I would like
to know whether it is mandatory to use the same solrj version as the solr.

I have pasted the stacktrace for the exception below.


That's not the whole stacktrace.  The whole thing could be hundreds of 
lines long, and if you leave any of it out, understanding it might not 
be possible.  The top of the stacktrace says that the server returned an 
error.  The error from the server will most likely be in the stacktrace 
on the client, but if it's not, you should be able to check the server's 
logs and see it there.


Seeing the SolrJ code that produced the error might become necessary.

HttpSolrClient should be widely compatible across versions.  
CloudSolrClient is where compatibility across a large gap might be 
problematic.  The gap between 7.2 and 7.6 is not large.  If using the 
same version for both isn't possible, best results are obtained when the 
client version is newer than the server version.  Using an older client 
with a newer Solr can be problematic, even when it's not CloudSolrClient.


General client functionality tends to be VERY stable -- queries, 
updates, etc.  I see from the included stacktrace that you're calling 
something you've named "addFieldNameToSchema" ... that is one of Solr's 
specialty capabilities that hasn't been around as long as core 
functionality.  That kind of functionality tends to change more 
frequently than core functionality.


Thanks,
Shawn



modifying the export request handler

2019-01-19 Thread tom400
hey,
i'm using solr 6.5 . 
i'm trying to modify the /export implicit request handler.
i want to add a search components to the export handler, or create 
/my_export request handler from the exportHandler class that. 
for some reason , i receive an error when trying to create a /my_export
handler. 
does someone knows how to define this request handler correctly? 

  



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: modifying the export request handler

2019-01-19 Thread Erick Erickson
Attachments and graphics tend to be stripped by the mail server,
I can't see the error.

Best,
Erick

On Sat, Jan 19, 2019 at 1:12 PM tom400  wrote:
>
> hey,
> i'm using solr 6.5 .
> i'm trying to modify the /export implicit request handler.
> i want to add a search components to the export handler, or create
> /my_export request handler from the exportHandler class that.
> for some reason , i receive an error when trying to create a /my_export
> handler.
> does someone knows how to define this request handler correctly?
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Should Solr and SolrJ version should match?

2019-01-19 Thread Arunan Sugunakumar
Dear Shawn,

Thanks for the info. You are right. I have pasted only the client side
stack trace. I failed to check the solr logs. As I went over the logs I
found out that the same fieldnames cannot be created again and again
according to Solr 7.6.0. In Solr 7.2.1, it was not the case. Anyhow I'll
create a separate mail thread to discuss this matter.

Thanks again,

Regards,
Arunan

*Sugunakumar Arunan*
Undergraduate - CSE | UOM

Email : aruna ns...@cse.mrt.ac.lk
Mobile : 0094 766016272 <076%20601%206272>
LinkedIn : https://www.linkedin.com/in/arunans23/


On Sun, 20 Jan 2019 at 02:26, Shawn Heisey  wrote:

> On 1/19/2019 10:05 AM, Arunan Sugunakumar wrote:
> > I created a project with solrj 7.2.1 which worked perfectly with Apache
> > Solr 7.2. But it does not seem to work with Apache Solr 7.6. I would like
> > to know whether it is mandatory to use the same solrj version as the
> solr.
> >
> > I have pasted the stacktrace for the exception below.
>
> That's not the whole stacktrace.  The whole thing could be hundreds of
> lines long, and if you leave any of it out, understanding it might not
> be possible.  The top of the stacktrace says that the server returned an
> error.  The error from the server will most likely be in the stacktrace
> on the client, but if it's not, you should be able to check the server's
> logs and see it there.
>
> Seeing the SolrJ code that produced the error might become necessary.
>
> HttpSolrClient should be widely compatible across versions.
> CloudSolrClient is where compatibility across a large gap might be
> problematic.  The gap between 7.2 and 7.6 is not large.  If using the
> same version for both isn't possible, best results are obtained when the
> client version is newer than the server version.  Using an older client
> with a newer Solr can be problematic, even when it's not CloudSolrClient.
>
> General client functionality tends to be VERY stable -- queries,
> updates, etc.  I see from the included stacktrace that you're calling
> something you've named "addFieldNameToSchema" ... that is one of Solr's
> specialty capabilities that hasn't been around as long as core
> functionality.  That kind of functionality tends to change more
> frequently than core functionality.
>
> Thanks,
> Shawn
>
>


Same field name cannot be added in newer Solr versions

2019-01-19 Thread Arunan Sugunakumar
Hi,

I used SolrJ 7.2.1 and Apache Solr 7.2.1 in one of my projects. Prior to
data indexing, I create the fields and field types using SchemaAPI in
SolrJ. If a same field name is repeated, solr did not return an exception,
but simply ignored it (or over-rid it). But I tried Solr 7.6.0 with my
project and it returned exceptions. I went over the Solr logs and found the
below error. It seems that I cannot create a field again and again. I would
like to know whether I should do something different to overcome this
problem.

2019-01-19 16:46:43.580 ERROR (qtp952486988-18) [   x:biotestmine-search]
o.a.s.h.RequestHandlerBase
org.apache.solr.api.ApiBag$ExceptionWithErrObject: error processing
commands, errors: [{add-field={indexed=true, stored=false, name=Category,
type=analyzed_string, multiValued=true, required=false},
errorMessages=[Field 'Category' already exists.
]}],
at
org.apache.solr.handler.SchemaHandler.handleRequestBody(SchemaHandler.java:92)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2541)
at
org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
at
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)
at java.lang.Thread.run(Thread.java:748)



Thanks in advance

Regards,
Arunan

*Sugunakumar Arunan*
Undergraduate - CSE | UOM

Email : aruna ns...@cse.mrt.ac.lk
Mobile : 0094 766016272 <076%20601%206272>
LinkedIn : https://www.linkedin.com/in/arunans23/