date:20181021

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)

Hi again, Is there anyone who has some experience of using Tesseract’s OCR module within Solr? The files I am trying to read into Solr is Danish Tiff documents. Martin Frank Hansen, Senior Data Analytiker Data, IM & Analytics [cid:image001.png@01D383C9.6C129A60] Lautrupparken 40-42, DK-2750

Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS

Hi, We've specific exception that happening only on Thai core and only once we're using SolrCloud. Same indexing activity is running successfully while running on EN core with SolrCloud or with Thai core and standalone configuration. We're running on Linux with Solr 4.6 and with -Dfile.encod

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Alexandre Rafalovitch

I would check if the Byte-order mark is the cause: https://en.wikipedia.org/wiki/Byte_order_mark The error message does not seem to be a perfect match to this issue, but a good thing to check anyway. That symbol (right at the file start) is usually invisible and can trip Java XML parsers for some

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS

Hi Alexandre, Thank you. How this explain the issue exists only with SolrCloud and not standalone? Moshe From: Alexandre Rafalovitch Sent: Sunday, October 21, 2018 5:18:24 PM To: solr-user Subject: Re: Error while indexing Thai core with SolrCloud I would che

Re: Tesseract language

2018-10-21 Thread Alexandre Rafalovitch

There is a couple of things mixed in here: 1) Extract handler is not recommended for production usage. It is great for a quick test, just like you did it, but going to production, running it externally is better. Tika - especially with large files can use up a lot of memory and trip up the Solr ins

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Alexandre Rafalovitch

Ok, If the same file and the same core definition works on a standalone, then the issue may be different. Can you please share the full stack trace of the message. It may be important to see which thread died. Also, I would just spin up a test Solr 7.5 instance and see if the problem is still the

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)

Hi Alexandre, Thanks for your reply. Yes right now it is just for testing the possibilities of Solr and Tesseract. I will take a look at the Tika documentation to see if I can make it work. You said that DIH are not recommended for production usage, what is the recommended method(s) to upload

6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Clemens Wyss DEV

Just upgrading from 6.6 to 7.5 and am now seeing many "Connection evcitor"-threads which are all Thread.slee()ing ... As of 6.6 I am keeping the SolrClients (one per core) in a HashMap. Is this ok or should I create a new SolrClient for each request I am doing? SolrClient creation is as follows

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS

Hi, Thank you. Full stacktrace below "core_node_name":"172.19.218.201:8082_solr_core_th"}DEBUG - 2018-10-19 02:13:20.343; org.apache.zookeeper.ClientCnxn$SendThread; Reading reply sessionid:0x200b5a04a770005, packet:: clientPath:null serverPath:null finished:false header:: 356,1 replyHeader

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Alexandre Rafalovitch

Ok, That may have been a bit too much :-) However, it was useful. There seem to have several possible avenues: 1) You are using SolrJ and your SolrJ version is not the same as the version of the Solr server. There was a bunch of things that could trigger, especially in combination with Unicode bu

Re: Tesseract language

2018-10-21 Thread Alexandre Rafalovitch

Usually, we just say to do a custom solution using SolrJ client to connect. This gives you maximum flexibility and allows to integrate Tika either inside your code or as a server. Latest Tika actually has some off-thread handling I believe, to make it safer to embed. For DIH alternatives, if you w

Re: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Shawn Heisey

On 10/21/2018 10:13 AM, Clemens Wyss DEV wrote: Just upgrading from 6.6 to 7.5 and am now seeing many "Connection evcitor"-threads which are all Thread.slee()ing ... What's the stacktrace on those threads? If they're sleeping, then it's unlikely that there's any real contribution to system l

AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Clemens Wyss DEV

Thx Shawn! > If they're sleeping, then it's unlikely that there's any real contribution to > system load. I know, but > seeing threads you didn't expect to see? exactly this > You should really be keeping one SolrClient per server node, >and indicating which core to access with each request Du

Re: Error while indexing Thai core with SolrCloud

2018-10-21 Thread Moshe Recanati | KMS

Thank you. Will check all options and let you know. From: Alexandre Rafalovitch Sent: Sunday, October 21, 2018 8:09:34 PM To: solr-user Subject: Re: Error while indexing Thai core with SolrCloud Ok, That may have been a bit too much :-) However, it was useful.

Re: AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Shawn Heisey

On 10/21/2018 11:43 AM, Clemens Wyss DEV wrote: If I omit the core in the url upon creation of the SolrClient, where can I then "indicate" the core? You do it with the request, not with the client. https://lucene.apache.org/solr/7_5_0//solr-solrj/org/apache/solr/client/solrj/SolrClient.html#a

Re: Tesseract language

2018-10-21 Thread Erick Erickson

Here's a skeletal program that uses Tika in a stand-alone client. Rip the RDBMS parts out https://lucidworks.com/2012/02/14/indexing-with-solrj/ On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch wrote: > > Usually, we just say to do a custom solution using SolrJ client to > connect. This

Re: Tesseract language

2018-10-21 Thread Gus Heck

Hi Martin, I wrote a framework (https://github.com/nsoft/jesterj) that is meant to help with small to medium custom solutions It's not (yet) ready for cases where you need multiple machines feeding data, but so long as a single box can do the work it should be useful. It has a basic Tika stage whi

Re: Is there a tool to directly index hdfs files to solr?

2018-10-21 Thread Jason Gerlowski

Not familiar with the contrib you mentioned, or the rationale behind its removal. But as to your first question, you might be interested in looking at: https://github.com/lucidworks/hadoop-solr Disclaimer: I help maintain the "hadoop-solr" project mentioned. On Thu, Oct 18, 2018 at 8:17 AM shreck

AW: AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

2018-10-21 Thread Clemens Wyss DEV

On 10/21/2018 01:06 PM, Shawn Heisey wrote: > You do it with the request, not with the client For the UpdateRequests it is the "commitWithinMs"-parameter? To me this parameter sounds like telling the solr-server I need to see this data within "x ms". As we have autoCommit and autoSoftCommit ...

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)

Hi Alex, Thanks again for your reply, much appreciated. Martin Frank Hansen, Senior Data Analytiker Data, IM & Analytics Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk Web www.kmd.dk Mobil +4525571418 -Oprindelig meddelelse- Fra: Alexandre Rafalovitch Sendt: 21. oktober

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)

Hi Gus, Thank you so much! I will definitely take a look at it during the day. Martin Frank Hansen, -Oprindelig meddelelse- Fra: Gus Heck Sendt: 22. oktober 2018 00:06 Til: solr-user@lucene.apache.org Emne: Re: Tesseract language Hi Martin, I wrote a framework (https://github.com/nso

SV: Tesseract language

2018-10-21 Thread Martin Frank Hansen (MHQ)

Hi Erick, Thanks for the help! I will take a look at it. Martin Frank Hansen, Senior Data Analytiker Data, IM & Analytics Lautrupparken 40-42, DK-2750 Ballerup E-mail m...@kmd.dk Web www.kmd.dk Mobil +4525571418 -Oprindelig meddelelse- Fra: Erick Erickson Sendt: 21. oktober 2018 2

SV: Tesseract language

Error while indexing Thai core with SolrCloud

Re: Error while indexing Thai core with SolrCloud

Re: Error while indexing Thai core with SolrCloud

Re: Tesseract language

Re: Error while indexing Thai core with SolrCloud

SV: Tesseract language

6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

Re: Error while indexing Thai core with SolrCloud

Re: Error while indexing Thai core with SolrCloud

Re: Tesseract language

Re: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

Re: Error while indexing Thai core with SolrCloud

Re: AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

Re: Tesseract language

Re: Tesseract language

Re: Is there a tool to directly index hdfs files to solr?

AW: AW: 6.6 -> 7.5 SolrJ, seeing many "Connection evictor"-Threads

SV: Tesseract language

SV: Tesseract language

SV: Tesseract language

22 matches

Site Navigation

Mail list logo

Footer information