Hi again,
Is there anyone who has some experience of using Tesseract’s OCR module within
Solr? The files I am trying to read into Solr is Danish Tiff documents.
Martin Frank Hansen, Senior Data Analytiker
Data, IM & Analytics
[cid:image001.png@01D383C9.6C129A60]
Lautrupparken 40-42, DK-2750
Hi,
We've specific exception that happening only on Thai core and only once we're
using SolrCloud.
Same indexing activity is running successfully while running on EN core with
SolrCloud or with Thai core and standalone configuration.
We're running on Linux with Solr 4.6
and with -Dfile.encod
I would check if the Byte-order mark is the cause:
https://en.wikipedia.org/wiki/Byte_order_mark
The error message does not seem to be a perfect match to this issue,
but a good thing to check anyway.
That symbol (right at the file start) is usually invisible and can
trip Java XML parsers for some
Hi Alexandre,
Thank you.
How this explain the issue exists only with SolrCloud and not standalone?
Moshe
From: Alexandre Rafalovitch
Sent: Sunday, October 21, 2018 5:18:24 PM
To: solr-user
Subject: Re: Error while indexing Thai core with SolrCloud
I would che
There is a couple of things mixed in here:
1) Extract handler is not recommended for production usage. It is great for
a quick test, just like you did it, but going to production, running it
externally is better. Tika - especially with large files can use up a lot
of memory and trip up the Solr ins
Ok,
If the same file and the same core definition works on a standalone,
then the issue may be different. Can you please share the full stack
trace of the message. It may be important to see which thread died.
Also, I would just spin up a test Solr 7.5 instance and see if the
problem is still the
Hi Alexandre,
Thanks for your reply.
Yes right now it is just for testing the possibilities of Solr and Tesseract.
I will take a look at the Tika documentation to see if I can make it work.
You said that DIH are not recommended for production usage, what is the
recommended method(s) to upload
Just upgrading from 6.6 to 7.5 and am now seeing many "Connection
evcitor"-threads which are all Thread.slee()ing ...
As of 6.6 I am keeping the SolrClients (one per core) in a HashMap. Is this ok
or should I create a new SolrClient for each request I am doing?
SolrClient creation is as follows
Hi,
Thank you.
Full stacktrace below
"core_node_name":"172.19.218.201:8082_solr_core_th"}DEBUG - 2018-10-19
02:13:20.343; org.apache.zookeeper.ClientCnxn$SendThread; Reading reply
sessionid:0x200b5a04a770005, packet:: clientPath:null serverPath:null
finished:false header:: 356,1 replyHeader
Ok,
That may have been a bit too much :-) However, it was useful.
There seem to have several possible avenues:
1) You are using SolrJ and your SolrJ version is not the same as the
version of the Solr server. There was a bunch of things that could
trigger, especially in combination with Unicode bu
Usually, we just say to do a custom solution using SolrJ client to
connect. This gives you maximum flexibility and allows to integrate
Tika either inside your code or as a server. Latest Tika actually has
some off-thread handling I believe, to make it safer to embed.
For DIH alternatives, if you w
On 10/21/2018 10:13 AM, Clemens Wyss DEV wrote:
Just upgrading from 6.6 to 7.5 and am now seeing many "Connection
evcitor"-threads which are all Thread.slee()ing ...
What's the stacktrace on those threads? If they're sleeping, then it's
unlikely that there's any real contribution to system l
Thx Shawn!
> If they're sleeping, then it's unlikely that there's any real contribution to
> system load.
I know, but
> seeing threads you didn't expect to see?
exactly this
> You should really be keeping one SolrClient per server node,
>and indicating which core to access with each request
Du
Thank you. Will check all options and let you know.
From: Alexandre Rafalovitch
Sent: Sunday, October 21, 2018 8:09:34 PM
To: solr-user
Subject: Re: Error while indexing Thai core with SolrCloud
Ok,
That may have been a bit too much :-) However, it was useful.
On 10/21/2018 11:43 AM, Clemens Wyss DEV wrote:
If I omit the core in the url upon creation of the SolrClient, where can I then
"indicate" the core?
You do it with the request, not with the client.
https://lucene.apache.org/solr/7_5_0//solr-solrj/org/apache/solr/client/solrj/SolrClient.html#a
Here's a skeletal program that uses Tika in a stand-alone client. Rip
the RDBMS parts out
https://lucidworks.com/2012/02/14/indexing-with-solrj/
On Sun, Oct 21, 2018 at 1:13 PM Alexandre Rafalovitch
wrote:
>
> Usually, we just say to do a custom solution using SolrJ client to
> connect. This
Hi Martin,
I wrote a framework (https://github.com/nsoft/jesterj) that is meant to
help with small to medium custom solutions It's not (yet) ready for cases
where you need multiple machines feeding data, but so long as a single box
can do the work it should be useful. It has a basic Tika stage whi
Not familiar with the contrib you mentioned, or the rationale behind
its removal. But as to your first question, you might be interested
in looking at: https://github.com/lucidworks/hadoop-solr
Disclaimer: I help maintain the "hadoop-solr" project mentioned.
On Thu, Oct 18, 2018 at 8:17 AM shreck
On 10/21/2018 01:06 PM, Shawn Heisey wrote:
> You do it with the request, not with the client
For the UpdateRequests it is the "commitWithinMs"-parameter? To me this
parameter sounds like telling the solr-server I need to see this data within "x
ms". As we have autoCommit and autoSoftCommit
...
Hi Alex,
Thanks again for your reply, much appreciated.
Martin Frank Hansen, Senior Data Analytiker
Data, IM & Analytics
Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk Web www.kmd.dk
Mobil +4525571418
-Oprindelig meddelelse-
Fra: Alexandre Rafalovitch
Sendt: 21. oktober
Hi Gus,
Thank you so much! I will definitely take a look at it during the day.
Martin Frank Hansen,
-Oprindelig meddelelse-
Fra: Gus Heck
Sendt: 22. oktober 2018 00:06
Til: solr-user@lucene.apache.org
Emne: Re: Tesseract language
Hi Martin,
I wrote a framework (https://github.com/nso
Hi Erick,
Thanks for the help! I will take a look at it.
Martin Frank Hansen, Senior Data Analytiker
Data, IM & Analytics
Lautrupparken 40-42, DK-2750 Ballerup
E-mail m...@kmd.dk Web www.kmd.dk
Mobil +4525571418
-Oprindelig meddelelse-
Fra: Erick Erickson
Sendt: 21. oktober 2018 2
22 matches
Mail list logo