Re: Using Tesseract OCR to extract PDF files in EML file attachment

2017-04-04 Thread AJ Weber
You'll need to use something like javax mail (or some of the jars that 
have been built on top of it for higher-level access) to open the EML 
files and extract the attachments, then operate on the extracted 
attachments as you would any file.


There are alternative, paid, libraries to parse and extract attachments 
from EML files as well.


EML attachments will have a mimetype associated with their metadata.



On 4/4/2017 2:00 AM, Zheng Lin Edwin Yeo wrote:

Hi,

Currently, I am able to extract scanned PDF images and index them to Solr
using Tesseract OCR, although the speed is very slow.

However, for EML files with PDF attachments that consist of scanned images,
the Tesseract OCR is not able to extract the text from those PDF
attachments.

Can we use the same method for EML files? Or what are the suggestions that
we can do to extract those attachments?

I'm using Solr 6.5.0

Regards,
Edwin





Re: The book: Solr 4.x Deep Dive - Early Access Release #1

2013-06-21 Thread AJ Weber



On 6/21/2013 9:22 AM, Alexandre Rafalovitch wrote:


I might be however confused regarding your strategy. I thought you
were going to do several different volumes, rather than one large one.
Or is this all a 'first' volume discussion so far.

Pricing: $7.99 feels better for the book this size. Under $5 it feels
like it may be mostly filler (even if it is not). I don't think
anybody will pay every month just because it got updated.
I agree that I'm a little confused as to the pricing.  Are you saying 
you'll keep updating it and everyone would just d/l the latest version 
monthly?  If so, what's to stop someone from waiting to "subscribe" 
until it is entirely complete and just pay the $8 once for the whole 
thing -- versus those of us (me included) who would be sending our $8 
every month and therefore receiving the same work at 10x the price (for 
example)?


I'm with one of the previous responses:  I'd be willing to pay $30 for 
early-access (and updates) to an eBook as a one-time-cost and then when 
you release the final, set it at $40 or more.




newbie questions about cache stats & query perf

2013-01-09 Thread AJ Weber
Sorry, I did search for an answer, but didn't find an applicable one.  
I'm currently stuck on 1.4.1 (running in Tomcat 6 on 64bit Linux) for 
the time being...


When I see stats like this:
name:  documentCache
class:  org.apache.solr.search.LRUCache
version:  1.0
description:  LRU Cache(maxSize=512, initialSize=512)
lookups : 0
hits : 0
hitratio : 0.00
inserts : 0
evictions : 0
size : 0
warmupTime : 0
cumulative_lookups : 8158
cumulative_hits : 685
cumulative_hitratio : 0.08
cumulative_inserts : 7473
cumulative_evictions : 3023

I don't understand "lookups" vs. "cumulative_lookups", etc.  I _do_ 
understand that a hit-ratio of 0.08 isn't a very good one.


Something I definitely find strange is that I've allocated 4G of RAM to 
the java heap, but solr consistently remains around 1.7G.  I'm trying to 
give it all the RAM I can spare (I could go higher, but it's not even 
using what I'm giving it) to make it faster.


The index takes-up roughly 25GB on disk, and indexing is very fast 
(well, nothing we're complaining about anyway).  We're trying to figure 
out why queries against the default, document content are slow (15-30 
seconds for only a few mm total documents).  Mergefactor=3, if that helps.


So if anyone could point me to someplace that defines what these stats 
mean, and if anyone has any immediate tips/tricks/recommendations as to 
increasing query performance (and whether this documentCache is a good 
candidate to be increased substantially), I would very much appreciate it.


-AJ