Re: Using Tesseract OCR to extract PDF files in EML file attachment
You'll need to use something like javax mail (or some of the jars that have been built on top of it for higher-level access) to open the EML files and extract the attachments, then operate on the extracted attachments as you would any file. There are alternative, paid, libraries to parse and extract attachments from EML files as well. EML attachments will have a mimetype associated with their metadata. On 4/4/2017 2:00 AM, Zheng Lin Edwin Yeo wrote: Hi, Currently, I am able to extract scanned PDF images and index them to Solr using Tesseract OCR, although the speed is very slow. However, for EML files with PDF attachments that consist of scanned images, the Tesseract OCR is not able to extract the text from those PDF attachments. Can we use the same method for EML files? Or what are the suggestions that we can do to extract those attachments? I'm using Solr 6.5.0 Regards, Edwin
Re: The book: Solr 4.x Deep Dive - Early Access Release #1
On 6/21/2013 9:22 AM, Alexandre Rafalovitch wrote: I might be however confused regarding your strategy. I thought you were going to do several different volumes, rather than one large one. Or is this all a 'first' volume discussion so far. Pricing: $7.99 feels better for the book this size. Under $5 it feels like it may be mostly filler (even if it is not). I don't think anybody will pay every month just because it got updated. I agree that I'm a little confused as to the pricing. Are you saying you'll keep updating it and everyone would just d/l the latest version monthly? If so, what's to stop someone from waiting to "subscribe" until it is entirely complete and just pay the $8 once for the whole thing -- versus those of us (me included) who would be sending our $8 every month and therefore receiving the same work at 10x the price (for example)? I'm with one of the previous responses: I'd be willing to pay $30 for early-access (and updates) to an eBook as a one-time-cost and then when you release the final, set it at $40 or more.
newbie questions about cache stats & query perf
Sorry, I did search for an answer, but didn't find an applicable one. I'm currently stuck on 1.4.1 (running in Tomcat 6 on 64bit Linux) for the time being... When I see stats like this: name: documentCache class: org.apache.solr.search.LRUCache version: 1.0 description: LRU Cache(maxSize=512, initialSize=512) lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 8158 cumulative_hits : 685 cumulative_hitratio : 0.08 cumulative_inserts : 7473 cumulative_evictions : 3023 I don't understand "lookups" vs. "cumulative_lookups", etc. I _do_ understand that a hit-ratio of 0.08 isn't a very good one. Something I definitely find strange is that I've allocated 4G of RAM to the java heap, but solr consistently remains around 1.7G. I'm trying to give it all the RAM I can spare (I could go higher, but it's not even using what I'm giving it) to make it faster. The index takes-up roughly 25GB on disk, and indexing is very fast (well, nothing we're complaining about anyway). We're trying to figure out why queries against the default, document content are slow (15-30 seconds for only a few mm total documents). Mergefactor=3, if that helps. So if anyone could point me to someplace that defines what these stats mean, and if anyone has any immediate tips/tricks/recommendations as to increasing query performance (and whether this documentCache is a good candidate to be increased substantially), I would very much appreciate it. -AJ