RE: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
December 2017 3:36 p.m. To: solr-user@lucene.apache.org Subject: Re: Alternatives to tika for extracting text out of PDFs No need to prove it. More modern PDF formats are easier to decode, but for many years the text was move-print-move-print, so the font metrics were necessary to guess at spaces. Plus

Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Walter Underwood
No need to prove it. More modern PDF formats are easier to decode, but for many years the text was move-print-move-print, so the font metrics were necessary to guess at spaces. Plus, the glyph IDs had to be mapped to characters, so some PDFs were effectively a substitution code. Our team joked

Re: Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Erick Erickson
der it a single word. I'm not quite sure how to prove that, but I'd be willing to make a bet ;) Erick On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden wrote: > I am indexing PDFs and a separate process has converted any image PDFs to > search PDF before solr gets near it. I notice

Alternatives to tika for extracting text out of PDFs

2017-12-07 Thread Phil Scadden
I am indexing PDFs and a separate process has converted any image PDFs to search PDF before solr gets near it. I notice that tika is very slow at parsing some PDFs. I don't need any metadata (which I suspect is slowing tika down), just the text. Has anyone used an alternative PDF

Re: [Result Query Solr] How to retrieve the content of pdfs

2016-09-20 Thread Dmitry Kan
Hi Alexandre, Could you add fl=* to your query and check the output? Alternatively, have a look at your schema file and check what could look like content field: text or similar. Dmitry 14 сент. 2016 г. 1:27 AM пользователь "Alexandre Martins" < alexandremart...@gmail.com> написал: > Hi Guys, >

Re: [Result Query Solr] How to retrieve the content of pdfs

2016-09-14 Thread Alexandre Rafalovitch
The extracted content goes into text field which is not stored. You can make it stored but the output will really not be pretty. PDF is not a linear storage format. Regards, Alex On 14 Sep 2016 5:16 AM, "Alexandre Martins" wrote: > Hi Guys, > > I'm trying to use the last version of solr and

[Result Query Solr] How to retrieve the content of pdfs

2016-09-13 Thread Alexandre Martins
Hi Guys, I'm trying to use the last version of solr and i have used the post tool to upload 28 pdf files and it works fine. However, I don't know how to show the content of the files in the resulted json. Anybody know how to include this field? "responseHeader":{ "zkConnected":true, "status":0, "

Re: Question about indexing PDFs

2016-08-26 Thread Betsey Benagh
;is very easy to do without knowing it. > >> not actually having 'indexed="true" set in your schema > >> not committing after inserting the doc > >Best, >Erick > >On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh < >betsey.ben...@stresearch.com

RE: Question about indexing PDFs

2016-08-26 Thread Srinivasa Meenavalli
on&indent=true Regards Srinivas Meenavalli -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, August 26, 2016 3:09 AM To: solr-user Subject: Re: Question about indexing PDFs That is always a dangero

Re: Question about indexing PDFs

2016-08-25 Thread Erick Erickson
you are. This is very easy to do without knowing it. > not actually having 'indexed="true" set in your schema > not committing after inserting the doc Best, Erick On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh < betsey.ben...@stresearch.com> wrote: > It looks like the

Re: Question about indexing PDFs

2016-08-25 Thread Betsey Benagh
Right, that¹s where I looked. No Œcontent¹. Which is what confused me. On 8/25/16, 1:56 PM, "Erick Erickson" wrote: >when you say "I don't see it in the schema for that collection" are you >talking schema.xml? managed_schema? Or actual documents in the index? >Often >these are defined by dyna

Re: Question about indexing PDFs

2016-08-25 Thread Betsey Benagh
It looks like the metadata of the PDFs was indexed, but not the content (which is what I was interested in). Searches on terms I know exist in the content come up empty. On 8/25/16, 2:16 PM, "Betsey Benagh" wrote: >Right, that¹s where I looked. No Œcontent¹. Which is wha

Re: Question about indexing PDFs

2016-08-25 Thread Erick Erickson
when you say "I don't see it in the schema for that collection" are you talking schema.xml? managed_schema? Or actual documents in the index? Often these are defined by dynamic fields and the like in the schema files. Take a look at the admin UI>>schema browser>>drop down and you'll see all the ac

Question about indexing PDFs

2016-08-25 Thread Betsey Benagh
Following the instructions in the quick start guide, I imported a bunch of PDF documents into my Solr 6.0 instance. As far as I can tell from the documentation, there should be a 'content' field indexing, well, the content, but I don't see it in the schema for that collection. Is there somethi

Re: iText hitting infinite loop - Was Re: pdfs

2014-06-02 Thread Erick Erickson
; Siegfried Goeschl >>> >>> On 25 May 2014, at 10:06, Siegfried Goeschl wrote: >>> >>> Hi Brian, >>>> >>>> can you send me the email? I would like to play around :-) >>>> >>>> Have you opened a JIRA for PdfBox? If not

iText hitting infinite loop - Was Re: pdfs

2014-06-02 Thread Siegfried Goeschl
egfried Goeschl On 25 May 2014, at 04:18, Brian McDowell wrote: Our feeding (indexing) tool halts because Solr becomes unresponsive after getting some really bad pdfs. There are levels of pdf "badness." Some just will not parse and that's fine, but others are more problematic

Re: pdfs

2014-05-26 Thread Erick Erickson
;> >> On 25 May 2014, at 04:18, Brian McDowell wrote: >> >>> Our feeding (indexing) tool halts because Solr becomes unresponsive after >>> getting some really bad pdfs. There are levels of pdf "badness." Some just >>> will not parse and that'

Re: pdfs

2014-05-25 Thread Siegfried Goeschl
eproduce > the issue … > > Thanks in advance > > Siegfried Goeschl > > > On 25 May 2014, at 04:18, Brian McDowell wrote: > >> Our feeding (indexing) tool halts because Solr becomes unresponsive after >> getting some really bad pdfs. There are levels of

Re: pdfs

2014-05-25 Thread Siegfried Goeschl
use Solr becomes unresponsive after > getting some really bad pdfs. There are levels of pdf "badness." Some just > will not parse and that's fine, but others are more problematic in that our > Operations team has to restart Solr because it just hangs and accepts no > m

Re: pdfs

2014-05-24 Thread Brian McDowell
Our feeding (indexing) tool halts because Solr becomes unresponsive after getting some really bad pdfs. There are levels of pdf "badness." Some just will not parse and that's fine, but others are more problematic in that our Operations team has to restart Solr because it just hangs

Re: pdfs

2014-05-22 Thread Jack Krupansky
Subject: Re: pdfs Hi folks, for a small customer project I'm running SOLR with embedded Tikka. * memory consumption is an issue but can be handled * there is an issue with PDFBox hitting an infinite loop which causes excessive CPU usage - requires SOLR restart but happens only once withing 40

Re: pdfs

2014-05-22 Thread Siegfried Goeschl
Krupansky -Original Message- From: Brian McDowell Sent: Thursday, May 22, 2014 12:24 AM To: solr-user@lucene.apache.org Subject: pdfs Has anyone had issues with indexing pdf files? Some pdfs are bringing down Solr completely so that it actually needs to be manually restarted. We are using Solr 4.4 a

Re: pdfs

2014-05-21 Thread Jack Krupansky
cific symptom? -- Jack Krupansky -Original Message- From: Brian McDowell Sent: Thursday, May 22, 2014 12:24 AM To: solr-user@lucene.apache.org Subject: pdfs Has anyone had issues with indexing pdf files? Some pdfs are bringing down Solr completely so that it actually needs to be man

Re: pdfs

2014-05-21 Thread Alexandre Rafalovitch
Run Tika in a client instead? Or as a standalone server listening over TCP socket). Ship only extractions to Solr. This is more efficient as well. I suspect, there would always be PDFs that cause strange behaviour, even if just based on memory requirements (e.g. embedded images). If that becomes

pdfs

2014-05-21 Thread Brian McDowell
Has anyone had issues with indexing pdf files? Some pdfs are bringing down Solr completely so that it actually needs to be manually restarted. We are using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the problem because the release notes associated with the new tika version and

Re: Indexing scanned PDFs

2014-05-06 Thread Jack Krupansky
ubject: Re: Indexing scanned PDFs Nothing I am aware of for Solr directly. You may have better luck chasing this at TIKA mailing list, as that's what Solr uses under covers to index PDF otherwise. Doing a quick search for Tika and OCR brings up a number of links. Regards, Alex. Persona

Re: Indexing scanned PDFs

2014-05-05 Thread Alexandre Rafalovitch
s.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Tue, May 6, 2014 at 12:15 PM, Chandan Tamrakar wrote: > we are using SOLr to index pdf documents but there are cases where PDFs > are usually a scanned document with no text to extract and index . >

Indexing scanned PDFs

2014-05-05 Thread Chandan Tamrakar
​we are using SOLr to index pdf documents but there are cases where PDFs are usually a scanned document with no text to extract and index . Is there a plugin or module in SOLR that we can integrate so that it would actually extract a text / OCR and then index? Thanks in advance Chandan

RE: Many PDFs indexed but only one returned in te Solr-UI

2014-03-11 Thread Croci Francesco Luigi (ID SWS)
uid Greetings Francesco -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Dienstag, 11. März 2014 12:46 To: solr-user@lucene.apache.org Subject: Re: Many PDFs indexed but only one returned in te Solr-UI Hmmm, that looks OK to me. I'd log o

Re: Many PDFs indexed but only one returned in te Solr-UI

2014-03-11 Thread Erick Erickson
. doc.add("id", filename + timestamp) just to have something that changes every run. Best Erick On Tue, Mar 11, 2014 at 6:00 AM, Croci Francesco Luigi (ID SWS) wrote: > I followed the example here > (http://searchhub.org/2012/02/14/indexing-with-solrj/) for indexing all the >

Many PDFs indexed but only one returned in te Solr-UI

2014-03-11 Thread Croci Francesco Luigi (ID SWS)
I followed the example here (http://searchhub.org/2012/02/14/indexing-with-solrj/) for indexing all the pdfs in a directory. The process seems to work well, but at the end, when I go in the Solr-UI and click on "Execute query"(with q=*:*), I get only one entry. Do I miss something

Re: Issue regarding Indexing PDFs into Solr.

2013-04-29 Thread Furkan KAMACI
It seems that your solrconfig.xml can not find libraries. Here is an example path from solrconfig.xml: 2013/4/29 Krishna Venkateswaran > Hi > > I have installed Solr over Apache Tomcat. > I have used Apache Tomcat v6.x for Solr to work. > > When trying to upload a file using SolrJ to index it

Issue regarding Indexing PDFs into Solr.

2013-04-29 Thread Krishna Venkateswaran
Hi I have installed Solr over Apache Tomcat. I have used Apache Tomcat v6.x for Solr to work. When trying to upload a file using SolrJ to index it into Solr, I am getting an exception as follows: Server at http://localhost:8080/solr-example returned non ok status:500, message:Internal Server Err

Is it possible to index pdfs and database into single document?

2012-05-11 Thread anarchos78
pl.execute(StatementImpl.java:841) at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:681) at org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:246) ... 13 more Is it possible to index pdfs, docs, rtf along with database and havin

Re: bug in ExtractingRequestHandler with PDFs and metadata field Category

2011-07-07 Thread Juan Grande
Hi Andras, I added metadata_ so all PDF metadata fields > should be saved in solr as "metadata_something" fields. > The problem is that the "Category" metadata field from the PDF for some > reason is not prefixed with "metadata_" and > solr will merge the "Category" field I have in the schema with

bug in ExtractingRequestHandler with PDFs and metadata field Category

2011-07-07 Thread Andras Balogh
Hi, I think this is a bug but before reporting to issue tracker I thought I will ask it here first. So the problem is I have a PDF file which among other metadata fields like Author, CreatedDate etc. has a metadata field Category (I can see all metadata fields with tika-app.jar started in

RE: Setting up Solr for PDFs on JBoss

2011-01-04 Thread Olson, Ron
8:10 PM To: solr-user@lucene.apache.org Subject: Re: Setting up Solr for PDFs on JBoss What's your solrconfig.xml look like for setting up the ExtractingReqHandler? -Grant On Jan 3, 2011, at 4:44 PM, Olson, Ron wrote: > Hi all- > > After testing the PDF import functionality in my

Re: Setting up Solr for PDFs on JBoss

2011-01-04 Thread Jak Akdemir
In JBoss, duplicate libraries will be ignored as you mentioned. You may start to find libraries used in JBoss with "find -name *.jar". I don't know any other resource than wiki. It says remove the libraries below. - xercesImpl-2.8.1.jar - xml-apis-1.3.03.jar http://wiki.apache.org/solr/Sol

Re: Setting up Solr for PDFs on JBoss

2011-01-03 Thread Grant Ingersoll
What's your solrconfig.xml look like for setting up the ExtractingReqHandler? -Grant On Jan 3, 2011, at 4:44 PM, Olson, Ron wrote: > Hi all- > > After testing the PDF import functionality in my local copy of Solr 1.4.1 > with the included Jetty app server, I tried replicating it using my copy

Setting up Solr for PDFs on JBoss

2011-01-03 Thread Olson, Ron
Hi all- After testing the PDF import functionality in my local copy of Solr 1.4.1 with the included Jetty app server, I tried replicating it using my copy of Solr running in JBoss 5.10 (which uses Tomcat as its servlet container). When I try to add a PDF, I get an error buried in the stack trac

Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-06 Thread Ross
ap.content=attr_content&commit=true"; -F "myfi...@tutorial.html" -F "literal.mydata= > -Original Message- > From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] > Sent: Monday, January 04, 2010 4:28 AM > To: solr-user@lucene.apache.org > Subject

RE: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-05 Thread Giovanni Fernandez-Kincade
-Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Monday, January 04, 2010 4:28 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cell - PDFs plus literal metadata - GET or POST ? On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote: > Hi all > >

Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-04 Thread Shalin Shekhar Mangar
On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote: > Hi all > > I'm experimenting with Solr. I've successfully indexed some PDFs and > all looks good but now I want to index some PDFs with metadata pulled > from another source. I see this example in the docs. > > curl

Solr Cell - PDFs plus literal metadata - GET or POST ?

2009-12-29 Thread Ross
Hi all I'm experimenting with Solr. I've successfully indexed some PDFs and all looks good but now I want to index some PDFs with metadata pulled from another source. I see this example in the docs. curl "http://localhost:8983/solr/update/extract?literal.id=doc4&captureAtt