December 2017 3:36 p.m.
To: solr-user@lucene.apache.org
Subject: Re: Alternatives to tika for extracting text out of PDFs
No need to prove it. More modern PDF formats are easier to decode, but for many
years the text was move-print-move-print, so the font metrics were necessary to
guess at spaces. Plus
No need to prove it. More modern PDF formats are easier to decode, but for many
years the text was move-print-move-print, so the font metrics were necessary to
guess at spaces. Plus, the glyph IDs had to be mapped to characters, so some
PDFs were effectively a substitution code. Our team joked
der it a single word. I'm not quite sure how to prove that, but
I'd be willing to make a bet ;)
Erick
On Thu, Dec 7, 2017 at 4:57 PM, Phil Scadden wrote:
> I am indexing PDFs and a separate process has converted any image PDFs to
> search PDF before solr gets near it. I notice
I am indexing PDFs and a separate process has converted any image PDFs to
search PDF before solr gets near it. I notice that tika is very slow at parsing
some PDFs. I don't need any metadata (which I suspect is slowing tika down),
just the text. Has anyone used an alternative PDF
Hi Alexandre,
Could you add fl=* to your query and check the output? Alternatively, have
a look at your schema file and check what could look like content field:
text or similar.
Dmitry
14 сент. 2016 г. 1:27 AM пользователь "Alexandre Martins" <
alexandremart...@gmail.com> написал:
> Hi Guys,
>
The extracted content goes into text field which is not stored. You can
make it stored but the output will really not be pretty. PDF is not a
linear storage format.
Regards,
Alex
On 14 Sep 2016 5:16 AM, "Alexandre Martins"
wrote:
> Hi Guys,
>
> I'm trying to use the last version of solr and
Hi Guys,
I'm trying to use the last version of solr and i have used the post tool to
upload 28 pdf files and it works fine. However, I don't know how to show
the content of the files in the resulted json. Anybody know how to include
this field?
"responseHeader":{ "zkConnected":true, "status":0, "
;is very easy to do without knowing it.
>
>> not actually having 'indexed="true" set in your schema
>
>> not committing after inserting the doc
>
>Best,
>Erick
>
>On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
>betsey.ben...@stresearch.com
on&indent=true
Regards
Srinivas Meenavalli
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, August 26, 2016 3:09 AM
To: solr-user
Subject: Re: Question about indexing PDFs
That is always a dangero
you are. This
is very easy to do without knowing it.
> not actually having 'indexed="true" set in your schema
> not committing after inserting the doc
Best,
Erick
On Thu, Aug 25, 2016 at 11:19 AM, Betsey Benagh <
betsey.ben...@stresearch.com> wrote:
> It looks like the
Right, that¹s where I looked. No Œcontent¹. Which is what confused me.
On 8/25/16, 1:56 PM, "Erick Erickson" wrote:
>when you say "I don't see it in the schema for that collection" are you
>talking schema.xml? managed_schema? Or actual documents in the index?
>Often
>these are defined by dyna
It looks like the metadata of the PDFs was indexed, but not the content
(which is what I was interested in). Searches on terms I know exist in
the content come up empty.
On 8/25/16, 2:16 PM, "Betsey Benagh" wrote:
>Right, that¹s where I looked. No Œcontent¹. Which is wha
when you say "I don't see it in the schema for that collection" are you
talking schema.xml? managed_schema? Or actual documents in the index? Often
these are defined by dynamic fields and the like in the schema files.
Take a look at the admin UI>>schema browser>>drop down and you'll see all
the ac
Following the instructions in the quick start guide, I imported a bunch of PDF
documents into my Solr 6.0 instance. As far as I can tell from the
documentation, there should be a 'content' field indexing, well, the content,
but I don't see it in the schema for that collection. Is there somethi
; Siegfried Goeschl
>>>
>>> On 25 May 2014, at 10:06, Siegfried Goeschl wrote:
>>>
>>> Hi Brian,
>>>>
>>>> can you send me the email? I would like to play around :-)
>>>>
>>>> Have you opened a JIRA for PdfBox? If not
egfried Goeschl
On 25 May 2014, at 04:18, Brian McDowell wrote:
Our feeding (indexing) tool halts because Solr becomes unresponsive after
getting some really bad pdfs. There are levels of pdf "badness." Some just
will not parse and that's fine, but others are more problematic
;>
>> On 25 May 2014, at 04:18, Brian McDowell wrote:
>>
>>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>>> will not parse and that'
eproduce
> the issue …
>
> Thanks in advance
>
> Siegfried Goeschl
>
>
> On 25 May 2014, at 04:18, Brian McDowell wrote:
>
>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>> getting some really bad pdfs. There are levels of
use Solr becomes unresponsive after
> getting some really bad pdfs. There are levels of pdf "badness." Some just
> will not parse and that's fine, but others are more problematic in that our
> Operations team has to restart Solr because it just hangs and accepts no
> m
Our feeding (indexing) tool halts because Solr becomes unresponsive after
getting some really bad pdfs. There are levels of pdf "badness." Some just
will not parse and that's fine, but others are more problematic in that our
Operations team has to restart Solr because it just hangs
Subject: Re: pdfs
Hi folks,
for a small customer project I'm running SOLR with embedded Tikka.
* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes
excessive CPU usage - requires SOLR restart but happens only once
withing 40
Krupansky
-Original Message- From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs
Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 a
cific symptom?
-- Jack Krupansky
-Original Message-
From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs
Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be man
Run Tika in a client instead? Or as a standalone server listening over
TCP socket). Ship only extractions to Solr. This is more efficient as
well.
I suspect, there would always be PDFs that cause strange behaviour,
even if just based on memory requirements (e.g. embedded images). If
that becomes
Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
ubject: Re: Indexing scanned PDFs
Nothing I am aware of for Solr directly. You may have better luck
chasing this at TIKA mailing list, as that's what Solr uses under
covers to index PDF otherwise. Doing a quick search for Tika and OCR
brings up a number of links.
Regards,
Alex.
Persona
s.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Tue, May 6, 2014 at 12:15 PM, Chandan Tamrakar
wrote:
> we are using SOLr to index pdf documents but there are cases where PDFs
> are usually a scanned document with no text to extract and index .
>
we are using SOLr to index pdf documents but there are cases where PDFs
are usually a scanned document with no text to extract and index .
Is there a plugin or module in SOLR that we can integrate so that it would
actually extract a text / OCR and then index?
Thanks in advance
Chandan
uid
Greetings
Francesco
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Dienstag, 11. März 2014 12:46
To: solr-user@lucene.apache.org
Subject: Re: Many PDFs indexed but only one returned in te Solr-UI
Hmmm, that looks OK to me. I'd log o
.
doc.add("id", filename + timestamp) just to
have something that changes every run.
Best
Erick
On Tue, Mar 11, 2014 at 6:00 AM, Croci Francesco Luigi (ID SWS)
wrote:
> I followed the example here
> (http://searchhub.org/2012/02/14/indexing-with-solrj/) for indexing all the
>
I followed the example here
(http://searchhub.org/2012/02/14/indexing-with-solrj/) for indexing all the
pdfs in a directory. The process seems to work well, but at the end, when I go
in the Solr-UI and click on "Execute query"(with q=*:*), I get only one entry.
Do I miss something
It seems that your solrconfig.xml can not find libraries. Here is an
example path from solrconfig.xml:
2013/4/29 Krishna Venkateswaran
> Hi
>
> I have installed Solr over Apache Tomcat.
> I have used Apache Tomcat v6.x for Solr to work.
>
> When trying to upload a file using SolrJ to index it
Hi
I have installed Solr over Apache Tomcat.
I have used Apache Tomcat v6.x for Solr to work.
When trying to upload a file using SolrJ to index it into Solr, I am
getting an exception as follows:
Server at http://localhost:8080/solr-example returned non ok status:500,
message:Internal Server Err
pl.execute(StatementImpl.java:841)
at com.mysql.jdbc.StatementImpl.execute(StatementImpl.java:681)
at
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.(JdbcDataSource.java:246)
... 13 more
Is it possible to index pdfs, docs, rtf along with database and havin
Hi Andras,
I added metadata_ so all PDF metadata fields
> should be saved in solr as "metadata_something" fields.
>
The problem is that the "Category" metadata field from the PDF for some
> reason is not prefixed with "metadata_" and
>
solr will merge the "Category" field I have in the schema with
Hi,
I think this is a bug but before reporting to issue tracker I
thought I will ask it here first.
So the problem is I have a PDF file which among other metadata fields
like Author, CreatedDate etc. has a metadata
field Category (I can see all metadata fields with tika-app.jar started
in
8:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Setting up Solr for PDFs on JBoss
What's your solrconfig.xml look like for setting up the ExtractingReqHandler?
-Grant
On Jan 3, 2011, at 4:44 PM, Olson, Ron wrote:
> Hi all-
>
> After testing the PDF import functionality in my
In JBoss, duplicate libraries will be ignored as you mentioned. You may
start to find libraries used in JBoss with "find -name *.jar". I don't know
any other resource than wiki. It says remove the libraries below.
- xercesImpl-2.8.1.jar
- xml-apis-1.3.03.jar
http://wiki.apache.org/solr/Sol
What's your solrconfig.xml look like for setting up the ExtractingReqHandler?
-Grant
On Jan 3, 2011, at 4:44 PM, Olson, Ron wrote:
> Hi all-
>
> After testing the PDF import functionality in my local copy of Solr 1.4.1
> with the included Jetty app server, I tried replicating it using my copy
Hi all-
After testing the PDF import functionality in my local copy of Solr 1.4.1 with
the included Jetty app server, I tried replicating it using my copy of Solr
running in JBoss 5.10 (which uses Tomcat as its servlet container). When I try
to add a PDF, I get an error buried in the stack trac
ap.content=attr_content&commit=true";
-F "myfi...@tutorial.html" -F "literal.mydata=
> -Original Message-
> From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
> Sent: Monday, January 04, 2010 4:28 AM
> To: solr-user@lucene.apache.org
> Subject
-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
Sent: Monday, January 04, 2010 4:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr Cell - PDFs plus literal metadata - GET or POST ?
On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote:
> Hi all
>
>
On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote:
> Hi all
>
> I'm experimenting with Solr. I've successfully indexed some PDFs and
> all looks good but now I want to index some PDFs with metadata pulled
> from another source. I see this example in the docs.
>
> curl
Hi all
I'm experimenting with Solr. I've successfully indexed some PDFs and
all looks good but now I want to index some PDFs with metadata pulled
from another source. I see this example in the docs.
curl
"http://localhost:8983/solr/update/extract?literal.id=doc4&captureAtt
44 matches
Mail list logo