Walter,
Well said. (And I love the hamburger conversion analogy - very apt.)
The only thing I will add is that when you have a collection of similar
rich text documents, you might be able to construct queries to respect
internal structures within the documents. If all/most of your documents
hav
You may try to use tesseract tool to check data extraction from pdf or
images and then go forward accordingly. As far as I understand the PDF is
an image and not data. The searchable PDF actually overlays the selectable
text as hidden text over the PDF image. These PDFs can be indexed and
extracted
PDF is not a structured document format. It is a printer control format.
PDF does not have a paragraph marker. Instead, it says to move
to this spot on the page, choose this font, and print this letter. For a
paragraph, it moves farther. For the next letter in a word, it moves a
little bit. Extra
Solr will not do this automatically, the Extracting Request Handler
simply indexes the entire contents of the doc without regard to things
like paragraphs etc. Ditto with HTML. This is actually a task that
requires getting into Tika and using all the bells and whistles there.
I'd recommend two thi
Hello Team,
I am using the Solr for indexing and searching for pdf document
I have go through with your website document and installed solr but unable
to index and search the document.
For example: Suppose we have a PDF file which have no of paragraph with
separate heading.
So If I search for t
Hi Team,
I am indexing PDF using Apache Solr 3.6 . Passing around 3000
keywords using the OR operator (gardens OR flowers OR time OR train OR trees
OR etc) able to get the files containing these keywords. But every .PDF file
will not be containing all the keywords, some may contai
On Apr 29, 2014 2:52 PM, "vignesh" wrote:
>
> Hi Team,
>
>
>
> I am indexing PDF using Apache Solr 3.6 . Passing around
3000 keywords using the OR operator and able to get the files containing
the keywords. Kindly guide me to get the keyword list in a .PDF file.
What do you mean? Do
Your question is not terribly clear. Are you having troubles indexing PDF
in general? Try the tutorial and specifically look for extract handler.
Or you already got PDF into the system but your 3000 Keyword query does not
match it? In which case it might be just that PDF extraction is limited by
d
Hi Team,
I am indexing PDF using Apache Solr 3.6 . Passing around 3000
keywords using the OR operator and able to get the files containing the
keywords. Kindly guide me to get the keyword list in a .PDF file.
Note : In Schema.xml have declared a unique tag "id".
Than
: Wednesday, April 2, 2014 3:35 PM
To: solr-user@lucene.apache.org
Subject: Re: PDF Indexing
Hi Sujatha,
There is no built in mechanism. Prepare page documents outside of the solr.
http://searchhub.org/2012/02/14/indexing-with-solrj/
And you may want to save text content somewhere too. If you change
Hi Sujatha,
There is no built in mechanism. Prepare page documents outside of the solr.
http://searchhub.org/2012/02/14/indexing-with-solrj/
And you may want to save text content somewhere too. If you change something in
index analysis/schema you need to reindex. If you save text data, you can
Hi,
I am able to use TIKA and DIH to Index a pdf as a single document.However
I need each page to be single document. Is there any inbuilt mechanism to
achieve the same or do I have to use pdfbox or any other tool achieve this?
Regards
You should check the Apache PDFBox project. A similar question:
https://issues.apache.org/jira/browse/PDFBOX-940
2013/11/15 Marcello Lorenzi
Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors
occurred for PDF indexing:
ERROR - 2013-11-15 15:14:2
You should check the Apache PDFBox project. A similar question:
https://issues.apache.org/jira/browse/PDFBOX-940
2013/11/15 Marcello Lorenzi
> Hi,
> during you testing of Apache SOLR 4.3, we have noticed some errors
> occurred for PDF indexing:
>
> ERROR - 2013-11-
Hi,
during you testing of Apache SOLR 4.3, we have noticed some errors
occurred for PDF indexing:
ERROR - 2013-11-15 15:14:26.248;
org.apache.pdfbox.pdmodel.font.PDCIDFont; Error: Could not parse
predefined CMAP file for 'PDFXC30-Indentity0-UCS2'
ERROR - 2013-11-15 15
solr/ExtractingRequestHandler
>>
>> Again, DO NOT MIX the instructions from the two.
>>
>> post.jar is designed so that you do not need to know or care exactly how
>> rich document indexing works.
>>
>> -- Jack Krupansky
>>
>> -Original Message
nsky
>
> -Original Message- From: Furkan KAMACI
> Sent: Friday, April 26, 2013 5:30 AM
> To: solr-user@lucene.apache.org
> Subject: Document is missing mandatory uniqueKey field: id for Solr PDF
> indexing
>
>
> I use Solr 4.2.1 and these are my fiel
Krupansky
-Original Message-
From: Furkan KAMACI
Sent: Friday, April 26, 2013 5:30 AM
To: solr-user@lucene.apache.org
Subject: Document is missing mandatory uniqueKey field: id for Solr PDF
indexing
I use Solr 4.2.1 and these are my fields:
I run th
I think that I should start a new thread for my question to help people who
searches for same situation.
2013/4/26 Furkan KAMACI
> If you can help me it would be nice. I get that error:
>
> SimplePostTool version 1.5
> Posting files to base url http://localhost:8983/solr/update/extract..
> Enter
If you can help me it would be nice. I get that error:
SimplePostTool version 1.5
Posting files to base url http://localhost:8983/solr/update/extract..
Entering auto mode. File endings considered are
xml,json,csv,pdf,doc,docx,ppt,pptx,xls,xlsx,odt,odp,ods,ott,otp,ots,rtf,htm,html,txt,log
POSTing f
http://wiki.apache.org/solr/post.jar
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
26. apr. 2013 kl. 13:28 skrev Furkan KAMACI :
> Hi Raymond;
>
> Now I get that error: SimplePostTool: WARNING: IOException while reading
> respons
Hi Raymond;
Now I get that error: SimplePostTool: WARNING: IOException while reading
response: java.io.FileNotFoundException:
2013/4/26 Raymond Wiker
> You could start by doing
>
> java post.jar -help
>
> --- the 7th example shows exactly what you need to do to add a document id.
>
> On Fri, Ap
You could start by doing
java post.jar -help
--- the 7th example shows exactly what you need to do to add a document id.
On Fri, Apr 26, 2013 at 11:30 AM, Furkan KAMACI wrote:
> I use Solr 4.2.1 and these are my fields:
>
> multiValued="false" />
>
>
>
>
> multiValued="true"/>
>
> stored=
I use Solr 4.2.1 and these are my fields:
I run that command:
java -Durl=http://localhost:8983/solr/update/extract -jar post.jar
523387.pdf
However I get that error, any ideas?
Apr 26, 2013 12:26:51 PM org.apache.solr.common.SolrException log
SEVERE: org.apache
ctingRequestHandler
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Tolga Sent: Monday, May 07, 2012 3:24 PM
>> To: solr-user@lucene.apache.org Subject: PDF indexing
>> Hi,
>>
>> From what I have read, I think I have to use Tika (?) to ind
On 05/07/2012 10:35 PM, Jack Krupansky wrote:
Try SolrCell (ExtractingRequestHandler).
See:
http://wiki.apache.org/solr/ExtractingRequestHandler
-- Jack Krupansky
-Original Message- From: Tolga Sent: Monday, May 07, 2012 3:24
PM To: solr-user@lucene.apache.org Subject: PDF indexing
Try SolrCell (ExtractingRequestHandler).
See:
http://wiki.apache.org/solr/ExtractingRequestHandler
-- Jack Krupansky
-Original Message-
From: Tolga
Sent: Monday, May 07, 2012 3:24 PM
To: solr-user@lucene.apache.org
Subject: PDF indexing
Hi,
From what I have read, I think I have
Hi,
From what I have read, I think I have to use Tika (?) to index PDF,
xls, doc, etc files. How do I start? Do I use mvn clean install in the
source directory to get all the jar files to begin? Centos doesn't
provide mvn, how do I build Tika after getting it from
http://maven.apache.org ?
Good day,
I'm checking if Solr would work for indexing PDFs. My requirements are:
1) I must know which page has what contents.
2) Left to right search support. Such as Hebrew. This has been the most
trickiest to achieve.
I also prefer to know the position of the searched contents on the page but
How long are the documents ? indexing a large document can be slow
(although 2 seconds is very slow indeed).
2011/6/22 Rode González (libnova) :
> Hi !
>
>
>
> We are using Zend Search based on Lucene. Our indexing pdf consultations
> take longer than 2 seconds.
>
> We want to change to solr to tr
o Iglesias; Leo; Marcos; Mario Crespo
> (Silvereme); 'Rode'
> Subject: response time for pdf indexing
>
> Hi !
>
>
>
> We are using Zend Search based on Lucene. Our indexing pdf consultations
> take longer than 2 seconds.
>
>
>
> We want to chan
Hi !
We are using Zend Search based on Lucene. Our indexing pdf consultations
take longer than 2 seconds.
We want to change to solr to try to solve this problem.
i. Can anyone tell me the response time for querys on pdf documents on solr?
ii. Can anyone tell me some strategies to reduce
32 matches
Mail list logo