Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Erick... On Sun, Jun 7, 2020 at 1:50 PM Erick Erickson wrote: > https://lucidworks.com/post/indexing-with-solrj/ > > > > On Jun 7, 2020, at 3:22 PM, Fiz N wrote: > > > > Thanks Jorn and Erick. > > > > Hi Erick, looks like the skeletal SOLRJ program attachment is missing. > > > > Thanks >

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
https://lucidworks.com/post/indexing-with-solrj/ > On Jun 7, 2020, at 3:22 PM, Fiz N wrote: > > Thanks Jorn and Erick. > > Hi Erick, looks like the skeletal SOLRJ program attachment is missing. > > Thanks > Fiz > > On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson > wrote: > >> Here’s a skele

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Jorn and Erick. Hi Erick, looks like the skeletal SOLRJ program attachment is missing. Thanks Fiz On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson wrote: > Here’s a skeletal SolrJ program using Tika as another alternative. > > Best, > Erick > > > On Jun 7, 2020, at 2:06 PM, Jörn Franke w

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
Here’s a skeletal SolrJ program using Tika as another alternative. Best, Erick > On Jun 7, 2020, at 2:06 PM, Jörn Franke wrote: > > You have to write an external application that creates multiple threads, > parses the PDFs and index them in Solr. Ideally you parse the PDFs once and > store th

Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Jörn Franke
You have to write an external application that creates multiple threads, parses the PDFs and index them in Solr. Ideally you parse the PDFs once and store the resulting text on some file system and then index it. Reason is that if you upgrade to two major versions of Solr you might need to reind

Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Arunas Spurga
Yes, I know the reasons why put this work on a client rather than use Solr directly and it should be maybe the next my task. But I need to finish first my task - index a pdf files stored in SqlBase database. The pdf files are pretty simple, sometimes only dozens text lines. Regards, Aruna On Wed

Re: Indexing PDF files in SqlBase database

2019-04-03 Thread Erick Erickson
For a lot of reasons, I greatly prefer to put this work on a client rather than use Solr directly. Here’s a place to get started, it connects to a DB and also scans local file directory for docs to push through (local) Tika and index. So you should be able to modify it relatively easily to get t

RE: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Phil Scadden
mmit(solr, "prindex"); return true; -Original Message- From: Erick Erickson Sent: Wednesday, 31 October 2018 06:00 To: solr-user Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA All of the above work, but for robust production situations you'll wa

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread ☼ R Nair
I have done a production implementation of this, running for last four months without any issue. Just a resatrt every week of all components. http://blog.cloudera.com/blog/2015/10/how-to-index-scanned-pdfs-at-scale-using-fewer-than-50-lines-of-code/ Best, Ravion On Tue, Oct 30, 2018, 1:00 PM Er

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Erick Erickson
All of the above work, but for robust production situations you'll want to consider a SolrJ client, see: https://lucidworks.com/2012/02/14/indexing-with-solrj/. That blog combines indexing from a DB and using Tika, but those are independent. Best, Erick On Tue, Oct 30, 2018 at 12:21 AM Kamuela Lau

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Kamuela Lau
Hi there, Here are a couple of ways I'm aware of: 1. Extract-handler / post tool You can use the curl command with the extract handler or bin/post to upload a single document. Reference: https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html 2. DataImportHa

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
>http - however, the big advantage of doing your indexing on different machine >is that the heavy lifting that tika does in extracting text from documents, >finding metadata etc is not happening on the server. If the indexer crashes, >it doesn’t affect Solr either. +1 for what can go wrong:

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Phil Scadden
: ZiYuan [mailto:ziyu...@gmail.com] Sent: Tuesday, 20 June 2017 11:29 p.m. To: solr-user@lucene.apache.org Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context Dear Erick and Timothy, I also took a look at the Python clients (say, SolrClient and

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread Allison, Timothy B.
Yeah, Chris knows a thing or two about Tika. :) -Original Message- From: ZiYuan [mailto:ziyu...@gmail.com] Sent: Tuesday, June 20, 2017 8:00 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context No

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan
No intention of spamming but I also want to mention tika-python in the toolchain. Ziyuan On Tue, Jun 20, 2017 at 2:29 PM, ZiYuan wrote: > Dear Erick and Timothy, > > I also took a look at the Python clients (say, SolrClient and pysolr) > because Py

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-20 Thread ZiYuan
Dear Erick and Timothy, I also took a look at the Python clients (say, SolrClient and pysolr) because Python is my main programming language. I have an impression that 1. they send HTTP requests to the server according to the server APIs; 2. they are not official and thus possibly not up to date.

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Dear Erick and Timothy, yes I will parse from the client for all the benefits. I am just trying to figure out what is going on by indexing one or two PDF files first. Thank you both. Best regards, Ziyuan On Mon, Jun 19, 2017 at 6:17 PM, Erick Erickson wrote: > bq: Hope that there is no side ef

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Erick Erickson
bq: Hope that there is no side effect of not mapping the PDF Well, yes it will have that side effect. You can cure that with a copyField directive from content to _text_. But do really consider running this as a SolrJ program on the client. Tim knows in far more painful detail than I do what kind

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Hi Erick, Now it is clear. I have to update the request handler of /update/extract/ from "defaults":{"fmap.content":"_text_"} to "defaults":{"fmap.content":"content"} to fill the field. Hope that there is no side effect of not mapping the PDF content to _text_. Thank you for the hint. Best regar

RE: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Allison, Timothy B.
file. Finally, and I mean it this time, I heartily second Erik's point about SolrJ and the need to keep your file processing outside of Solr's JVM, VM and M! -Original Message- From: Erik Hatcher [mailto:erik.hatc...@gmail.com] Sent: Monday, June 19, 2017 6:56 AM To: solr-us

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread Erik Hatcher
Ziyuan - You may be interested in the example/files that ships with Solr too. It’s got schema and config and even UI for file indexing and searching. Check it out README.txt under example/files in your Solr install. Erik > On Jun 19, 2017, at 6:52 AM, ZiYuan wrote: > > Hi Erick, >

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-19 Thread ZiYuan
Hi Erick, thanks very much for the explanations! Clarification for question 2: more specifically I cannot see the field content in the returned JSON, with the the same definitions as in the post

Re: Indexing PDF files with Solr 6.6 while allowing highlighting matched text with context

2017-06-18 Thread Erick Erickson
1> Yes, you can use your single definition. The author identifies the "text" field as a catch-all. Somewhere in the schema there'll be a copyField directive copying (perhaps) many different fields to the "text" field. That permits simple searches against a single field rather than, say, using edism

Re: indexing pdf files using post tool

2016-03-19 Thread Francisco Andrés Fernández
Vidya, I don't know if I'm understanding it very well but, I think that the best way is to parse your text using a routine outside Solr. You might need to map the different parts of your document using your domain knowledge and use such routine to produce an XML document for example, with correspon

Re: indexing pdf files using post tool

2016-03-19 Thread Binoy Dalal
Take a look at the CloneFieldUpdateProcessorFactory here: http://www.solr-start.com/info/update-request-processors/ On Wed, 16 Mar 2016, 18:25 Binoy Dalal, wrote: > Like Francisco said, use a custom update processor to map the fields the > way you want and add it to your update chain. > > On Wed

Re: indexing pdf files using post tool

2016-03-18 Thread Binoy Dalal
Like Francisco said, use a custom update processor to map the fields the way you want and add it to your update chain. On Wed, 16 Mar 2016, 18:16 Francisco Andrés Fernández, wrote: > Vidya, I don't know if I'm understanding it very well but, I think that the > best way is to parse your text usin

Re: indexing pdf files using post tool

2016-03-18 Thread Jan Høydahl
Hi You can look at the Apache Tika project or the PDFBox project to parse your files before sending to Solr. Alternatively, if your processing is very simple, you can use the built-in Tika as U just did, and then deploy some UpdateRequestProcessor’s in order to modify the Tika output into whate

Re: indexing pdf files using post tool

2016-03-16 Thread vidya
Sorry for conveying it in wrong way. I want my data of 1 pdf file to be indexed with different fields in a document of solr according to data in it like name;id;title;content etc Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp42

Re: indexing pdf files using post tool

2016-03-15 Thread roshan agarwal
Yes vidya, you just have to use copy field Roshan On Tue, Mar 15, 2016 at 3:07 PM, vidya wrote: > Hi > I got data into my content field. But i wanted to have differnt fields to > be > allocated for data in my file.How can I achieve this ? > > > > -- > View this message in context: > http://luce

Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal
You should use copy fields. https://cwiki.apache.org/confluence/display/solr/Copying+Fields On Tue, 15 Mar 2016, 15:07 vidya, wrote: > Hi > I got data into my content field. But i wanted to have differnt fields to > be > allocated for data in my file.How can I achieve this ? > > > > -- > View th

Re: indexing pdf files using post tool

2016-03-15 Thread vidya
Hi I got data into my content field. But i wanted to have differnt fields to be allocated for data in my file.How can I achieve this ? -- View this message in context: http://lucene.472066.n3.nabble.com/indexing-pdf-files-using-post-tool-tp4263811p4263840.html Sent from the Solr - User mailing

Re: indexing pdf files using post tool

2016-03-15 Thread Binoy Dalal
Do you have a "content" field defined in your schema? Is it stored? By default, the content from the docs uploaded through post should be mapped to a field called "content". On Tue, 15 Mar 2016, 12:47 vidya, wrote: > Hi > I am trying to index a pdf file by using post tool in my linux system,Whe

Re: indexing pdf binary stored in mongodb?

2016-02-05 Thread Jack Krupansky
See if they are stored in BSON format using GridFS. If so, you can simply use the mongofiles command to retrieve the PDF into a local file and index that in Solr either using Solr Cell or Tika. See: http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb https://docs.mong

Re: Indexing PDF and MS Office files

2015-04-16 Thread Walter Underwood
Turning PDF back into a structured document is like trying to turn hamburger back into a cow. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 16, 2015, at 4:55 AM, Allison, Timothy B. wrote: > +1 > > :) > >> PS: one more thing - please, tell

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
@lucene.apache.org Subject: RE: Indexing PDF and MS Office files +1 :) >PS: one more thing - please, tell your management that you will never >ever successfully all real-world PDFs and cater for that fact in your >requirements :-)

RE: Indexing PDF and MS Office files

2015-04-16 Thread Davis, Daniel (NIH/NLM) [C]
and httpd, at least to me. -Original Message- From: Siegfried Goeschl [mailto:sgoes...@gmx.at] Sent: Thursday, April 16, 2015 7:53 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Hi Vijay, I know the this road too well :-) For PDF you can fallback to

Re: Indexing PDF and MS Office files

2015-04-16 Thread Charlie Hull
On 16/04/2015 12:53, Siegfried Goeschl wrote: Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) Here's some file e

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
> > >> > Cheers, >> > >> > Tim >> > >> > [1] >> > >> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf >> > [2] >> > >> http://events.linuxfoundation.org/sites/events/files/slides/Tik

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
on some files, but we are always > working to improve it. > > Best, > > Tim > > -Original Message- > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: > vijaya.bhoomire...@whishworks.com] > Sent: Thursday, April 16, 2015 7:44 AM > To: solr-user

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
+1 :) >PS: one more thing - please, tell your management that you will never >ever successfully all real-world PDFs and cater for that fact in your >requirements :-)

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
Sent: Thursday, April 16, 2015 7:44 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you

Re: Indexing PDF and MS Office files

2015-04-16 Thread Siegfried Goeschl
Hi Vijay, I know the this road too well :-) For PDF you can fallback to other tools for text extraction * ps2ascii.ps * XPDF's pdftotext CLI utility (more comfortable than Ghostscript) * some other tools exists as well (pdflib) If you start command line tools from your JVM please have a look a

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
TikaEval_ACNA15_allison_herceg_v2.pdf > -Original Message- > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: > vijaya.bhoomire...@whishworks.com] > Sent: Thursday, April 16, 2015 7:10 AM > To: solr-user@lucene.apache.org > Subject: Re: Indexing PDF and MS Office files >

RE: Indexing PDF and MS Office files

2015-04-16 Thread Allison, Timothy B.
arayana Reddy Bhoomi Reddy [mailto:vijaya.bhoomire...@whishworks.com] Sent: Thursday, April 16, 2015 7:10 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF and MS Office files Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractReque

Re: Indexing PDF and MS Office files

2015-04-16 Thread Vijaya Narayana Reddy Bhoomi Reddy
Erick, I tried indexing both ways - SolrJ / Tika's AutoParser and as well as SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents are getting parsed properly and indexed into Solr. However, a minority of them keep failing wither PDFParser or OfficeParser error. Not sure if thi

Re: Indexing PDF and MS Office files

2015-04-15 Thread Erick Erickson
There's quite a discussion here: https://issues.apache.org/jira/browse/SOLR-7137 But, I personally am not a huge fan of pushing all the work on to Solr, in a production environment the Solr server is responsible for indexing, parsing the docs through Tika, perhaps searching etc. This doesn't scale

Re: Indexing PDF and MS Office files

2015-04-15 Thread Vijaya Narayana Reddy Bhoomi Reddy
Thanks everyone for the responses. Now I am able to index PDF documents successfully. I have implemented manual extraction using Tika's AutoParser and PDF functionality is working fine. However, the error with some MS office word documents still persist. The error message is "java.lang.IllegalArg

Re: Indexing PDF and MS Office files

2015-04-14 Thread Shyam R
Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used. Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes wrote: > Perhaps the PDF is protected and the content can not be extracted? > > i have an unverified suspicion th

Re: Indexing PDF and MS Office files

2015-04-14 Thread Terry Rhodes
Perhaps the PDF is protected and the content can not be extracted? i have an unverified suspicion that the tika shipped with solr 4.10.2 may not support some/all office 2013 document formats. On 4/14/2015 8:18 PM, Jack Krupansky wrote: Try doing a manual extraction request directly to Solr

Re: Indexing PDF and MS Office files

2015-04-14 Thread Jack Krupansky
Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted. See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content a

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini
It seems something like https://issues.apache.org/jira/browse/TIKA-1251. I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seems to be fixed in Tika 1.6. I agree with Erik: you should try with another version of Tika. Best, Andrea On 04/14/2015 06:44 PM, Vijaya Narayana Reddy

Re: Indexing PDF and MS Office files

2015-04-14 Thread Erick Erickson
looks like this is just a file that Tika can't handle, based on this line: bq: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser You might be able to get some joy from parsing this from Java and see if a more recent Tika would

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Andrea, Yes, I am using the stock schema.xml that comes with the example server of Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and put into the content field in the index. Please find the log information for the Parsing error below. org.apache.solr.common.SolrExcepti

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini
Hi, solrconfig.xml (especially if you didn't touch it) should be good. What about the schema? Are you using the one that comes with the download bundle, too? I don't see the stacktrace..did you forget to paste it? Best, Andrea On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Re: Indexing PDF and MS Office files

2015-04-14 Thread Vijaya Narayana Reddy Bhoomi Reddy
Hi, Here are the solr-config xml and the error log from Solr logs for your reference. As mentioned earlier, I didnt make any changes to the solr-config.xml as I am using the xml file out of the box one that came with the default installation. Please let me know your thoughts on why these issues a

Re: Indexing PDF and MS Office files

2015-04-14 Thread Andrea Gazzarini
Hi Vijay, Please paste an extract of your schema, where the "content" field (the field where the PDF text shoudl be) and its type are declared. For the other issue, please paste the whole stacktrace because org.apache.tika.parser.microsoft.OfficeParser* says nothing. The complete stacktrace (o

Re: Indexing PDF in Apache Solr 4.8.0 - Problem.

2014-05-12 Thread Siegfried Goeschl
Hi Vignesh, can you check your SOLR Server Log?! Not all PDF documents on this planet can be processed using Tikka :-) Cheers, Siegfried Goeschl On 07 May 2014, at 09:40, vignesh wrote: > Dear Team, > > I am Vignesh using the latest version 4.8.0 Apache Solr and am > Indexing my

Re: Indexing pdf files - question.

2013-09-08 Thread Nutan Shinde
Error got resolved,solution was must be within tag. On Sun, Sep 8, 2013 at 3:31 AM, Furkan KAMACI wrote: > Could you show us logs you get when you start your web container? > > > 2013/9/4 Nutan Shinde > > > My solrconfig.xml is: > > > > > > > > > class="solr.extraction.ExtractingRequestHandl

Re: Indexing pdf files - question.

2013-09-07 Thread Furkan KAMACI
Could you show us logs you get when you start your web container? 2013/9/4 Nutan Shinde > My solrconfig.xml is: > > > > class="solr.extraction.ExtractingRequestHandler" > > > > > descwhich > is defined as shown below in schem.xml--> > > true > > attr_ > > true > > > > > > > > > > Schem

Re: Indexing pdf files - question.

2013-09-04 Thread Nutan Shinde
My solrconfig.xml is: desc true attr_ true Schema.xml: doc_id I have created extract directory and copied all required .jar and solr-cell jar files into this extract directory and given its path in lib tag in solrconfig.xml When I try

Re: Indexing PDF Files

2013-04-24 Thread Jack Krupansky
or these "lib" elements ("INFO org.apache.solr.core.SolrConfig – Adding specified lib dirs to ClassLoader"). -- Jack Krupansky -Original Message- From: Furkan KAMACI Sent: Wednesday, April 24, 2013 6:50 PM To: solr-user@lucene.apache.org Subject: Re: Indexing PD

Re: Indexing PDF Files

2013-04-24 Thread Jan Høydahl
In your schema you have written > class="solr.StrField" /> Note that XML tag and param names are case sensitive, so instead of fieldtype you should use fieldType I see that you have the same error for several fieldTypes in your schema, probably resulting in other similar errors too. -- Jan H

Re: Indexing PDF Files

2013-04-24 Thread Furkan KAMACI
Hi Alex; What do you mean with wrong case. Could you tell me what should I do? 2013/4/25 Alexandre Rafalovitch > You still seem to have 'fieldtype' with wrong case. Can you try that > simple thing before doing other complicated steps? And yes, restart > Solr after you change schema.xml > > Regar

Re: Indexing PDF Files

2013-04-24 Thread Alexandre Rafalovitch
You still seem to have 'fieldtype' with wrong case. Can you try that simple thing before doing other complicated steps? And yes, restart Solr after you change schema.xml Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time

Re: Indexing PDF Files

2013-04-24 Thread Furkan KAMACI
Here is my definition for handler: text true attr_ true 2013/4/25 Furkan KAMACI > I just want to search on rich documents but I still get same error. I have > copied example folder into anywhere else at my computer. I have copied dist > and contrib folders from my build folder into that

Re: Indexing PDF Files

2013-04-24 Thread Furkan KAMACI
I just want to search on rich documents but I still get same error. I have copied example folder into anywhere else at my computer. I have copied dist and contrib folders from my build folder into that copy of example folder (because solr-cell etc. are within that folders) However I still get same

Re: Indexing PDF Files

2013-04-24 Thread Erik Hatcher
Did you restart after adding those fields and types? On Apr 24, 2013, at 16:59, Furkan KAMACI wrote: > I have added that fields: > > > stored="true" multiValued="true"/> > > > and I have that definition: > > class="solr.StrField" /> > > here is my error: > > > > > 400 > 4154 > > >

Re: Indexing PDF Files

2013-04-24 Thread Alexandre Rafalovitch
Wrong case for ? Though I would have through Solr would complaint about that when it hits dynamicField with unknown type. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events fro

Re: Indexing PDF Files

2013-04-24 Thread Furkan KAMACI
I have added that fields: and I have that definition: here is my error: 400 4154 ERROR: [doc=1] unknown field 'ignored_meta' 400 What should I do more? 2013/4/24 Erik Hatcher > Also, at Solr startup time it logs what it loads from those > elements, so you can see whether it is

Re: Indexing PDF Files

2013-04-24 Thread Erik Hatcher
Also, at Solr startup time it logs what it loads from those elements, so you can see whether it is loading the files you intend to or not. Erik On Apr 24, 2013, at 10:05 , Alexandre Rafalovitch wrote: > Have you tried using absolute path to the relevant urls? That will > cleanly split

Re: Indexing PDF Files

2013-04-24 Thread Alexandre Rafalovitch
Have you tried using absolute path to the relevant urls? That will cleanly split the problem into 'still not working' and 'wrong relative path'. Regards, Alex. On Wed, Apr 24, 2013 at 9:02 AM, Furkan KAMACI wrote: > > Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.li

Re: Indexing PDF-Files using Solr Cell

2012-09-17 Thread Jack Krupansky
ber 17, 2012 1:12 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF-Files using Solr Cell Thank you for your response. I'm writing my Bachelor-Thesis about Solr and my company doesn't want me to use a beta-version. I dont want to be annoying, but "how" do i direct the

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
ng. > > Again, this is all simplified in Solr 4.0-BETA. > > > -- Jack Krupansky > > -Original Message- From: Alexander Troost > Sent: Sunday, September 16, 2012 11:59 PM > To: solr-user@lucene.apache.org > Subject: Re: Indexing PDF-Files using Solr Cell > >

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Jack Krupansky
n Solr 4.0-BETA. -- Jack Krupansky -Original Message- From: Alexander Troost Sent: Sunday, September 16, 2012 11:59 PM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF-Files using Solr Cell Hi, first of all: Thank you for that quick response! But i am not sure if i am doing this r

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
Hi, first of all: Thank you for that quick response! But i am not sure if i am doing this right. For my point of view the command now has to look like: curl " http://localhost:8983/solr/update/extract?literal.id=doc11&literal.filename=markus&fmap.content=text&commit=true"; -F "myfile=@markus.pdf

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Jack Krupansky
The content will be sent to the "content" field, which you can redirect using the &fmap.content=some-field request parameter. You need to explicitly set the file name field yourself, using the &literal.your-file-name-field=file-name request parameter. Also, if using Solr 4.0-BETA, you can simp

Re: Indexing PDF

2011-10-05 Thread Héctor Trujillo
I've uloaded the file here: http://www.filesonic.com/file/2342166624/Starting_a_Search_Application.pdf try this, thanks 2011/10/5 Michael McCandless > Hmm, no attachment; maybe it's too large? > > Can you send it directly to me? > > Mike McCandless > > http://blog.mikemccandless.com > > 2011/1

Re: Indexing PDF

2011-10-05 Thread Michael McCandless
Hmm, no attachment; maybe it's too large? Can you send it directly to me? Mike McCandless http://blog.mikemccandless.com 2011/10/5 Héctor Trujillo : > This is the file that give me errors. > > 2011/10/5 Michael McCandless >> >> Can you attach this PDF to an email & send to the list?  Or is it

Re: Indexing PDF

2011-10-05 Thread Michael McCandless
Can you attach this PDF to an email & send to the list? Or is it too large for that? Or, you can try running Tika directly on the PDF to see if it's able to extract the text. Mike McCandless http://blog.mikemccandless.com 2011/10/5 Héctor Trujillo : > Sorry you have the reason, this file was i

Re: Indexing PDF

2011-10-05 Thread Paul Libbrecht
Héctor, I was meaning you need another way to reference the file *to the mailing list*. Sorry for the confusion. I do not think there's anything special to the set of interfaces you're using if the delivery is the same for the solr client and the acrobat plugin. To make sure of it, you could t

Re: Indexing PDF

2011-10-05 Thread Héctor Trujillo
Sorry you have the reason, this file was indexed with a .Net web service client, that calls a Java application(a web service) that calls Solr using SolrJ. I will try to index this in a different way, may be this resolve the problem. Thanks Best regards El 5 de octubre de 2011 08:42, Héctor Tr

Re: Indexing PDF

2011-10-05 Thread Héctor Trujillo
It seems unreasonable that if I want to index a local file, I have to references this local file by an URL. This isn't a estrange file, this is a file downloaded from lucid web portal called: Starting a Search Application.pdf This problem may be a codification problem, or char set problem. I op

Re: Indexing PDF

2011-10-04 Thread Robert Muir
Your persian pdf problem is different, and already taken care of in pdfbox trunk https://issues.apache.org/jira/browse/PDFBOX-1127 On Tue, Oct 4, 2011 at 2:04 PM, ahmad ajiloo wrote: > I have this problem too, in indexing some of persian pdf files. > > 2011/10/4 Héctor Trujillo > >> Hi all, I'm

Re: Indexing PDF

2011-10-04 Thread ahmad ajiloo
I have this problem too, in indexing some of persian pdf files. 2011/10/4 Héctor Trujillo > Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But > with > some files I’ve got problems because they stored estrange characters. I got > stored this content: > +++ > > Starting a

Re: Indexing PDF

2011-10-04 Thread Paul Libbrecht
full of boxes for me. Héctor, you need another way to reference these! (e.g. a URL) paul Le 4 oct. 2011 à 16:49, Héctor Trujillo a écrit : > Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But with > some files I’ve got problems because they stored estrange characters. I got

Re: Indexing pdf files - question.

2011-04-08 Thread Mike
Hi Erick, Thank you for the Reply. Now I am able to index the PDF files and search. I am left with couple of questions: 1. Can I add custom field to Search Response XML (Ex: Need to as description which gives brief description about the PDF file). 2. Currently Solr runs as a separate applicatio

Re: Indexing pdf files - question.

2011-04-07 Thread Erick Erickson
Did you try the curl commands that Adam suggested as part of this e-mail thread? If so, what happened? Best Erick On Wed, Apr 6, 2011 at 7:50 AM, Mike wrote: > Hi All, > > I am new to solr. I have gone through solr documents to index pdf files, > But > it was hard to find the exact procedure to

Re: Indexing pdf files - question.

2011-04-07 Thread Mike
Hi All, I am new to solr. I have gone through solr documents to index pdf files, But it was hard to find the exact procedure to get started. I need step by step procedure to do this. Could you please let me know the steps to index pdf files. Thanks, Mike -- View this message in context: http://

Re: Indexing pdf files - question.

2010-12-13 Thread Wodek Siebor
The sample /docs/tutorial.pdf does not require OCR. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-pdf-files-question-tp2079505p2080307.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing pdf files - question.

2010-12-13 Thread Adam Estrada
Hi, I use the following command to post PDF files. $ curl "http://localhost:8983/solr/update/extract?stream.file=C :\temp\document.docx&stream.contentType=application/msword&literal.id =esc.doc&commit=true" $ curl "http://localhost:8983/solr/update/extract?stream.file=C :\temp\features.pdf&stream

Re: Indexing PDF - literal field already there & many "null"'s in text field

2010-09-17 Thread Lance Norskog
Tika is not perfect. Very much not perfect. I've seen a 10-15% failure rate on randomly sampled files. It works for creating searchable text fields, but not for text fields to return. That is, the anlyzers rip out the nulls and make an intelligible stream of words. If you want to save these words

Re: indexing pdf documents

2008-05-14 Thread Brian Carmalt
Hello Cam, The wiki for RichDocuments explains how you can add meta data to the RDUpdater. http://wiki.apache.org/solr/UpdateRichDocuments I have used the patch to index docs and thier meta data, but it was not exactly what we needed. Brian. Am Mittwoch, den 14.05.2008, 12:38 +0300 schrieb

Re: indexing pdf documents

2008-05-14 Thread Cam Bazz
Hello Elizabeth; Yes, I have PDF files, and metadata about them already extracted. so I need something like: someone content of my pdf file it seems that the updaterichdocument patch can only accept pdfs in raw form - so it is not possible to feed metadata. Have you found a solution other th

Re: indexing pdf documents

2008-05-13 Thread Bess Sadler
C.B., are you saying you have metadata about your PDF files (i.e., title, author, etc) separate from the PDF file itself, or are you saying you want to extract that information from the PDF file? The first of these is pretty easy, the second of these can be difficult or impossible, dependin

Re: indexing pdf documents

2008-05-13 Thread Cam Bazz
yes, I have seen the documentation on RichDocumentRequestHandler at the http://wiki.apache.org/solr/UpdateRichDocuments page. However, from what I understand this just feeds documents to solr. How can I construct something like: document_id, document_name, document_text and feed it in. (i.e. my doc

Re: indexing pdf documents

2008-05-12 Thread Chris Harris
Solr does not have this support built in, but there's a patch for it: https://issues.apache.org/jira/browse/SOLR-284 On Mon, May 12, 2008 at 2:02 PM, Cam Bazz <[EMAIL PROTECTED]> wrote: > Hello, > > Before making a little program to extract the txt from my pdfs and feed it > into solr with xml,