RE: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Phil Scadden
mmit(solr, "prindex"); return true; -Original Message- From: Erick Erickson Sent: Wednesday, 31 October 2018 06:00 To: solr-user Subject: Re: Indexing PDF file in Apache SOLR via Apache TIKA All of the above work, but for robust production situations you'll wa

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread ☼ R Nair
let me introduce my self. My name is Mohammad Kevin Putra > (you > > > can call me Kevin), from Indonesia, i am a beginner in backend > developer, i > > > use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0. > > > > > > I have a little bit problem

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Erick Erickson
0, 2018 at 3:40 PM adiyaksa kevin > wrote: > > > Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you > > can call me Kevin), from Indonesia, i am a beginner in backend developer, i > > use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0. &

Re: Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-30 Thread Kamuela Lau
t; can call me Kevin), from Indonesia, i am a beginner in backend developer, i > use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0. > > I have a little bit problem about how to put PDF File via Apache TIKA. I > understand how SOLR or TIKA works, but i don't know how th

Indexing PDF file in Apache SOLR via Apache TIKA

2018-10-29 Thread adiyaksa kevin
Hello there, let me introduce my self. My name is Mohammad Kevin Putra (you can call me Kevin), from Indonesia, i am a beginner in backend developer, i use Linux Mint, i use Apache SOLR 7.5.0 and Apache TIKA 1.91.0. I have a little bit problem about how to put PDF File via Apache TIKA. I

Re: Can't upload pdf file to example Core

2017-06-14 Thread Susheel Kumar
:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "myfile=@example/exampledocs/solr-word.pdf" On Wed, Jun 14, 2017 at 1:30 PM, Vasiliy Boldyrev < vasiliy.boldy...@gmail.com> wrote: > Hello, > > I used Apache Solr™ version 6.6.0 but can't upload

Can't upload pdf file to example Core

2017-06-14 Thread Vasiliy Boldyrev
Hello, I used Apache Solr™ version 6.6.0 but can't upload pdf file to Core Instruction and Example has been get from https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Add to solconfig.xml additional path to /dist/ and /contrib/extractio

Re: Indexing several parts of PDF file

2013-02-05 Thread Jorge Luis Betancourt Gonzalez
hanks for the replies! - Mensaje original - De: "Upayavira" Para: solr-user@lucene.apache.org Enviados: Martes, 5 de Febrero 2013 9:05:58 Asunto: Re: Indexing several parts of PDF file This would involve you querying against every page in your document, which will be too many

Re: Indexing several parts of PDF file

2013-02-05 Thread VIGNESH S
Yes.. I also think the same..Better Index each Page as Documents On Tue, Feb 5, 2013 at 7:35 PM, Upayavira wrote: > This would involve you querying against every page in your document, > which will be too many fields and will break quickly. > > The best way to do it is to index pages as documents

Re: Indexing several parts of PDF file

2013-02-05 Thread Upayavira
This would involve you querying against every page in your document, which will be too many fields and will break quickly. The best way to do it is to index pages as documents. You can use field collapsing to group pages from the same document together. Upayavira On Tue, Feb 5, 2013, at 02:00 PM

Indexing several parts of PDF file

2013-02-05 Thread Jorge Luis Betancourt Gonzalez
Hi: I'm working on a search engine for several PDF documents, right now one of the requirements is that we can provide not only the documents matching the search criteria but the page that match the criteria. Normally tika only extracts the text content and does not do this distinction, but usi

Re: Apache solr not indexing complete pdf file using tikka

2012-04-03 Thread Ravish Bhagdev
I'd also suggest trying extracting text using tika-app (shipped with tika distribution as executable jar) on the PDF(s) in question to see if problem is with extraction or with indexing. Rav On Mon, Apr 2, 2012 at 1:55 PM, Erick Erickson wrote: > You can index 2B tokens, so upping maxFieldLength

Re: Apache solr not indexing complete pdf file using tikka

2012-04-02 Thread Erick Erickson
You can index 2B tokens, so upping maxFieldLength should have fixed your problem at least as far as Solr is concerned. How many tokens get indexed? I'm not as familiar with Tika, but there may be some kind of parameter there (although I don't remember this coming up before)... Did you restart Solr

Apache solr not indexing complete pdf file using tikka

2012-04-02 Thread Manoj Saini
Hello Guys, I am using apache solr 3.3.0 with Tikka 1.0. I have pdf files which I am pushing into solr for conent searching. Apache solr is indexing pdf files and I can see them in apache solr admin interface for search. But the issue is apache solr is not indexing whole file content. It is index

Unexpected Tika Exception extracting text from a PDF file.

2012-03-23 Thread Jon Dragt
Howdy Folks, I'm stumped and hope somebody can give me some clues on how to work around this occasional error I'm getting. I've got a .Net console program using SolrNet to scour certain folders at certain times and extract text from PDF files and index them. It succeeds on a majority of the fi

SolrJ Request issue when trying to add a PDF file to Index

2012-03-16 Thread Jones, Rhys
Hello, I'm having trouble adding a pdf file to my index. It's multicored. My server object instantiates properly (StreamingUpdateSolrServer). In my request object (ContentStreamUpdateRequest) I add a couple of literals to populate fields in the index that the parsed content of the

Re: How to index PDF file stored in SQL Server 2008

2011-04-11 Thread Roy Liu
I changed data-config-sql.xml to There are no errors, but, the indexed pdf is convert to Numbers.. 200 1 202 1 203 1 212 1 222 1 236 1 242 1 244 1 254 1 255 -- Best Regards, Roy Liu On Mon, Apr 11, 2011 at 2:02 PM, Roy Liu wrote: >

Re: How to index PDF file stored in SQL Server 2008

2011-04-10 Thread Roy Liu
Hi, I have copied \apache-solr-3.1.0\dist\apache-solr-dataimporthandler-extras-3.1.0.jar into \apache-tomcat-6.0.32\webapps\solr\WEB-INF\lib\ Other Errors: Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Unclosed quotation mark after the character string 'B@3e574'. -- Best Regards,

Re: How to index PDF file stored in SQL Server 2008

2011-04-10 Thread Roy Liu
Hi, all Thank YOU very much for your kindly help. *1. I have upgrade from Solr 1.4 to Solr 3.1* *2. Change data-config-sql.xml * *** * *3. solrconfig.xml and schema.xml are NOT changed.* However, when I

Re: How to index PDF file stored in SQL Server 2008

2011-04-10 Thread Lance Norskog
You have to upgrade completely to the Apache Solr 3.1 release. It is worth the effort. You cannot copy any jars between Solr releases. Also, you cannot copy over jars from newer Tika releases. On Fri, Apr 8, 2011 at 10:47 AM, Darx Oman wrote: > Hi again > what you are missing is field mapping >

Re: How to index PDF file stored in SQL Server 2008

2011-04-08 Thread Darx Oman
Hi again what you are missing is field mapping no need for TikaEntityProcessor since you are not accessing pdf files

Re: How to index PDF file stored in SQL Server 2008

2011-04-08 Thread Darx Oman
Hi there TikaEntityProcessor is available as part of DIH-extras*.jar in 3.x and 4.0

Re: How to index PDF file stored in SQL Server 2008

2011-04-07 Thread Roy Liu
Thanks Lance, I'm using Solr 1.4. If I want to using TikaEP, need to upgrade to Solr 3.1 or import jar files? Best Regards, Roy Liu On Fri, Apr 8, 2011 at 10:22 AM, Lance Norskog wrote: > You need the TikaEntityProcessor to unpack the PDF image. You are > sticking binary blobs into the index.

Re: How to index PDF file stored in SQL Server 2008

2011-04-07 Thread Lance Norskog
You need the TikaEntityProcessor to unpack the PDF image. You are sticking binary blobs into the index. Tika unpacks the text out of the file. TikaEP is not in Solr 1.4, but it is in the new Solr 3.1 release. On Thu, Apr 7, 2011 at 7:14 PM, Roy Liu wrote: > Hi, > > I have a table named *attachme

How to index PDF file stored in SQL Server 2008

2011-04-07 Thread Roy Liu
Hi, I have a table named *attachment *in MS SQL Server 2008. COLUMNTYPE - id int titlevarchar(200) attachment image I need to index the attachment(store pdf files) column from database via DIH. After access this URL, it returns "Ind

Re: Internal Server Error when indexing a pdf file

2011-01-10 Thread Grijesh.singh
Check your libraries for Tika related Jar files.Tika related files must be on classpath of solr - Grijesh -- View this message in context: http://lucene.472066.n3.nabble.com/Internal-Server-Error-when-indexing-a-pdf-file-tp2214617p2226374.html Sent from the Solr - User mailing list archive

Internal Server Error when indexing a pdf file

2011-01-07 Thread Alessandro Marino
Hi, I was trying to use Solr Cell (through the Java API) to index a pdf file. The class has been extracted from http://wiki.apache.org/solr/ContentStreamUpdateRequestExample public class Solr { public static void main(String[] args) { try { String solrId = "beautiful_st

Re: PDF file

2010-08-12 Thread Chris Hostetter
: Subject: PDF file : References: <20100729152139.321c4...@ibis> : : In-Reply-To: http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh

RE: PDF file

2010-08-11 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
ur help! Thanks, -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Wednesday, August 11, 2010 10:36 AM To: solr-user@lucene.apache.org Cc: 'jayendra.patil@gmail.com' Subject: RE: PDF file Thanks so much for your help! I got "Remote Streaming is disabled" error.

RE: PDF file

2010-08-11 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
rg Subject: Re: PDF file Try ... curl " http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?stream.file= /pub2009001.pdf&literal.id=777045&commit=true" stream.file - specify full path literal. - specify any extra params if needed Regards, Jayendra On Tue, Aug 10, 2010

Re: PDF file

2010-08-10 Thread Jayendra Patil
i (NIH/NLM/LHC) [C] < xiao...@mail.nlm.nih.gov> wrote: > Thanks so much for your help! I tried to index a pdf file and got the > following. The command I used is > > curl ' > http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=text&map.stream_name=id&

RE: PDF file

2010-08-10 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Thanks so much for your help! I tried to index a pdf file and got the following. The command I used is curl 'http://lhcinternal.nlm.nih.gov:8989/solr/lhc/update/extract?map.content=text&map.stream_name=id&commit=true' -F "fi...@pub2009001.pdf" Did I do somet

RE: PDF file

2010-08-10 Thread Sharp, Jonathan
, Xiaohui (NIH/NLM/LHC) [C] [xiao...@mail.nlm.nih.gov] Sent: Tuesday, August 10, 2010 11:57 AM To: 'solr-user@lucene.apache.org' Subject: RE: PDF file Does anyone have any experience with PDF file? I really appreciate your help! Thanks so much in advance. -Original Message- From: M

RE: PDF file

2010-08-10 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Does anyone have any experience with PDF file? I really appreciate your help! Thanks so much in advance. -Original Message- From: Ma, Xiaohui (NIH/NLM/LHC) [C] Sent: Tuesday, August 10, 2010 10:37 AM To: 'solr-user@lucene.apache.org' Subject: PDF file I have a lot of pdf f

PDF file

2010-08-10 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
I have a lot of pdf files. I am trying to import pdf files to solr and index them. I added ExtractingRequestHandler to solrconfig.xml. Please tell me if I need download some jar files. In the Solr1.4 Enterprise Search Server book, use following command to import a mccm.pdf. curl 'http://loc

Re: Posting pdf file and posting from remote

2010-02-11 Thread alendo
tingDocumentLoader.load(ExtractingDocumentLoader.java:158) >> at >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) >> at >> org.

Re: Posting pdf file and posting from remote

2010-02-09 Thread Lance Norskog
solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at > > etc etc... > > -- > View this message in context: > http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Lance Norskog goks...@gmail.com

Re: Posting pdf file and posting from remote

2010-02-09 Thread alendo
e.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at etc etc... -- View this message in context: http://old.nabble.com/Posting

Posting pdf file and posting from remote

2010-02-09 Thread alendo
/old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512455.html Sent from the Solr - User mailing list archive at Nabble.com.

Multiple field / pdf file per document

2009-08-24 Thread Joe Kessel
. The data for these text fields comes from multiple pdf files. As i am currently supporting 4 locales, I will have a different pdf file for each locale.In addition I have a number of other fields that are used by the application. Solr will be returning a reference used by the application to