Re: ExtractingRequestHandler indexing zip files

2014-09-11 Thread keeblerh
Working now - fyi - the "update/extract" from a post works extracting from a kmz(zip) but I am still having trouble from the dataimport. I'll move to another thread for that. THANKS all. -- View this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip

Re: ExtractingRequestHandler indexing zip files

2014-09-10 Thread keeblerh
Thanks for the info Sergio. I updated my 4.8.1 version with that patch and SOLR 4216 (which was really the same thing). It took a day to get it to compile on my network and it still doesn't work. Did my config file look correct? I'm wondering if I need another param somewhere. "Patch has to be

Re: ExtractingRequestHandler indexing zip files

2014-09-09 Thread marotosg
hi keeblerh, Patch has to be applied to the source code and compile again Solr.war. If you do that then it works extracting the content of documents Regards, Sergio -- View this message in context: http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172p415767

Re: ExtractingRequestHandler indexing zip files

2014-09-09 Thread keeblerh
I am also having the issue where my zip contents (or kmz contents) are not being processed - only the file names are processed. It seems to recognize the kmz extension and open the file just doesn't recurse the processing on the contents. The patch you mention has been around for a while. I am ru

Re: ExtractingRequestHandler - extracted files caching?

2014-06-30 Thread Erick Erickson
Here's an example of what Alexandre is talking about: http://searchhub.org/2012/02/14/indexing-with-solrj/ It mixes database fetching in with the Tika processing, but that should be pretty easy to pull out. Best, Erick On Mon, Jun 30, 2014 at 8:21 PM, Alexandre Rafalovitch wrote: > Under the co

Re: ExtractingRequestHandler - extracted files caching?

2014-06-30 Thread Alexandre Rafalovitch
Under the covers, Tika is used. You can use Tika yourself on the client side and cache it's output in the database or text file. Then, send that to Solr instead. Puts less load on Solr as well. Or you can use atomic update, but then all the primary (not copyField) fields must be stored="true". Re

Re: ExtractingRequestHandler indexing zip files

2014-05-28 Thread marotosg
I extended ExtractingDocumentLoader with this patch and it works. https://issues.apache.org/jira/secure/attachment/12473188/SOLR-2416_ExtractingDocumentLoader.patch Iterates throw all documents and extracts the name and the content of all documents inside the file. Regards, Sergio -- View thi

Re: ExtractingRequestHandler indexing zip files

2014-05-27 Thread Siegfried Goeschl
Hi Sergio, your either do the stuff on the caller side (which is probably a good idea since you are off-load the SOLR server) or extend the ExtractingRequestHandler Cheers, Siegfried Goeschl On 27 May 2014, at 10:37, marotosg wrote: > Hi, > > Thanks for your answer Alexandre. > I have zip f

Re: ExtractingRequestHandler indexing zip files

2014-05-27 Thread marotosg
Hi, Thanks for your answer Alexandre. I have zip files with only one document inside per zip file. These documents are mainly pdf,xml,html. I tried to index "tini.txt.gz" file which is located in the trunk to be used by extraction tests \trunk\solr\contrib\extraction\src\test-files\extraction\tin

Re: ExtractingRequestHandler indexing zip files

2014-05-26 Thread Alexandre Rafalovitch
A zip file can contain many files and directories in a nested structure. With files of any type and size. What would you expect Solr to do facing a generic Zip file? And what would you like it to do for _your_ - one assumes more restricted - scenario? Regards, Alex. Personal website: http://

Re: ExtractingRequestHandler causes Out of Memory Error

2012-10-03 Thread Jan Høydahl
Hi, If you like, you can open a JIRA issue on this and provide as much info as possible. Someone can then look into (potential) memory optimization of this part of the code. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 28. sep.

Re: ExtractingRequestHandler causes Out of Memory Error

2012-09-27 Thread Shigeki Kobayashi
Hi Jan. Thank you very much for your advice. So I understand Solr needs more memory to parse the files. To parse a file of size x, it needs double memory (2x). Then how much memory allocation should be taken to heap size? 8x? 16x? Regards, Shigeki 2012/9/28 Jan Høydahl > Please try to incr

Re: ExtractingRequestHandler causes Out of Memory Error

2012-09-27 Thread Jan Høydahl
Please try to increase -Xmx and see how much RAM you need for it to succeed. I believe it is simply a case where this particular file needs double memory (480Mb) to parse and you have only allocated 1Gb (which is not particularly much). Perhaps the code could be optimized to avoid the Arrays.cop

Re: ExtractingRequestHandler causes Out of Memory Error

2012-09-27 Thread Lance Norskog
These are very large files and this is not enough memory. Do you upload these as files? If the CSV file is one document per line, you can split it up. Unix has a 'split' command which does this very nicely. - Original Message - | From: "Shigeki Kobayashi" | To: solr-user@lucene.apach

Re: ExtractingRequestHandler

2012-04-03 Thread Ravish Bhagdev
(Bit off-topic but...) I understand the fact that Solr isn't meant to 'store' everything, but because highlighting matches requires a field to be stored I would expect most people having to end-up storing full document content in their indexes? Can't think there is any good workaround for this...

RE: ExtractingRequestHandler

2012-04-02 Thread spring
> Solr Cell is great for proof-of-concept, but for heavy-duty > applications, > you're offloading all the processing on the Solr server, > which can be a > problem. Good point! Thank you

Re: ExtractingRequestHandler

2012-04-01 Thread Bill Bell
I have had good luck with creating a separate core index for just data. This is a different core than the indexed core. Very fast. Bill Bell Sent from mobile On Apr 1, 2012, at 11:15 AM, Erick Erickson wrote: > Yes, you can. but Generally, storing the raw input in Solr is > not the best

Re: ExtractingRequestHandler

2012-04-01 Thread Erick Erickson
Ahhh, OK. Sure, anything you store in Solr you can get back. The key is not Tika, but your schema.xml file, and setting 'stored="true" ' bq: So my question was if I can index the original doc via ExtractingRequestHandler in Solr AND get back the text output, in a single call. I know of now way to

RE: ExtractingRequestHandler

2012-04-01 Thread spring
Hi Erik, I think we have some misunderstanding. I want to index the text of the docs in Solr (only indexed, NOT stored). But I want the text (Tika output) back for: * later faster reindexing (some text extraction like OCR takes really long) * use the text for other processings The original doc

Re: ExtractingRequestHandler

2012-04-01 Thread Erick Erickson
Yes, you can. but Generally, storing the raw input in Solr is not the best approach. The problem here is that pretty soon you get a huge index that contains *everything*. Solr was not intended to be a data store. Besides, you then need to store the binary form of the file. Solr only deals with

Re: ExtractingRequestHandler HTTP GET Problem

2011-11-17 Thread Chris Hostetter
: indexed file. The CommonsHttpSolrServer sends the parameters as a HTTP : GET request. Because of that I'll get a "socket write error". If I : change the CommonsHttpSolrServer to send the parameters as HTTP POST : sending will work, but the ExtractingRequestHandler will not recognize : the parame

Re: ExtractingRequestHandler - renaming tika generated fields

2011-06-09 Thread Jan Høydahl
One solution to this problem is to change the order of field operation (http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations) to first do fmap.*= processing, then add the fields from literal.*=. Why would anyone want to rename a field they just have explicitly named any

Re: ExtractingRequestHandler and Solr 3.1

2011-04-14 Thread Liam O'Boyle
Hi Grant, After comparing the differences between my solrconfig.xml and that used by the example, the key difference is that I didn't have true in the defaults for the ERH. Commenting out this line in the example configuration causes the example to display the same behaviour as I'm seeing. I've

Re: ExtractingRequestHandler and Solr 3.1

2011-04-13 Thread Grant Ingersoll
On Apr 13, 2011, at 12:06 AM, Liam O'Boyle wrote: > Afternoon, > > After an upgrade to Solr 3.1 which has largely been very smooth and > painless, I'm having a minor issue with the ExtractingRequestHandler. > > The problem is that it's inserting metadata into the extracted > content, as well as

Re: ExtractingRequestHandler "multiple values encountered for non multiValued field last_modified"

2010-02-04 Thread Lance Norskog
The Tika integration with the DataImportHandler allows you to control many aspects of what goes into the index, including solving this problem: http://wiki.apache.org/solr/TikaEntityProcessor (Tika is the extraction library, and ExtractingRequestHandler and the TikaEntityProcessor both use it.)

Re: ExtractingRequestHandler unknown field 'stream_source_info'

2009-10-01 Thread Tricia Williams
Thanks Lance, I have lucid's search as one of my open search tools in my browser. Generally pretty useful (especially the ability to filter) but it's not of much help when the tool points out that the best info is on the wiki and the link to the wiki reveals that it can't be reached. This

Re: ExtractingRequestHandler unknown field 'stream_source_info'

2009-10-01 Thread Lance Norskog
For future reference, the Solr & Lucene wikis and mailing lists are indexed on http://www.lucidimagination.com/search/ On Thu, Oct 1, 2009 at 11:40 AM, Tricia Williams wrote: > If the wiki isn't working >> >> >> https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 >

Re: ExtractingRequestHandler unknown field 'stream_source_info'

2009-10-01 Thread Tricia Williams
If the wiki isn't working https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 gave me more information. The LucidImagination article helps too. Now that the wiki is up again it is more obvious that I need to add: fulltext text to my solrconfig.xml Tricia

Re: ExtractingRequestHandler unknown field 'stream_source_info'

2009-10-01 Thread Walter Lewis
On 1 Oct 09, at 12:46 PM, Tricia Williams wrote: STREAM_SOURCE_INFO https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 appears to be a constant from this page: http://lucene.apache.org/solr/api/constant-values.html This has it embedded as an "arr" in the re

RE: ExtractingRequestHandler and local files

2009-06-09 Thread Fergus McMenemie
and stream.body >start working everywhere > > > >So wanted to confirm. > >> From: gsing...@apache.org >> To: solr-user@lucene.apache.org >> Subject: Re: ExtractingRequestHandler and local files >> Date: Tue, 9 Jun 2009 14:50:43 -0400 >> >> I haven&

RE: ExtractingRequestHandler and local files

2009-06-09 Thread doraiswamy thirumalai
stream.file and stream.body start working everywhere So wanted to confirm. > From: gsing...@apache.org > To: solr-user@lucene.apache.org > Subject: Re: ExtractingRequestHandler and local files > Date: Tue, 9 Jun 2009 14:50:43 -0400 > > I haven't tried it, but I thought th

Re: ExtractingRequestHandler and local files

2009-06-09 Thread Grant Ingersoll
I haven't tried it, but I thought the enableRemoteStreaming stuff should work. That stuff is handled by Solr in other places, if I recall correctly. Have you tried it? -Grant On Jun 9, 2009, at 2:28 PM, doraiswamy thirumalai wrote: Hi, I would greatly appreciate a quick response to t

Re: ExtractingRequestHandler Question

2009-05-10 Thread Erick Erickson
Why is this surprising? *Assuming* that the EnglishPorterFilterFactory doesn't stem "clai" to "cla", this makes perfect sense. And since "clai" isn't English in the first place. Or am I missing something? Have you looked at your index with Luke to see what actually gets placed in it (i.e. whe

Re: ExtractingRequestHandler Question

2009-05-10 Thread Venu Mittal
Hi, Wondering if somebody could help me in understanding the following behavior :- If I search on a text field with search query as "davi cla" then it does not yields any search results however if I search for "davi clai" then it yields me 100+ results. The field I am searching on is a text fi

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-29 Thread francisco treacy
Well, problem seems to be with > java -Dsolr.solr.home="/my/path/to/solr" -jar start.jar Everything runs fine if I copy my xmls to the original conf directory of the example (example/solr/conf) and I execute like > java -jar start.jar Some wrong path to libs somewhere - who knows. Couldn't find

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-27 Thread francisco treacy
Thanks for your answers. Still no success. >> These need to be in your Solr home lib, not example/lib. I sometimes get >> confused on this one, too, forgetting that I need to go down a few more >> directories. The example/lib directory is where the Jetty stuff lives, >> example/solr/lib is the l

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-22 Thread Peter Wolanin
I had problems with this when trying to set this up with multiple cores - I had to set the shared lib as: in example/solr/solr.xml in order for it to find the jars in example/solr/lib -Peter On Wed, Apr 22, 2009 at 11:43 AM, Grant Ingersoll wrote: > > On Apr 20, 2009, at 12:46 PM, francisco t

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-22 Thread Grant Ingersoll
On Apr 20, 2009, at 12:46 PM, francisco treacy wrote: Additionally, here's what I've got in example/lib: These need to be in your Solr home lib, not example/lib. I sometimes get confused on this one, too, forgetting that I need to go down a few more directories. The example/lib director

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-20 Thread francisco treacy
Additionally, here's what I've got in example/lib: apache-solr-cell-nightly.jar bcmail-jdk14-132.jar commons-lang-2.1.jar icu4j-3.8.jar log4j-1.2.14.jar poi-3.5-beta5.jar slf4j-api-1.5.5.jar xml-apis-1.0.b2.jar apache-solr-core-nightly.jar bcprov-jdk14-132.jar common

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-20 Thread francisco treacy
Hi Grant, Here is the full stacktrace: 20-Apr-2009 12:36:39 org.apache.solr.common.SolrException log SEVERE: java.lang.ClassCastException: org.apache.solr.handler.extraction.ExtractingRequestHandler cannot be cast to org.apache.solr.request.SolrRequestHandler at org.apache.solr.core.Requ

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-20 Thread Grant Ingersoll
Can you give the full stack trace? On Apr 20, 2009, at 6:49 AM, francisco treacy wrote: Hi all, I am unsuccessfully attempting to use the ExtractingRequestHandler (indexing documents via Tika, Solr cell). I start Solr from the example app (start.jar), but point to my own Solr conf, where I hav

Re: ExtractingRequestHandler Question

2009-04-07 Thread Grant Ingersoll
Can you add the values as literals? http://wiki.apache.org/solr/ExtractingRequestHandler#head-88b9f55989c9878638e88be5d335b5126550f87c On Apr 3, 2009, at 8:29 PM, Venu Mittal wrote: Hi, I am using ExtractingRequestHandler to index rich text documents. The way I am doing it is I get some dat

Re: ExtractingRequestHandler Question

2009-04-06 Thread Venu Mittal
is not helping much either. Anyways I will explore and see if I can come up with anything better (may be a separate index for rich text docs). Thanks, Venu From: Jacob Singh To: solr-user@lucene.apache.org Sent: Saturday, April 4, 2009 9:59:13 PM Subject: Re

Re: ExtractingRequestHandler Question

2009-04-04 Thread Jacob Singh
Hi TIA, I have the same desired requirement. If you look up in the archives, you might find a similar thread between myself and the always super helpful Erik Hatcher. Basically, it can't be done (right now). You can however use the "ExtractOnly" request handler, and just get the extracted text

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-20 Thread Jacob Singh
On Wed, Dec 17, 2008 at 11:06 AM, Chris Hostetter wrote: > > : > : If I can find the bandwidth, I'd like to make something which allows > : > : file uploads via the XMLUpdateHandler as well... Do you have any ideas > : > > : > the XmlUpdateRequestHandler already supports file uploads ... all reque

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-16 Thread Chris Hostetter
: > : If I can find the bandwidth, I'd like to make something which allows : > : file uploads via the XMLUpdateHandler as well... Do you have any ideas : > : > the XmlUpdateRequestHandler already supports file uploads ... all request : But it doesn't do what Jacob is asking for... he wants (if I

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-16 Thread Jacob Singh
No, I didn't mean storing the binary along with, just that I could send a binary file (or a text file) which tika could process and store along with the XML which describes its literal meta-data. Best, Jacob On Mon, Dec 15, 2008 at 7:17 PM, Grant Ingersoll wrote: > > On Dec 15, 2008, at 8:20 AM,

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Grant Ingersoll
On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote: Hi Erik, Sorry I wasn't totally clear. Some responses inline: If the file is visible from the Solr server, there is no need to actually send the bits through HTTP. Solr's content steam capabilities allow a file to be retrieved from Solr it

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Jacob Singh
Hi Erik, Sorry I wasn't totally clear. Some responses inline: > If the file is visible from the Solr server, there is no need to actually > send the bits through HTTP. Solr's content steam capabilities allow a file > to be retrieved from Solr itself. > Yeah, I know. But in my case not possible

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Erik Hatcher
Jacob, Hmmm... seems the wires are still crossed and confusing. On Dec 15, 2008, at 6:34 AM, Jacob Singh wrote: This is indeed what I was talking about... It could even be handled via some type of transient file storage system. this might even be better to avoid the risks associated with uplo

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Jacob Singh
Hi Erik, This is indeed what I was talking about... It could even be handled via some type of transient file storage system. this might even be better to avoid the risks associated with uploading a huge file across a network and might (have no idea) be easier to implement. So I could send the fi

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Erik Hatcher
On Dec 15, 2008, at 3:13 AM, Chris Hostetter wrote: : If I can find the bandwidth, I'd like to make something which allows : file uploads via the XMLUpdateHandler as well... Do you have any ideas the XmlUpdateRequestHandler already supports file uploads ... all request handlers do using

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-15 Thread Chris Hostetter
: If I can find the bandwidth, I'd like to make something which allows : file uploads via the XMLUpdateHandler as well... Do you have any ideas the XmlUpdateRequestHandler already supports file uploads ... all request handlers do using the ContentStream abstraction... http://wiki.apache

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-14 Thread Jacob Singh
Hey, thanks! This is good stuff. I didn't expect you to just make the fix! If I can find the bandwidth, I'd like to make something which allows file uploads via the XMLUpdateHandler as well... Do you have any ideas here? I was thinking we could just send the XML payload as another POST field.

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-13 Thread Grant Ingersoll
Hi Jacob, I just updated the code such that it should now be possible to send in multiple values as literals, as in an HTML form that looks like: method="POST"> Choose a file to upload: Cheers, Grant On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote: Hi Grant, Thanks for the q

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-13 Thread Grant Ingersoll
On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote: Hi Grant, Thanks for the quick response. My Colleague looked into the code a bit, and I did as well, here is what I see (my Java sucks): http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-12 Thread Jacob Singh
Hi Grant, Thanks for the quick response. My Colleague looked into the code a bit, and I did as well, here is what I see (my Java sucks): http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java //handle the lite

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-12 Thread Grant Ingersoll
Hmmm, I think I see the disconnect, but I'm not sure. Sending to the ERH (ExtractingReqHandler) is not an XML command at all, it's a file- upload/ multi-part encoding. I think you will need an API that does something like: (Just making this up, this is not real code) File file = new File(f

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-12 Thread Jacob Singh
Hi Grant, Happy to. Currently we are sending over documents by building a big XML file of all of the fields of that document. Something like this: $document = new Apache_Solr_Document(); $document->id = apachesolr_document_id($node->nid); $document->title = $node->title; $document->b

Re: ExtractingRequestHandler and XmlUpdateHandler

2008-12-11 Thread Grant Ingersoll
On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote: Hey folks, I'm looking at implementing ExtractingRequestHandler in the Apache_Solr_PHP library, and I'm wondering what we can do about adding meta-data. I saw the docs, which suggests you use different post headers to pass field values alo