Working now - fyi - the "update/extract" from a post works extracting from a
kmz(zip) but I am still having trouble from the dataimport. I'll move to
another thread for that. THANKS all.
--
View this message in context:
http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip
Thanks for the info Sergio. I updated my 4.8.1 version with that patch and
SOLR 4216 (which was really the same thing). It took a day to get it to
compile on my network and it still doesn't work. Did my config file look
correct? I'm wondering if I need another param somewhere.
"Patch has to be
hi keeblerh,
Patch has to be applied to the source code and compile again Solr.war.
If you do that then it works extracting the content of documents
Regards,
Sergio
--
View this message in context:
http://lucene.472066.n3.nabble.com/ExtractingRequestHandler-indexing-zip-files-tp4138172p415767
I am also having the issue where my zip contents (or kmz contents) are not
being processed - only the file names are processed. It seems to recognize
the kmz extension and open the file just doesn't recurse the processing on
the contents.
The patch you mention has been around for a while. I am ru
Here's an example of what Alexandre is
talking about:
http://searchhub.org/2012/02/14/indexing-with-solrj/
It mixes database fetching in with the
Tika processing, but that should be pretty easy
to pull out.
Best,
Erick
On Mon, Jun 30, 2014 at 8:21 PM, Alexandre Rafalovitch
wrote:
> Under the co
Under the covers, Tika is used. You can use Tika yourself on the
client side and cache it's output in the database or text file. Then,
send that to Solr instead. Puts less load on Solr as well.
Or you can use atomic update, but then all the primary (not copyField)
fields must be stored="true".
Re
I extended ExtractingDocumentLoader with this patch and it works.
https://issues.apache.org/jira/secure/attachment/12473188/SOLR-2416_ExtractingDocumentLoader.patch
Iterates throw all documents and extracts the name and the content of all
documents inside the file.
Regards,
Sergio
--
View thi
Hi Sergio,
your either do the stuff on the caller side (which is probably a good idea
since you are off-load the SOLR server) or extend the ExtractingRequestHandler
Cheers,
Siegfried Goeschl
On 27 May 2014, at 10:37, marotosg wrote:
> Hi,
>
> Thanks for your answer Alexandre.
> I have zip f
Hi,
Thanks for your answer Alexandre.
I have zip files with only one document inside per zip file. These documents
are mainly pdf,xml,html.
I tried to index "tini.txt.gz" file which is located in the trunk to be used
by extraction tests
\trunk\solr\contrib\extraction\src\test-files\extraction\tin
A zip file can contain many files and directories in a nested
structure. With files of any type and size.
What would you expect Solr to do facing a generic Zip file?
And what would you like it to do for _your_ - one assumes more
restricted - scenario?
Regards,
Alex.
Personal website: http://
Hi,
If you like, you can open a JIRA issue on this and provide as much info as
possible. Someone can then look into (potential) memory optimization of this
part of the code.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com
28. sep.
Hi Jan.
Thank you very much for your advice.
So I understand Solr needs more memory to parse the files.
To parse a file of size x, it needs double memory (2x). Then how much
memory allocation should be taken to heap size? 8x? 16x?
Regards,
Shigeki
2012/9/28 Jan Høydahl
> Please try to incr
Please try to increase -Xmx and see how much RAM you need for it to succeed.
I believe it is simply a case where this particular file needs double memory
(480Mb) to parse and you have only allocated 1Gb (which is not particularly
much). Perhaps the code could be optimized to avoid the Arrays.cop
These are very large files and this is not enough memory. Do you upload these
as files?
If the CSV file is one document per line, you can split it up. Unix has a
'split' command which does this very nicely.
- Original Message -
| From: "Shigeki Kobayashi"
| To: solr-user@lucene.apach
(Bit off-topic but...) I understand the fact that Solr isn't meant to
'store' everything, but because highlighting matches requires a field to be
stored I would expect most people having to end-up storing full document
content in their indexes? Can't think there is any good workaround for
this...
> Solr Cell is great for proof-of-concept, but for heavy-duty
> applications,
> you're offloading all the processing on the Solr server,
> which can be a
> problem.
Good point!
Thank you
I have had good luck with creating a separate core index for just data. This is
a different core than the indexed core.
Very fast.
Bill Bell
Sent from mobile
On Apr 1, 2012, at 11:15 AM, Erick Erickson wrote:
> Yes, you can. but Generally, storing the raw input in Solr is
> not the best
Ahhh, OK. Sure, anything you store in Solr you can get back. The key
is not Tika, but your schema.xml file, and setting 'stored="true" '
bq: So my question was if I can index the original doc via
ExtractingRequestHandler in Solr AND get back the text output, in a single
call.
I know of now way to
Hi Erik,
I think we have some misunderstanding.
I want to index the text of the docs in Solr (only indexed, NOT stored).
But I want the text (Tika output) back for:
* later faster reindexing (some text extraction like OCR takes really long)
* use the text for other processings
The original doc
Yes, you can. but Generally, storing the raw input in Solr is
not the best approach. The problem here is that pretty soon
you get a huge index that contains *everything*. Solr was not
intended to be a data store.
Besides, you then need to store the binary form of the file. Solr
only deals with
: indexed file. The CommonsHttpSolrServer sends the parameters as a HTTP
: GET request. Because of that I'll get a "socket write error". If I
: change the CommonsHttpSolrServer to send the parameters as HTTP POST
: sending will work, but the ExtractingRequestHandler will not recognize
: the parame
One solution to this problem is to change the order of field operation
(http://wiki.apache.org/solr/ExtractingRequestHandler#Order_of_field_operations)
to first do fmap.*= processing, then add the fields from literal.*=. Why would
anyone want to rename a field they just have explicitly named any
Hi Grant,
After comparing the differences between my solrconfig.xml and that used by
the example, the key difference is that I didn't have true in the defaults for the ERH. Commenting out
this line in the example configuration causes the example to display the
same behaviour as I'm seeing.
I've
On Apr 13, 2011, at 12:06 AM, Liam O'Boyle wrote:
> Afternoon,
>
> After an upgrade to Solr 3.1 which has largely been very smooth and
> painless, I'm having a minor issue with the ExtractingRequestHandler.
>
> The problem is that it's inserting metadata into the extracted
> content, as well as
The Tika integration with the DataImportHandler allows you to control
many aspects of what goes into the index, including solving this
problem:
http://wiki.apache.org/solr/TikaEntityProcessor
(Tika is the extraction library, and ExtractingRequestHandler and the
TikaEntityProcessor both use it.)
Thanks Lance,
I have lucid's search as one of my open search tools in my browser.
Generally pretty useful (especially the ability to filter) but it's not
of much help when the tool points out that the best info is on the wiki
and the link to the wiki reveals that it can't be reached. This
For future reference, the Solr & Lucene wikis and mailing lists are
indexed on http://www.lucidimagination.com/search/
On Thu, Oct 1, 2009 at 11:40 AM, Tricia Williams
wrote:
> If the wiki isn't working
>>
>>
>> https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2
>
If the wiki isn't working
https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2
gave me more information. The LucidImagination article helps too.
Now that the wiki is up again it is more obvious that I need to add:
fulltext
text
to my solrconfig.xml
Tricia
On 1 Oct 09, at 12:46 PM, Tricia Williams wrote:
STREAM_SOURCE_INFO
https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2
appears to be a constant from this page:
http://lucene.apache.org/solr/api/constant-values.html
This has it embedded as an "arr" in the re
and stream.body
>start working everywhere
>
>
>
>So wanted to confirm.
>
>> From: gsing...@apache.org
>> To: solr-user@lucene.apache.org
>> Subject: Re: ExtractingRequestHandler and local files
>> Date: Tue, 9 Jun 2009 14:50:43 -0400
>>
>> I haven&
stream.file and stream.body
start working everywhere
So wanted to confirm.
> From: gsing...@apache.org
> To: solr-user@lucene.apache.org
> Subject: Re: ExtractingRequestHandler and local files
> Date: Tue, 9 Jun 2009 14:50:43 -0400
>
> I haven't tried it, but I thought th
I haven't tried it, but I thought the enableRemoteStreaming stuff
should work. That stuff is handled by Solr in other places, if I
recall correctly. Have you tried it?
-Grant
On Jun 9, 2009, at 2:28 PM, doraiswamy thirumalai wrote:
Hi,
I would greatly appreciate a quick response to t
Why is this surprising? *Assuming* that the EnglishPorterFilterFactory
doesn't stem "clai" to "cla", this makes perfect sense. And since "clai"
isn't English in the first place.
Or am I missing something?
Have you looked at your index with Luke to see what actually gets placed
in it (i.e. whe
Hi,
Wondering if somebody could help me in understanding the following behavior :-
If I search on a text field with search query as "davi cla" then it does not
yields any search results however if I search for "davi clai" then it yields me
100+ results.
The field I am searching on is a text fi
Well, problem seems to be with
> java -Dsolr.solr.home="/my/path/to/solr" -jar start.jar
Everything runs fine if I copy my xmls to the original conf directory
of the example (example/solr/conf) and I execute like
> java -jar start.jar
Some wrong path to libs somewhere - who knows. Couldn't find
Thanks for your answers. Still no success.
>> These need to be in your Solr home lib, not example/lib. I sometimes get
>> confused on this one, too, forgetting that I need to go down a few more
>> directories. The example/lib directory is where the Jetty stuff lives,
>> example/solr/lib is the l
I had problems with this when trying to set this up with multiple
cores - I had to set the shared lib as:
in example/solr/solr.xml in order for it to find the jars in example/solr/lib
-Peter
On Wed, Apr 22, 2009 at 11:43 AM, Grant Ingersoll wrote:
>
> On Apr 20, 2009, at 12:46 PM, francisco t
On Apr 20, 2009, at 12:46 PM, francisco treacy wrote:
Additionally, here's what I've got in example/lib:
These need to be in your Solr home lib, not example/lib. I sometimes
get confused on this one, too, forgetting that I need to go down a few
more directories. The example/lib director
Additionally, here's what I've got in example/lib:
apache-solr-cell-nightly.jar bcmail-jdk14-132.jar
commons-lang-2.1.jar icu4j-3.8.jar log4j-1.2.14.jar
poi-3.5-beta5.jar slf4j-api-1.5.5.jar
xml-apis-1.0.b2.jar
apache-solr-core-nightly.jar bcprov-jdk14-132.jar
common
Hi Grant,
Here is the full stacktrace:
20-Apr-2009 12:36:39 org.apache.solr.common.SolrException log
SEVERE: java.lang.ClassCastException:
org.apache.solr.handler.extraction.ExtractingRequestHandler cannot be
cast to org.apache.solr.request.SolrRequestHandler
at
org.apache.solr.core.Requ
Can you give the full stack trace?
On Apr 20, 2009, at 6:49 AM, francisco treacy wrote:
Hi all,
I am unsuccessfully attempting to use the ExtractingRequestHandler
(indexing documents via Tika, Solr cell). I start Solr from the
example app (start.jar), but point to my own Solr conf, where I hav
Can you add the values as literals?
http://wiki.apache.org/solr/ExtractingRequestHandler#head-88b9f55989c9878638e88be5d335b5126550f87c
On Apr 3, 2009, at 8:29 PM, Venu Mittal wrote:
Hi,
I am using ExtractingRequestHandler to index rich text documents.
The way I am doing it is I get some dat
is not helping much either.
Anyways I will explore and see if I can come up with anything better (may be a
separate index for rich text docs).
Thanks,
Venu
From: Jacob Singh
To: solr-user@lucene.apache.org
Sent: Saturday, April 4, 2009 9:59:13 PM
Subject: Re
Hi TIA,
I have the same desired requirement. If you look up in the archives,
you might find a similar thread between myself and the always super
helpful Erik Hatcher. Basically, it can't be done (right now).
You can however use the "ExtractOnly" request handler, and just get
the extracted text
On Wed, Dec 17, 2008 at 11:06 AM, Chris Hostetter
wrote:
>
> : > : If I can find the bandwidth, I'd like to make something which allows
> : > : file uploads via the XMLUpdateHandler as well... Do you have any ideas
> : >
> : > the XmlUpdateRequestHandler already supports file uploads ... all reque
: > : If I can find the bandwidth, I'd like to make something which allows
: > : file uploads via the XMLUpdateHandler as well... Do you have any ideas
: >
: > the XmlUpdateRequestHandler already supports file uploads ... all request
: But it doesn't do what Jacob is asking for... he wants (if I
No, I didn't mean storing the binary along with, just that I could
send a binary file (or a text file) which tika could process and store
along with the XML which describes its literal meta-data.
Best,
Jacob
On Mon, Dec 15, 2008 at 7:17 PM, Grant Ingersoll wrote:
>
> On Dec 15, 2008, at 8:20 AM,
On Dec 15, 2008, at 8:20 AM, Jacob Singh wrote:
Hi Erik,
Sorry I wasn't totally clear. Some responses inline:
If the file is visible from the Solr server, there is no need to
actually
send the bits through HTTP. Solr's content steam capabilities
allow a file
to be retrieved from Solr it
Hi Erik,
Sorry I wasn't totally clear. Some responses inline:
> If the file is visible from the Solr server, there is no need to actually
> send the bits through HTTP. Solr's content steam capabilities allow a file
> to be retrieved from Solr itself.
>
Yeah, I know. But in my case not possible
Jacob,
Hmmm... seems the wires are still crossed and confusing.
On Dec 15, 2008, at 6:34 AM, Jacob Singh wrote:
This is indeed what I was talking about... It could even be handled
via some type of transient file storage system. this might even be
better to avoid the risks associated with uplo
Hi Erik,
This is indeed what I was talking about... It could even be handled
via some type of transient file storage system. this might even be
better to avoid the risks associated with uploading a huge file across
a network and might (have no idea) be easier to implement.
So I could send the fi
On Dec 15, 2008, at 3:13 AM, Chris Hostetter wrote:
: If I can find the bandwidth, I'd like to make something which allows
: file uploads via the XMLUpdateHandler as well... Do you have any
ideas
the XmlUpdateRequestHandler already supports file uploads ... all
request
handlers do using
: If I can find the bandwidth, I'd like to make something which allows
: file uploads via the XMLUpdateHandler as well... Do you have any ideas
the XmlUpdateRequestHandler already supports file uploads ... all request
handlers do using the ContentStream abstraction...
http://wiki.apache
Hey,
thanks! This is good stuff. I didn't expect you to just make the fix!
If I can find the bandwidth, I'd like to make something which allows
file uploads via the XMLUpdateHandler as well... Do you have any ideas
here? I was thinking we could just send the XML payload as another
POST field.
Hi Jacob,
I just updated the code such that it should now be possible to send in
multiple values as literals, as in an HTML form that looks like:
method="POST">
Choose a file to upload:
Cheers,
Grant
On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote:
Hi Grant,
Thanks for the q
On Dec 12, 2008, at 11:53 PM, Jacob Singh wrote:
Hi Grant,
Thanks for the quick response. My Colleague looked into the code a
bit, and I did as well, here is what I see (my Java sucks):
http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler
Hi Grant,
Thanks for the quick response. My Colleague looked into the code a
bit, and I did as well, here is what I see (my Java sucks):
http://svn.apache.org/repos/asf/lucene/solr/trunk/contrib/extraction/src/main/java/org/apache/solr/handler/extraction/SolrContentHandler.java
//handle the lite
Hmmm, I think I see the disconnect, but I'm not sure. Sending to the
ERH (ExtractingReqHandler) is not an XML command at all, it's a file-
upload/ multi-part encoding. I think you will need an API that does
something like:
(Just making this up, this is not real code)
File file = new File(f
Hi Grant,
Happy to.
Currently we are sending over documents by building a big XML file of
all of the fields of that document. Something like this:
$document = new Apache_Solr_Document();
$document->id = apachesolr_document_id($node->nid);
$document->title = $node->title;
$document->b
On Dec 10, 2008, at 10:21 PM, Jacob Singh wrote:
Hey folks,
I'm looking at implementing ExtractingRequestHandler in the
Apache_Solr_PHP
library, and I'm wondering what we can do about adding meta-data.
I saw the docs, which suggests you use different post headers to
pass field
values alo
60 matches
Mail list logo