Re: Solr Cell, Tika and UpdateProcessorChains

2019-02-21 Thread Erick Erickson
Several things: 1> Please don’t use add-unknown…. It’s fine for prototyping, but guesses field definitions. 2> the solrocnfig appears to be malformed, I’m surprised it fires up at all. This never terminates for instance:

Re: Solr Cell Input Parameter tika.config

2018-11-07 Thread Jan Høydahl
The tika.config param is documented here: https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler I notice that the code (https://github.com/apache/lucene-solr/blob/964cc88cee7d62edf03a923e3217809d630af5d5/solr/co

Re: Solr Cell Input Parameter tika.config

2018-10-25 Thread Yasufumi Mizoguchi
Hello, I could not find the process that parse tika.config parameter from solr request. Maybe, tika.config parameter can only be defined in solrconfig.xml as following. tika-config.xml true ignored_ true links ignored_ Thanks, Yasufumi 2018年10月26日(金) 7:07 Robertson

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Rahul Singh
Lucene ( the major underlying Tech in SolR ) can handle any data, but it’s optimized to be an index , not a file store. Better to put that in another DB or file system like Cassandra, S3, etc. (better than SolR). In our experience , leveraging the tika binary / microservice as a pre-index proce

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Shawn Heisey
On 4/25/2018 4:02 AM, Lee Carroll wrote: *We don't recommend using solr-cell for production indexing.* Ok. Are the reasons for: Performance. I think we have rather modest index requirement (1000 a day... on a busy day) Security. The index workflow is, upload files to public facing server w

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Lee Carroll
> > > > > *That's not usually the kind of information you want to have in a > Solrindex. Most of the time, there will be an entry in the Solr index > thattells the system making queries how to locate the actual data -- > afilename, a URL, a database lookup key, etc.* Agreed. The app will have a

Re: solr cell: write entire file content binary to index along with metadata

2018-04-24 Thread Shawn Heisey
On 4/24/2018 10:26 AM, Lee Carroll wrote: > Does the solr cell contrib give access to the files raw content along with > the extracted metadata?\ That's not usually the kind of information you want to have in a Solr index.  Most of the time, there will be an entry in the Solr index that tells the

Re: Solr Cell Tika - date.formats

2014-05-28 Thread ienjreny
Thanks for your fast answer On Wed, May 28, 2014 at 11:23 PM, Jack Krupansky-2 [via Lucene] < ml-node+s472066n4138505...@n3.nabble.com> wrote: > Pass multiple instances of the date.formats parameter: > > http://server:port > /solr/update/extract?date.formats=-MM-dd'T'HH:mm:ss'Z'&date.formats

Re: Solr Cell Tika - date.formats

2014-05-28 Thread Jack Krupansky
Pass multiple instances of the date.formats parameter: http://server:port/solr/update/extract?date.formats=-MM-dd'T'HH:mm:ss'Z'&date.formats=-MM-dd'T'HH:mm:ss But as the doc says, it comes preconfigured with all these formats: -MM-dd'T'HH:mm:ss'Z' -MM-dd'T'HH:mm:ss -MM-dd yy

Re: Solr Cell Question

2013-09-09 Thread Jamie Johnson
Thanks Erick, This is how I was doing it but when I saw the Solr Cell stuff I figured I'd give it a go. What I ended up doing is the following ModifiableSolrParams params = indexer.index(artifact); params.add("fmap.content", "my_custom_field"); params.add("extractFormat", "text"); ContentS

Re: Solr Cell Question

2013-09-06 Thread Erick Erickson
It's always frustrating when someone replies with "Why not do it a completely different way?". But I will anyway :). There's no requirement at all that you send things to Solr to make Solr Cel (aka Tika) do it's tricks. Since you're already in SolrJ anyway, why not just parse on the client? This

Re: solr cell

2013-03-15 Thread Arcadius Ahouansou
Another options similar to this would be the new file system WatchService available in java 7: http://docs.oracle.com/javase/tutorial/essential/io/notification.html Arcadius. On 15 March 2013 15:22, Michael Della Bitta wrote: > Niklas, > > In Linux, the API for watching for filesystem changes i

Re: solr cell

2013-03-15 Thread Jack Krupansky
Take a look at ManifoldCF, whch has a file system crawler which can track changed files. -- Jack Krupansky -Original Message- From: Niklas Langvig Sent: Friday, March 15, 2013 11:10 AM To: solr-user@lucene.apache.org Subject: solr cell We have all our documents (doc, docx, pdf) on a

Re: solr cell

2013-03-15 Thread Michael Della Bitta
Niklas, In Linux, the API for watching for filesystem changes is called inotify. You'd need to write something to listen to those events and react accordingly. Here's a brief discussion about it: http://stackoverflow.com/questions/4062806/inotify-how-to-use-it-linux Michael Della Bitta ---

Re: Re: Re: Solr Cell Questions

2012-09-25 Thread Erick Erickson
t; http://wiki.apache.org/solr/ExtractingRequestHandler#SolrJ > > > Erick Erickson schrieb am 25.09.2012 15:47:34: > >> Von: >> >> Erick Erickson >> >> An: >> >> solr-user@lucene.apache.org >> >> Datum: >> >> 25.09.2012 15:48 >>

Re: Solr Cell Questions

2012-09-25 Thread Jack Krupansky
a separate process) to minimize thread issues, GC issues, hung parsers, etc. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Tuesday, September 25, 2012 10:24 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cell Questions Are you by any chance committing

Re: Solr Cell Questions

2012-09-25 Thread Alexandre Rafalovitch
Are you by any chance committing after every file being indexed? That could cause the speed issues. Also, have you tried to optimize your indexer's java memory params. I use this for mine which used to run out of memory as well: java -server -Xms512m -Xmx2048m Regards, Alex. P.s. I may have so

Antwort: Re: Re: Solr Cell Questions

2012-09-25 Thread Johannes . Schwendinger
2 15:47:34: > Von: > > Erick Erickson > > An: > > solr-user@lucene.apache.org > > Datum: > > 25.09.2012 15:48 > > Betreff: > > Re: Re: Solr Cell Questions > > bq: how many documents per minute, second, what ever can i put into solr >

Re: Re: Solr Cell Questions

2012-09-25 Thread Erick Erickson
ings? > > > Best > Johannes > > Erick Erickson schrieb am 25.09.2012 00:22:26: > >> Von: >> >> Erick Erickson >> >> An: >> >> solr-user@lucene.apache.org >> >> Datum: >> >> 25.09.2012 00:23 >> >> Bet

Antwort: Re: Solr Cell Questions

2012-09-25 Thread Johannes . Schwendinger
kson > > An: > > solr-user@lucene.apache.org > > Datum: > > 25.09.2012 00:23 > > Betreff: > > Re: Solr Cell Questions > > If you're concerned about throughput, consider moving all the > SolrCell (Tika) processing off the server. SolrC

Re: Solr Cell Questions

2012-09-24 Thread Erick Erickson
If you're concerned about throughput, consider moving all the SolrCell (Tika) processing off the server. SolrCell is way cool for showing what can be done, but its downside is you're moving all the processing of the structured documents to the same machine doing the indexing. Pretty soon, especiall

Re: Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
In case the exact problem was not clear to somebody: The problem with FileUpload interpreting file data as regular form fields is that, Solr thinks there are no content streams in the request and throws a "missing_content_stream" exception. On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly < karth

Re: Solr Cell and encrypted pdf files

2010-05-26 Thread Yiannis Pericleous
I've opened an issue and sumbitted a patch https://issues.apache.org/jira/browse/SOLR-1929 Chris Hostetter wrote: : I can't seem to get solr cell to index password protected pdf files. : I can't figure out how to pass the password to tika and looking at : ExtractingDocumentLoader, : it doesn't

Re: Solr Cell and encrypted pdf files

2010-05-25 Thread Chris Hostetter
: I can't seem to get solr cell to index password protected pdf files. : I can't figure out how to pass the password to tika and looking at : ExtractingDocumentLoader, : it doesn't seem to pass any pdf password related metadata to the tika parser. I suspect you are correct, i don't think anyone h

Re: Solr Cell. Seems to be only indexing the first N bytes of a text file.

2010-03-20 Thread Ross
Thanks Erick. That was it. All looking good now. Cheers Ross On Sat, Mar 20, 2010 at 9:29 PM, Erick Erickson wrote: > Does our solarconfig file have a line like... > 1 > ? > > Try upping the 1... > > HTH > Erick > > On Sat, Mar 20, 2010 at 8:40 PM, Ross wrote: > >> Hi all >> >> I'm tr

Re: Solr Cell. Seems to be only indexing the first N bytes of a text file.

2010-03-20 Thread Erick Erickson
Does our solarconfig file have a line like... 1 ? Try upping the 1... HTH Erick On Sat, Mar 20, 2010 at 8:40 PM, Ross wrote: > Hi all > > I'm trying to index some text files using Solr Cell. I'm using the > schema from Avi Rappoport's tutorial about indexing html and text > files altho

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-02 Thread Bill Engle
Thanks for the responses. This is exactly what I had to resort to. I will definitely put in a feature request to get the generated ID back from the extract request. I am doing this with PHP cURL for extraction and pecl php solr for querying. I am then saving the unique id and dupe hash in a MyS

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter
: To quote from the wiki, ... That's all true ... but Bill explicitly said he wanted to use SignatureUpdateProcessorFactory to generate a uniqueKey from the content field post-extraction so he could dedup documents with the same content ... his question was how to get that key after ad

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Lance Norskog
To quote from the wiki, http://wiki.apache.org/solr/ExtractingRequestHandler curl 'http://localhost:8983/solr/update/extract?literal.id=doc1&commit=true' -F "myfi...@tutorial.html" This runs the extractor on your input file (in this case an HTML file). It then stores the generated document with t

Re: Solr Cell and Deduplication - Get ID of doc

2010-03-01 Thread Chris Hostetter
: You could create your own unique ID and pass it in with the : literal.field=value feature. By which Lance means you could specify an unique value in a differnet field from yoru uniqueKey field, and then query on that field:value pair to get the doc after it's been added -- but that query will

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Lance Norskog
You could create your own unique ID and pass it in with the literal.field=value feature. http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters On Fri, Feb 26, 2010 at 7:56 AM, Bill Engle wrote: > Any thoughts on this? I would like to get the id back in the request after > indexin

Re: Solr Cell and Deduplication - Get ID of doc

2010-02-26 Thread Bill Engle
Any thoughts on this? I would like to get the id back in the request after indexing. My initial thoughts were to do a search to get the docid based on the attr_stream_name after indexing but now that I reread my message I mentioned the attr_stream_name (file_name) may be different so that is unre

Re: Solr Cell RTF Woes

2010-02-26 Thread Bill Engle
Thanks. Headless put me in the right direction. I am running on a headless Mac OSX 10.6 Server. I added the below to my {CATALINA_HOME}/bin/setenv.sh file and now I am indexing RTF. export JAVA_OPTS="-d64 -server -Xmx1024m -XX:MaxPermSize=512m -Djava.awt.headless=true -Dsun.lang.ClassLoader.al

RE: Solr Cell RTF Woes

2010-02-26 Thread David.Dankwerth
Are you running on a Linux/Unix box that has no X ... Did you try with headless options ? http://java.sun.com/developer/technicalArticles/J2SE/Desktop/headless/ Tika's RTF is using Swing and AWT to analyze the rtf, these in turn will attempt to use Graphics libraries, unless you use headless. -

Re: Solr Cell RTF Woes

2010-02-25 Thread Lance Norskog
Ha! http://issues.apache.org/jira/browse/TIKA-282 You're running this on a headless machine and the RTF parser demands an X window. On Thu, Feb 25, 2010 at 11:08 AM, Bill Engle wrote: > Any RTF file I tried to index in Solr 1.4 throws these errors out.  I have > no issues with doc, pdf.  Any th

Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-06 Thread Ross
ap.content=attr_content&commit=true"; -F "myfi...@tutorial.html" -F "literal.mydata= > -Original Message- > From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] > Sent: Monday, January 04, 2010 4:28 AM > To: solr-user@lucene.apache.org > Subject

RE: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-05 Thread Giovanni Fernandez-Kincade
-Original Message- From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] Sent: Monday, January 04, 2010 4:28 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cell - PDFs plus literal metadata - GET or POST ? On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote: > Hi all > >

Re: Solr Cell - PDFs plus literal metadata - GET or POST ?

2010-01-04 Thread Shalin Shekhar Mangar
On Wed, Dec 30, 2009 at 7:49 AM, Ross wrote: > Hi all > > I'm experimenting with Solr. I've successfully indexed some PDFs and > all looks good but now I want to index some PDFs with metadata pulled > from another source. I see this example in the docs. > > curl " > http://localhost:8983/solr/upd

Re: Re: Solr Cell and Spellchecking.

2009-12-09 Thread boyleme
I just resolved the issue (fresh coffee == good) ! In my schema, I had added: but missed the copyField definition. Adding these: and a restart and everything is working properly. Thanks for the reply and for LucidImagination -- the only reason I have been able to get Solr integrated int

Re: Solr Cell and Spellchecking.

2009-12-09 Thread Grant Ingersoll
What's your schema and your config look like for the various relevant pieces? On Dec 8, 2009, at 8:04 PM, Michael Boyle wrote: > Following Eric Hatcher's post about using SolrCell and acts_as_solr { > http://www.lucidimagination.com/blog/2009/02/17/acts_as_solr_cell/ }, I have > been able to in

RE: Solr Cell text extraction - non-issue

2009-11-20 Thread Ian Smith
Sorry guys, the bad request seemed to be caused elsewhere, no need to URL encode now. Ian. -Original Message- From: Ian Smith [mailto:ian.sm...@gossinteractive.com] Sent: 20 November 2009 15:26 To: solr-user@lucene.apache.org Subject: Solr Cell text extraction Hi Guys, I am trying to us

Re: Solr Cell on web-based files?

2009-11-02 Thread Alexey Serba
> e.g (doesn't work) > curl http://localhost:8983/solr/update/extract?extractOnly=true > --data-binary @http://myweb.com/mylocalfile.htm -H "Content-type:text/html" > You might try remote streaming with Solr (see > http://wiki.apache.org/solr/SolrConfigXml). Yes, curl example curl 'http://local

Re: Solr Cell on web-based files?

2009-10-31 Thread Yonik Seeley
On Sat, Oct 31, 2009 at 12:52 PM, Insight 49, LLC wrote: > Is local file URIs a limitation of solr cell, or just curl; All of Solr's interfaces are currently based on HTTP and usable over a network. Curl (like wget) is simply a useful command line tool that can speak HTTP and is nice for testing.

Re: Solr Cell on web-based files?

2009-10-31 Thread Insight 49, LLC
markus.rietz...@rzf.fin-nrw.de wrote: curl reads from local file or stdin, so you could do something like if it only a single file from a webserver curl http://someserver/file.html/ | curl "http://localhost:8983/solr/update/extract?extractOnly=true"; -F na...@- but this way no crawling, no

Re: Solr Cell on web-based files?

2009-10-27 Thread Insight 49, LLC
Andrzej Bialecki wrote: Grant Ingersoll wrote: You might try remote streaming with Solr (see http://wiki.apache.org/solr/SolrConfigXml). Otherwise, look into a crawler such as Nutch or Droids or Heretrix. Additionally, Nutch can be configured to send the crawled/parsed documents to Solr for

Re: Solr Cell on web-based files?

2009-10-27 Thread Andrzej Bialecki
Grant Ingersoll wrote: You might try remote streaming with Solr (see http://wiki.apache.org/solr/SolrConfigXml). Otherwise, look into a crawler such as Nutch or Droids or Heretrix. Additionally, Nutch can be configured to send the crawled/parsed documents to Solr for indexing. -- Best reg

Re: Solr Cell on web-based files?

2009-10-27 Thread Grant Ingersoll
You might try remote streaming with Solr (see http://wiki.apache.org/solr/SolrConfigXml ). Otherwise, look into a crawler such as Nutch or Droids or Heretrix. -Grant On Oct 27, 2009, at 11:14 AM, Insight 49, LLC wrote: Hi, If I use the ExtractingRequestHandler

Re: solr cell/tika: pdf import with xml metatags

2009-10-27 Thread Grant Ingersoll
On Oct 27, 2009, at 6:36 AM, > wrote: hi, we want to use SOLR as our intranet search engine. i downloaded the nightly bild of solr 1.4. pdf extraction does via Solr Cell/Tika. i can send the pdf via curl to solr. we do have a large set of meta-tags to all our intranet documents, includ

Re: Solr Cell

2009-07-23 Thread Matt Weber
Found my own answer, use the literal parameter. Should have dug around before asking. Sorry. Thanks, Matt Weber eSr Technologies http://www.esr-technologies.com On Jul 23, 2009, at 2:26 PM, Matt Weber wrote: Is it possible to supply addition metadata along with the binary file when us

Re: Solr Cell (ExtractingRequestHandler) and plain text files

2009-02-10 Thread Erik Hatcher
On Feb 10, 2009, at 10:57 AM, Grant Ingersoll wrote: So, this seems to be an issue with Tika and it's mime type detection of plain text. For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files and also http

Re: Solr Cell (ExtractingRequestHandler) and plain text files

2009-02-10 Thread Grant Ingersoll
So, this seems to be an issue with Tika and it's mime type detection of plain text. For some discussion on it, see http://www.lucidimagination.com/search/document/64e27546d23e67b9/mime_type_identification_of_plain_text_files and also https://issues.apache.org/jira/browse/TIKA-154, which has

Re: Solr Cell (ExtractingRequestHandler) and plain text files

2009-02-10 Thread Grant Ingersoll
OK, I have reproduced this. Let me debug for a moment and then we can likely file a JIRA On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote: One other person has reported this to me off-list, and I just encountered it myself. ExtractingRequestHandler does not handle plain text files properl

Re: Solr Cell (ExtractingRequestHandler) and plain text files

2009-02-09 Thread Erik Hatcher
And yes, the file does have textual content :) And I tried both ext.resource.name and stream.contentType to no avail. Erik On Feb 9, 2009, at 10:17 PM, Erik Hatcher wrote: One other person has reported this to me off-list, and I just encountered it myself. ExtractingRequestHandler d