Re: Term Freq Vector with SOLR cell?

2019-05-01 Thread Erik Hatcher
q=doc_content?Try q=id:"" Solr Cell and DIH are comparable (in that they are about getting content into Solr) but "unrelated" to TVRH. TVRH is about inspecting indexed content, regardless of how it got in. Erik > On May 1, 2019, at 3:14 PM, Geoffrey Will

Term Freq Vector with SOLR cell?

2019-05-01 Thread Geoffrey Willis
I am using Solr in a web app to extract text from .pdf, and docx files. I was wondering if I can access the TermFreq and TermPosition vectors via the HTTP interface exposed by Solr Cell. I’m posting/getting documents fine, I’ve enabled the TV, TFV etc in the managed schema: http://localhost

Re: Solr Cell, Tika and UpdateProcessorChains

2019-02-21 Thread Erick Erickson
Several things: 1> Please don’t use add-unknown…. It’s fine for prototyping, but guesses field definitions. 2> the solrocnfig appears to be malformed, I’m surprised it fires up at all. This never terminates for instance:

Solr Cell, Tika and UpdateProcessorChains

2019-02-21 Thread Demian Katz
able to point her in the right direction more quickly than I can. Here is her original inquiry: I am pulling data from a local drive for indexing. I am using solr cell and tika in schemaless mode. I am attempting to rewrite certain field information prior to indexing using

Re: Solr Cell Input Parameter tika.config

2018-11-07 Thread Jan Høydahl
The tika.config param is documented here: https://lucene.apache.org/solr/guide/7_5/uploading-data-with-solr-cell-using-apache-tika.html#configuring-the-solr-extractingrequesthandler I notice that the code (https://github.com/apache/lucene-solr/blob/964cc88cee7d62edf03a923e3217809d630af5d5/solr

Re: Solr Cell Input Parameter tika.config

2018-10-25 Thread Yasufumi Mizoguchi
Robertson, Eric J : > Hello all, > > Currently trying to define a tika config to use when posting a pdf to Solr > Cell as we may want to override the default tika configuration depending on > type of document being ingested. > > In the docs it lists tika.config as an input param

Solr Cell Input Parameter tika.config

2018-10-25 Thread Robertson, Eric J
Hello all, Currently trying to define a tika config to use when posting a pdf to Solr Cell as we may want to override the default tika configuration depending on type of document being ingested. In the docs it lists tika.config as an input parameter to the Solr Cell endpoint. Though in my

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Rahul Singh
process can improve the overall stability of the SolR service. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Apr 25, 2018, 12:49 PM -0400, Shawn Heisey , wrote: > On 4/25/2018 4:02 AM, Lee Carroll wrote: > > *We don't recommend using solr-cell for production indexing.* >

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Shawn Heisey
On 4/25/2018 4:02 AM, Lee Carroll wrote: *We don't recommend using solr-cell for production indexing.* Ok. Are the reasons for: Performance. I think we have rather modest index requirement (1000 a day... on a busy day) Security. The index workflow is, upload files to public facing s

Re: solr cell: write entire file content binary to index along with metadata

2018-04-25 Thread Lee Carroll
Agreed. The app will have a few implementations for storing the binary file. Easiest for a user to configure for proto-typing would be store in index impl. A live impl would probably be fs *We don't recommend using solr-cell for production indexing.* Ok. Are the reasons for: Performa

Re: solr cell: write entire file content binary to index along with metadata

2018-04-24 Thread Shawn Heisey
On 4/24/2018 10:26 AM, Lee Carroll wrote: > Does the solr cell contrib give access to the files raw content along with > the extracted metadata?\ That's not usually the kind of information you want to have in a Solr index.  Most of the time, there will be an entry in the Solr index

solr cell: write entire file content binary to index along with metadata

2018-04-24 Thread Lee Carroll
Does the solr cell contrib give access to the files raw content along with the extracted metadata? cheers Lee C

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-12 Thread Allison, Timothy B.
Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? As a bonus here's a Dropwizard Tika wrapper that gives you a Tika web service https://github.com/mattflax/dropwizard-tika-server written by a colleague of mine at Flax. Hope this is

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread David Hastings
includes Charlie's advice and > the link to Erick's blog post whenever Tika is used. 😊 > > > > > > -Original Message- > > From: Charlie Hull [mailto:char...@flax.co.uk] > > Sent: Monday, April 9, 2018 12:44 PM > > To: solr-user@lucene.apac

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-10 Thread Alexandre Rafalovitch
;s blog post whenever Tika is used. 😊 > > > -Original Message- > From: Charlie Hull [mailto:char...@flax.co.uk] > Sent: Monday, April 9, 2018 12:44 PM > To: solr-user@lucene.apache.org > Subject: Re: How to use Tika (Solr Cell) to extract content from HTML > document instead of Solr&

RE: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Oh this is great! Saves me a whole bunch of manual work. Thanks! -Original Message- From: Charlie Hull [mailto:char...@flax.co.uk] Sent: Monday, April 09, 2018 2:15 PM To: solr-user@lucene.apache.org Subject: [EXT] Re: How to use Tika (Solr Cell) to extract content from HTML document

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
> I will integrate Tika in my Java app and use SolrJ to send data to Solr. > > > -Original Message- > From: Allison, Timothy B. [mailto:talli...@mitre.org] > Sent: Monday, April 09, 2018 11:24 AM > To: solr-user@lucene.apache.org > Subject: [EXT] RE: How to use Tika (So

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Thank you Charlie, Tim. I will integrate Tika in my Java app and use SolrJ to send data to Solr. -Original Message- From: Allison, Timothy B. [mailto:talli...@mitre.org] Sent: Monday, April 09, 2018 11:24 AM To: solr-user@lucene.apache.org Subject: [EXT] RE: How to use Tika (Solr Cell

RE: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Allison, Timothy B.
2018 12:44 PM To: solr-user@lucene.apache.org Subject: Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ? I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent i

Re: How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Charlie Hull
I'd recommend you run Tika externally to Solr, which will allow you to catch this kind of problem and prevent it bringing down your Solr installation. Cheers Charlie On 9 April 2018 at 16:59, Hanjan, Harinder wrote: > Hello! > > Solr (i.e. Tika) throws a "zip bomb" exception with certain docum

How to use Tika (Solr Cell) to extract content from HTML document instead of Solr's MostlyPassthroughHtmlMapper ?

2018-04-09 Thread Hanjan, Harinder
Hello! Solr (i.e. Tika) throws a "zip bomb" exception with certain documents we have in our Sharepoint system. I have used the tika-app.jar directly to extract the document in question and it does _not_ throw an exception and extract the contents just fine. So it would seem Solr is doing someth

Re: Issue with Solr Cell mixing metadata and content together

2017-12-21 Thread Phillip Rhodes
Solr, using the >> ExtractingRequestHandler. Basically, when indexing a PDF (for >> example) I get all the metadata mixed into the "content" field along >> with the content. See: >> <https://stackoverflow.com/questions/47934257/importing-files-with-

Re: Issue with Solr Cell mixing metadata and content together

2017-12-21 Thread Erick Erickson
eld along > with the content. See: > <https://stackoverflow.com/questions/47934257/importing-files-with-solr-cell-tika-is-mixing-metadata-fields-with-content> > for the gory details. > > I'm guessing this is the same basic issue as > <https://issues.apache.org/jira/browse/SOLR-9178&g

Issue with Solr Cell mixing metadata and content together

2017-12-21 Thread Phillip Rhodes
Hi all, I have been having an issue with Solr, using the ExtractingRequestHandler. Basically, when indexing a PDF (for example) I get all the metadata mixed into the "content" field along with the content. See: <https://stackoverflow.com/questions/47934257/importing-files-with-solr

Re: Recursive archive indexing using solr cell

2017-12-13 Thread Erick Erickson
uly wrote: > Hello, > > I have been successfully able to index archive files (zip, tar, and the > like) using solr cell, but the archive is returned as a single document > when I do queries. Is there a way to configure it so that files are > extracted recursively, and indexed sepa

Recursive archive indexing using solr cell

2017-12-13 Thread Sean Gilhuly
Hello, I have been successfully able to index archive files (zip, tar, and the like) using solr cell, but the archive is returned as a single document when I do queries. Is there a way to configure it so that files are extracted recursively, and indexed separately? I know that if I set the

Re: Import html data in mysql and map schemas using only Solr CELL+TIKA+DIH [scottchu]

2016-05-20 Thread Siddhartha Singh Sandhu
You will have to configure your schema.xml in Solr. What version are you using? On Fri, May 20, 2016 at 2:17 AM, scott.chu wrote: > > I have a mysql table with over 300M blog articles. The records are in html > format. Is it possible to import these records using only Solr > CELL

Import html data in mysql and map schemas using only Solr CELL+TIKA+DIH [scottchu]

2016-05-19 Thread scott.chu
I have a mysql table with over 300M blog articles. The records are in html format. Is it possible to import these records using only Solr CELL+TIKA+DIH to some Solr with schema? I mean when importing, I can map schema on mysql to schema in Solr? scott.chu,scott@udngroup.com 2016/5/20 (週五)

Re: Solr Cell Tika - date.formats

2014-05-28 Thread ienjreny
these formats: > > -MM-dd'T'HH:mm:ss'Z' > -MM-dd'T'HH:mm:ss > -MM-dd > -MM-dd hh:mm:ss > -MM-dd HH:mm:ss > EEE MMM d hh:mm:ss z > EEE, dd MMM HH:mm:ss zzz > , dd-MMM-yy HH:mm:ss zzz > EEE MMM d HH:mm:ss

Re: Solr Cell Tika - date.formats

2014-05-28 Thread Jack Krupansky
HH:mm:ss'Z' -MM-dd'T'HH:mm:ss -MM-dd -MM-dd hh:mm:ss -MM-dd HH:mm:ss EEE MMM d hh:mm:ss z EEE, dd MMM HH:mm:ss zzz , dd-MMM-yy HH:mm:ss zzz EEE MMM d HH:mm:ss See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+

Solr Cell Tika - date.formats

2014-05-28 Thread ienjreny
7;HH:mm:ss Thanks in advance -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Cell-Tika-date-formats-tp4138478.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using Solr Cell to index the internal structure of a PDF

2013-10-10 Thread Furkan KAMACI
You can have a look here: http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/ 2013/10/10 Peter Bleackley > I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can > get Solr to ingest the entire document as one long string, stored in the > index

Using Solr Cell to index the internal structure of a PDF

2013-10-10 Thread Peter Bleackley
I'm trying to index a set of PDF documents with Solr 4.5.0. So far I can get Solr to ingest the entire document as one long string, stored in the index as "content". However, I want to index structure within the documents. I know that the ExtractingRequestHandler uses Apache Tika to convert the

Re: Solr Cell Question

2013-09-09 Thread Jamie Johnson
Thanks Erick, This is how I was doing it but when I saw the Solr Cell stuff I figured I'd give it a go. What I ended up doing is the following ModifiableSolrParams params = indexer.index(artifact); params.add("fmap.content", "my_custom_field"); params.a

Re: Solr Cell Question

2013-09-06 Thread Erick Erickson
ol over what's done. Here's a skeletal program with indexing from a DB mixed in, but it shouldn't be hard at all to pull the DB parts out. http://searchhub.org/dev/2012/02/14/indexing-with-solrj/ FWIW, Erick On Thu, Sep 5, 2013 at 5:28 PM, Jamie Johnson wrote: > Is it possible to c

Solr Cell Question

2013-09-05 Thread Jamie Johnson
Is it possible to configure solr cell to only extract and store the body of a document when indexing? I'm currently doing the following which I thought would work ModifiableSolrParams params = new ModifiableSolrParams(); params.set("defaultField", "content"); params.

Re: solr cell

2013-03-15 Thread Arcadius Ahouansou
Another options similar to this would be the new file system WatchService available in java 7: http://docs.oracle.com/javase/tutorial/essential/io/notification.html Arcadius. On 15 March 2013 15:22, Michael Della Bitta wrote: > Niklas, > > In Linux, the API for watching for filesystem changes i

Re: solr cell

2013-03-15 Thread Jack Krupansky
Take a look at ManifoldCF, whch has a file system crawler which can track changed files. -- Jack Krupansky -Original Message- From: Niklas Langvig Sent: Friday, March 15, 2013 11:10 AM To: solr-user@lucene.apache.org Subject: solr cell We have all our documents (doc, docx, pdf) on a

Re: solr cell

2013-03-15 Thread Michael Della Bitta
Niklas, In Linux, the API for watching for filesystem changes is called inotify. You'd need to write something to listen to those events and react accordingly. Here's a brief discussion about it: http://stackoverflow.com/questions/4062806/inotify-how-to-use-it-linux Michael Della Bitta ---

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-03-05 Thread Divyanand Tiwari
Hi Chris thank you for replying. My "content" field in the schema is stored="true" and indexed="false" because I am copying the "content" field in "text" field which is by default indexed="true". I was having a query that I am able to search in the html documents I had fed to the solr, but as the

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-21 Thread Chris Hostetter
: Hi everyone, i am new to solr technology and not getting a way to get back : the original HTML document with Hits highlighted into it. what : configuration and where i can do to instruct SolrCell/ Tika so that it does : not strips down the tags of HTML document in the content field. I _think_ w

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-19 Thread Divyanand Tiwari
---Original Message- From: Divyanand Tiwari > Sent: Monday, February 18, 2013 10:52 PM > To: solr-user@lucene.apache.org > Subject: Re: How can i instruct the Solr/ Solr Cell to output the original > HTML document which was fed to it.? > > > Thank you for replying sir !!!

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-18 Thread Jack Krupansky
ansky -Original Message- From: Divyanand Tiwari Sent: Monday, February 18, 2013 10:52 PM To: solr-user@lucene.apache.org Subject: Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.? Thank you for replying sir !!! I have two queries related

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-18 Thread Divyanand Tiwari
Thank you for replying sir !!! I have two queries related with this - 1) So in this case which request handler I have to use because 'ExtractingRequestHandler' by default strips the html content and the default handler 'UpdateRequestHandler' does not accepts the HTML contrents. 2) How can I 'Ext

Re: How can i instruct the Solr/ Solr Cell to output the original HTML document which was fed to it.?

2013-02-18 Thread Jack Krupansky
highlighting. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory -- Jack Krupansky -Original Message- From: Divyanand Tiwari Sent: Monday, February 18, 2013 7:28 AM To: solr-user@lucene.apache.org Subject: How can i instruct the Solr/ Solr

Re: Re: Re: Solr Cell Questions

2012-09-25 Thread Erick Erickson
roach... FWIW, Erick On Tue, Sep 25, 2012 at 10:04 AM, wrote: > The difference with solr cell is, that i'am sending every single document > to solr cell and don't collect them until i have a couple of them in my > memory. > Using mainly the code form here: &g

Re: Solr Cell Questions

2012-09-25 Thread Jack Krupansky
a separate process) to minimize thread issues, GC issues, hung parsers, etc. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Tuesday, September 25, 2012 10:24 AM To: solr-user@lucene.apache.org Subject: Re: Solr Cell Questions Are you by any chance committing

Re: Solr Cell Questions

2012-09-25 Thread Alexandre Rafalovitch
http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Mon, Sep 24, 2012 at 10:04 AM, wrote: > Hi, > > Im currently experimenting with Solr

Antwort: Re: Re: Solr Cell Questions

2012-09-25 Thread Johannes . Schwendinger
The difference with solr cell is, that i'am sending every single document to solr cell and don't collect them until i have a couple of them in my memory. Using mainly the code form here: http://wiki.apache.org/solr/ExtractingRequestHandler#SolrJ Erick Erickson schrieb am 25.09.201

Re: Re: Solr Cell Questions

2012-09-25 Thread Erick Erickson
t? Best Erick On Tue, Sep 25, 2012 at 5:23 AM, wrote: > Thank you Erick for your respone, > > I've already tried what you've suggested and got some out of memory > exceptions. Because of this i like the solution with solr Cell where i can > send the file directly t

Antwort: Re: Solr Cell Questions

2012-09-25 Thread Johannes . Schwendinger
Thank you Erick for your respone, I've already tried what you've suggested and got some out of memory exceptions. Because of this i like the solution with solr Cell where i can send the file directly to solr via stream and don't collect them in my memory. And another question

Re: Solr Cell Questions

2012-09-24 Thread Erick Erickson
d to do the indexing Best Erick On Mon, Sep 24, 2012 at 10:04 AM, wrote: > Hi, > > Im currently experimenting with Solr Cell to index files to Solr. During > this some questions came up. > > 1. Is it possible (and wise) to connect to Solr Cell with multiple Threads > at th

Solr Cell Questions

2012-09-24 Thread Johannes . Schwendinger
Hi, Im currently experimenting with Solr Cell to index files to Solr. During this some questions came up. 1. Is it possible (and wise) to connect to Solr Cell with multiple Threads at the same time to index several documents at the same time? This question came up because my prrogramm takes

Re: Indexing PDF-Files using Solr Cell

2012-09-17 Thread Jack Krupansky
ber 17, 2012 1:12 AM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF-Files using Solr Cell Thank you for your response. I'm writing my Bachelor-Thesis about Solr and my company doesn't want me to use a beta-version. I dont want to be annoying, but "how" do i direct the

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
ng. > > Again, this is all simplified in Solr 4.0-BETA. > > > -- Jack Krupansky > > -Original Message- From: Alexander Troost > Sent: Sunday, September 16, 2012 11:59 PM > To: solr-user@lucene.apache.org > Subject: Re: Indexing PDF-Files using Solr Cell > >

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Jack Krupansky
n Solr 4.0-BETA. -- Jack Krupansky -Original Message- From: Alexander Troost Sent: Sunday, September 16, 2012 11:59 PM To: solr-user@lucene.apache.org Subject: Re: Indexing PDF-Files using Solr Cell Hi, first of all: Thank you for that quick response! But i am not sure if i am doing this r

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
ost > Sent: Sunday, September 16, 2012 10:16 PM > To: solr-user@lucene.apache.org > Subject: Indexing PDF-Files using Solr Cell > > > Hello *, > > I've got a problem indexing and searching PDF-Files. > > It seems like Solr doenst index the name of the file. > > In re

Re: Indexing PDF-Files using Solr Cell

2012-09-16 Thread Jack Krupansky
ject: Indexing PDF-Files using Solr Cell Hello *, I've got a problem indexing and searching PDF-Files. It seems like Solr doenst index the name of the file. In returning i only get A28240application/pdfdoc52012-09-17T01:45:39Z He founds the right document, but no content or title is displa

Indexing PDF-Files using Solr Cell

2012-09-16 Thread Alexander Troost
Hello *, I've got a problem indexing and searching PDF-Files. It seems like Solr doenst index the name of the file. In returning i only get A28240application/pdfdoc52012-09-17T01:45:39Z He founds the right document, but no content or title is displayed in the XML-Response. Where do i config tha

Re: scanned pdf with solr cell

2012-08-20 Thread Michael Della Bitta
It's pretty easy to accidentally run into the AWT stuff if you're doing anything that involves image processing, which I would expect a generic RTF parser might do. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 w

Re: scanned pdf with solr cell

2012-08-19 Thread Lance Norskog
The backstory here is that Tika uses a library that for some crazy reason is inside the Java AWG graphics toolkit. (I think the RTF parser?) On Wed, Aug 15, 2012 at 5:57 AM, Ahmet Arslan wrote: >> You can try passing >> -Djava.awt.headless=true as one of the arguments >> when you start Jetty to s

Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
> You can try passing > -Djava.awt.headless=true as one of the arguments > when you start Jetty to see if you can get this to go away > with no ill > effects. I started jetty using : 'java -Djava.awt.headless=true -jar start.jar' and successfully indexed two pdf files. That icon didn't appeared :

Re: scanned pdf with solr cell

2012-08-15 Thread Michael Della Bitta
You can try passing -Djava.awt.headless=true as one of the arguments when you start Jetty to see if you can get this to go away with no ill effects. Michael Della Bitta Appinions | 18 East 41st St., Suite 1806 | New York, NY 10017 www.appinions.com

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht
Le 15 août 2012 à 13:03, Ahmet Arslan a écrit : > Hi Paul, thanks for the explanation. So is it nothing to worry about? it is nothing to worry about except to remember that you can't run this step in a daemon-like process. (on Linux, I had to set-up a VNC-server for similar tasks) paul

Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
> the dock icon appears when AWT starts, e.g. when a font is > loaded. > You can prevent it using the headless mode but this is > likely to trigger an exception. > Same if your user is not UI-logged-in. Hi Paul, thanks for the explanation. So is it nothing to worry about?

Re: scanned pdf with solr cell

2012-08-15 Thread Paul Libbrecht
Ahmet, the dock icon appears when AWT starts, e.g. when a font is loaded. You can prevent it using the headless mode but this is likely to trigger an exception. Same if your user is not UI-logged-in. hope it helps. Paul Le 15 août 2012 à 01:30, Ahmet Arslan a écrit : > Hi All, > > I have set

Re: scanned pdf with solr cell

2012-08-15 Thread Ahmet Arslan
> When I send a scanned pdf to extraction request > handler, below icon appears in my Dock. > > http://tinypic.com/r/2mpmo7o/6 > http://tinypic.com/r/28ukxhj/6 I found that text-extractable pdf files triggers above weird icon too. curl "http://localhost:8983/solr/update/extract?literal.id=solr-

Re: scanned pdf with solr cell

2012-08-14 Thread Jack Krupansky
ostScript fonts. Try a "normal" PDF for comparison. -- Jack Krupansky -Original Message- From: Ahmet Arslan Sent: Tuesday, August 14, 2012 7:30 PM To: solr-user@lucene.apache.org Subject: scanned pdf with solr cell Hi All, I have set of rich documents. Some of them are scanned

scanned pdf with solr cell

2012-08-14 Thread Ahmet Arslan
Hi All, I have set of rich documents. Some of them are scanned pdf files. When I send a scanned pdf to extraction request handler, below icon appears in my Dock. http://tinypic.com/r/2mpmo7o/6 http://tinypic.com/r/28ukxhj/6 Does anyone know what this is? curl "http://localhost:8983/solr/docum

[Error] Indexing with solr cell

2012-07-03 Thread savitha sundaramurthy
Hi , I'm using solr cell(solrj) to index plain text files, but am encountering IllegalCharsetNameException: Could you please point out if anything should be added in schema.xml file. I could index the other mime types efficiently. I gave the field type as

Re: Custom content extractor for Solr Cell

2011-12-27 Thread Jan Høydahl
Hi John, See discussion about the issue of indexing contents of ZIP files: https://issues.apache.org/jira/browse/SOLR-2416 Depending on your use case, you may be able to write a Tika parser which handles your specific case, such as uncompressing a GZIP file and using AutoDetect on its contents

Custom content extractor for Solr Cell

2011-12-05 Thread John Bartak
Is it possible to extract content for file types that Tika doesn’t support without changing and rebuilding Tika? Do I need to specify a tika.config file in the solrconfig.xml file, and if so, what is the format of that file? One example that I’m trying to solve is for a document management syst

Re: Can you please guide me through step-by-step installation of Solr Cell ?

2011-11-03 Thread Chris Hostetter
: Caused by: org.apache.solr.common.SolrException: Error loading class 'solr.extraction.ExtractingRequestHandler' : : With the jetty and the provided example, I have no problem. It all happens when I use tomcat and solr. : : My setup is as follows: : : I downloaded the apache-solr-3.3.0 and

Can you please guide me through step-by-step installation of Solr Cell ?

2011-10-17 Thread Sina Fakhraee
Can you please guide me through step-by-step usgae of Solr Cell installation? regards, Sina -- Sina Fakhraee , PhD  candidate Department of Computer Science Wayne State University 5057 Woodward Avenue 3rd floor, Suite 3105 Detroit, Michigan 48202 (517)974-8437(Cell) http://uwerg.c

Re: Please help - Solr Cell using 'stream.url'

2011-10-12 Thread Jan Høydahl
Latest version is 3.4, and it is fairly compatible with 1.4.1, but you have to reindex. First step migration can be to continue using your 1.4 schema on new solr.war (and SolrJ), but I suggest you take a few hours upgrading your schema and config as well. -- Jan Høydahl, search solution archite

Re: Please help - Solr Cell using 'stream.url'

2011-10-12 Thread Tod
On 10/10/2011 3:39 PM, � wrote: Hi, If you have 4Gb on your server total, try giving about 1Gb to Solr, leaving 3Gb for OS, OS caching and mem-allocation outside the JVM. Also, add 'ulimit -v unlimited' and 'ulimit -s 10240' to /etc/profile to increase virtual memory and stack limit. I will

Re: Please help - Solr Cell using 'stream.url'

2011-10-10 Thread Jan Høydahl
Hi, If you have 4Gb on your server total, try giving about 1Gb to Solr, leaving 3Gb for OS, OS caching and mem-allocation outside the JVM. Also, add 'ulimit -v unlimited' and 'ulimit -s 10240' to /etc/profile to increase virtual memory and stack limit. And you should also consider upgrading to

Re: Please help - Solr Cell using 'stream.url'

2011-10-10 Thread Tod
On 10/07/2011 6:21 PM, � wrote: Hi, What Solr version? Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 18:06:42. Its running on a Suse Linux VM. How often do you do commits, or do you use autocommit? I had been doing commits every 100 documents (the entire set is about 3

Re: Please help - Solr Cell using 'stream.url'

2011-10-07 Thread Jan Høydahl
solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 7. okt. 2011, at 20:19, Tod wrote: > I'm batching documents into solr using solr cell with the 'stream.url' > parameter. Everything is working fine until I get to about 5k docume

Please help - Solr Cell using 'stream.url'

2011-10-07 Thread Tod
I'm batching documents into solr using solr cell with the 'stream.url' parameter. Everything is working fine until I get to about 5k documents in and then it starts issuing 'read timeout 500' errors on every document. The sysadmin says there's plenty of CP

Question on XPATH use in Solr Cell.

2011-06-15 Thread Koorosh Vakhshoori
I am new to both Solr and Cell, so sorry if I am misusing some of the terminologies. So the problem I am trying to solve is to index a PDF document using Solr Cell where I want to exclude part of it via XPATH. I am using Solr release 3.1. When researching the user list, I came across one entry

Re: Limit data stored from fmap.content with Solr cell

2011-06-01 Thread Erick Erickson
tion from files with Solr Cell. Some of > the files we are indexing are large, and have much content. I would like to > limit the amount of data I index to a specified limit of characters (example > 300 chars) which I will use as a document preview. Is this possible to set as >

Limit data stored from fmap.content with Solr cell

2011-06-01 Thread Greg Georges
Hello everyone, I have just gotten extracting information from files with Solr Cell. Some of the files we are indexing are large, and have much content. I would like to limit the amount of data I index to a specified limit of characters (example 300 chars) which I will use as a document

Re: Indexing files Solr cell and Amazon S3

2011-05-30 Thread Jan Høydahl
- www.cominvent.com On 30. mai 2011, at 22.46, Greg Georges wrote: > Hello everyone, > > We have our infrastructure on Amazon cloud servers, and we use the S3 file > system. We need to index files using Solr Cell. From what I have read, we > need to stream files to Solr in order for it

Indexing files Solr cell and Amazon S3

2011-05-30 Thread Greg Georges
Hello everyone, We have our infrastructure on Amazon cloud servers, and we use the S3 file system. We need to index files using Solr Cell. From what I have read, we need to stream files to Solr in order for it to extract the metadata into the index. If we stream data through a public url there

Solr Cell and operations on metadata extracted

2011-05-16 Thread Olivier Tavard
Hi, I have a question about Solr Cell please. I index some files. For example, if I want to extract the filename, then use a hash function on it like MD5 and then store it on Solr ; the correct way is to use Tika « manually » to extract the metadata I want, do the transformations on it and

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Erick Erickson
jel...@openindex.io] > Sent: Friday, March 25, 2011 1:23 PM > To: solr-user@lucene.apache.org > Cc: Upayavira > Subject: Re: Multiple Cores with Solr Cell for indexing documents > > You can only set properties for a lib dir that must be used in solrconfig.xml. > You can use sharedLi

RE: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Brandon Waterloo
__ From: Markus Jelsma [markus.jel...@openindex.io] Sent: Friday, March 25, 2011 1:23 PM To: solr-user@lucene.apache.org Cc: Upayavira Subject: Re: Multiple Cores with Solr Cell for indexing documents You can only set properties for a lib dir that must be used in solrconfig.xml. You can use shared

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Markus Jelsma
solr.xml file is > > sharedLib="lib">. That is housed in .../example/solr/. So, does it > > > look in .../example/lib or .../example/solr/lib? > > > > > > ~Brandon Waterloo > > > ____ > > > From: Markus Jelsma [markus.jel...@openindex.io] > > &g

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Upayavira
jel...@openindex.io] > > Sent: Thursday, March 24, 2011 11:29 AM > > To: solr-user@lucene.apache.org > > Cc: Brandon Waterloo > > Subject: Re: Multiple Cores with Solr Cell for indexing documents > > > > Sounds like the Tika jar is not on the class path. Add it to a

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Markus Jelsma
_ > From: Markus Jelsma [markus.jel...@openindex.io] > Sent: Thursday, March 24, 2011 11:29 AM > To: solr-user@lucene.apache.org > Cc: Brandon Waterloo > Subject: Re: Multiple Cores with Solr Cell for indexing documents > > Sounds like the Tika jar is not on the class pat

Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo
Markus Jelsma [markus.jel...@openindex.io] Sent: Thursday, March 24, 2011 11:29 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Multiple Cores with Solr Cell for indexing documents Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking f

RE: Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo
Markus Jelsma [markus.jel...@openindex.io] Sent: Thursday, March 24, 2011 11:29 AM To: solr-user@lucene.apache.org Cc: Brandon Waterloo Subject: Re: Multiple Cores with Solr Cell for indexing documents Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking f

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Markus Jelsma
Sounds like the Tika jar is not on the class path. Add it to a directory where Solr's looking for libs. On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote: > Hello everyone, > > I've been trying for several hours now to set up Solr with multiple cores > with Sol

Multiple Cores with Solr Cell for indexing documents

2011-03-24 Thread Brandon Waterloo
Hello everyone, I've been trying for several hours now to set up Solr with multiple cores with Solr Cell working on each core. The only items being indexed are PDF, DOC, and TXT files (with the possibility of expanding this list, but for now, just assume the only things in the index shou

Multiple Cores with Solr Cell for indexing documents

2011-03-22 Thread Brandon Waterloo
Hello everyone, I've been trying for several hours now to set up Solr with multiple cores with Solr Cell working on each core. The only items being indexed are PDF, DOC, and TXT files (with the possibility of expanding this list, but for now, just assume the only things in the index shou

Re: Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
In case the exact problem was not clear to somebody: The problem with FileUpload interpreting file data as regular form fields is that, Solr thinks there are no content streams in the request and throws a "missing_content_stream" exception. On Thu, Mar 10, 2011 at 10:59 AM, Karthik Shiraly < karth

Solr Cell: Content extraction problem with ContentStreamUpdateRequest and multiple files

2011-03-09 Thread Karthik Shiraly
Hi, I'm using Solr 1.4.1. The scenario involves user uploading multiple files. These have content extracted using SolrCell, then indexed by Solr along with other information about the user. ContentStreamUpdateRequest seemed like the right choice for this - use addFile() to send file data, and use

Solr Cell & DataImport Tika handler broken - fails to index Zip file contents

2011-03-07 Thread Jayendra Patil
Working with the latest Solr Trunk code and seems the Tika handlers for Solr Cell (ExtractingDocumentLoader.java) and Data Import handler (TikaEntityProcessor.java) fails to index the zip file contents again. It just indexes the file names again. This issue was addressed some time back, late last

Exception being thrown indexing a specific pdf document using Solr Cell

2010-10-15 Thread Shaun Campbell
apache.org/2009-09/msg00037.html Looking at my libraries it seems I am using pdfbox 0.7.3. I am using maven for building and pdfbox 0.7.3 appears to have come from the tika-parsers 0.4 pom file which in turn appears to have come solr-cell 1.4.0 pom file. In my project's maven pom file I hav

  1   2   3   >