RE: Large Data Set Suggestions

2008-11-07 Thread Lance Norskog
In my DIH tests I ran a nested loop where the outer RSS feed gave a list of feeds, and the inner loop walked each feed. Some of the feeds were bogus, and the DIH loop immediately failed. It would be good to have at least "ignoreerrors=true" the way 'ant' does. This would be set inside each loop

Re: Large Corpus XML Conversion?

2008-11-07 Thread Otis Gospodnetic
Hi, For message parsing you'll either have to write a custom parser or see if you can use JavaMail for that (or some other library if you are not working with Java). As for the second part, that's not directly related to Solr. Extracting meaning out of text would be something that your applic

Re: Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-07 Thread Yonik Seeley
Hi Tom, if you're on a non Windows box, could you perhaps try your test on the latest Solr nightly build? We've recently improved this through the use of NIO. -Yonik On Fri, Nov 7, 2008 at 4:23 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote: > Hello, > > We are testing Solr with a simulation of

Large Corpus XML Conversion?

2008-11-07 Thread Johnny X
I've been asked to look at the Enron e-mail corpus (http://www.cs.cmu.edu/~enron/) and I've decided to use Solr as a means to analyse it. So I have a few questions... First off, how can I convert the flat file text below: Message-ID: <[EMAIL PROTECTED]> Date: Mon, 14 May 2001 16:39:00 -0700 (

RE: Handling proper names

2008-11-07 Thread Nguyen, Joe
Use synonym. Added these line to your ../conf/synonym.txt Stephen,Steven,Steve Bobby,Bob,Robert ... -Original Message- From: news [mailto:[EMAIL PROTECTED] On Behalf Of Jon Drukman Sent: Friday, November 07, 2008 3:19 Joe To: solr-user@lucene.apache.org Subject: Handling proper names Is

Handling proper names

2008-11-07 Thread Jon Drukman
Is there any way to tell Solr that Stephen is the same as Steven and Steve? Carl and Karl? Bobby/Bob/Robert, and so on... -jsd-

Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-07 Thread Burton-West, Tom
Hello, We are testing Solr with a simulation of 30 concurrent users. We are getting socket timeouts and the thread dump from the admin tool shows about 100+ threads with a similar message about a lock. (Message appended below). We supsect this may have something to do with one or more phrase que

Displaying stdout from postCommit command

2008-11-07 Thread Jerry Mindek
Hi all, I would like to see the output from snapshooter as it executes after it has been called via the postCommit event of the solr.RunExecutableListener class. In my solrconfig.xml, the listener is described by: snapshooter solr/bin true Is there a way

Re: Getting an AND query when using Solrj.

2008-11-07 Thread Erik Holstad
Hi! Sorry that I was unclear, when I wrote that it works in the web interface I also meant to say that it is set in the schema.xml file and therefore working there. Sorry about that Regards Erik On Fri, Nov 7, 2008 at 11:33 AM, Jorge Solari <[EMAIL PROTECTED]> wrote: > setting in schema.xml > >

Re: Getting an AND query when using Solrj.

2008-11-07 Thread Jorge Solari
setting in schema.xml On Fri, Nov 7, 2008 at 5:21 PM, Erik Holstad <[EMAIL PROTECTED]> wrote: > Hi! > When making a query using the web interface the we get the expected > OR function. But when using the java client it look like it is treating the > query as an AND query. > > Is there way to s

Getting an AND query when using Solrj.

2008-11-07 Thread Erik Holstad
Hi! When making a query using the web interface the we get the expected OR function. But when using the java client it look like it is treating the query as an AND query. Is there way to see what operator is used for the query using Solrj? Regards Erik

Re: Very bad performance

2008-11-07 Thread Yonik Seeley
On Fri, Nov 7, 2008 at 12:58 PM, Cedric Houis <[EMAIL PROTECTED]> wrote: > Another remark, why do we have better performance when we use parallel > instances of SOLR that use the same index on the same machine? Internal locking. SOLR-465 was committed yesterday... it may improve some things slight

Re: Different tokenizing algorithms for the same stream

2008-11-07 Thread Erick Erickson
That should be easy enough to test with a trivial bit of logging. Worst case, make a *new* analyzer for each of your fields. PerFieldAnalyzerWrapper is your friend here. Best Erick On Fri, Nov 7, 2008 at 11:23 AM, Yuri Jan <[EMAIL PROTECTED]> wrote: > I'm subclassing my own tokenizer. > I'm not

Re: Very bad performance

2008-11-07 Thread Cedric Houis
Hello Yonik. I’ve made few tests more today. Here are the results: Start thread every 0.1->0.5 sec. Each thread waits 2->10 sec before starting a new query Each thread runs 5 min. With FastSolrLRU FullText : - 10 users : 998 queries / Average time 0.037 sec - 50 users : 4819 qu

Re: Solr for large volume data processing with minimal full-text serach

2008-11-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
If you need anything close to realtime (~ few seconds) hadoop and its ilk is not a choice. Solr is fine. But be prepared to dedicate a lot of hardware for that On Fri, Nov 7, 2008 at 10:53 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi Shalin, > > Thanks for your input. > > Yes I agree that my applic

Solr for large volume data processing with minimal full-text serach

2008-11-07 Thread souravm
Hi Shalin, Thanks for your input. Yes I agree that my application is not much about full text search. Hive/Chukwa/Pig (a combination) running on Hadoop can be a good bet. But where they fall short is in online querying of the huge data. I am specifically talking about Pig in this case which ha

Re: Batch and Incremental mode of indexing

2008-11-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Nov 7, 2008 at 5:48 PM, Vaijanath N. Rao <[EMAIL PROTECTED]> wrote: > Hi Solr-Users, > > I am not sure but does there exist any mechanism where-in we can specify > solr as Batch and incremental indexing. > What I mean by batch indexing is solr would delete all the records which > existed in

Re: Large Data Set Suggestions

2008-11-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
OK .you can raise an issue anyway On Fri, Nov 7, 2008 at 7:03 PM, Steven Anderson <[EMAIL PROTECTED]> wrote: > Ideally, it would be a configuration option. > > Also, it would be great to have a hook to log or process an exception. > > Steve > > > -Original Message- > From: Noble Paul ?

Re: Different tokenizing algorithms for the same stream

2008-11-07 Thread Yuri Jan
I'm subclassing my own tokenizer. I'm not sure though if I can rely on the fact this tokenizer will be used for this field sequentially. I'm going to use it with different fields and doesn't want the member variable to be used when tokenizing different fields or even the same field on different doc

Re: Very bad performance

2008-11-07 Thread Cedric Houis
Hello Ryantxu. I've check the cache state, it seems that the cache is well used : cumulative_evictions : 0 Thanks for your help, Regards, Cédric ryantxu wrote: > >> >> Data : >> >> 367380 documents >> >> nGeographicLocations : 39298 distincts values >> nPersonNames : 325142 distincts valu

Different tokenizing algorithms for the same stream

2008-11-07 Thread Jérôme Etévé
Hi, you have to keep track of the character position yourself in your custom Tokenizer. See org.apache.lucene.analysis.CharTokenizer for a starting example. Cheers, J. On Fri, Nov 7, 2008 at 3:33 PM, Yoav Caspi <[EMAIL PROTECTED]> wrote: > Thanks, Jerome. > > My problem is that in Tok

Re: Solr Multicore ...

2008-11-07 Thread Shalin Shekhar Mangar
>From what I can understand, you have little full-text search involved here. You should probably look at Hadoop and its contrib and sub-projects such as Pig, Hive and Chukwa. http://wiki.apache.org/hadoop/ http://wiki.apache.org/hadoop/Hive http://wiki.apache.org/hadoop/Chukwa http://incubator.apa

Re: Different tokenizing algorithms for the same stream

2008-11-07 Thread Erick Erickson
Why not just subclass your own tokenizer and use that one? Each call to next could increment a member variable in your new class and you could make your decisions based upon that... Best Erick On Fri, Nov 7, 2008 at 10:33 AM, Yoav Caspi <[EMAIL PROTECTED]> wrote: > Thanks, Jerome. > > My problem

Re: maxFieldLength

2008-11-07 Thread Andrzej Bialecki
Dan A. Dickey wrote: I just came across the maxFieldLength setting for the mainIndex in solrconfig.xml and have a question or two about it. The default value is 1. I'm extracting text from pdf documents and storing them into a text field. Is the length of this text field limited to 1 ch

Re: Different tokenizing algorithms for the same stream

2008-11-07 Thread Yoav Caspi
Thanks, Jerome. My problem is that in Token next(Token result) there is no information about the location inside the stream. I can read characters from the input Reader, but couldn't find a way to know if it's the beginning of the input or not. -J On Fri, Nov 7, 2008 at 6:13 AM, Jérôme Etévé <[E

RE: Solr Multicore ...

2008-11-07 Thread souravm
Hi Guys, Here I'm struggling with to decide whether Solr would be a fitting solution for me. Highly appreciate you The key requirements can be summarized as below - 1. Need to process very high volume of data online from log files of various applications - around 100s of Millions of total size

Re: maxFieldLength

2008-11-07 Thread Erick Erickson
I believe it's 10,000 tokens, not characters, but that's a quibble. Yes, you need to change maxFieldLength to be greater than any doc you expect to index. It can be made huge, I don't think there's a penalty for making this number, say, 100,000,000 and indexing documents with only 10 tokens.

RE: Solr Multicore ...

2008-11-07 Thread souravm
Thanks Noble for your answer. Regards, Sourav -Original Message- From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:[EMAIL PROTECTED] Sent: Thursday, November 06, 2008 7:41 PM To: solr-user@lucene.apache.org Subject: Re: Solr Multicore ... On Fri, Nov 7, 2008 at 3:28 AM, souravm <[EMAIL PROTECTED]>

RE: Distributed Search ...

2008-11-07 Thread souravm
Thanks Otis for clarification. Sourav -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, November 06, 2008 8:18 PM To: solr-user@lucene.apache.org Subject: Re: Distributed Search ... Sourav, Whichever Solr instance you send the request to will dispatch r

maxFieldLength

2008-11-07 Thread Dan A. Dickey
I just came across the maxFieldLength setting for the mainIndex in solrconfig.xml and have a question or two about it. The default value is 1. I'm extracting text from pdf documents and storing them into a text field. Is the length of this text field limited to 1 characters? Many pdf doc

combining negation in query

2008-11-07 Thread smadhu
If I have a field with value "foo blah blah blah bar" and "foo blah blah blah". I want to be abe to find documents with "foo" NOT "bar" within 5 token positions. Is that possible? -- View this message in context: http://www.nabble.com/combining-negation-in-query-tp20381550p20381550.html Sent fr

Re: Throughput Optimization

2008-11-07 Thread Yonik Seeley
FYI, SOLR-465 has been committed. Let us know if it improves your scenario. -Yonik On Wed, Nov 5, 2008 at 5:39 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Wed, Nov 5, 2008 at 5:18 PM, wojtekpia <[EMAIL PROTECTED]> wrote: >> I'd like to integrate this improvement into my deployment. Is it ju

RE: Large Data Set Suggestions

2008-11-07 Thread Steven Anderson
Ideally, it would be a configuration option. Also, it would be great to have a hook to log or process an exception. Steve -Original Message- From: Noble Paul ??? ?? [mailto:[EMAIL PROTECTED] Sent: Thu 11/6/2008 11:38 PM To: solr-user@lucene.apache.org Subject: Re: Large Data Set

Re: Calculating peaks - solrj support for facet.date?

2008-11-07 Thread Erik Hatcher
On Nov 7, 2008, at 7:23 AM, [EMAIL PROTECTED] wrote: Sorry, but I have one more question. Does the java client solrj support facet.date? Yeah, but it doesn't have explicit setters for it. A SolrQuery is also a ModifiableSolrParams - so you can call the add/set methods on it using the sam

Re: Calculating peaks - solrj support for facet.date?

2008-11-07 Thread gistolero
Sorry, but I have one more question. Does the java client solrj support facet.date? QueryResponse knows the getFacetDates() method but I don't understand how to set facet.date, facet.date.start, facet.date.end, and facet.date.gap for the query. It seems that SolrQuery doesn't provide functions

Re: Batch and Incremental mode of indexing

2008-11-07 Thread Jérôme Etévé
Hi, For batch indexing, what you could do is to use two core. One in production and one used for your update. Once your update core is build (delete *:* plus batch insert) , you can swap the cores to put it in production: http://wiki.apache.org/solr/CoreAdmin#head-928b872300f1b66748c85cebb12a59bb

Batch and Incremental mode of indexing

2008-11-07 Thread Vaijanath N. Rao
Hi Solr-Users, I am not sure but does there exist any mechanism where-in we can specify solr as Batch and incremental indexing. What I mean by batch indexing is solr would delete all the records which existed in the index and will create an new index form the given data. For incremental I want

Re: Very bad performance

2008-11-07 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Fri, Nov 7, 2008 at 12:49 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > Your problem is most likely the time it takes to facet on those > multi-valued fields. > Help is coming within the month I'd estimate, in the form of faster > faceting for multivalued fields where the number of values per >

delivering customized results using a SearchComponent plugin

2008-11-07 Thread Jérôme Etévé
Hi there, I developed a personalized SearchComponent in which I'm building a docset from a personalized Query, and a personalized Priority Queue. To be short, I'm doing that (in the process method) : HitCollector hitCol = new HitCollector() { @Override public void colle

Using solar in mu application application - very important

2008-11-07 Thread Sajith Vimukthi
Hi all, I just want to use solar in a certain Knowledge Management System that I am going to develop. I basically got pdfs and docs and these can be converted into suitable forms via pdf box and poi frameworks. In the search function I need to obtain data in such a way that I can give the sourc

Re: Different tokenizing algorithms for the same stream

2008-11-07 Thread Jérôme Etévé
Hi, I think you could implement your personalized tokenizer in a way it changes its behaviour after it has delivered X tokens. This implies a new tokenizer instance is build from the factory for every string analyzed, which I believe is true. Can this be confirmed ? Cheers ! Jerome. On Thu

RE: Search based on price range

2008-11-07 Thread Dave Searle
Team Lead EC Software __ Information from ESET NOD32 Antivirus, version of virus signature database 3593 (20081107) __ The message was checked by ESET NOD32 Antivirus. http://www.eset.com __ Information from ESET NOD32 Antivirus, version of virus signature

Search based on price range

2008-11-07 Thread Anto Binish Kaspar
Does anyone have solution for my problem? I am doing index for Products. Each product can have multiple price (like Government, Club users, Public and etc). This multiple price is not limited. One product can have n number of price, depends on clients need. Now I need to do index for all the p