foo

2012-03-07 Thread Phillip Farber
unsubscribe

Re: solr optimize - no space left on device

2009-10-09 Thread Phillip Farber
Thanks Hoss. Yes, in a separate thread on the list I reported that doing a multi-stage optimize worked around the out of space problem. We use mergefactor=10, maxSegments = 16, 8, 4, 2, 1 iteratively starting at the closest power of two below the number of segments to merge.Works nicely s

Optimization of large shard succeeded

2009-10-08 Thread Phillip Farber
I thought I'd summarize a method that solved the problem we were having trying to optimize a large shard that was running out of disk space, df=100% (400g), du=~380g. After we ran out of space, if we restarted tomcat, segment files disappeared from disk leaving 3 segments. What worked: we u

Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber
But we're still exceeding 2x. And after the optimize fails, if we then do a commit or bounce tomcat, a bunch of segments disappear. I am stumped. Yonik Seeley wrote: On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber wrote: So this implies that for a "normal" optimize, in every case, due to

Re: How much disk space does optimize really take

2009-10-07 Thread Phillip Farber
Yonik Seeley wrote: Does this means that there's always a lucene IndexReader holding segment files open so they can't be deleted during an optimize so we run out of disk space > 2x? Yes. A feature could probably now be developed now that avoids opening a reader until it's requested. That wa

How much disk space does optimize really take

2009-10-07 Thread Phillip Farber
In a separate thread, I've detailed how an optimize is taking > 2x disk space. We don't use solr distribution/snapshooter. We are using the default deletion policy = 1. We can't optimize a 192G index in 400GB of space. This thread in lucene/java-user http://www.gossamer-threads.com/lists/l

Re: solr optimize - no space left on device

2009-10-07 Thread Phillip Farber
his is interesting. On Tue, Oct 6, 2009 at 5:28 PM, Phillip Farber wrote: I am attempting to optimize a large shard on solr 1.4 and repeatedly get java.io.IOException: No space left on device. The shard, after a final commit before optimize, shows a size of about 192GB on a 400GB volume. I ha

solr optimize - no space left on device

2009-10-06 Thread Phillip Farber
I am attempting to optimize a large shard on solr 1.4 and repeatedly get java.io.IOException: No space left on device. The shard, after a final commit before optimize, shows a size of about 192GB on a 400GB volume. I have successfully optimized 2 other shards that were similarly large without

Re: best way to get the size of an index

2009-10-02 Thread Phillip Farber
Thanks, Mark. I really appreciate your confirmation. Phil Mark Miller wrote: Phillip Farber wrote: Resuming this discussion in a new thread to focus only on this question: What is the best way to get the size of an index so it does not get too big to be optimized (or to allow a very large

best way to get the size of an index

2009-10-01 Thread Phillip Farber
Resuming this discussion in a new thread to focus only on this question: What is the best way to get the size of an index so it does not get too big to be optimized (or to allow a very large segment merge) given space limits? I already have the largest 15,000rpm SCSI direct attached storage

index size before and after commit

2009-10-01 Thread Phillip Farber
I am trying to automate a build process that adds documents to 10 shards over 5 machines and need to limit the size of a shard to no more than 200GB because I only have 400GB of disk available to optimize a given shard. Why does the size (du) of an index typically decrease after a commit? I'v

Re: Writing optimized index to different storage?

2009-09-30 Thread Phillip Farber
Sorry, I should have given more background. We have, at the moment 3.8 million documents of 0.7MB/doc average so we have extremely large shards. We build about 400,000 documents to a shard resulting 200GB/shard. We are also using LVM snapshots to manage a snapshot of the shard which we serve

mergefactor=1 questions

2009-09-30 Thread Phillip Farber
In order to make maximal use of our storage by avoiding the dead 2x overhead needed to optimize the index we are considering setting mergefactor=1 and living with the slow indexing performance which is not a problem in our use case. Some questions: 1) Does mergefactor=1 mean that the size o

Re: Writing optimized index to different storage?

2009-09-28 Thread Phillip Farber
he newly optimized files via a FileSwitchDirectory like implementation that knows which new files are optimized and should "underneath" go to a different physical path. On Mon, Sep 28, 2009 at 7:57 AM, Phillip Farber wrote: Is it possible to tell Solr or Lucene, when optimizing, to write

Writing optimized index to different storage?

2009-09-28 Thread Phillip Farber
Is it possible to tell Solr or Lucene, when optimizing, to write the files that constitute the optimized index to somewhere other than SOLR_HOME/data/index or is there something about the optimize that requires the final segment to be created in SOLR_HOME/data/index? Thanks, Phil

OOM error during merge - index still ok?

2009-09-25 Thread Phillip Farber
Can I expect the index to be left in a usable state ofter an out of memory error during a merge or it it most likely to be corrupt? I'd really hate to have to start this index build again from square one. Thanks. Thanks, Phil --- Exception in thread "http-8080-Processor2505" java.lang.

com.ctc.wstx.exc.WstxUnexpectedCharException error

2009-08-25 Thread Phillip Farber
I have a valid xml document that begins: mdp.39015052775379 2 Technology transfer and in-house R&D in Indian industry : in the later 1990s / edited and with an introduction by Binay Kumar Pattnaik. v.1 Not found TECHNOLOGY TRANSFER AND IN.HOUSE R&D IN INDIAN INDUSTRY I believe Solr is throwi

Multi-shard query with error on one shard

2009-08-20 Thread Phillip Farber
What will the client receive from the primary solr instance if that instance doesn't get HTTP 200 from all the shards in a multi-shard query? Thanks, Phil

Rotating the primary shard in /solr/select

2009-07-28 Thread Phillip Farber
Is there any value in a round-robin scheme to cycle through the Solr instances supporting a multi-shard index over several machines when sending queries or is it better to just pick one instance and stick with it. I'm assuming all machines in the cluster have the same hardware specs. So sce

Is there a multi-shard optimize message?

2009-07-28 Thread Phillip Farber
Normally to optimize an index you POST to /solr/update. Is there any way to POST an optimize message to one instance and have it propagate to all shards sort of like the select? /solr-shard-1/select?q=dog... shards=shard-1,shard2 Thanks, Phil

initialSize of queryResultCache and documentCache

2009-06-30 Thread Phillip Farber
I'm trying to understand the purpose of the initialSize parameter for the queryResultCache and documentCache. Is it correct that it controls how much heap is allocated to each cache at startup? I can see how it makes sense for queryResultCache since it is documented as an "ordered lists of

Re: Entire heap consumed to answer initial ping()

2009-06-30 Thread Phillip Farber
ucene - Solr - Nutch - Original Message From: Phillip Farber To: solr-user Sent: Monday, June 29, 2009 4:20:26 PM Subject: Entire heap consumed to answer initial ping() Jconsole shows the entire 2.1g heap consumed on the first request (a simple ping) to Solr after a Tomcat resta

Entire heap consumed to answer initial ping()

2009-06-29 Thread Phillip Farber
Jconsole shows the entire 2.1g heap consumed on the first request (a simple ping) to Solr after a Tomcat restart. After a Tomcat restart: 13140 tomcatvirtual=2255m resident=183m ... jsvc After the ping(): 13140 tomcatvirtual=2255m resident=2.0g ... jsvc Jconsole says my "Tenured Gen"

Re: Programatic way to know when an optimize is finished?

2008-11-18 Thread Phillip Farber
e http and then the next command in the script runs after the optimize finishes. Hours later, in our case. Lance -Original Message- From: Phillip Farber [mailto:[EMAIL PROTECTED] Sent: Friday, November 14, 2008 10:04 AM To: solr-user@lucene.apache.org Subject: Programatic way to know w

Programatic way to know when an optimize is finished?

2008-11-14 Thread Phillip Farber
I'd like to automate my indexing processes. Is there a slick method to know when an optimize on an index has completed? Thanks, Phil

Re: Huge increase in index size adding just 2 fields

2008-11-06 Thread Phillip Farber
Hi Otis and Hoss, My dates are not too granular. They're always -MM-DD 00:00:00 but I see that I did not omitNorms on the date field and hlb field. Thanks for pointing me in the right direction. Phil Chris Hostetter wrote: : We added the following 2 fields to the above schema as foll

Re: Huge increase in index size adding just 2 fields

2008-11-06 Thread Phillip Farber
large number of terms. Thanks, Phil Phillip Farber wrote: Hi, We're indexing a lot of dirty OCR. So the index is really huge due to the size of the position file. We still get ok response time though with a median of 100ms. Phrase queries are a different matter obviously. But

Huge increase in index size adding just 2 fields

2008-11-03 Thread Phillip Farber
Hi, We're indexing a lot of dirty OCR. So the index is really huge due to the size of the position file. We still get ok response time though with a median of 100ms. Phrase queries are a different matter obviously. But we're seeing some really large increases in index size as we add a cou

Re: Practical number of Solr instances per machine

2008-10-14 Thread Phillip Farber
ry latency, etc. etc. So there is no super clear cut answer. If you have some concrete numbers, that will be easier to answer :) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber <[EMAIL PROTECTED]> To: solr-user@l

Practical number of Solr instances per machine

2008-10-08 Thread Phillip Farber
Hello everyone, What is the generally accepted number of solr instances it makes sense to run on a single machine given solr/lucene threading? Servers now commonly have 4 or 8 cpus. Obviously the more instances you run the bigger your JVM heap needs to be and that takes away from OS cache.

Re: Testing query response time

2008-08-21 Thread Phillip Farber
can't give more advice at this time, I don't fully understand what you are trying to test... Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Phillip Farber <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, Au

Testing query response time

2008-08-20 Thread Phillip Farber
e query set (with empty solr caches). Speaking of empty solr caches, is there a way to flush those while solr is running? What other system states do I need to control for to get a handle on response time? Thanks and regards, Phil ------ Phillip Farber - http://www.umdl.umich.edu

Re: shards and performance

2008-08-19 Thread Phillip Farber
PM, Mike Klaas <[EMAIL PROTECTED]> wrote: On 19-Aug-08, at 10:18 AM, Phillip Farber wrote: I'm trying to understand how splitting a monolithic index into shards improves query response time. Please tell me if I'm on the right track here. Were does the increase in performance c

shards and performance

2008-08-19 Thread Phillip Farber
mind that we would want any protocols around distributed search to be as stable as possible? Or just wait for the 1.3 release? Thanks very much, Phil ------ Phillip Farber - http://www.umdl.umich.edu

Shard searching clarifications

2008-08-15 Thread Phillip Farber
Hi, I just want to be clear on how sharding works so I have two questions. If I have 2 solr instances (solr1 and solr2) each serving a shard is it correct I only need to send my query to one of the shards, e.g. solr1:8080/select?shards=solr1,solr2 ... and that I'll get merged results over bot

Re: Index size vs. number of documents

2008-08-15 Thread Phillip Farber
By "Index size almost never grows linearly with the number of documents" are you saying it increases more slowly that the number of documents, i.e. sub-linearly or more rapidly? With dirty OCR the number of unique terms is always increasing due to the garbage "words" -Phil Chris Hostetter w

Re: Index size vs. number of documents

2008-08-14 Thread Phillip Farber
non-English text this crudely if you must support native-language searching. yes. I'd be interested in how this changes your index size if you do decide to try it. There's nothing like having somebody else do research for me . Best Erick On Wed, Aug 13, 2008 at 1:45 PM, Phillip Farbe

Index size vs. number of documents

2008-08-13 Thread Phillip Farber
We're indexing the ocr for a large number of books. Our experimental schema is simple and id field and an ocr text field (not stored). Currently we just have two data points: 3005 documents = 723 MB index 174237 documents = 51460 MB index These indexes are not optimized. If the index size

Re: Accented search

2008-06-20 Thread Phillip Farber
Regarding indexing words with accented and unaccented characters with positionIncrement zero: Chris Hostetter wrote: you don't really need a custom tokenizer -- just a buffered TokenFilter that clones the original token if it contains accent chars, mutates the clone, and then emits it next w

UnicodeNormalizationFilterFactory

2008-06-20 Thread Phillip Farber
Apologies for reposting. I should have posted this in a new thread. I've seen mention of these filters: But I don't see them in the 1.2 distribution. Am I looking in the wrong place? What will the UnicodeNormalizationFilterFactory do for me? I can't find any documentation on it. Thank

Re: Accented search

2008-06-20 Thread Phillip Farber
I've seen mention of these filters: But I don't see them in the 1.2 distribution. Am I looking in the wrong place? What will the UnicodeNormalizationFilterFactory do for me? I can't find any documentation on it. Thanks, Phil

Re: scaling / sharding questions

2008-06-18 Thread Phillip Farber
This may be slightly off topic, for which I apologize, but is related to the question of searching several indexes as Lance describes below, quoting: "We also found that searching a few smaller indexes via the Solr 1.3 Distributed Search feature is actually faster than searching one large ind

field normalization and omitNorms

2008-05-27 Thread Phillip Farber
Hi all, I've been looking without success for a simple explanation of the effect of omitNorms=false for a text field. Can someone point me to the relevant doc? What is the effect of omitNorms=false on index size and query performance for say 200K documents that have s single large text fiel

Solr queuing behavior

2008-04-27 Thread Phillip Farber
Hello, I have a quasi-realtime indexing application where documents are grouped into collections and documents can be added or removed from collections. The document has an id and multiple collection id (collid) fields reflecting the collections that contain that document. The collid field i

Queuing adds and commits

2008-04-27 Thread Phillip Farber
A while back Hoss described Solr queuing behavior: > searches can go on happily while commits/adds are happening, and > multiple adds can happen in parallel, ... but all adds block while a > commit is taking place. i just give all of clients that update the > index a really large timeout value

Re: limit on number of values in a filter query?

2008-04-03 Thread Phillip Farber
eley wrote: 1000 filters, wow! Perhaps you hit a jetty limit on the size of a GET request or something? Perhaps try POST? -Yonik On Thu, Apr 3, 2008 at 11:03 AM, Phillip Farber <[EMAIL PROTECTED]> wrote: I use a filter query (fq) parameter in my requests to limit the select response to a s

limit on number of values in a filter query?

2008-04-03 Thread Phillip Farber
I use a filter query (fq) parameter in my requests to limit the select response to a subset of all document ids. I'm getting a Solr exception when the number of values in the fq approaches 1000. I get the following response from Solr: _header=[6518349,32231686,m=4,g=4096,p=4096,c=4096]={/so

Re: Help with XmlPullParserException

2008-04-02 Thread Phillip Farber
... Thanks for reading. Phil Phillip Farber wrote: Hello all, I'm indexing a body of OCR and encountered this exception. Apparently it's some kind of XML parser error. Out of thousands of documents, which I create with significant processing to make sure they are XML compliant, only

Help with XmlPullParserException

2008-04-02 Thread Phillip Farber
Hello all, I'm indexing a body of OCR and encountered this exception. Apparently it's some kind of XML parser error. Out of thousands of documents, which I create with significant processing to make sure they are XML compliant, only this one appears to have a problem. But can anyone tell me

stopwords and phrase queries

2008-03-21 Thread Phillip Farber
Am I correct that if I index with stop words: "to", "be", "or" and "not" then phrase query "to be or not to be" will not retrieve any documents? Is there any documentation that discusses the interaction of stop words and phrase queries? Thanks. Phil

Re: Solr feasibility with terabyte-scale data

2008-01-23 Thread Phillip Farber
ords" are nonsense). I saw one attempt to OCR a family tree. As in a stylized tree with the data hand-written along the various branches in every orientation. Not a recognizable word in the bunch Best Erick On Jan 22, 2008 2:05 PM, Phillip Farber <[EMAIL PROTECTED]> wrote: Ryan

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
a single copy of the index that all searchers (and the master) point to? Yes we're thinking a single copy of the index using hardware-based snapshot technology for the readers a dedicated indexing solr instance updates the index. Reasonable? Otis -- Sematext -- http://semate

Re: Solr feasibility with terabyte-scale data

2008-01-22 Thread Phillip Farber
Ryan McKinley wrote: We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large

Solr feasibility with terabyte-scale data

2008-01-18 Thread Phillip Farber
Hello everyone, We are considering Solr 1.2 to index and search a terabyte-scale dataset of OCR. Initially our requirements are simple: basic tokenizing, score sorting only, no faceting. The schema is simple too. A document consists of a numeric id, stored and indexed and a large text fiel

Re: Searching for two terms together in a multiValued TextField

2007-12-06 Thread Phillip Farber
Hello Hoss, I appreciate your detailed response. I think I like your second alternative because I'd like to score whole books rather than pages in books. It seems to me that the more words one has to work with in a "document" the better the scoring would be for the entire book. Here's a qu

Searching for two terms together in a multiValued TextField

2007-12-05 Thread Phillip Farber
Hello, I'm still new to Solr/Lucene. I want to search documents for 2 or more terms that must appear together on a page. I have a multiValued TextField called "page" in a document with uniqueId called "id" that represents a OCR'd book. My default operator is AND. My default field is "page".

What does "Missing sort order" error mean

2007-11-30 Thread Phillip Farber
Hi, Just getting my head around Solr queries. I get HTTP ERROR: 400 Missing sort order. when I issue this query: http://localhost:8983/solr/select/?q=car&sort=title My title field is: stored="false" required="true"/> where "myAlphaOnlySort" is based on the alphaOnlySort from the example

Re: Document field data not getting indexed

2007-11-30 Thread Phillip Farber
Well this one falls into the category of bald faced embarrassment. It's a bug in my process. Thanks to all for taking the time to respond. Have I said how great solr support is? :-) Phil Phillip Farber wrote: Hi Yonik, Hoss, et. al. I'm using numItems=2000 in the luke url so I

Re: Document field data not getting indexed

2007-11-30 Thread Phillip Farber
the 22 docs Campeau *is* in the index and I can retrieve it: 0 90 on 0 ocr:campeau 2.2 10 mdp.39015015394847 44 2007-11-30T13:59:45.783Z Luke: 1 1 1 1 1 <<<<<<<<<<<<< 1 1 Yonik Seeley wrote: On Nov 29, 2007 7:29 PM,

Document field data not getting indexed

2007-11-29 Thread Phillip Farber
Hi, I have 22 documents. I index these by posting them using LWP::UserAgent all with http status 200 OK. One of my documents (id=44) contains the word "Campeau" in the "ocr" field. But according to luke this term does not appear in the index. Yet when I delete the index (delete by query *:

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber
Chris Hostetter wrote: : After following Otis' and Thorsten's advice, I still get: : : HTTP ERROR: 500 No Java compiler available Just so i'm clear, you: 1) downloaded solr, tried out the tutorial, and had the url http://localhost:8983/solr/admin/ work when you ran: > cd $

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Phillip Farber
can find it. e.g. cd ~ vim .bashrc ... export JAVA_HOME=/home/thorsten/opt/java export PATH=$JAVA_HOME/bin:$PATH The important thing is that $JAVA_HOME points to the JDK and it is first in your path! salu2 Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ---

Help with Debian solr/jetty install?

2007-11-20 Thread Phillip Farber
Hi, I've successfully run as far as the example admin page on Debian linux 2.6. So I installed the solr-jetty packaged for Debian testing which gives me Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the Solr home page at http://localhost:8280/solr But I get an error wh

Multiple collections of items

2007-05-17 Thread Phillip Farber
Hello, I'm yet another new solr user and I'll confess that I haven't read the documentation in great depth but hope someone can at least point me in the right direction. I have an application that manages documents in real-time into collections where a given document can live in more than o