unsubscribe
Thanks Hoss. Yes, in a separate thread on the list I reported that
doing a multi-stage optimize worked around the out of space problem. We
use mergefactor=10, maxSegments = 16, 8, 4, 2, 1 iteratively starting at
the closest power of two below the number of segments to merge.Works
nicely s
I thought I'd summarize a method that solved the problem we were having
trying to optimize a large shard that was running out of disk space,
df=100% (400g), du=~380g. After we ran out of space, if we restarted
tomcat, segment files disappeared from disk leaving 3 segments.
What worked: we u
But we're still exceeding 2x. And after the optimize fails, if we then
do a commit or bounce tomcat, a bunch of segments disappear. I am stumped.
Yonik Seeley wrote:
On Wed, Oct 7, 2009 at 1:50 PM, Phillip Farber wrote:
So this implies that for a "normal" optimize, in every case, due to
Yonik Seeley wrote:
Does this means that there's always a lucene IndexReader holding segment
files open so they can't be deleted during an optimize so we run out of disk
space > 2x?
Yes.
A feature could probably now be developed now that avoids opening a
reader until it's requested.
That wa
In a separate thread, I've detailed how an optimize is taking > 2x disk
space. We don't use solr distribution/snapshooter. We are using the
default deletion policy = 1. We can't optimize a 192G index in 400GB of
space.
This thread in lucene/java-user
http://www.gossamer-threads.com/lists/l
his is interesting.
On Tue, Oct 6, 2009 at 5:28 PM, Phillip Farber wrote:
I am attempting to optimize a large shard on solr 1.4 and repeatedly get
java.io.IOException: No space left on device. The shard, after a final
commit before optimize, shows a size of about 192GB on a 400GB volume. I
ha
I am attempting to optimize a large shard on solr 1.4 and repeatedly get
java.io.IOException: No space left on device. The shard, after a final
commit before optimize, shows a size of about 192GB on a 400GB volume.
I have successfully optimized 2 other shards that were similarly large
without
Thanks, Mark. I really appreciate your confirmation.
Phil
Mark Miller wrote:
Phillip Farber wrote:
Resuming this discussion in a new thread to focus only on this question:
What is the best way to get the size of an index so it does not get
too big to be optimized (or to allow a very large
Resuming this discussion in a new thread to focus only on this question:
What is the best way to get the size of an index so it does not get too
big to be optimized (or to allow a very large segment merge) given space
limits?
I already have the largest 15,000rpm SCSI direct attached storage
I am trying to automate a build process that adds documents to 10 shards
over 5 machines and need to limit the size of a shard to no more than
200GB because I only have 400GB of disk available to optimize a given shard.
Why does the size (du) of an index typically decrease after a commit?
I'v
Sorry, I should have given more background. We have, at the moment 3.8
million documents of 0.7MB/doc average so we have extremely large
shards. We build about 400,000 documents to a shard resulting
200GB/shard. We are also using LVM snapshots to manage a snapshot of
the shard which we serve
In order to make maximal use of our storage by avoiding the dead 2x
overhead needed to optimize the index we are considering setting
mergefactor=1 and living with the slow indexing performance which is not
a problem in our use case.
Some questions:
1) Does mergefactor=1 mean that the size o
he newly optimized
files via a FileSwitchDirectory like implementation that knows
which new files are optimized and should "underneath" go to a
different physical path.
On Mon, Sep 28, 2009 at 7:57 AM, Phillip Farber wrote:
Is it possible to tell Solr or Lucene, when optimizing, to write
Is it possible to tell Solr or Lucene, when optimizing, to write the
files that constitute the optimized index to somewhere other than
SOLR_HOME/data/index or is there something about the optimize that
requires the final segment to be created in SOLR_HOME/data/index?
Thanks,
Phil
Can I expect the index to be left in a usable state ofter an out of
memory error during a merge or it it most likely to be corrupt? I'd
really hate to have to start this index build again from square one.
Thanks.
Thanks,
Phil
---
Exception in thread "http-8080-Processor2505"
java.lang.
I have a valid xml document that begins:
mdp.39015052775379
2
Technology transfer and in-house R&D in Indian
industry : in the later 1990s / edited and with an introduction by Binay
Kumar Pattnaik. v.1
Not found
TECHNOLOGY
TRANSFER AND
IN.HOUSE R&D
IN
INDIAN
INDUSTRY
I believe Solr is throwi
What will the client receive from the primary solr instance if that
instance doesn't get HTTP 200 from all the shards in a multi-shard query?
Thanks,
Phil
Is there any value in a round-robin scheme to cycle through the Solr
instances supporting a multi-shard index over several machines when
sending queries or is it better to just pick one instance and stick with
it. I'm assuming all machines in the cluster have the same hardware specs.
So sce
Normally to optimize an index you POST to /solr/update. Is
there any way to POST an optimize message to one instance and have it
propagate to all shards sort of like the select?
/solr-shard-1/select?q=dog... shards=shard-1,shard2
Thanks,
Phil
I'm trying to understand the purpose of the initialSize parameter for
the queryResultCache and documentCache. Is it correct that it controls
how much heap is allocated to each cache at startup?
I can see how it makes sense for queryResultCache since it is documented
as an "ordered lists of
ucene - Solr - Nutch
- Original Message
From: Phillip Farber
To: solr-user
Sent: Monday, June 29, 2009 4:20:26 PM
Subject: Entire heap consumed to answer initial ping()
Jconsole shows the entire 2.1g heap consumed on the first request (a simple
ping) to Solr after a Tomcat resta
Jconsole shows the entire 2.1g heap consumed on the first request (a
simple ping) to Solr after a Tomcat restart.
After a Tomcat restart:
13140 tomcatvirtual=2255m resident=183m ... jsvc
After the ping():
13140 tomcatvirtual=2255m resident=2.0g ... jsvc
Jconsole says my "Tenured Gen"
e http and then the next command in the script runs after the
optimize finishes. Hours later, in our case.
Lance
-Original Message-
From: Phillip Farber [mailto:[EMAIL PROTECTED]
Sent: Friday, November 14, 2008 10:04 AM
To: solr-user@lucene.apache.org
Subject: Programatic way to know w
I'd like to automate my indexing processes. Is there a slick method to
know when an optimize on an index has completed?
Thanks,
Phil
Hi Otis and Hoss,
My dates are not too granular. They're always -MM-DD 00:00:00 but I
see that I did not omitNorms on the date field and hlb field. Thanks
for pointing me in the right direction.
Phil
Chris Hostetter wrote:
: We added the following 2 fields to the above schema as foll
large number of terms.
Thanks,
Phil
Phillip Farber wrote:
Hi,
We're indexing a lot of dirty OCR. So the index is really huge due to
the size of the position file. We still get ok response time though
with a median of 100ms. Phrase queries are a different matter
obviously. But
Hi,
We're indexing a lot of dirty OCR. So the index is really huge due to
the size of the position file. We still get ok response time though
with a median of 100ms. Phrase queries are a different matter
obviously. But we're seeing some really large increases in index size
as we add a cou
ry latency, etc. etc. So
there is no super clear cut answer. If you have some concrete numbers, that
will be easier to answer :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Phillip Farber <[EMAIL PROTECTED]>
To: solr-user@l
Hello everyone,
What is the generally accepted number of solr instances it makes sense
to run on a single machine given solr/lucene threading? Servers now
commonly have 4 or 8 cpus. Obviously the more instances you run the
bigger your JVM heap needs to be and that takes away from OS cache.
can't give more advice at this time, I don't fully understand what you are
trying to test...
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
From: Phillip Farber <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, Au
e query set (with empty solr caches).
Speaking of empty solr caches, is there a way to flush those while solr
is running?
What other system states do I need to control for to get a handle on
response time?
Thanks and regards,
Phil
------
Phillip Farber - http://www.umdl.umich.edu
PM, Mike Klaas <[EMAIL PROTECTED]> wrote:
On 19-Aug-08, at 10:18 AM, Phillip Farber wrote:
I'm trying to understand how splitting a monolithic index into shards
improves query response time. Please tell me if I'm on the right track here.
Were does the increase in performance c
mind that
we would want any protocols around distributed search to be as stable as
possible? Or just wait for the 1.3 release?
Thanks very much,
Phil
------
Phillip Farber - http://www.umdl.umich.edu
Hi,
I just want to be clear on how sharding works so I have two questions.
If I have 2 solr instances (solr1 and solr2) each serving a shard
is it correct I only need to send my query to one of the shards, e.g.
solr1:8080/select?shards=solr1,solr2 ...
and that I'll get merged results over bot
By "Index size almost never grows linearly with the number of
documents" are you saying it increases more slowly that the number of
documents, i.e. sub-linearly or more rapidly?
With dirty OCR the number of unique terms is always increasing due to
the garbage "words"
-Phil
Chris Hostetter w
non-English text this crudely if you must
support native-language searching.
yes.
I'd be interested in how this changes your index size if you do decide
to try it. There's nothing like having somebody else do research for
me .
Best
Erick
On Wed, Aug 13, 2008 at 1:45 PM, Phillip Farbe
We're indexing the ocr for a large number of books. Our experimental
schema is simple and id field and an ocr text field (not stored).
Currently we just have two data points:
3005 documents = 723 MB index
174237 documents = 51460 MB index
These indexes are not optimized.
If the index size
Regarding indexing words with accented and unaccented characters with
positionIncrement zero:
Chris Hostetter wrote:
you don't really need a custom tokenizer -- just a buffered TokenFilter
that clones the original token if it contains accent chars, mutates the
clone, and then emits it next w
Apologies for reposting. I should have posted this in a new thread.
I've seen mention of these filters:
But I don't see them in the 1.2 distribution. Am I looking in the wrong
place? What will the UnicodeNormalizationFilterFactory do for me? I
can't find any documentation on it.
Thank
I've seen mention of these filters:
But I don't see them in the 1.2 distribution. Am I looking in the wrong
place? What will the UnicodeNormalizationFilterFactory do for me? I
can't find any documentation on it.
Thanks,
Phil
This may be slightly off topic, for which I apologize, but is related to
the question of searching several indexes as Lance describes below, quoting:
"We also found that searching a few smaller indexes via the Solr 1.3
Distributed Search feature is actually faster than searching one large
ind
Hi all,
I've been looking without success for a simple explanation of the effect
of omitNorms=false for a text field. Can someone point me to the
relevant doc?
What is the effect of omitNorms=false on index size and query
performance for say 200K documents that have s single large text fiel
Hello,
I have a quasi-realtime indexing application where documents are grouped
into collections and documents can be added or removed from collections.
The document has an id and multiple collection id (collid) fields
reflecting the collections that contain that document. The collid field
i
A while back Hoss described Solr queuing behavior:
> searches can go on happily while commits/adds are happening, and
> multiple adds can happen in parallel, ... but all adds block while a
> commit is taking place. i just give all of clients that update the
> index a really large timeout value
eley wrote:
1000 filters, wow!
Perhaps you hit a jetty limit on the size of a GET request or
something? Perhaps try POST?
-Yonik
On Thu, Apr 3, 2008 at 11:03 AM, Phillip Farber <[EMAIL PROTECTED]> wrote:
I use a filter query (fq) parameter in my requests to limit the select
response to a s
I use a filter query (fq) parameter in my requests to limit the select
response to a subset of all document ids. I'm getting a Solr exception
when the number of values in the fq approaches 1000. I get the
following response from Solr:
_header=[6518349,32231686,m=4,g=4096,p=4096,c=4096]={/so
... Thanks
for reading.
Phil
Phillip Farber wrote:
Hello all,
I'm indexing a body of OCR and encountered this exception. Apparently
it's some kind of XML parser error. Out of thousands of documents,
which I create with significant processing to make sure they are XML
compliant, only
Hello all,
I'm indexing a body of OCR and encountered this exception. Apparently
it's some kind of XML parser error. Out of thousands of documents,
which I create with significant processing to make sure they are XML
compliant, only this one appears to have a problem. But can anyone tell
me
Am I correct that if I index with stop words: "to", "be", "or" and "not"
then phrase query "to be or not to be" will not retrieve any documents?
Is there any documentation that discusses the interaction of stop words
and phrase queries? Thanks.
Phil
ords" are nonsense). I saw one attempt
to OCR a family tree. As in a stylized tree with the data
hand-written along the various branches in every orientation. Not a
recognizable word in the bunch
Best
Erick
On Jan 22, 2008 2:05 PM, Phillip Farber <[EMAIL PROTECTED]> wrote:
Ryan
a single copy
of the index that all searchers (and the master) point to?
Yes we're thinking a single copy of the index using hardware-based
snapshot technology for the readers a dedicated indexing solr instance
updates the index. Reasonable?
Otis
--
Sematext -- http://semate
Ryan McKinley wrote:
We are considering Solr 1.2 to index and search a terabyte-scale
dataset of OCR. Initially our requirements are simple: basic
tokenizing, score sorting only, no faceting. The schema is simple
too. A document consists of a numeric id, stored and indexed and a
large
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale dataset
of OCR. Initially our requirements are simple: basic tokenizing, score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text fiel
Hello Hoss,
I appreciate your detailed response. I think I like your second
alternative because I'd like to score whole books rather than pages in
books. It seems to me that the more words one has to work with in a
"document" the better the scoring would be for the entire book.
Here's a qu
Hello,
I'm still new to Solr/Lucene. I want to search documents for 2 or more
terms that must appear together on a page. I have a multiValued
TextField called "page" in a document with uniqueId called "id" that
represents a OCR'd book. My default operator is AND. My default field is
"page".
Hi,
Just getting my head around Solr queries. I get HTTP ERROR: 400
Missing sort order. when I issue this query:
http://localhost:8983/solr/select/?q=car&sort=title
My title field is:
stored="false" required="true"/>
where "myAlphaOnlySort" is based on the alphaOnlySort from the example
Well this one falls into the category of bald faced embarrassment. It's
a bug in my process. Thanks to all for taking the time to respond.
Have I said how great solr support is? :-)
Phil
Phillip Farber wrote:
Hi Yonik, Hoss, et. al.
I'm using numItems=2000 in the luke url so I
the 22 docs Campeau *is* in the index and
I can retrieve it:
0
90
on
0
ocr:campeau
2.2
10
mdp.39015015394847
44
2007-11-30T13:59:45.783Z
Luke:
1
1
1
1
1 <<<<<<<<<<<<<
1
1
Yonik Seeley wrote:
On Nov 29, 2007 7:29 PM,
Hi,
I have 22 documents. I index these by posting them using LWP::UserAgent
all with http status 200 OK.
One of my documents (id=44) contains the word "Campeau" in the "ocr"
field. But according to luke this term does not appear in the index.
Yet when I delete the index (delete by query *:
Chris Hostetter wrote:
: After following Otis' and Thorsten's advice, I still get:
:
: HTTP ERROR: 500 No Java compiler available
Just so i'm clear, you:
1) downloaded solr, tried out the tutorial, and had the
url http://localhost:8983/solr/admin/ work when you ran:
> cd $
can find it.
e.g.
cd ~
vim .bashrc
...
export JAVA_HOME=/home/thorsten/opt/java
export PATH=$JAVA_HOME/bin:$PATH
The important thing is that $JAVA_HOME points to the JDK and it is first
in your path!
salu2
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
---
Hi,
I've successfully run as far as the example admin page on Debian linux 2.6.
So I installed the solr-jetty packaged for Debian testing which gives me
Jetty 5.1.14-1 and Solr 1.2.0+ds1-1. Jetty starts fine and so does the
Solr home page at http://localhost:8280/solr
But I get an error wh
Hello,
I'm yet another new solr user and I'll confess that I haven't read the
documentation in great depth but hope someone can at least point me in
the right direction.
I have an application that manages documents in real-time into
collections where a given document can live in more than o
64 matches
Mail list logo