Thanks wunder,
I really appreciate the help.
Tom
Thanks wunder and Lance,
In the discussions I've seen of Japanese IR in the English language IR
literature, Hiragana is either removed or strings are segmented first by
character class. I'm interested in finding out more about why bigramming
across classes is desirable.
Based on my limited und
I have a few questions about the CJKBigram filter.
About 10% of our queries that contain Han characters are single character
queries. It looks like the CJKBigram filter only outputs single characters
when there are no adjacent bigrammable characters in the input. This means we
would have to
Hello all,
I'm getting ready to upgrade from Solr 3.4 to Solr 3.6 and I noticed that
maxMergeDocs is no longer in the example solrconfig.xml.
Has maxMergeDocs been deprecated? or doe the tieredMergePolicy ignore it?
Since our Docs are about 800K or more and the setting in the old example
solrco
es = true;
}
on TextField. Specifying autoGeneratePhraseQueries explicitly on a field type
overrides whatever the default may be.
Erik
On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote:
> Seems like a change in default behavior like this should be included in the
> chang
Seems like a change in default behavior like this should be included in the
changes.txt for Solr 3.5.
Not sure how to do that.
Tom
-Original Message-
From: Naomi Dushay [mailto:ndus...@stanford.edu]
Sent: Thursday, February 23, 2012 1:57 PM
To: solr-user@lucene.apache.org
Subject: autoG
Hello ,
Searching real-time sounds difficult with that amount of data. With large
documents, 3 million documents, and 5TB of data the index will be very large.
With indexes that large your performance will probably be I/O bound.
Do you plan on allowing phrase or proximity searches? If so, you
Thanks so much for your reply Hoss,
I didn't realize how much more complicated this gets with distributed search.
Do you think it's worth opening a JIRA issue for this?
Is there already some ongoing work on the faceting code that this might fit in
with?
In the meantime, I think I'll go ahead an
ted a similar feature for a categorization suggestion service. I
did the faceting in the client code, which is not exactly the best
performing but it worked very well.
It would be nice to have the Solr server do the faceting for performance.
Burton-West, Tom wrote:
>
> If relevance ranking is w
If relevance ranking is working well, in theory it doesn't matter how many hits
you get as long as the best results show up in the first page of results.
However, the default in choosing which facet values to show is to show the
facets with the highest count in the entire result set. Is there
r setting ignored?
Tom
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Friday, September 16, 2011 7:09 PM
To: solr-user@lucene.apache.org
Subject: Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?
On Fri, Sep 16, 2011 at 6:53 PM, Burton-West, Tom wrote
Hello,
The TieredMergePolicy has become the default with Solr 3.3, but the
configuration in the example uses the mergeFactor setting which applys to the
LogByteSizeMergePolicy.
How is the mergeFactor interpreted by the TieredMergePolicy?
Is there an example somewhere showing how to configure t
Hello,
The TieredMergePolicy has become the default with Solr 3.3, but the
configuration in the example uses the mergeFactor setting which applys to the
LogByteSizeMergePolicy.
How is the mergeFactor interpreted by the TieredMergePolicy?
Is there an example somewhere showing how to configure th
Hello,
The TieredMergePolicy has become the default with Solr 3.3, but the
configuration in the example uses the mergeFactor setting which applys to the
LogByteSizeMergePolicy.
How is the mergeFactor interpreted by the TieredMergePolicy?
Is there an example somewhere showing how to configure th
Hi Jonothan and Markus,
>>Why 3 shards on one machine instead of one larger shard per machine?
Good question!
We made this architectural decision several years ago and I'm not remembering
the rationale at the moment. I believe we originally made the decision due to
some tests showing a sweetsp
Hi Markus,
Just as a data point for a very large sharded index, we have the full text of
9.3 million books with an index size of about 6+ TB spread over 12 shards on 4
machines. Each machine has 3 shards. The size of each shard ranges between
475GB and 550GB. We are definitely I/O bound. Our m
Hello,
On Mon, 2011-07-04 at 13:51 +0200, Jame Vaalet wrote:
> What would be the maximum size of a single SOLR index file for resulting in
> optimum search time ?
How do you define optimimum? Do you want the fastest possible response time
at any cost or do you have a specific response time go
Hi Shawn,
Thanks for sharing this information. I also found that in our use case, for
some reason the default settings for the concurrent garbage collector seem to
size the young generation way too small (At least for heap sizes of 1GB or
larger.) Can you also let us know what version of the
Hi Dimitry,
>>The parameters you have menioned -- termInfosIndexDivisor and
>>termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you
>>using SOLR 3.1?
I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is
in the 1.4.1 example solrconfig.xml file, alt
Thank you Koji,
I'll take a look at SingleFragListBuilder, LUCENE-2464, and SOLR-1985, and I
will update the wiki on Monday.
Tom
There is SingleFragListBuilder for this purpose. Please see:
https://issues.apache.org/jira/browse/LUCENE-2464
> 3) A
According to the documentation on the Solr wiki page, setting the hl.fragsize
parameter to "0" indicates that the whole field value should be used (no
fragmenting). However the FastVectorHighlighter throws an exception
message fragCharSize(0) is too small. It must be 18 or higher.
java.lang.
Hi Koji,
Thank you for your reply.
>> It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery
>> and DisjunctionMaxQuery
>> and Query constructed by those queries.
Sorry, I'm not sure I understand. Are you saying that FVH supports MultiTerm
highlighting?
Tom
Hi Erick,
Thanks for asking, yes we have termVectors=true set:
I guess I should also mention that highlighting works fine using the
fastVectorHighLighter as long as we don't do a MultiTerm query. For example
see the query and results appended below (using the same hl parameters listed
in t
We are trying to implement highlighting for wildcard (MultiTerm) queries. This
seems to work find with the regular highlighter but when we try to use the
fastVectorHighlighter we don't see any results in the highlighting section of
the response. Appended below are the parameters we are using.
Hi Dmitry,
I am assuming you are splitting one very large index over multiple shards
rather than replicating and index multiple times.
Just for a point of comparison, I thought I would describe our experience with
large shards. At HathiTrust, we run a 6 terabyte index over 12 shards. This is
Hi Otis,
Our OCR fields average around 800 KB. My guess is that the largest docs we
index (in a single OCR field) are somewhere between 2 and 10MB. We have had
issues where the in-memory representation of the document (the in memory index
structures being built)is several times the size of t
If I have a query with a filter query such as : " q=art&fq=history" and then
run a second query "q=art&fq=-history", will Solr realize that it can use the
cached results of the previous filter query "history" (in the filter cache) or
will it not realize this and have to actually do a second fi
y idea why this happening and if we now optimize it using SFF it
should be fine in future with CFF= false?
P.S: Increasing the MergeFactor didn't even work.
On Wed, Apr 27, 2011 at 10:09 PM, Burton-West, Tom wrote:
> Hi Salman,
>
> Sounds like somehow you are triggering merge
Hi Salman,
Sounds like somehow you are triggering merges or optimizes. What is your
mergeFactor?
Have you turned on the IndexWriter log?
In solrconfig.xml
true
In our case we feed the directory name as a Java property in our java startup
script , but you can also hard code where you want
Don't know your use case, but if you just want a list of the 400 most common
words you can use the lucene contrib. HighFreqTerms.java with the - t flag.
You have to point it at your lucene index. You also probably don't want Solr
to be running and want to give the JVM running HighFreqTerms a l
>> As far as I know, Solr will never arrive to a segment file greater than 2GB,
>>so this shouldn't be a problem.
Solr can easily create a file size over 2GB, it just depends on how much data
you index and your particular Solr configuration, including your
ramBufferSizeMB, your mergeFactor, and
: Thursday, April 14, 2011 5:41 PM
To: solr-user@lucene.apache.org; yo...@lucidimagination.com
Cc: Burton-West, Tom
Subject: Re: Understanding the DisMax tie parameter
: Perhaps the parameter could have had a better name. It's essentially
: max(score of matching clauses) + tie * (sco
Hello,
I'm having trouble understanding the relationship of the word "tie" and
"tiebreaker" to the explanation of this parameter on the wiki.
What two (or more things) are in a tie? and how does the number in the range
from 0 to 1 break the tie?
http://wiki.apache.org/solr/DisMaxQParserPlugin#t
xceed some number to trigger the bug?
I rebuilt lucene-core-3.1-SNAPSHOT.jar with your patch and it fixes the
problem.
Tom
-Original Message-
From: Michael McCandless [mailto:luc...@mikemccandless.com]
Sent: Monday, April 11, 2011 1:00 PM
To: Burton-West, Tom
Cc: solr
ess [mailto:luc...@mikemccandless.com]
Sent: Monday, April 11, 2011 8:40 AM
To: solr-user@lucene.apache.org
Cc: Burton-West, Tom
Subject: Re: ArrayIndexOutOfBoundsException with facet query
Tom,
I think I see where this may be -- it looks like another > 2B terms
bug in Lucene (we are using an int in
The query below results in an array out of bounds exception:
select/?q=solr&version=2.2&start=0&rows=0&facet=true&facet.field=topicStr
Here is the exception:
Exception during facet.field of
topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149
at
org.apache.lucene.index.TermInfosR
+1 on some kind of simple performance framework that would allow comparing Solr
vs Lucene. Any chance the Lucene benchmark programs in contrib could be
adopted to read Solr config information?
BTW: You probably want to empty the OS cache in addition to restarting Solr
between each run if the in
This page discusses the reasons why it's not a simple one to one mapping
http://www.kanji.org/cjk/c2c/c2cbasis.htm
Tom
-Original Message-
> I have documents that contain both simplified and traditional Chinese
> characters. Is there any way to search across them? For example, if someone
Hello all,
We are getting intermittent socket timeout errors (see below). Out of about
600,000 indexing requests, 30 returned these socket timeout errors. We haven't
been able to correlate these with large merges, which tends to slow down the
indexing response rate.
Does anyone know where we
riter and
SolrIndexConfig trying to better understand how solrconfig.xml gets
instantiated and how it affects the readers and writers.
Tom
From: Robert Muir [rcm...@gmail.com]
On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom wrote:
>>>Your
>>Your setting isn't being applied to the reader IW uses during
>>merging... its only for readers Solr opens from directories
>>explicitly.
>>I think you should open a jira issue!
Do I understand correctly that this setting in theory could be applied to the
reader IW uses during merging but is no
Thanks Mike,
>>But, if you are doing deletions (or updateDocument, which is just a
>>delete + add under-the-hood), then this will force the terms index of
>>the segment readers to be loaded, thus consuming more RAM.
Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance
a
Hello all,
Are there any general guidelines for determining the main factors in memory use
during merges?
We recently changed our indexing configuration to speed up indexing but in the
process of doing a very large merge we are running out of memory.
Below is a list of the changes and part of t
I see variables used to access java system properties in solrconfig.xml and
schema.xml:
http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution
${solr.data.dir:}
or
${solr.abortOnConfigurationError:true}
Is there a way to access environment variables or does everything have to be
-user@lucene.apache.org
Subject: Re: ramBufferSizeMB not reflected in segment sizes in index
On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom wrote:
> Thanks Mike,
>
> Yes we have many unique terms due to dirty OCR and 400 languages and probably
> lots of low doc freq terms as well (altho
what merges are taking place.
Mike
On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom wrote:
> We are using a recent Solr 3.x (See below for exact version).
>
> We have set the ramBufferSizeMB to 320 in both the indexDefaults and the
> mainIndex sections of our solrconfig.xml:
>
>
We are using a recent Solr 3.x (See below for exact version).
We have set the ramBufferSizeMB to 320 in both the indexDefaults and the
mainIndex sections of our solrconfig.xml:
320
20
We expected that this would mean that the index would not write to disk until
it reached somewhere approximate
If I want to delete an entire index and start over, in previous versions of
Solr, you could stop Solr, delete all files in the index directory and restart
Solr. Solr would then create empty segments files and you could start
indexing. In Solr 3x if I delete all the files in the index directo
optmize automatically?
tks
On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom wrote:
> Hi Claudio,
>
> What's happening when you re-index the documents is that Solr/Lucene
> implements an update as a delete plus a new index. Because of the nature of
> inverted indexes, deleting
Hi Claudio,
What's happening when you re-index the documents is that Solr/Lucene implements
an update as a delete plus a new index. Because of the nature of inverted
indexes, deleting documents requires a rewrite of the entire index. In order to
avoid rewriting the entire index each time one d
es. (Unless someone beats
me to it :)
Tom
-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com]
Sent: Monday, November 01, 2010 12:49 PM
To: solr-user@lucene.apache.org
Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support
from Solr
On Mon, Nov 1, 2010
We are trying to solve some multilingual issues with our Solr analysis filter
chain and would like to use the new Lucene 3.x filters that are Unicode
compliant.
Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with
UAX#29 support from Solr?
Is it just a matter of writing
Tom
-Original Message-
From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley
Sent: Friday, October 15, 2010 1:19 PM
To: solr-user@lucene.apache.org
Subject: Re: filter query from external list of Solr unique IDs
On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom wrote:
>
Hi Jonathan,
The advantages of the obvious approach you outline are that it is simple, it
fits in to the existing Solr model, it doesn't require any customization or
modification to Solr/Lucene java code. Unfortunately, it does not scale well.
We originally tried just what you suggest for our
At the Lucene Revolution conference I asked about efficiently building a filter
query from an external list of Solr unique ids.
Some use cases I can think of are:
1) personal sub-collections (in our case a user can create a small subset
of our 6.5 million doc collection and then run filter
Hi Mike,
>.Do you use multiple threads for indexing? Large RAM buffer size is
>>also good, but I think perf peaks out mabye around 512 MB (at least
>>based on past tests)?
We are using Solr, I'm not sure if Solr uses multiple threads for indexing. We
have 30 "producers" each sending documents
Hi all,
At some point we will need to re-build an index that totals about 3 terabytes
in size (split over 12 shards). At our current indexing speed we estimate that
this will take about 4 weeks. We would like to reduce that time. It appears
that our main bottleneck is disk I/O during index m
We are having some memory and GC issues. I'm trying to get a handle on the
contribution of the Solr caches. Is there a way to estimate the amount of
memory used by the documentCache and the queryResultCache?
I assume if we know the average size of our stored fields we can just multiply
the s
Hi Yonik,
>>If the new "autoGeneratePhraseQueries" is off, position doesn't matter, and
>>the query will
>>be treated as "index" OR "reader".
Just wanted to make sure, in Solr does autoGeneratePhraseQueries = "off" treat
the query with the *default* query operator as set in SolrConfig rather t
Hi Jonathan,
>> I'm afraid I'm having trouble understanding "if the analyzer returns more
>> than one position back from a "queryparser token"
>>I'm not sure if "the queryparser forms a phrase query without explicit phrase
>>quotes" is a problem for me, I had no idea it happened until now, ne
Hi all,
The CommonGrams filter is designed to only work on phrase queries. It is
designed to solve the problem of slow phrase queries with phrases containing
common words, when you don't want to use stop words. It would not make sense
for Boolean queries. Boolean queries just get passed throu
Thanks Kent for your info.
We are not doing any faceting, sorting, or much else. My guess is that most of
the memory increase is just the data structures created when parts of the frq
and prx files get read into memory. Our frq files are about 77GB and the prx
files are about 260GB per sha
Thanks Robert and everyone!
I'm working on changing our JVM settings today, since putting Solr 1.4.1 into
production will take a bit more work and testing. Hopefully, I'll be able to
test the setTermIndexDivisor on our test server tomorrow.
Mike, I've started the process to see if we can provi
Thanks Mike,
>>Do you use a terms index divisor? Setting that to 2 would halve the
>>amount of RAM required but double (on average) the seek time to locate
>>a given term (but, depending on your queries, that seek time may still
>>be a negligible part of overall query time, ie the tradeoff could
We have noticed that when the first query hits Solr after starting it up,
memory use increases significantly, from about 1GB to about 16GB, and then as
queries are received it goes up to about 19GB at which point there is a Full
Garbage Collection which takes about 30 seconds and then memory use
Hi all,
When we run the first query after starting up Solr, memory use goes up from
about 1GB to 15GB and never goes below that level. In debugging a recent OOM
problem I ran jmap with the output appended below. Not surprisingly, given the
size of our indexes, it looks like the TermInfo and T
+1
I just had occasion to debug something where the interaction between the
queryparser and the analyzer produced *interesting* results. Having a separate
jsp that includes the whole chain (i.e. analyzer/tokenizer/filter and qp) would
be great!
Tom
-Original Message-
From: Michael McC
Hi Peter,
If hits aren't showing up, and you aren't getting any queryResultCache hits
even with the exact query being repeated, something is very wrong. I'd suggest
first getting the query result cache working, and then moving on to look at
other possible bottlenecks.
What are your settings
Hi Peter,
Can you give a few more examples of slow queries?
Are they phrase queries? Boolean queries? prefix or wildcard queries?
If one word queries are your slow queries, than CommonGrams won't help.
CommonGrams will only help with phrase queries.
How are you using termvectors? That may be
Hi Peter,
A few more details about your setup would help list members to answer your
questions.
How large is your index?
How much memory is on the machine and how much is allocated to the JVM?
Besides the Solr caches, Solr and Lucene depend on the operating system's disk
caching for caching of
A good starting place might be the list of stemming errors for the original
Porter stemmer in this article that describes k-stem:
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings
of the 16th annual international ACM SIGIR conference on Research and
development in i
Hi Jason,
Are you looking for the total number of unique terms or total number of term
occurrences?
Checkindex reports both, but does a bunch of other work so is probably not the
fastest.
If you are looking for total number of term occurrences, you might look at
contrib/org/apache/lucene/misc
Hi Ken,
This is all very dependent on your documents, your indexing setup and your
hardware. Just as an extreme data point, I'll describe our experience.
We run 5 clients on each of 6 machines to send documents to Solr using the
standard http xml process. Our documents contain about 10 field
Hi all,
We are about to test out various factors to try to speed up our indexing
process. One set of experiments will try various maxRamBufferSizeMB settings.
Since the factors we will be varying are at the Lucene level, we are
considering using the Lucene Benchmark utilities in Lucene/cont
Thanks Koji,
That was the information I was looking for. I'll be sure to post the test
results to the list. It may be a few weeks before we can schedule the tests
for our test server.
Tom
>>I've never tried it but NoMergePolicy and NoMergeScheduler
>>can be specified in solrconfig.xml:
>>
Hi Kallin,
Given the previous postings on the list about terrible NFS performance we were
pleasantly surprised when we did some tests against a well tuned NFS RAID array
on a private network. We got reasonably good results (given our large index
sizes.) See
http://www.hathitrust.org/blogs/lar
Is it possible to use the NoOpMergePolicy (
https://issues.apache.org/jira/browse/LUCENE-2331 ) from Solr?
We have very large indexes and always optimize, so we are thinking about using
a very large ramBufferSizeMB
and a NoOpMergePolicy and then running an optimize to avoid extra disk reads
a
We are currently indexing 5 million books in Solr, scaling up over the next few
years to 20 million. However we are using the entire book as a Solr document.
We are evaluating the possibility of indexing individual pages as there are
some use cases where users want the most relevant pages rega
Hello all,
We have been running a configuration in production with 3 solr instances under
one tomcat with 16GB allocated to the JVM. (java -Xmx16384m -Xms16384m) I
just noticed the warning in the LucidWorks Certified Distribution Reference
Guide that warns against using more than 2GB (see be
Thanks Robert,
I've been thinking about this since you suggested it on another thread. One
problem is that it would also remove real words. Apparently 40-60% of the words
in large corpora occur only once
(http://en.wikipedia.org/wiki/Hapax_legomenon.)
There are a couple of use cases where r
Hello all,
We have been indexing a large collection of OCR'd text. About 5 million books
in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error
rate creates a relatively large number of meaningless unique terms. (See
http://www.hathitrust.org/blogs/large-scale-search/too
Hello all,
At some point we will need to re-build an index that totals about 2 terrabytes
in size (split over 10 shards). At our current indexing speed we estimate that
this will take about 3 weeks. We would like to reduce that time. It appears
that our main bottleneck is disk I/O.
We curre
Hello all,
After optimizing rather large indexes on 10 shards (each index holds about
500,000 documents and is about 270-300 GB in size) we started getting
intermittent TermInfosReader.get() ArrayIndexOutOfBounds exceptions. The
exceptions sometimes seem to occur on all 10 shards at the sam
Hello,
We are trying to debug an indexing/optimizing problem and have tried setting
the infoStream file in solrconf.xml so that the SolrIndexWriter will write a
log file. Here is our setting:
true
After making that change to solrconfig.xml, restarting Solr, we see a message
in the tomc
Hello all,
When I start up Solr from the example directory using start.jar, it seems to
start up, but when I go to the localhost admin url
(http://localhost:8983/solr/admin) I get a 404 (See message appended below).
Has the url for the Solr admin changed?
Tom
Tom Burton-West
---
Here
In trying to understand the various options for WordDelimiterFilterFactory, I
tried setting all options to 0.
This seems to prevent a number of words from being output at all. In particular
"can't" and "99dxl" don't get output, nor do any wods containing hypens. Is
this correct behavior?
Here
Hello all,
We are experimenting with the ShingleFilter with a very large document set (1
million full-text books). Because the ShingleFilter indexes every word pair as
a token, the number of unique terms increases tremendously. In our experiments
so far the tii and tis files are getting very l
Hello all,
As I understand distributed Solr, a request for a distributed search
goes to a particular Solr instance with a list of arguments specifying
the addresses of the shards to search. The Solr instance to which the
request is first directed is responsible for distributing the query to
the o
Thanks Yonik,
->>The next nightly build (Dec-01-2008) should have the changes.
The latest nightly build seems to be 30-Nov-2008 08:20,
http://people.apache.org/builds/lucene/solr/nightly/
has the version with the NIO fix been built? Are we looking in the
wrong place?
Tom
Tom Burton-West
Infor
Hello all,
We are having problems with extremely slow phrase queries when the
phrase query contains a common words. We are reluctant to just use stop
words due to various problems with false hits and some things becoming
impossible to search with stop words turned on. (For example "to be or
not to
Hello,
We are working with a very large index and with large documents (300+
page books.) It appears that the bottleneck on our system is the disk
IO involved in reading position information from the prx file for
commonly occuring terms.
An example slow query is "the new economics".
To pr
the latest Solr nightly build? We've recently improved this through
the use of NIO.
-Yonik
On Fri, Nov 7, 2008 at 4:23 PM, Burton-West, Tom <[EMAIL PROTECTED]>
wrote:
> Hello,
>
> We are testing Solr with a simulation of 30 concurrent users. We are
> getting socket timeo
Hello,
We are testing Solr with a simulation of 30 concurrent users. We are
getting socket timeouts and the thread dump from the admin tool shows
about 100+ threads with a similar message about a lock. (Message
appended below).
We supsect this may have something to do with one or more phrase que
93 matches
Mail list logo