RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom
Thanks wunder, I really appreciate the help. Tom

RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom
Thanks wunder and Lance, In the discussions I've seen of Japanese IR in the English language IR literature, Hiragana is either removed or strings are segmented first by character class. I'm interested in finding out more about why bigramming across classes is desirable. Based on my limited und

CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-27 Thread Burton-West, Tom
I have a few questions about the CJKBigram filter. About 10% of our queries that contain Han characters are single character queries. It looks like the CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input. This means we would have to

maxMergeDocs in Solr 3.6

2012-04-19 Thread Burton-West, Tom
Hello all, I'm getting ready to upgrade from Solr 3.4 to Solr 3.6 and I noticed that maxMergeDocs is no longer in the example solrconfig.xml. Has maxMergeDocs been deprecated? or doe the tieredMergePolicy ignore it? Since our Docs are about 800K or more and the setting in the old example solrco

RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
es = true; } on TextField. Specifying autoGeneratePhraseQueries explicitly on a field type overrides whatever the default may be. Erik On Feb 23, 2012, at 14:45 , Burton-West, Tom wrote: > Seems like a change in default behavior like this should be included in the > chang

RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
Seems like a change in default behavior like this should be included in the changes.txt for Solr 3.5. Not sure how to do that. Tom -Original Message- From: Naomi Dushay [mailto:ndus...@stanford.edu] Sent: Thursday, February 23, 2012 1:57 PM To: solr-user@lucene.apache.org Subject: autoG

RE: Can Apache Solr Handle TeraByte Large Data

2012-01-16 Thread Burton-West, Tom
Hello , Searching real-time sounds difficult with that amount of data. With large documents, 3 million documents, and 5TB of data the index will be very large. With indexes that large your performance will probably be I/O bound. Do you plan on allowing phrase or proximity searches? If so, you

RE: Getting facet counts for 10,000 most relevant hits

2011-10-03 Thread Burton-West, Tom
Thanks so much for your reply Hoss, I didn't realize how much more complicated this gets with distributed search. Do you think it's worth opening a JIRA issue for this? Is there already some ongoing work on the faceting code that this might fit in with? In the meantime, I think I'll go ahead an

RE: Getting facet counts for 10,000 most relevant hits

2011-09-30 Thread Burton-West, Tom
ted a similar feature for a categorization suggestion service. I did the faceting in the client code, which is not exactly the best performing but it worked very well. It would be nice to have the Solr server do the faceting for performance. Burton-West, Tom wrote: > > If relevance ranking is w

Getting facet counts for 10,000 most relevant hits

2011-09-23 Thread Burton-West, Tom
If relevance ranking is working well, in theory it doesn't matter how many hits you get as long as the best results show up in the first page of results. However, the default in choosing which facet values to show is to show the facets with the highest count in the entire result set. Is there

RE: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-19 Thread Burton-West, Tom
r setting ignored? Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Friday, September 16, 2011 7:09 PM To: solr-user@lucene.apache.org Subject: Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4? On Fri, Sep 16, 2011 at 6:53 PM, Burton-West, Tom wrote

Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-16 Thread Burton-West, Tom
Hello, The TieredMergePolicy has become the default with Solr 3.3, but the configuration in the example uses the mergeFactor setting which applys to the LogByteSizeMergePolicy. How is the mergeFactor interpreted by the TieredMergePolicy? Is there an example somewhere showing how to configure t

Example for Solr TieredMergePolicy configuration

2011-09-16 Thread Burton-West, Tom
Hello, The TieredMergePolicy has become the default with Solr 3.3, but the configuration in the example uses the mergeFactor setting which applys to the LogByteSizeMergePolicy. How is the mergeFactor interpreted by the TieredMergePolicy? Is there an example somewhere showing how to configure th

Example configuring TieredMergePolicy in Solr

2011-09-16 Thread Burton-West, Tom
Hello, The TieredMergePolicy has become the default with Solr 3.3, but the configuration in the example uses the mergeFactor setting which applys to the LogByteSizeMergePolicy. How is the mergeFactor interpreted by the TieredMergePolicy? Is there an example somewhere showing how to configure th

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom
Hi Jonothan and Markus, >>Why 3 shards on one machine instead of one larger shard per machine? Good question! We made this architectural decision several years ago and I'm not remembering the rationale at the moment. I believe we originally made the decision due to some tests showing a sweetsp

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom
Hi Markus, Just as a data point for a very large sharded index, we have the full text of 9.3 million books with an index size of about 6+ TB spread over 12 shards on 4 machines. Each machine has 3 shards. The size of each shard ranges between 475GB and 550GB. We are definitely I/O bound. Our m

RE: what s the optimum size of SOLR indexes

2011-07-05 Thread Burton-West, Tom
Hello, On Mon, 2011-07-04 at 13:51 +0200, Jame Vaalet wrote: > What would be the maximum size of a single SOLR index file for resulting in > optimum search time ? How do you define optimimum? Do you want the fastest possible response time at any cost or do you have a specific response time go

RE: Garbage Collection: I have given bad advice in the past!

2011-06-24 Thread Burton-West, Tom
Hi Shawn, Thanks for sharing this information. I also found that in our use case, for some reason the default settings for the concurrent garbage collector seem to size the young generation way too small (At least for heap sizes of 1GB or larger.) Can you also let us know what version of the

RE: huge shards (300GB each) and load balancing

2011-06-15 Thread Burton-West, Tom
Hi Dimitry, >>The parameters you have menioned -- termInfosIndexDivisor and >>termIndexInterval -- are not found in the solr 1.4.1 config|schema. Are you >>using SOLR 3.1? I'm pretty sure that the termIndexInterval (ratio of tii file to tis file) is in the 1.4.1 example solrconfig.xml file, alt

RE: FastVectorHighlighter and hl.fragsize parameter set to zero causes exception

2011-06-11 Thread Burton-West, Tom
Thank you Koji, I'll take a look at SingleFragListBuilder, LUCENE-2464, and SOLR-1985, and I will update the wiki on Monday. Tom There is SingleFragListBuilder for this purpose. Please see: https://issues.apache.org/jira/browse/LUCENE-2464 > 3) A

FastVectorHighlighter and hl.fragsize parameter set to zero causes exception

2011-06-10 Thread Burton-West, Tom
According to the documentation on the Solr wiki page, setting the hl.fragsize parameter to "0" indicates that the whole field value should be used (no fragmenting). However the FastVectorHighlighter throws an exception message fragCharSize(0) is too small. It must be 18 or higher. java.lang.

RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Burton-West, Tom
Hi Koji, Thank you for your reply. >> It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery >> and DisjunctionMaxQuery >> and Query constructed by those queries. Sorry, I'm not sure I understand. Are you saying that FVH supports MultiTerm highlighting? Tom

RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom
Hi Erick, Thanks for asking, yes we have termVectors=true set: I guess I should also mention that highlighting works fine using the fastVectorHighLighter as long as we don't do a MultiTerm query. For example see the query and results appended below (using the same hl parameters listed in t

Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom
We are trying to implement highlighting for wildcard (MultiTerm) queries. This seems to work find with the regular highlighter but when we try to use the fastVectorHighlighter we don't see any results in the highlighting section of the response. Appended below are the parameters we are using.

RE: huge shards (300GB each) and load balancing

2011-06-08 Thread Burton-West, Tom
Hi Dmitry, I am assuming you are splitting one very large index over multiple shards rather than replicating and index multiple times. Just for a point of comparison, I thought I would describe our experience with large shards. At HathiTrust, we run a 6 terabyte index over 12 shards. This is

RE: 400 MB Fields

2011-06-07 Thread Burton-West, Tom
Hi Otis, Our OCR fields average around 800 KB. My guess is that the largest docs we index (in a single OCR field) are somewhere between 2 and 10MB. We have had issues where the in-memory representation of the document (the in memory index structures being built)is several times the size of t

filter cache and negative filter query

2011-05-17 Thread Burton-West, Tom
If I have a query with a filter query such as : " q=art&fq=history" and then run a second query "q=art&fq=-history", will Solr realize that it can use the cached results of the previous filter query "history" (in the filter cache) or will it not realize this and have to actually do a second fi

RE: CommonGrams indexing very slow!

2011-04-27 Thread Burton-West, Tom
y idea why this happening and if we now optimize it using SFF it should be fine in future with CFF= false? P.S: Increasing the MergeFactor didn't even work. On Wed, Apr 27, 2011 at 10:09 PM, Burton-West, Tom wrote: > Hi Salman, > > Sounds like somehow you are triggering merge

RE: CommonGrams indexing very slow!

2011-04-27 Thread Burton-West, Tom
Hi Salman, Sounds like somehow you are triggering merges or optimizes. What is your mergeFactor? Have you turned on the IndexWriter log? In solrconfig.xml true In our case we feed the directory name as a Java property in our java startup script , but you can also hard code where you want

RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread Burton-West, Tom
Don't know your use case, but if you just want a list of the 400 most common words you can use the lucene contrib. HighFreqTerms.java with the - t flag. You have to point it at your lucene index. You also probably don't want Solr to be running and want to give the JVM running HighFreqTerms a l

RE: QUESTION: SOLR INDEX BIG FILE SIZES

2011-04-18 Thread Burton-West, Tom
>> As far as I know, Solr will never arrive to a segment file greater than 2GB, >>so this shouldn't be a problem. Solr can easily create a file size over 2GB, it just depends on how much data you index and your particular Solr configuration, including your ramBufferSizeMB, your mergeFactor, and

RE: Understanding the DisMax tie parameter

2011-04-15 Thread Burton-West, Tom
: Thursday, April 14, 2011 5:41 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Cc: Burton-West, Tom Subject: Re: Understanding the DisMax tie parameter : Perhaps the parameter could have had a better name. It's essentially : max(score of matching clauses) + tie * (sco

Understanding the DisMax tie parameter

2011-04-14 Thread Burton-West, Tom
Hello, I'm having trouble understanding the relationship of the word "tie" and "tiebreaker" to the explanation of this parameter on the wiki. What two (or more things) are in a tie? and how does the number in the range from 0 to 1 break the tie? http://wiki.apache.org/solr/DisMaxQParserPlugin#t

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom
xceed some number to trigger the bug? I rebuilt lucene-core-3.1-SNAPSHOT.jar with your patch and it fixes the problem. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, April 11, 2011 1:00 PM To: Burton-West, Tom Cc: solr

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom
ess [mailto:luc...@mikemccandless.com] Sent: Monday, April 11, 2011 8:40 AM To: solr-user@lucene.apache.org Cc: Burton-West, Tom Subject: Re: ArrayIndexOutOfBoundsException with facet query Tom, I think I see where this may be -- it looks like another > 2B terms bug in Lucene (we are using an int in

ArrayIndexOutOfBoundsException with facet query

2011-04-08 Thread Burton-West, Tom
The query below results in an array out of bounds exception: select/?q=solr&version=2.2&start=0&rows=0&facet=true&facet.field=topicStr Here is the exception: Exception during facet.field of topicStr:java.lang.ArrayIndexOutOfBoundsException: -1931149 at org.apache.lucene.index.TermInfosR

RE: Using Solr over Lucene effects performance?

2011-03-14 Thread Burton-West, Tom
+1 on some kind of simple performance framework that would allow comparing Solr vs Lucene. Any chance the Lucene benchmark programs in contrib could be adopted to read Solr config information? BTW: You probably want to empty the OS cache in addition to restarting Solr between each run if the in

RE: How to handle searches across traditional and simplifies Chinese?

2011-03-08 Thread Burton-West, Tom
This page discusses the reasons why it's not a simple one to one mapping http://www.kanji.org/cjk/c2c/c2cbasis.htm Tom -Original Message- > I have documents that contain both simplified and traditional Chinese > characters. Is there any way to search across them? For example, if someone

Solr indexing socket timeout errors

2011-01-07 Thread Burton-West, Tom
Hello all, We are getting intermittent socket timeout errors (see below). Out of about 600,000 indexing requests, 30 returned these socket timeout errors. We haven't been able to correlate these with large merges, which tends to slow down the indexing response rate. Does anyone know where we

RE: Memory use during merges (OOM)

2010-12-18 Thread Burton-West, Tom
riter and SolrIndexConfig trying to better understand how solrconfig.xml gets instantiated and how it affects the readers and writers. Tom From: Robert Muir [rcm...@gmail.com] On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom wrote: >>>Your

RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom
>>Your setting isn't being applied to the reader IW uses during >>merging... its only for readers Solr opens from directories >>explicitly. >>I think you should open a jira issue! Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is no

RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom
Thanks Mike, >>But, if you are doing deletions (or updateDocument, which is just a >>delete + add under-the-hood), then this will force the terms index of >>the segment readers to be loaded, thus consuming more RAM. Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance a

Memory use during merges (OOM)

2010-12-15 Thread Burton-West, Tom
Hello all, Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of t

access to environment variables in solrconfig.xml and/or schema.xml?

2010-12-13 Thread Burton-West, Tom
I see variables used to access java system properties in solrconfig.xml and schema.xml: http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution ${solr.data.dir:} or ${solr.abortOnConfigurationError:true} Is there a way to access environment variables or does everything have to be

RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Burton-West, Tom
-user@lucene.apache.org Subject: Re: ramBufferSizeMB not reflected in segment sizes in index On Wed, Dec 1, 2010 at 3:16 PM, Burton-West, Tom wrote: > Thanks Mike, > > Yes we have many unique terms due to dirty OCR and 400 languages and probably > lots of low doc freq terms as well (altho

RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
what merges are taking place. Mike On Wed, Dec 1, 2010 at 2:13 PM, Burton-West, Tom wrote: > We are using a recent Solr 3.x (See below for exact version). > > We have set the ramBufferSizeMB to 320 in both the indexDefaults and the > mainIndex sections of our solrconfig.xml: > >

ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
We are using a recent Solr 3.x (See below for exact version). We have set the ramBufferSizeMB to 320 in both the indexDefaults and the mainIndex sections of our solrconfig.xml: 320 20 We expected that this would mean that the index would not write to disk until it reached somewhere approximate

Solr 3x segments file and deleting index

2010-12-01 Thread Burton-West, Tom
If I want to delete an entire index and start over, in previous versions of Solr, you could stop Solr, delete all files in the index directory and restart Solr. Solr would then create empty segments files and you could start indexing. In Solr 3x if I delete all the files in the index directo

RE: Doubt about index size

2010-11-12 Thread Burton-West, Tom
optmize automatically? tks On Fri, Nov 12, 2010 at 2:39 PM, Burton-West, Tom wrote: > Hi Claudio, > > What's happening when you re-index the documents is that Solr/Lucene > implements an update as a delete plus a new index. Because of the nature of > inverted indexes, deleting

RE: Doubt about index size

2010-11-12 Thread Burton-West, Tom
Hi Claudio, What's happening when you re-index the documents is that Solr/Lucene implements an update as a delete plus a new index. Because of the nature of inverted indexes, deleting documents requires a rewrite of the entire index. In order to avoid rewriting the entire index each time one d

RE: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
es. (Unless someone beats me to it :) Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, November 01, 2010 12:49 PM To: solr-user@lucene.apache.org Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr On Mon, Nov 1, 2010

Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
We are trying to solve some multilingual issues with our Solr analysis filter chain and would like to use the new Lucene 3.x filters that are Unicode compliant. Is it possible to use the Lucene ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr? Is it just a matter of writing

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
Tom -Original Message- From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik Seeley Sent: Friday, October 15, 2010 1:19 PM To: solr-user@lucene.apache.org Subject: Re: filter query from external list of Solr unique IDs On Fri, Oct 15, 2010 at 11:49 AM, Burton-West, Tom wrote: >

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
Hi Jonathan, The advantages of the obvious approach you outline are that it is simple, it fits in to the existing Solr model, it doesn't require any customization or modification to Solr/Lucene java code. Unfortunately, it does not scale well. We originally tried just what you suggest for our

filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
At the Lucene Revolution conference I asked about efficiently building a filter query from an external list of Solr unique ids. Some use cases I can think of are: 1) personal sub-collections (in our case a user can create a small subset of our 6.5 million doc collection and then run filter

RE: Experience with large merge factors

2010-10-06 Thread Burton-West, Tom
Hi Mike, >.Do you use multiple threads for indexing? Large RAM buffer size is >>also good, but I think perf peaks out mabye around 512 MB (at least >>based on past tests)? We are using Solr, I'm not sure if Solr uses multiple threads for indexing. We have 30 "producers" each sending documents

Experience with large merge factors

2010-10-05 Thread Burton-West, Tom
Hi all, At some point we will need to re-build an index that totals about 3 terabytes in size (split over 12 shards). At our current indexing speed we estimate that this will take about 4 weeks. We would like to reduce that time. It appears that our main bottleneck is disk I/O during index m

Estimating memory use for Solr caches

2010-10-01 Thread Burton-West, Tom
We are having some memory and GC issues. I'm trying to get a handle on the contribution of the Solr caches. Is there a way to estimate the amount of memory used by the documentCache and the queryResultCache? I assume if we know the average size of our stored fields we can just multiply the s

RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom
Hi Yonik, >>If the new "autoGeneratePhraseQueries" is off, position doesn't matter, and >>the query will >>be treated as "index" OR "reader". Just wanted to make sure, in Solr does autoGeneratePhraseQueries = "off" treat the query with the *default* query operator as set in SolrConfig rather t

RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom
Hi Jonathan, >> I'm afraid I'm having trouble understanding "if the analyzer returns more >> than one position back from a "queryparser token" >>I'm not sure if "the queryparser forms a phrase query without explicit phrase >>quotes" is a problem for me, I had no idea it happened until now, ne

RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Burton-West, Tom
Hi all, The CommonGrams filter is designed to only work on phrase queries. It is designed to solve the problem of slow phrase queries with phrases containing common words, when you don't want to use stop words. It would not make sense for Boolean queries. Boolean queries just get passed throu

RE: Solr and jvm Garbage Collection tuning

2010-09-13 Thread Burton-West, Tom
Thanks Kent for your info. We are not doing any faceting, sorting, or much else. My guess is that most of the memory increase is just the data structures created when parts of the frq and prx files get read into memory. Our frq files are about 77GB and the prx files are about 260GB per sha

RE: Solr memory use, jmap and TermInfos/tii

2010-09-13 Thread Burton-West, Tom
Thanks Robert and everyone! I'm working on changing our JVM settings today, since putting Solr 1.4.1 into production will take a bit more work and testing. Hopefully, I'll be able to test the setTermIndexDivisor on our test server tomorrow. Mike, I've started the process to see if we can provi

RE: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Burton-West, Tom
Thanks Mike, >>Do you use a terms index divisor? Setting that to 2 would halve the >>amount of RAM required but double (on average) the seek time to locate >>a given term (but, depending on your queries, that seek time may still >>be a negligible part of overall query time, ie the tradeoff could

Solr and jvm Garbage Collection tuning

2010-09-10 Thread Burton-West, Tom
We have noticed that when the first query hits Solr after starting it up, memory use increases significantly, from about 1GB to about 16GB, and then as queries are received it goes up to about 19GB at which point there is a Full Garbage Collection which takes about 30 seconds and then memory use

Solr memory use, jmap and TermInfos/tii

2010-09-10 Thread Burton-West, Tom
Hi all, When we run the first query after starting up Solr, memory use goes up from about 1GB to 15GB and never goes below that level. In debugging a recent OOM problem I ran jmap with the output appended below. Not surprisingly, given the size of our indexes, it looks like the TermInfo and T

RE: analysis tool vs. reality

2010-08-13 Thread Burton-West, Tom
+1 I just had occasion to debug something where the interaction between the queryparser and the analyzer produced *interesting* results. Having a separate jsp that includes the whole chain (i.e. analyzer/tokenizer/filter and qp) would be great! Tom -Original Message- From: Michael McC

RE: Improve Query Time For Large Index

2010-08-12 Thread Burton-West, Tom
Hi Peter, If hits aren't showing up, and you aren't getting any queryResultCache hits even with the exact query being repeated, something is very wrong. I'd suggest first getting the query result cache working, and then moving on to look at other possible bottlenecks. What are your settings

RE: Improve Query Time For Large Index

2010-08-11 Thread Burton-West, Tom
Hi Peter, Can you give a few more examples of slow queries? Are they phrase queries? Boolean queries? prefix or wildcard queries? If one word queries are your slow queries, than CommonGrams won't help. CommonGrams will only help with phrase queries. How are you using termvectors? That may be

RE: Improve Query Time For Large Index

2010-08-10 Thread Burton-West, Tom
Hi Peter, A few more details about your setup would help list members to answer your questions. How large is your index? How much memory is on the machine and how much is allocated to the JVM? Besides the Solr caches, Solr and Lucene depend on the operating system's disk caching for caching of

RE: Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Burton-West, Tom
A good starting place might be the list of stemming errors for the original Porter stemmer in this article that describes k-stem: Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on Research and development in i

RE: Total number of terms in an index?

2010-07-27 Thread Burton-West, Tom
Hi Jason, Are you looking for the total number of unique terms or total number of term occurrences? Checkindex reports both, but does a bunch of other work so is probably not the fastest. If you are looking for total number of term occurrences, you might look at contrib/org/apache/lucene/misc

RE: indexing best practices

2010-07-19 Thread Burton-West, Tom
Hi Ken, This is all very dependent on your documents, your indexing setup and your hardware. Just as an extreme data point, I'll describe our experience. We run 5 clients on each of 6 machines to send documents to Solr using the standard http xml process. Our documents contain about 10 field

benchmarking indexing :Use Solr or use Lucene Benchmark?

2010-06-08 Thread Burton-West, Tom
Hi all, We are about to test out various factors to try to speed up our indexing process. One set of experiments will try various maxRamBufferSizeMB settings. Since the factors we will be varying are at the Lucene level, we are considering using the Lucene Benchmark utilities in Lucene/cont

RE: Using NoOpMergePolicy (Lucene 2331) from Solr

2010-04-29 Thread Burton-West, Tom
Thanks Koji, That was the information I was looking for. I'll be sure to post the test results to the list. It may be a few weeks before we can schedule the tests for our test server. Tom >>I've never tried it but NoMergePolicy and NoMergeScheduler >>can be specified in solrconfig.xml: >>

RE: nfs vs sas in production

2010-04-27 Thread Burton-West, Tom
Hi Kallin, Given the previous postings on the list about terrible NFS performance we were pleasantly surprised when we did some tests against a well tuned NFS RAID array on a private network. We got reasonably good results (given our large index sizes.) See http://www.hathitrust.org/blogs/lar

Using NoOpMergePolicy (Lucene 2331) from Solr

2010-04-27 Thread Burton-West, Tom
Is it possible to use the NoOpMergePolicy ( https://issues.apache.org/jira/browse/LUCENE-2331 ) from Solr? We have very large indexes and always optimize, so we are thinking about using a very large ramBufferSizeMB and a NoOpMergePolicy and then running an optimize to avoid extra disk reads a

Experience with indexing billions of documents?

2010-04-02 Thread Burton-West, Tom
We are currently indexing 5 million books in Solr, scaling up over the next few years to 20 million. However we are using the entire book as a Solr document. We are evaluating the possibility of indexing individual pages as there are some use cases where users want the most relevant pages rega

Experience with Solr and JVM heap sizes over 2 GB

2010-03-31 Thread Burton-West, Tom
Hello all, We have been running a configuration in production with 3 solr instances under one tomcat with 16GB allocated to the JVM. (java -Xmx16384m -Xms16384m) I just noticed the warning in the LucidWorks Certified Distribution Reference Guide that warns against using more than 2GB (see be

RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom
Thanks Robert, I've been thinking about this since you suggested it on another thread. One problem is that it would also remove real words. Apparently 40-60% of the words in large corpora occur only once (http://en.wikipedia.org/wiki/Hapax_legomenon.) There are a couple of use cases where r

Cleaning up dirty OCR

2010-03-09 Thread Burton-West, Tom
Hello all, We have been indexing a large collection of OCR'd text. About 5 million books in over 200 languages. With 1.5 billion OCR'd pages, even a small OCR error rate creates a relatively large number of meaningless unique terms. (See http://www.hathitrust.org/blogs/large-scale-search/too

What is largest reasonable setting for ramBufferSizeMB?

2010-02-17 Thread Burton-West, Tom
Hello all, At some point we will need to re-build an index that totals about 2 terrabytes in size (split over 10 shards). At our current indexing speed we estimate that this will take about 3 weeks. We would like to reduce that time. It appears that our main bottleneck is disk I/O. We curre

TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-08 Thread Burton-West, Tom
Hello all, After optimizing rather large indexes on 10 shards (each index holds about 500,000 documents and is about 270-300 GB in size) we started getting intermittent TermInfosReader.get() ArrayIndexOutOfBounds exceptions. The exceptions sometimes seem to occur on all 10 shards at the sam

IndexWriter InfoStream in solrconfig not working

2009-10-07 Thread Burton-West, Tom
Hello, We are trying to debug an indexing/optimizing problem and have tried setting the infoStream file in solrconf.xml so that the SolrIndexWriter will write a log file. Here is our setting: true After making that change to solrconfig.xml, restarting Solr, we see a message in the tomc

Solr admin url for example gives 404

2009-08-26 Thread Burton-West, Tom
Hello all, When I start up Solr from the example directory using start.jar, it seems to start up, but when I go to the localhost admin url (http://localhost:8983/solr/admin) I get a 404 (See message appended below). Has the url for the Solr admin changed? Tom Tom Burton-West --- Here

WordDelimiterFilterFactory removes words when options set to 0

2009-04-17 Thread Burton-West, Tom
In trying to understand the various options for WordDelimiterFilterFactory, I tried setting all options to 0. This seems to prevent a number of words from being output at all. In particular "can't" and "99dxl" don't get output, nor do any wods containing hypens. Is this correct behavior? Here

Can TermIndexInterval be set in Solr?

2009-03-25 Thread Burton-West, Tom
Hello all, We are experimenting with the ShingleFilter with a very large document set (1 million full-text books). Because the ShingleFilter indexes every word pair as a token, the number of unique terms increases tremendously. In our experiments so far the tii and tis files are getting very l

Load balancing for distributed Solr

2008-12-02 Thread Burton-West, Tom
Hello all, As I understand distributed Solr, a request for a distributed search goes to a particular Solr instance with a list of arguments specifying the addresses of the shards to search. The Solr instance to which the request is first directed is responsible for distributing the query to the o

RE: NIO not working yet

2008-12-02 Thread Burton-West, Tom
Thanks Yonik, ->>The next nightly build (Dec-01-2008) should have the changes. The latest nightly build seems to be 30-Nov-2008 08:20, http://people.apache.org/builds/lucene/solr/nightly/ has the version with the NIO fix been built? Are we looking in the wrong place? Tom Tom Burton-West Infor

port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-24 Thread Burton-West, Tom
Hello all, We are having problems with extremely slow phrase queries when the phrase query contains a common words. We are reluctant to just use stop words due to various problems with false hits and some things becoming impossible to search with stop words turned on. (For example "to be or not to

Processing of prx file for phrase queries: Whole position list for term read?

2008-11-18 Thread Burton-West, Tom
Hello, We are working with a very large index and with large documents (300+ page books.) It appears that the bottleneck on our system is the disk IO involved in reading position information from the prx file for commonly occuring terms. An example slow query is "the new economics". To pr

RE: Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-11 Thread Burton-West, Tom
the latest Solr nightly build? We've recently improved this through the use of NIO. -Yonik On Fri, Nov 7, 2008 at 4:23 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote: > Hello, > > We are testing Solr with a simulation of 30 concurrent users. We are > getting socket timeo

Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-07 Thread Burton-West, Tom
Hello, We are testing Solr with a simulation of 30 concurrent users. We are getting socket timeouts and the thread dump from the admin tool shows about 100+ threads with a similar message about a lock. (Message appended below). We supsect this may have something to do with one or more phrase que