CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-27 Thread Burton-West, Tom
characters are formed: いろは革命歌 =>“いろ” ”ろは“ “は革” ”革命” “命歌” Is there a way to specify that you don’t want bigrams across character types? Tom Tom Burton-West Digital Library Production Service University of Michigan Library http://www.hathitrust.org/blogs/large-scale-search

RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom
my limited understanding of Japanese, I can see how perhaps bigramming a Han and Hiragana character might make sense but what about Han and Katakana? Lance, how did you weight the unigram vs bigram fields for CJK? or did you just OR them together assuming that idf will give the bigrams more weight? Tom

RE: CJKBigram filter questons: single character queries, bigrams created across sript/character types

2012-04-30 Thread Burton-West, Tom
Thanks wunder, I really appreciate the help. Tom

boost not showing up in Solr 3.6 debugQueries?

2012-05-17 Thread Tom Burton-West
log. Tom Burton-West 兵にな^1000 OR hanUnigrams:兵にな 兵にな^1000 OR hanUnigrams:兵にな ((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵 ((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵 0.15685473 = (MATCH) sum of: 0.15684697 = (MATCH) sum of: 0.0067602023 = (MATCH) weight(ocr:兵に in 213594

What is the "docs" number in Solr explain query results for fieldnorm?

2012-05-25 Thread Tom Burton-West
ed below the explain scoring for a couple of documents with tf 50 and 67. 0.6798219 DF9199B7049F8DFE-220 DF9199B7049F8DFE The Aeroplane 0.6798219 = (MATCH) fieldWeight(ocr:the in 16624), product of: 1.0 = tf(termFreq(ocr:the)=1) 1.087715 = idf(docFreq=16219, maxDocs=1

Solr 3x segments file and deleting index

2010-12-01 Thread Burton-West, Tom
then restart Solr. Is this a feature or a bug? What is the rationale? Tom Tom Burton-West

ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
: 3.0.0.2010.11.19.16.00.54Solr Implementation Version: 3.1-SNAPSHOT 1036094 - root - 2010-11-19 16:00:54Lucene Specification Version: 3.1-SNAPSHOTLucene Implementation Version: 3.1-SNAPSHOT 1036094 - 2010-11-19 16:01:10 Tom Burton-West

RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-01 Thread Burton-West, Tom
n the production indexer. If it doesn't I'll turn it on and post here. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, December 01, 2010 2:43 PM To: solr-user@lucene.apache.org Subject: Re: ramBufferSizeMB not reflected in s

RE: ramBufferSizeMB not reflected in segment sizes in index

2010-12-02 Thread Burton-West, Tom
, 2010 5:40:33 PM IW 0 [Wed Dec 01 17:40:33 EST 2010; http-8091-Processor12]: flushedFiles=[_5h.frq, _5h.tis, _5h.prx, _5h.nrm, _5h.fnm, _5h.tii] Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Wednesday, December 01, 2010 3:43 PM To: solr

access to environment variables in solrconfig.xml and/or schema.xml?

2010-12-13 Thread Burton-West, Tom
stuffed into a java system property? Tom Burton-West

Memory use during merges (OOM)

2010-12-15 Thread Burton-West, Tom
terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West - Changes to indexing configuration

RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom
t is), > how many merges you allow concurrently, and whether you do false or > true deletions Does an optimize do something differently? Tom

RE: Memory use during merges (OOM)

2010-12-16 Thread Burton-West, Tom
a particular term as opposed to having to iterate through all the terms? (Haven't yet dug into the merging/indexing code). Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] > We are setting  termInfosIndexDivisor, w

RE: Memory use during merges (OOM)

2010-12-18 Thread Burton-West, Tom
riter and SolrIndexConfig trying to better understand how solrconfig.xml gets instantiated and how it affects the readers and writers. Tom From: Robert Muir [rcm...@gmail.com] On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom wrote: >>>Your

Solr indexing socket timeout errors

2011-01-07 Thread Burton-West, Tom
where we might look to determine the cause? Tom Tom Burton-West Jan 7, 2011 2:31:07 AM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: [was class java.net.SocketTimeoutException] Read timed out at com.ctc.wstx.util.ExceptionUtil.throwRuntimeExce

RE: How to handle searches across traditional and simplifies Chinese?

2011-03-08 Thread Burton-West, Tom
This page discusses the reasons why it's not a simple one to one mapping http://www.kanji.org/cjk/c2c/c2cbasis.htm Tom -Original Message- > I have documents that contain both simplified and traditional Chinese > characters. Is there any way to search across them? For example,

RE: Using Solr over Lucene effects performance?

2011-03-14 Thread Burton-West, Tom
index is large enough so disk I/O is a factor. Tom -Original Message- From: Glen Newton [mailto:glen.new...@gmail.com] Sent: Friday, March 11, 2011 5:28 PM To: solr-user@lucene.apache.org; yo...@lucidimagination.com Cc: sivaram Subject: Re: Using Solr over Lucene effects performance? I

ArrayIndexOutOfBoundsException with facet query

2011-04-08 Thread Burton-West, Tom
1=2537,nTerms=498975,bigTerms=0,termInstances=1368694,uses=0} Apr 8, 2011 2:01:58 PM org.apache.solr.core.SolrCore execute Is this a known bug? Can anyone provide a clue as to how we can determine what the problem is? Tom Burton-West Appended Below is the exception stack trace: SEVERE: Exceptio

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom
blems with regular searches against the index or with other facet queries. Only with this facet. Is TermInfoAndOrd only used for faceting? I'll go ahead and build the patch and let you know. Tom p.s. Here is the field definition: -Original Message- From: Michael McCandl

RE: ArrayIndexOutOfBoundsException with facet query

2011-04-11 Thread Burton-West, Tom
xceed some number to trigger the bug? I rebuilt lucene-core-3.1-SNAPSHOT.jar with your patch and it fixes the problem. Tom -Original Message- From: Michael McCandless [mailto:luc...@mikemccandless.com] Sent: Monday, April 11, 2011 1:00 PM To: Burton-West, Tom Cc: solr

Understanding the DisMax tie parameter

2011-04-14 Thread Burton-West, Tom
re it doesn't matter what the maximum scoring sub query is, the final score is the sum of the sub scores. Typically a low value (ie: 0.1) is useful." Tom Burton-West

RE: Understanding the DisMax tie parameter

2011-04-15 Thread Burton-West, Tom
Thanks everyone. I updated the wiki. If you have a chance please take a look and check to make sure I got it right on the wiki. http://wiki.apache.org/solr/DisMaxQParserPlugin#tie_.28Tie_breaker.29 Tom -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent

RE: QUESTION: SOLR INDEX BIG FILE SIZES

2011-04-18 Thread Burton-West, Tom
nk there is a new merge policy and parameter that allows you to set the maximum size of the merged segment: https://issues.apache.org/jira/browse/LUCENE-854. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Juan Grande [mailto:juan.gra...

RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

2011-04-26 Thread Burton-West, Tom
reqTerms a lot of memory. http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=log Tom http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: mdz-munich [mailto:sebastian.lu...@bsb-m

RE: CommonGrams indexing very slow!

2011-04-27 Thread Burton-West, Tom
documents and committing triggers a cascading merge. (But this is a WAG without seeing what's in your indexwriter log) Can you also send your solrconfig so we can see your mergeFactor and ramBufferSizeMB settings? Tom > > All, > > > > We have created index with CommonGram

RE: CommonGrams indexing very slow!

2011-04-27 Thread Burton-West, Tom
Hi Salman, We had a similar problem with the IndexMergeTool in Lucene contrib. I seem to remember having to hack the IndexMergeTool code so that it wouldn't create the CFF automatically. Let me know if you need it and I'll dig up the modified code. Tom. -Original Message

filter cache and negative filter query

2011-05-17 Thread Burton-West, Tom
this and have to actually do a second filter query against the index for "not history"? Tom

RE: 400 MB Fields

2011-06-07 Thread Burton-West, Tom
the text, so I would suspect even with the largest ramBufferSizeMB, you might run into problems. (This is with the 3.x branch. Trunk might not have this problem since it's much more memory efficient when indexing Tom Burton-West www.hathitrust.org/

RE: huge shards (300GB each) and load balancing

2011-06-08 Thread Burton-West, Tom
://www.hathitrust.org/blogs/large-scale-search/too-many-words-again for details) We later ran into memory problems when indexing so instead changed the index time parameter termIndexInterval from 128 to 1024. (More details here: http://www.hathitrust.org/blogs/large-scale-search) Tom Burton-West

Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom
using. Tom Burton-West query ocr:tink* highlighting params: true 200 true 200 colored simple ocr true true

RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-08 Thread Burton-West, Tom
d in the previous email) Tom ocr:tinkham − − − John {lt:}b style="background:#00"{gt:}Tinkham{lt:}/b{gt:}, who married Miss Mallie Kingsbury; Mr. William Ash- ley, and Mr. Leavitt, who, I believe, built the big stone

RE: Does MultiTerm highlighting work with the fastVectorHighlighter?

2011-06-09 Thread Burton-West, Tom
Hi Koji, Thank you for your reply. >> It is the feature of FVH. FVH supports TermQuery, PhraseQuery, BooleanQuery >> and DisjunctionMaxQuery >> and Query constructed by those queries. Sorry, I'm not sure I understand. Are you saying that FVH supports MultiTerm highlighting? Tom

FastVectorHighlighter and hl.fragsize parameter set to zero causes exception

2011-06-10 Thread Burton-West, Tom
uld have a note indicating whether they apply to only the regular highlighter or the FVH? Tom Burton-West

RE: FastVectorHighlighter and hl.fragsize parameter set to zero causes exception

2011-06-11 Thread Burton-West, Tom
Thank you Koji, I'll take a look at SingleFragListBuilder, LUCENE-2464, and SOLR-1985, and I will update the wiki on Monday. Tom There is SingleFragListBuilder for this purpose. Please see: https://issues.apache.org/jira/browse/LUCENE-2464

RE: huge shards (300GB each) and load balancing

2011-06-15 Thread Burton-West, Tom
tly is allocated max 12GB memory. I'm curious about how much memory you leave to the OS for disk caching. Can you give any details about the number of shards per machine and the total memory on the machine. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

RE: Garbage Collection: I have given bad advice in the past!

2011-06-24 Thread Burton-West, Tom
the JVM you are using? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

RE: what s the optimum size of SOLR indexes

2011-07-05 Thread Burton-West, Tom
e times of 200ms (but 99th percentile times of about 2 seconds). Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom
so many shards that the overhead of distributing the queries, and consolidating/merging the responses becomes a serious issue. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search * http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5

RE: performance crossover between single index and sharding

2011-08-02 Thread Burton-West, Tom
justify buying 8 more machines:) Tom -Original Message- From: Markus Jelsma [mailto:markus.jel...@openindex.io] Sent: Tuesday, August 02, 2011 2:12 PM To: solr-user@lucene.apache.org Subject: Re: performance crossover between single index and sharding Hi Tom, Very interesting indeed! But

edismax parser ignores mm parameter when tokenizer splits tokens (i.e. CJK)

2012-06-26 Thread Tom Burton-West
oblem also occurs with non-CJK queries for example [two-thirds] turns into a Boolean OR query for ( [two] OR [thirds] ). Is there some way to tell the edismax query parser to stick with mm =100%? Appended below is the debugQuery output for these two queries and an exceprt from our schema.xml. To

edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-06-28 Thread Tom Burton-West
something, or is this a bug? I'd like to file a JIRA issue, but want to find out if I am missing something here. Details of several queries are appended below. Tom Burton-West edismax query mm=2 query with hypenated word [fire-fly] {!edismax mm=2}fire-fly {!edismax mm=2}fire-fly +D

Re: edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-07-02 Thread Tom Burton-West
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which also lists a couple other related mailing list posts. On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West wrote: > Hello, > > My previous e-mail with a CJK example has received no replies. I > verifi

Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-21 Thread Tom Burton-West
users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance, I don't understand enough of how the field collapsing is implemented, but I thought it worked with distributed search. Are you saying it only works if everything that needs collapsing is on the same shard? Tom On Wed, Aug 22, 2012 at 2:41 AM, Lance Norskog wrote: > Ho

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Tirthankar, Can you give me a quick summary of what won't work and why? I couldn't figure it out from looking at your thread. You seem to have a different issue, but maybe I'm missing something here. Tom On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee < tchatte

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
numFound = 325. This shows that the items in the group are distributed between different shards. What am I missing here? What is it that you are saying does not work? Tom Field Collapse query ( IP address changed, and newlines added and shard urls simplified for readability) http://solr

Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Hello, Usually in the example/solr file in Solr distributions there is a populated conf file. However in the distribution I downloaded of solr 4.0.0-BETA, there is no /conf directory. Has this been moved somewhere? Tom ls -l apache-solr-4.0.0-BETA/example/solr total 107 drwxr-sr-x 2 tburtonw

Re: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Thanks Markus! Should the README.txt file in solr/example be updated to reflect this? Is that something I need to enter a JIRA issue for? Tom On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma wrote: > Hi - The example has been moved to collection1/ > > > > -Original message---

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
ot of docs per group and take a look at the memory use using JConsole. Tom On Wed, Aug 22, 2012 at 4:02 PM, Tirthankar Chatterjee < tchatter...@commvault.com> wrote: > Hi Tom, > > We had an issue where we are keeping millions of docs in a single node and > we were trying to

Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
23, 2012 12:06:02 PM org.apache.solr.core.SolrResourceLoader " I think somehow the previous solr.xml configuration is being stored on disk somewhere and loaded. Any clues? Tom

Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
llection1" specified in some other config file or hardcoded into Solr somewhere? If using a core is mandatory with Solr 4.0 , the CoreAdmin wiki page and the release notes should point this out. Tom

Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
rowse/SOLR-3753 On Thu, Aug 23, 2012 at 1:04 PM, Tom Burton-West wrote: > I did not describe the problems correctly. > > I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and > .../solrs/4.0/2solrs/3 > > For shard 1 I have a solr.xml file with the mo

Re: Solr 4.0 Beta missing example/conf files?

2012-08-23 Thread Tom Burton-West
l file, it was very wierd to still get a message about a missing collection1 core directory. See this JIRA issue:https://issues.apache.org/jira/browse/SOLR-3753 Tom On Thu, Aug 23, 2012 at 7:56 PM, Erik Hatcher wrote: > Tom - > > I corrected, on both trunk and 4_x, a reference to s

Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
example file does not exlain what the termIndexDivisor does. Would it be appropriate to add these back to the wiki page? If not, could someone add a line or two to the comments in the Solr 4.0 example file explaining what the termIndexDivisor doe? Tom

Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
these parameters don't make sense for the default codec, then maybe they need to be commented out or removed from the solr example solrconfig.xml. Tom On Fri, Sep 7, 2012 at 1:33 PM, Robert Muir wrote: > Hi Tom: I already enhanced the javadocs about this for Lucene, putting > war

Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Codec in Solr using the schema.xml or solrconfig.xml? Is there some simple way to specify minBlockSize and maxBlockSize in schema.xml? Once I get this all working and understand it, I'll be happy to draft some documentation. I'm really looking forward to experimenting with 4.0! Tom Tom On

Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
that gets sent to Solr is not actually a dismax query. 3 ocr^200 true true 0.1 fire-fly xml fire-fly fire-fly text:fire text:fly If a correct dismax query was being sent to Solr the parsedquery would have something like the following: (+DisjunctionMaxQuery(((text:fire text:fly))) Tom Burton-West

Re: Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
Thanks Erik, Just found out that there is already a bug report for this open as https://issues.apache.org/jira/browse/SOLR-3811. Tom On Thu, Sep 13, 2012 at 12:52 PM, Erik Hatcher wrote: > That's definitely a bug. dismax=true is not the correct parameter to > send. Should be def

Solr 4.0 error message: "Unsupported ContentType: Content-type:text/xml"

2012-11-02 Thread Tom Burton-West
, application/csv, application/javabin, text/xml, application/json] We use exactly the same code without problem with Solr 3.6. We are sending a ContentType 'text/xml'. Is it likely that there is some other problem and this is just not quite the right error message? Tom

Re: Solr 4.0 error message: "Unsupported ContentType: Content-type:text/xml"

2012-11-02 Thread Tom Burton-West
Thanks Jack, That is exactly the problem. Apparently earlier versions of Solr ignored the extra text, which is why we didn't catch the bug in our code earlier. Thanks for the quick response. Tom On Fri, Nov 2, 2012 at 5:34 PM, Jack Krupansky wrote: > That message makes it sounds a

Re: Skewed IDF in multi lingual index

2012-11-08 Thread Tom Burton-West
Hi Markus, No answers, but I am very interested in what you find out. We currently index all languages in one index, which presents different IDF issues, but are interested in exploring alternatives such as the one you describe. Tom Burton-West http://www.hathitrust.org/blogs/large-scale

URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
" doesn't work either: mysolr.umich.edu/analysis/field?name=title&q=fire-fly No matter what field I specify, the analysis returned is for the default field. (See repsonse excerpt below) Is there a page somewhere that shows the correct syntax for sending get requests to the FieldAnalysisRequestHandler? Tom

Re: URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
Thanks Robert, Somehow I read the doc but still entered the params wrong. Should have been "analysis.fieldname" instead of "analysis.name" Works fine now. Tom On Tue, Nov 13, 2012 at 2:11 PM, Robert Muir wrote: > I think the UI uses this behind the scenes, as in

Re: BM25 model for solr 4?

2012-11-15 Thread Tom Burton-West
l (non-research) production implementations that have tested the new ranking models available in Solr. Tom On Wed, Nov 14, 2012 at 9:16 PM, Floyd Wu wrote: > Hi there, > Does anybody can kindly tell me how to setup solr to use BM25? > By the way, are there any experiment or research shows

Re: Limit of Index size per machine..

2009-08-06 Thread Tom Burton-West
index size. We plan to distribute the index across 5 machines. More information on our setup and results is available at:http://www.hathitrust.org/blogs/large-scale-search Tom > > The expected processed log file size per day: 100 GB > > We are expecting to retain these indexes for 30 da

Solr admin url for example gives 404

2009-08-26 Thread Burton-West, Tom
Hello all, When I start up Solr from the example directory using start.jar, it seems to start up, but when I go to the localhost admin url (http://localhost:8983/solr/admin) I get a 404 (See message appended below). Has the url for the Solr admin changed? Tom Tom Burton-West

IndexWriter InfoStream in solrconfig not working

2009-10-07 Thread Burton-West, Tom
rowse/SOLR-1145, but did not see a unit test that I might try to run in our system. Do others have this logging working successfully ? Is there something else that needs to be set up? Tom

Re: Slow Phrase Queries

2009-10-20 Thread Tom Burton-West
query from the Admin tool interface in Solr and then in Lucene to see if the query is being parsed or otherwise interpreted differently. Tom DHast wrote: > > Hello, > I have recently installed Solr as an alternative to our home made lucene > search servers, and while in most

Re: Talk on Solr - Oakland, CA June 18, 2008

2008-06-21 Thread Tom Hill - Solr
me more content. If anyone knows a way to make them visible on Slideshare, please let me know! Comments, feedback and corrections are welcome. Tom -- View this message in context: http://www.nabble.com/Talk-on-Solr---Oakland%2C-CA-June-18%2C-2008-tp17880636p18050829.html Sent from the Solr - Us

Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-07 Thread Burton-West, Tom
queries containing common terms since our index is very large and we suspect one or more very large segments of the position index need to be read into memory. Can someone point us to either the possible cause of this problem or what we might change to reduce/eliminate it? Tom Tom Burton-West

RE: Solr locking issue? BLOCKED on lock=org.apache.lucene.store.FSDirectory

2008-11-11 Thread Burton-West, Tom
might help. What confuses me is why multiple searchers are locking the prx index file. I would think that searching is a read-only operation. Perhaps we need to change something to tell Solr we aren't updating the index? Tom -Original Message- From: [EMAIL PROTECTED] [mai

Processing of prx file for phrase queries: Whole position list for term read?

2008-11-18 Thread Burton-West, Tom
her than the term/doc id). On the other hand the documentation for the .prx file states that Positions entries are "ordered by increasing document number (the document number is implicit from the .frq file)" Tom

port of Nutch CommonGrams to Solr for help with slow phrase queries

2008-11-24 Thread Burton-West, Tom
ndexing. Optimize phrase queries to use the n-grams. Single terms are still indexed too, with n-grams overlaid." http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/analysis/C ommonGrams.html Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Services University of Michigan Library

RE: NIO not working yet

2008-12-02 Thread Burton-West, Tom
Thanks Yonik, ->>The next nightly build (Dec-01-2008) should have the changes. The latest nightly build seems to be 30-Nov-2008 08:20, http://people.apache.org/builds/lucene/solr/nightly/ has the version with the NIO fix been built? Are we looking in the wrong place? Tom Tom Burto

Load balancing for distributed Solr

2008-12-02 Thread Burton-West, Tom
distributor/response aggregator? Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Services University of Michigan

Re: port of Nutch CommonGrams to Solr for help with slow phrase queries

2009-03-06 Thread Tom Burton-West
? Tom i haven't used Nutch's implementation, but used the current implementation (1.3) of ngrams and shingles to address exactly the same issue ( database of music albums and tracks). We didn't notice any severe performance hit but : - data set isn't huge ( ca 1 MM docs). - r

Can TermIndexInterval be set in Solr?

2009-03-25 Thread Burton-West, Tom
IndexWriter.setTermIndexInterval? Tom Tom Burton-West Digital Library Production Services University of Michigan Library

WordDelimiterFilterFactory removes words when options set to 0

2009-04-17 Thread Burton-West, Tom
20,29 53,56 Here is the schema Tom

Re: Contributors - Solr in Action Case Studies

2010-01-20 Thread Tom Burton-West
://www.hathitrust.org/large_scale_search and our blog: http://www.hathitrust.org/blogs/large-scale-search http://www.hathitrust.org/blogs/large-scale-search (I'll be updating the blog with details of current hardware and performance tests in the next week or so) Tom Tom Burton-West Digital Li

Re: Thanks Robert!

2010-02-05 Thread Tom Burton-West
+1 And thanks to you both for all your work on CommonGrams! Tom Burton-West Jason Rutherglen-2 wrote: > > Robert, thanks for redoing all the Solr analyzers to the new API! It > helps to have many examples to work from, best practices so to speak. > > -- View this mes

TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-08 Thread Burton-West, Tom
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:246) Any suggestions for troubleshooting would be appreciated. Trace from tomcat logs appended below. Tom Burton-West --- F

Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West
n the index by 2 or 3 times. Tom --- Solr Specification Version: 1.3.0.2009.09.03.11.14.39 Solr Implementation Version: 1.4-dev 793569 - root - 2009-09-03 11:14:39 Lucene Specification Version: 2.

Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West
Thanks Michael, I'm not sure I understand. CheckIndex reported a negative number: -16777214. But in any case we can certainly try running CheckIndex from a patched lucene We could also run a patched lucene on our dev server. Tom Yes, the term count reported by CheckIndex is the

Re: persistent cache

2010-02-12 Thread Tom Burton-West
e. A good overview of the issues is the paper by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of Caching on Search Engines ) Tom Burton-West Digital Library Production Service University of Michigan Library -- View this message in context: http://old.nabble.com/persis

Re: persistent cache

2010-02-15 Thread Tom Burton-West
.org/blogs/large-scale-search ) Tom Hi Tom, 1600 warming queries, that's quite many. Do you run them every time a document is added to the index? Do you have any tips on warming? If the index size is more than you can have in RAM, do you recommend to split the index to several servers so i

What is largest reasonable setting for ramBufferSizeMB?

2010-02-17 Thread Burton-West, Tom
to 3200MB? What are people's experiences with very large ramBufferSizeMB sizes? Tom Burton-West University of Michigan Library www.hathitrust.org

Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-18 Thread Tom Burton-West
Thanks Otis, I don't know enough about Hadoop to understand the advantage of using Hadoop in this use case. How would using Hadoop differ from distributing the indexing over 10 shards on 10 machines with Solr? Tom Otis Gospodnetic wrote: > > Hi Tom, > > 32MB is very low

Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-19 Thread Tom Burton-West
size of the ramBuffer and how much heap you need to give the JVM, or is there some reasonable method of finding this out by experimentation? We would rather not find out by decreasing the amount of memory allocated to the JVM until we get an OOM. Tom I've run Lucene with heap sizes as large

Cleaning up dirty OCR

2010-03-09 Thread Burton-West, Tom
the number of languages and the inclusion of proper names, place names, and technical terms. We are considering using some heuristics, such as looking for strings over a certain length or strings containing more than some number of punctuation characters. This paper has a few such heur

Re: Solr Performance Issues

2010-03-11 Thread Tom Burton-West
es are you will see serious contention for disk I/O.. Of course if you don't see any waiting on i/o, then your bottleneck is probably somewhere else:) See http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1 for more background on our experience. Tom Burton-

RE: Cleaning up dirty OCR

2010-03-11 Thread Burton-West, Tom
x. I have some other questions about index pruning, but I want to do a bit more reading and then I'll post a question to either the Solr or Lucene list. Can you suggest which list I should post an index pruning question to? Tom -Original Message- From: Robert Muir [mailto:rcm

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
pt we would have to watch out for any Russian-Chinese dictionaries:) Tom > > > There wasn't any completely satisfactory solution; there were a large > number > of two and three letter n-grams so we were able to use a dictionary > approach > to eliminate those (nam

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus. hossman wrote: > > > > Since you are dealing with multiple langugaes, and multiple varient usages > of langauge

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
lock ideas mentioned above. I'm not sure I understand your suggestion. Since real word hapax legomenons are generally pretty common (maybe 40-60% of unique words) wouldn't using them as the "no" set provide mixed signals to the classifier? Tom Walter Underwood-2 wrote: >

Re: Solr RAM Requirements

2010-03-17 Thread Tom Burton-West
files. You also might want to take a look at the free memory when you start up Solr and then watch as it fills up as you get more queries (or send cache-warming queries). Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search KaktuChakarabati wrote: > > My question was m

Experience with Solr and JVM heap sizes over 2 GB

2010-03-31 Thread Burton-West, Tom
cases, “freeze the world” pauses of a minute or more. As a practical matter, this can become a serious problem for heap sizes that exceed about two gigabytes, even if far more physical memory is available.” http://www.lucidimagination.com/search/document/CDRG_ch08_8.4.1?q=memory%20caching Tom

Experience with indexing billions of documents?

2010-04-02 Thread Burton-West, Tom
(http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) mentions a limit of about 2 billion document ids. I assume this is the lucene internal document id and would therefore be a per index/per shard limit. Is this correct? Tom Burton-West.

Using NoOpMergePolicy (Lucene 2331) from Solr

2010-04-27 Thread Burton-West, Tom
and writes. Tom Burton-West

RE: nfs vs sas in production

2010-04-27 Thread Burton-West, Tom
management as we scale out. One of the reasons was to reduce contention between indexing/optimizing and search instances for disk I/O. See http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-50-volumes-5-million-volumes-and-beyond for details. Tom -Original

RE: Using NoOpMergePolicy (Lucene 2331) from Solr

2010-04-29 Thread Burton-West, Tom
Thanks Koji, That was the information I was looking for. I'll be sure to post the test results to the list. It may be a few weeks before we can schedule the tests for our test server. Tom >>I've never tried it but NoMergePolicy and NoMergeScheduler >>can be speci

<    1   2   3   4   5   6   >