Re: Does CDCR Bootstrap sync leaves replica's out of sync

2018-04-16 Thread Tom Peters
There are two ways I've gotten around this issue: 1. Add replicas in the target data center after CDCR bootstrapping has completed. -or- 2. After the bootstrapping has completed, restart the replica nodes one-at-time in the target data center (restart, wait for replica to catch up, then restar

Re: CDCR Bootstrap

2018-04-26 Thread Tom Peters
I'm not sure under what conditions it will be automatically triggered, but if you manually wanted to trigger a CDCR Bootstrap you need to issue the following query to the leader in your target data center. /solr//cdcr?action=BOOTSTRAP&masterUrl= The masterUrl will look something like (change th

ExternalFileField management strategy with SolrCloud

2018-04-26 Thread Tom Peters
Is there a recommended way of managing external files with SolrCloud. At first glance it appears that I would need to manually manage the placement of the external_.txt file in each shard's data directory. Is there a better way of managing this (Solr API, interface, etc?) This message and any

mapping and tuning payloads in Solr 8

2020-02-12 Thread Burgmans, Tom
re the override the scorePayload method in WKSimilarity (it is removed from TFIDFSimilarity). I wonder what alternatives there are for mapping strings payload to floats and use them in a tunable formula for boosting. Thanks, Tom Burgmans

Getting to grips with auto-scaling

2020-06-05 Thread Tom Evans
the hotter shards than the colder shards? It seems to add a lot of complexity - should I just instead think that they aren't getting queried much, so won't be using up cache space that the hot shards will be using. Disk space is pretty cheap after all (total size for "items" + "lists" is under 60GB). Cheers Tom

Indexing error when using Category Routed Alias

2020-06-09 Thread Tom Evans
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:336) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:313) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:171) at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.produce(EatWhatYouKill.java:135) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:806) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:938) at java.base/java.lang.Thread.run(Unknown Source) 2020-06-09 02:12:58.507 INFO (qtp90045638-16) [c:products_20200609__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP s:shard1 r:core_node2 x:products_20200609__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP_shard1_replica_n1] o.a.s.c.S.Request [products_20200609__CRA__NEW_CATEGORY_ROUTED_ALIAS_WAITING_FOR_DATA_TEMP_shard1_replica_n1] webapp=/solr path=/update/json/docs params={} status=400 QTime=2422 Cheers Tom

Re: Getting to grips with auto-scaling

2020-06-09 Thread Tom Evans
. Looks like it might be manually setup and managed collections and aliases for now. Cheers Tom On Mon, Jun 8, 2020 at 12:43 PM Radu Gheorghe wrote: > > Hi Tom, > > To your last two questions, I'd like to vent an alternative design: have > dedicated "hot" and

Re: Solr Support for BM25F

2016-04-18 Thread Tom Burton-West
ve 40 different ones, even with different properties). - the same issue applies to length normalization, lucene has a "field length" but really no concept of document length." Tom On Thu, Apr 14, 2016 at 12:41 PM, David Cawley wrote: > Hello, > I am developing

Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Tom Burton-West
/package-summary.html#changingSimilarity Has something changed between 4.1 and 5.2 that actually will prevent changing Similarity without re-indexing from working, or is this just a warning in case at some future point someone contributes code so that a particular similarity takes advantage of a different index format? Tom

Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
thread, but they each write to their own segments, and (I think) all the threads are in the same Solr process), Are we safe using locktype=single? Tom

Re: Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
maybe somehow a correctly configured Solr might have multiple processes writing to the same file. I'm wondering if your explanation above might be added to the documentation. Tom On Fri, Feb 20, 2015 at 1:25 PM, Chris Hostetter wrote: > > : We are using Solr. We would not co

Optimize maxSegments="2" not working right with Solr 4.10.2

2015-02-23 Thread Tom Burton-West
gment, since we gave the argument maxSegments=2. This didn't happen. Any suggestions about how to troubleshoot this issue would be appreciated. Tom --- Excerpt from indexwriter log: TMP][http-8091-Processor5]: findForcedMerges maxSegmentCount=2 ... ... [IW][Lucene Merge Thread #0]

Re: Basic Multilingual search capability

2015-02-25 Thread Tom Burton-West
detection. If you have German, a filter length of 25 might be too low (Because of compounding). You might want to analyze a sample of your German text to find a good length. Tom http://www.hathitrust.org/blogs/Large-scale-Search On Wed, Feb 25, 2015 at 10:31 AM, Rishi Easwaran wrote: >

Re: How to configure Solr PostingsFormat block size

2015-03-12 Thread Tom Burton-West
g the dread "ClassCastException Class.asSubclass(Unknown Source" error (See below). This is looking like a complex classloader issues. Should I put the file somewhere else and/or declare a lib directory in solrconfig.xml? Any suggestions on how to troubleshoot this?. Tom

Error in Solr 6.6 Example schemas re: DocValues for StrField type must be single-valued?

2017-08-15 Thread Tom Burton-West
index/DocValuesType.html Is the comment in the example schema file completely wrong, or is there some issue with using a docValues with a multivalued StrField? Tom Burton-West https://www.hathitrust.org/blogslarge-scale-search

Re: Indexing large documents

2014-03-19 Thread Tom Burton-West
se case as Otis suggested. In our use case sometimes this is appropriate, but we are investigating the possibility of other methods of scoring the group based on a more flexible function of the scores of the members (i.e scoring book based on function of scores of chapters). Tom Burton-West http://www

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
anese character sets. For example the config given in the JavaDocs tells it to make bigrams across 3 of the different Japanese character sets. (Is the issue related to Romaji?) http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html Tom

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
the test machine when I have time. Tom

Re: Analysis of Japanese characters

2014-04-03 Thread Tom Burton-West
I suspect this is behavior as designed. My guess is that the bigram filter figures that if there was space in the original input (to the whole filter chain), it should not create a bigram across it. Tom BTW: if you can show a few examples of Japanese queries the show the original problem and

Re: tf and very short text fields

2014-04-03 Thread Tom Burton-West
ity and emit 1f in tf() is probably the best way to switch to eliminate using tf counts, assumming that is really what you want. Tom On Tue, Apr 1, 2014 at 4:17 PM, Walter Underwood wrote: > Thanks! We'll try that out and report back. I keep forgetting that I want > to try BM25, s

Re: tf and very short text fields

2014-04-04 Thread Tom Burton-West
the code.) When you set k1 to 0 it does just what you said i.e provides binary tf. That part of the formula returns 1 if the term is present and 0 if not. Which is I think what Wunder was trying to accomplish. Sorry about jumping in without double checking things first. Tom On Fri, Apr 4

Re: When not to use NRTCachingDirectory and what to use instead.

2014-04-21 Thread Tom Burton-West
and I've been focused on other things relating to Solr 4. , I'd love to hear any results from someone who is testing for a batch indexing use case and has tested various xxxDirectoryFactory implementations. Please let me know your results if you do end up doing some testing. Tom On Sa

Re: Evaluating a SOLR index with trec_eval

2013-10-30 Thread Tom Burton-West
version of something like that for the INEX book track. I'll see if I can find the code and if it is in any shape to share. Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Sevice University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-sc

Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
e.org/jira/browse/SOLR-545 Tom

Re: Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
s. I'm still trying to sort out the old and new style solr.xml/core configuration stuff. Thanks for your help. Tom On Wed, Feb 5, 2014 at 4:31 PM, Chris Hostetter wrote: > > : I then tried to locate some config somewhere that would specify that the > : default core would be co

Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Tom Burton-West
mers such as the Greek stemmer will pass through any strings that don't contain characters in the Greek script. So it might be possible to at least do stemming on some of your languages/scripts. I'll be very interested to learn what approach you end up using. Tom -- Some

spam detection issue on sending legitimate mail to Solr list

2014-09-15 Thread Tom Burton-West
score (6.2) exceeded threshold (HTML_MESSAGE,RCVD_IN_DNSWL_ LOW,SPF_NEUTRAL,URIBL_SBL Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search

Re: How to implement multilingual word components fields schema?

2014-09-15 Thread Tom Burton-West
ment in information retrieval (SIGIR '08). ACM, New York, NY, USA, 813-814. DOI=10.1145/1390334.1390518 http:// doi. acm .org/10.1145/1390334.1390518 I hope this helps. Tom On Mon, Sep 8, 2014 at 1:33 AM, Ilia Sretenskii wrote: > Thank you for the replies, guys! > > Using f

Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-16 Thread Tom Burton-West
g the termIndexInterval. Can someone please confirm that these two parameter settings termIndexInterval and termsIndexDivisor, do not apply to the default PostingsFormat for Solr 4.10? Tom

How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

2014-09-17 Thread Tom Burton-West
rms with the keyword attribute more weight? What am I missing? Tom - "A repeated question is "how can I have the original term contribute more to the score than the stemmed version"? In Solr 4.3, the KeywordRepeatFilterFactory has

queryResultMaxDocsCached vs queryResultWindowSize

2014-09-23 Thread Tom Burton-West
ache any results list that contains over the queryResultMaxDocsCached? If so, I will add a comment to the Cwiki doc and open a JIRA and submit a patch to the example file. Tom . --- http://svn.apache.org/viewvc/lucene/dev/branches/lucene_solr_4_10/solr/example/solr/coll

Re: Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-24 Thread Tom Burton-West
you would instead use Lucene41PostingsFormat.Lucene41PostingsFormat(int, int) <http://lucene.apache.org/core/4_10_0/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Lucene41PostingsFormat(int,%20int)>. which can also be configured on a per-field basis:" Tom On Thu, Sep 18, 2014 at 1:42 PM, Chris Host

Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-10 Thread Tom Burton-West
very large XML documents, and the examples I see all build documents by adding fields in Java code. Is there an example that actually reads XML files from the file system? Tom

Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-11 Thread Tom Burton-West
w to use ConcurrentUpdateSolrServer with XML > documents > > I have very large XML documents, and the examples I see all build documents > by adding fields in Java code. Is there an example that actually reads XML > files from the file system? Tom

Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-12 Thread Tom Burton-West
has metadata. I'm now thinking that for testing purposes it might be sufficient to construct dummy documents as in the examples rather than trying to use our actual documents. If the speed improvements look significant enough, then I'd need to figure out how to test with real documents. Thanks again for all the input. Tom

How to configure Solr PostingsFormat block size

2015-01-12 Thread Tom Burton-West
ed on a per-field basis" How can we configure Solr to use different (i.e. non-default) mimum and maximum block sizes? Tom

Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Michael and Hoss, assuming I've written the subclass of the postings format, I need to tell Solr to use it. Do I just do something like: Is there a way to set this for all fieldtypes or would that require writing a custom CodecFactory? Tom On Mon, Jan 12, 2015 at 4:46 PM,

Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
se case for replacing a TermIndexInterval setting with changing the min and max block size on the 41 postings format? Tom On Tue, Jan 13, 2015 at 3:16 PM, Chris Hostetter wrote: > > : ...the nuts & bolts of it is that the PostingFormat baseclass should take > : care of all the

Solr example for Solr 4.10.2 gives warning about Multiple request handlers with same name

2015-01-16 Thread Tom Burton-West
ndler registered to the same name: /update ignoring: org.apache.solr.handler.UpdateRequestHandler Is this a bug? Is there something wrong with the out of the box example configuration? Tom

When not to use NRTCachingDirectory and what to use instead.

2013-07-10 Thread Tom Burton-West
in indexing as well? Does the NRTCachingDirectory have any benefit for indexing under the use case noted above? I'm guessing we should just use the solrStandardDirectoryFactory instead. Is this correct? Tom ---

What does "too many merges...stalling" in indexwriter log mean?

2013-07-11 Thread Tom Burton-West
Hello, We are seeing the message "too many merges...stalling" in our indexwriter log. Is this something to be concerned about? Does it mean we need to tune something in our indexing configuration? Tom

Re: What does "too many merges...stalling" in indexwriter log mean?

2013-07-12 Thread Tom Burton-West
nvestigate. Tom On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey wrote: > On 7/11/2013 1:47 PM, Tom Burton-West wrote: > >> We are seeing the message "too many merges...stalling" in our indexwriter >> log. Is this something to be concerned about? Does it mean we nee

Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
requested to 100,000, I have no problems. Does Solr have a limit on number of rows that can be requested or is this a bug? Tom INFO: [core] webapp=/dev-1 path=/select params={shards=XXX:8111/dev-1/core,XXX:8111/dev-2/core,XXX:8111/dev-3/core&fl=vol_id&indent=on&start=0&q=*:*&am

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
page level, which would result in about 3 billion pages. So testing the scalability of queries used by our current production system, such as the query against the index that is not released to production to get a list of the unique ids that are actually indexed in Solr is part of that testing pro

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
n see the posts that the shards are sending to the head shard and actually get a good measure of how many bytes are being sent around. I'll poke around and look at multipartUploadLimitInKB, and also see if there is some servlet container limit config I might need to mess with. Tom On Thu,

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
} hits=119220943 status=0 QTime=52952 Tom INFO: [core] webapp=/dev-1 path=/select params={fl=vol_id&indent=on&start=700&q=*:*&rows=100} hits=119220943 status=0 QTime=9772 Jul 25, 2013 5:39:43 PM org.apache.solr.core.SolrCore execute INFO: [core] webapp=/dev-1

How to set discountOverlaps="true" in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
If I am using solr.SchemaSimilarityFactory to allow different similarities for different fields, do I set "discountOverlaps="true" on the factory or per field? What is the syntax? The below does not seem to work Tom

Re: How to set discountOverlaps="true" in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
rlaps. Is the default for Solr 4 true? 1.2 0.75 false On Thu, Aug 22, 2013 at 4:58 PM, Markus Jelsma wrote: > Hi Tom, > > Don't set it as attributes but as lists as Solr uses everywhere: > > true > > > For BM25 you can also set

Re: How to set discountOverlaps="true" in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
I should have said that I have set it both to "true" and to "false" and restarted Solr each time and the rankings and info in the debug query showed no change. Does this have to be set at index time? Tom >

ICUTokenizer class not found with Solr 4.4

2013-08-27 Thread Tom Burton-West
t says in the README.txt, I am making some kind of a configuration error. I also don't understand the workaround in SOLR-4852. Is this an ICU issue? A java 7 issue? a Solr 4.4 issue, or did I simply no

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
he-box solrconfig.xml. According to the README.txt, all that needs to be done is create the collection1/lib directory and put the jars there. However, I am getting the class not found error. Should I open another bug report or comment on the existing report? Tom On Tue, Aug 27, 2013 at 6:48 PM

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
not explain why out-of-the-box, simply creating a collection1/lib directory and putting the jars there does not work as documented in both the README.txt and in solrconfig.xml. Shawn, should I add these comments to your JIRA issue? Should I open a separate related JIRA issue? Tom Tom On Tue, Aug

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
nk it to yours or just add this information (i.e. other scenarios where class loading not working) to your JIRA? Details below: Tom The documentation in the collections1/conf directory is confusing. For example the collections1/conf/solrconfig.xml file says you should put a ./lib dir in

Re: Slow queries for common terms

2013-03-22 Thread Tom Burton-West
cache warming with your most common terms. On the other hand as Jan pointed out, you may be cpu bound because Solr doesn't have early termination and has to rank all 90 million docs in order to show the top 10 or 25. Did you try the OR search to see if your CPU is at 100%? Tom On Fri, Mar 22,

Solr 4.x replacement for termsIndexDivisor

2013-05-21 Thread Tom Burton-West
IndexWriterConfig.html#setTermIndexInterval%28int%29 This is followed by an example of how to set the min and max block size in Lucene. Is the ability to set the min and max block size available in Solr? If not, should I open a JIRA? Tom -- Exceprt from the Solr 4.3 latest rev of the example/solr

Example configuring TieredMergePolicy in Solr

2011-09-16 Thread Burton-West, Tom
the Solr TieredMergePolicy to set the parameters: setMaxMergeAtOnce, setSegmentsPerTier, and setMaxMergedSegmentMB? Tom Burton-West

Example for Solr TieredMergePolicy configuration

2011-09-16 Thread Burton-West, Tom
the Solr TieredMergePolicy to set the parameters: setMaxMergeAtOnce, setSegmentsPerTier, and setMaxMergedSegmentMB? Tom Burton-West

Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-16 Thread Burton-West, Tom
ng to 'setMaxMergedSegmentMB' in org.apache.lucene.index.TieredMergePolicy" Tom Burton-West

RE: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-19 Thread Burton-West, Tom
l confused about the mergeFactor=10 setting in the example configuration. Took a quick look at the code, but I'm obviously looking in the wrong place. Is mergeFactor=10 interpreted by TieredMergePolicy as segmentsPerTier=10 and maxMergeAtOnce=10? If I specify values for these is the mergeFacto

Getting facet counts for 10,000 most relevant hits

2011-09-23 Thread Burton-West, Tom
vant documents and select the 5 or 30 facet values with the highest counts for those relevant documents. Is this possible or would it require writing some lucene or Solr code? Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

RE: Getting facet counts for 10,000 most relevant hits

2011-09-30 Thread Burton-West, Tom
ntire result set. In my use case the top 10,000 hits versus all 170,000. Tom -Original Message- From: Lan [mailto:dung@gmail.com] Sent: Thursday, September 29, 2011 7:40 PM To: solr-user@lucene.apache.org Subject: Re: Getting facet counts for 10,000 most relevant hits I implemen

RE: Getting facet counts for 10,000 most relevant hits

2011-10-03 Thread Burton-West, Tom
x27;ll go ahead and do some performance tests on my kludge. That might work for us as an interim measure until I have time to dive into the Solr/Lucene distributed faceting code. Tom -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Friday, September 30,

RE: Can Apache Solr Handle TeraByte Large Data

2012-01-16 Thread Burton-West, Tom
-fly. That way they can search within the document and get page level results. More details about our setup: http://www.hathitrust.org/blogs/large-scale-search Tom Burton-West University of Michigan Library www.hathitrust.org -Original Message-

RE: indexing best practices

2010-07-19 Thread Burton-West, Tom
g the "nomerge" merge policy. I hope to have some results to report on our blog sometime in the next month or so. Tom Burton-West www.hathitrust.org/blogs -Original Message- From: kenf_nc [mailto:ken.fos...@realestate.com] Sent: Sunday, July 18, 2010 8:18 AM To: solr-user@lucene.apa

RE: Total number of terms in an index?

2010-07-27 Thread Burton-West, Tom
x27;t dug in to the code so I don't actually know how the tii file gets loaded into a data structure in memory. If there is api access, it seems like this might be the quickest way to get the number of unique terms. (Of course you would have to do this for each segment). Tom -Origin

RE: Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Burton-West, Tom
rter2 stemmer page: http://snowball.tartarus.org/algorithms/english/stemmer.html Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Friday, July 30, 2010 4:42 PM To: solr-user@lucene.apach

RE: Improve Query Time For Large Index

2010-08-10 Thread Burton-West, Tom
earch/slow-queries-and-common-words-part-2) Tom Burton-West -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 9:54 AM To: solr-user@lucene.apache.org Subject: Improve Query Time For Large Index Hi, I have 5 Million small documents/tweets (=&

RE: Improve Query Time For Large Index

2010-08-11 Thread Burton-West, Tom
ide to use CommonGrams you definitely need to re-index and you also need to use both the index time filter and the query time filter. Your index will be larger. Tom -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Tuesday, August 10, 2010 3:32 PM To:

RE: Improve Query Time For Large Index

2010-08-12 Thread Burton-West, Tom
ar.baz). The debug/explain will indicate whether the parsed query is a PhraseQuery. Tom -Original Message- From: Peter Karich [mailto:peat...@yahoo.de] Sent: Thursday, August 12, 2010 5:36 AM To: solr-user@lucene.apache.org Subject: Re: Improve Query Time For Large Index Hi Tom,

RE: analysis tool vs. reality

2010-08-13 Thread Burton-West, Tom
+1 I just had occasion to debug something where the interaction between the queryparser and the analyzer produced *interesting* results. Having a separate jsp that includes the whole chain (i.e. analyzer/tokenizer/filter and qp) would be great! Tom -Original Message- From: Michael

Solr memory use, jmap and TermInfos/tii

2010-09-10 Thread Burton-West, Tom
600,000 full-text books in each shard). In interpreting the jmap output, can we assume that the listings for utf8 character arrays ("[C"), java.lang.String, long int arrays ("[J), and int arrays ("[i) are all part of the data structures involved in representing the tii

Solr and jvm Garbage Collection tuning

2010-09-10 Thread Burton-West, Tom
Solr is waiting on GC? If we could get the time for each GC to take under a second, with the trade-off being that GC would occur much more frequently, that would help us avoid the occasional query taking more than 30 seconds at the cost of a larger number of queries taking at least a second. Tom

RE: Solr memory use, jmap and TermInfos/tii

2010-09-11 Thread Burton-West, Tom
ex we plan to use a more intelligent filter that will truncate extremely long tokens on punctuation and we also plan to do some minimal prefiltering prior to sending documents to Solr for indexing. However, since with now have over 400 languages , we will have to be conservative in our filtering since we would rather index dirty OCR than risk not indexing legitimate content. Tom

RE: Solr memory use, jmap and TermInfos/tii

2010-09-13 Thread Burton-West, Tom
o see if we can provide you with our tii/tis data. I'll let you know as soon as I hear anything. Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Sunday, September 12, 2010 10:48 AM To: solr-user@lucene.apache.org; simon.willna...@gmail.com Subject: Re: Solr

RE: Solr and jvm Garbage Collection tuning

2010-09-13 Thread Burton-West, Tom
#x27;ll be testing using termIndexInterval with Solr 1.4.1 on our test server. Tom -Original Message- From: Grant Ingersoll [mailto:gsing...@apache.org] >.What are your current GC settings? Also, I guess I'd look at ways you can >reduce the heap size needed. >> Cac

RE: bi-grams for common terms - any analyzers do that?

2010-09-23 Thread Burton-West, Tom
believe Robert Muir, who is an expert on the various problems involved and opened Lucene-2458 is working on a better fix. Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search − −

RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom
e they have the same position, they are not turned into a phrase query. for "l'art" input postion|1 token |l'art output postion|1|2 token |l|art In this case there are two tokens with different positions so it treats them as a phrase query. Tom Burton-West

RE: bi-grams for common terms - any analyzers do that?

2010-09-27 Thread Burton-West, Tom
y with the *default* query operator as set in SolrConfig rather than necessarily using the Boolean "OR" operator? i.e. if and autoGeneratePhraseQueries = off then "IndexReader" -> "index" "reader" -> "index" AND "reader" Tom

Estimating memory use for Solr caches

2010-10-01 Thread Burton-West, Tom
docIDs. I assume these are Java ints but the number depends on the number of hits. Is there a good way to estimate (or measure:) the size of this in memory? Tom Burton-West

Experience with large merge factors

2010-10-05 Thread Burton-West, Tom
optimum mergeFactor somewhere between 0 (noMerge merge policy) and 1,000. (We are also planning to raise the ramBufferSizeMB significantly). What experience do others have using a large mergeFactor? Tom

RE: Experience with large merge factors

2010-10-06 Thread Burton-West, Tom
small compared to our huge OCR field. Since we construct our Solr documents programattically, I'm fairly certain that they are always in the same order. I'll have to look at the code when I get back to make sure. We aren't using term vectors now, but we plan to add them as well as a number of fields based on MARC (cataloging) metadata in the future. Tom

filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
query Can Hoss or someone else point me to more detailed information on what might be involved in the two ideas listed above? Is somehow keeping an up-to-date map of unique Solr ids to internal Lucene ids needed to implement this or is that a separate issue? Tom Burton-West http://www.hathitrust

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
e kind of in-memory map after we optimize an index and before we mount it in production. In our workflow, we update the index and optimize it before we release it and once it is released to production there is no indexing/merging taking place on the production index (so the internal Lucene i

RE: filter query from external list of Solr unique IDs

2010-10-15 Thread Burton-West, Tom
Thanks Yonik, Is this something you might have time to throw together, or an outline of what needs to be thrown together? Is this something that should be asked on the developer's list or discussed in SOLR 1715 or does it make the most sense to keep the discussion in this thread?

Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
writing the appropriate Solr filter factories? Are there any tricky gotchas in writing such a filter? If so, should I open a JIRA issue or two JIRA issues so the filter factories can be contributed to the Solr code base? Tom

RE: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr

2010-11-01 Thread Burton-West, Tom
es. (Unless someone beats me to it :) Tom -Original Message- From: Robert Muir [mailto:rcm...@gmail.com] Sent: Monday, November 01, 2010 12:49 PM To: solr-user@lucene.apache.org Subject: Re: Using ICUTokenizerFilter or StandardAnalyzer with UAX#29 support from Solr On Mon, Nov 1, 2010

RE: Doubt about index size

2010-11-12 Thread Burton-West, Tom
on that some of them are marked as deleted numDocs is the actual number of undeleted documents If you run an optimize the index will be rewritten, the index size will go down and numDocs will equal maxDocs Tom Burton-West -Original Message- From: Claudio Devecchi [mailto:cdevec...@g

RE: Doubt about index size

2010-11-12 Thread Burton-West, Tom
An optimize takes lots of cpu and I/O since it has to rewrite your indexes, so only do it when necessary. You can just use curl to send an optimize message to Solr when you are ready. See: http://wiki.apache.org/solr/UpdateXmlMessages#Passing_commit_parameters_as_part_of_the_URL Tom

How to configure termvectors to not store positions/offsets

2012-12-13 Thread Tom Burton-West
frequencies 2) Shows how to configure termvectors in Solr schema.xml to only store term frequencies, and not positions and offsets? Tom

configuring per-field similarity in Solr 4: "the global similarity does not support it"

2012-12-17 Thread Tom Burton-West
ely. I think I'm missing something here. Can someone point me to documentation or examples? Tom Simplified schema.xml excerpt:

Re: configuring per-field similarity in Solr 4: "the global similarity does not support it"

2012-12-17 Thread Tom Burton-West
there a plan to implement coord and queryNorm? Tom On Mon, Dec 17, 2012 at 5:17 PM, Markus Jelsma wrote: > Hi Tom, > > The global similarity must be able to delegate similarity to your > per-field setting. Solr has the SchemaSimilarityFactory that can do this. > Please replace y

ICUTokenizer labels number as Han character?

2012-12-19 Thread Tom Burton-West
n and "年" as type:Single and script: Common. This doesn't seem right. Couldn't fit the whole analysis output on one screen so there are two screenshots attached. Any clues as to what is going on and whether it is a problem? Tom

Best practices for Solr highlighter for CJK

2013-01-02 Thread Tom Burton-West
. i.e. ABC => searched as AB BC only AB gets highlighted even if the matching string is ABC. (Where ABC are chinese characters such as 大亚湾 => searched as 大亚 亚湾, but only 大亚 is highlighted rather than 大亚湾) Is there some highlighting parameter that might fix this? Tom Burton-West

coord missing from debugQuery explain?

2013-01-08 Thread Tom Burton-West
Hello, I'm trying to understand some Solr relevance issues using debugQuery=on, but I don't see the coord factor listed anywhere in the explain output. My understanding is that the coord factor is not included in either the querynorm or the fieldnorm. What am I missing? Tom

Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
e queryNorm be applied to each result (and show up in each explain from the debugQuery?) This is Solr 3.6. Tom - ocr:aardvark 0.4395488 = (MATCH) fieldWeight(ocr:aardvark in 504374), product of: 7.5498343 = tf(termFreq(ocr:aardvark)=57)

Re: Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
Thanks Hoss, Yes it is a distributed query. Tom On Fri, Jan 25, 2013 at 2:32 PM, Chris Hostetter wrote: > > : I have a one term query: "ocr:aardvark" When I look at the explain > : output, for some matches the queryNorm and fieldWeight are shown and for > : some matche

ngrams or truncation for multilingual searching in Solr

2013-02-05 Thread Tom Burton-West
, New York, NY, USA, 75-82. DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957 Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
Seems like a change in default behavior like this should be included in the changes.txt for Solr 3.5. Not sure how to do that. Tom -Original Message- From: Naomi Dushay [mailto:ndus...@stanford.edu] Sent: Thursday, February 23, 2012 1:57 PM To: solr-user@lucene.apache.org Subject

RE: autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Burton-West, Tom
ink it would help if the change was also noted in changes.txt. Is it possible to revise the changes.txt for 3.5? Do you by any chance know where the change in the default behavior was discussed? I know it has been a contentious issue. Tom -Original Message- From: Erik Hatcher [mailto:erik

maxMergeDocs in Solr 3.6

2012-04-19 Thread Burton-West, Tom
ample solrconfig was 2,147,483,647 we would never hit this limit, but I was wondering about why it is no longer in the example. Tom

<    1   2   3   4   5   6   >