Re: Solr Support for BM25F

2016-04-18 Thread Tom Burton-West
Hi David, It may not matter for your use case but just in case you really are interested in the "real BM25F" there is a difference between configuring K1 and B for different fields in Solr and a "real" BM25F implementation. This has to do with Solr's model of fields being mini-documents (i.e. ea

Changing Similarity without re-indexing (for example from default to BM25)

2015-08-19 Thread Tom Burton-West
Hello all, The last time I worked with changing Simlarities was with Solr 4.1 and at that time, it was possible to simply change the schema to specify the use of a different Similarity without re-indexing. This allowed me to experiment with several different ranking algorithms without having to

Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
Hello, We don't want to use locktype=native (we are using NFS) or locktype=simple (we mount a read-only snapshot of the index on our search servers and with locktype=simple, Solr refuses to start up becaise it sees the lock file.) However, we don't quite understand the warnings about using lockty

Re: Clarification of locktype=single and implications of use

2015-02-20 Thread Tom Burton-West
Thanks Hoss, Protection from misconfiguration and/or starting separate solr instances pointing to the same index dir I can understand. The current documentation on the wiki and in the ref guide (along with just enough understanding of Solr/Lucene indexing to be dangerous) left me wondering if ma

Optimize maxSegments="2" not working right with Solr 4.10.2

2015-02-23 Thread Tom Burton-West
Hello, We normally run an optimize with maxSegments="2" after our daily indexing. This has worked without problem on Solr 3.6. We recently moved to Solr 4.10.2 and on several shards the optimize completed with no errors in the logs, but left more than 2 segments. We send this xml to Solr I've

Re: Basic Multilingual search capability

2015-02-25 Thread Tom Burton-West
Hi Rishi, As others have indicated Multilingual search is very difficult to do well. At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to deal with having materials in 400 languages. We also added the CJKBigramFilter to get better precision on CJK queries. We don't use stop w

Re: How to configure Solr PostingsFormat block size

2015-03-12 Thread Tom Burton-West
Hi Hoss, I created a wrapper class, compiled a jar and included an org.apache.lucene.codecs.Codec file in META-INF/services in the jar file with an entry for the wrapper class :HTPostingsFormatWrapper. I created a collection1/lib directory and put the jar there. (see below) I'm getting the drea

Error in Solr 6.6 Example schemas re: DocValues for StrField type must be single-valued?

2017-08-15 Thread Tom Burton-West
index/DocValuesType.html Is the comment in the example schema file completely wrong, or is there some issue with using a docValues with a multivalued StrField? Tom Burton-West https://www.hathitrust.org/blogslarge-scale-search

Re: Indexing large documents

2014-03-19 Thread Tom Burton-West
se case as Otis suggested. In our use case sometimes this is appropriate, but we are investigating the possibility of other methods of scoring the group based on a more flexible function of the scores of the members (i.e scoring book based on function of scores of chapters). Tom Burton-West http://www

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn, I'm not sure I understand the problem and why you need to solve it at the ICUTokenizer level rather than the CJKBigramFilter Can you perhaps give a few examples of the problem? Have you looked at the flags for the CJKBigramfilter? You can tell it to make bigrams of different Japanese ch

Re: Analysis of Japanese characters

2014-04-02 Thread Tom Burton-West
Hi Shawn, I may still be missing your point. Below is an example where the ICUTokenizer splits Now, I'm beginning to wonder if I really understand what those flags on the CJKBigramFilter do. The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them back together into bigrams. I t

Re: Analysis of Japanese characters

2014-04-03 Thread Tom Burton-West
Hi Shawn, >>For an input of 田中角栄 the bigram filter works like you described, and what I would expect. If I add a space at the point where the ICU >>tokenizer would have split them anyway, the bigram filter output is very different. If I'm understanding what you are reporting, I suspect this is b

Re: tf and very short text fields

2014-04-03 Thread Tom Burton-West
Hi Markus and Wunder, I'm missing the original context, but I don't think BM25 will solve this particular problem. The k1 parameter sets how quickly the contribution of tf to the score falls off with increasing tf. It would be helpful for making sure really long documents don't get too high a

Re: tf and very short text fields

2014-04-04 Thread Tom Burton-West
Thanks Marcus, I was thinking about normalization and was absolutely wrong about setting K1 to zero. I should have taken a look at the algorithm and walked through setting K=0. (This is easier to do looking at the formula in wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though

Re: When not to use NRTCachingDirectory and what to use instead.

2014-04-21 Thread Tom Burton-West
Hi Ken, Given the comments which seemed to describe using NRT for the opposite of our use case, I just set our Solr 4 to use the solr.MMapDirectoryFactory. Didn't bother to test whether NRT would be better for our use case, mostly because it didn't sound like there was an advantage and I've bee

Re: Evaluating a SOLR index with trec_eval

2013-10-30 Thread Tom Burton-West
version of something like that for the INEX book track. I'll see if I can find the code and if it is in any shape to share. Tom Tom Burton-West Information Retrieval Programmer Digital Library Production Sevice University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-sc

Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
Hello, I'm running the example setup for Solr 4.6.1. In the ../example/solr/ directory, I set up a second core. I wanted to send updates to that core. I looked at .../exampledocs/post.sh and expected to see the URL as: URL= http://localhost:8983/solr/collection1/update However it does not

Re: Default core for updates in multicore setup

2014-02-05 Thread Tom Burton-West
Thanks Hoss, >>hardcoded default of "collection1" is still used for backcompat when there is no "defaultCoreName" configured by the user. Aha, it's hardcoded if there is nothing set in a config. No wonder I couldn't find it by grepping around the config files. I'm still trying to sort out the o

Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Tom Burton-West
SA, 677-686. DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622 Code: http://users.dsic.upv.es/~pgupta/mixed-script-ir.html Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org

spam detection issue on sending legitimate mail to Solr list

2014-09-15 Thread Tom Burton-West
score (6.2) exceeded threshold (HTML_MESSAGE,RCVD_IN_DNSWL_ LOW,SPF_NEUTRAL,URIBL_SBL Tom Burton-West Information Retrieval Programmer Digital Library Production Service University of Michigan Library tburt...@umich.edu http://www.hathitrust.org/blogs/large-scale-search

Re: How to implement multilingual word components fields schema?

2014-09-15 Thread Tom Burton-West
Hi Ilia, I see that Trey answered your question about how you might stack language specific filters in one field. If I remember correctly, his approach assumes you have identified the language of the query. That is not the same as detecting the script of the query and is much harder. Trying to

Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-16 Thread Tom Burton-West
Hello, I think the documentation and example files for Solr 4.x need to be updated. If someone will let me know I'll be happy to fix the example and perhaps someone with edit rights could fix the reference guide. Due to dirty OCR and over 400 languages we have over 2 billion unique terms in our

How does KeywordRepeatFilterFactory help giving a higher score to an original term vs a stemmed term

2014-09-17 Thread Tom Burton-West
The Solr wiki says "A repeated question is "how can I have the original term contribute more to the score than the stemmed version"? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. " https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming (

queryResultMaxDocsCached vs queryResultWindowSize

2014-09-23 Thread Tom Burton-West
Hello, queryResultWindowSize sets the number of documents to cache for each query in the queryResult cache.So if you normally output 10 results per pages, and users don't go beyond page 3 of results, you could set queryResultWindowSize to 30 and the second and third page requests will read f

Re: Solr 4.10 termsIndexInterval and termsIndexDivisor not supported with default PostingsFormat?

2014-09-24 Thread Tom Burton-West
Thanks Hoss, Just opened SOLR-6560 and attached a patch which removes the offending section from the example solrconfig.xml file. We suspect that with the much more efficient block and FST based Solr 4 default postings format that the need to mess with the parameters in order to reduce memory u

Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-10 Thread Tom Burton-West
Hello all, In the example schema.xml for Solr 4.10.2 this comment is listed under the "PERFORMANCE NOTE" "For maximum indexing performance, use the ConcurrentUpdateSolrServer java client." Is there some documentation somewhere that explains why this will maximize indexing peformance? In par

Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-11 Thread Tom Burton-West
Thanks Eric, That is helpful. We already have a process that works similarly. Each thread/process that sends a document to Solr waits until it gets a response in order to make sure that the document was indexed successfully (we log errors and retry docs that don't get indexed successfully), howe

Re: Details on why ConccurentUpdateSolrServer is reccommended for maximum index performance

2014-12-12 Thread Tom Burton-West
Thanks everybody for the information. Shawn, thanks for bringing up the issues around making sure each document is indexed ok. With our current architecture, that is important for us. Yonik's clarification about streaming really helped me to understand one of the main advantages of CUSS: >>When

How to configure Solr PostingsFormat block size

2015-01-12 Thread Tom Burton-West
Hello all, Our indexes have around 3 billion unique terms, so for Solr 3, we set TermIndexInterval to about 8 times the default. The net effect of this is to reduce the size of the in-memory index by about 1/8th. (For background see for http://www.hathitrust.org/blogs/large-scale-search/too-many

Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Michael and Hoss, assuming I've written the subclass of the postings format, I need to tell Solr to use it. Do I just do something like: Is there a way to set this for all fieldtypes or would that require writing a custom CodecFactory? Tom On Mon, Jan 12, 2015 at 4:46 PM, Chris Hoste

Re: How to configure Solr PostingsFormat block size

2015-01-13 Thread Tom Burton-West
Thanks Hoss, This is starting to sound pretty complicated. Are you saying this is not doable with Solr 4.10? >>...or at least: that's how it *should* work :) makes me a bit nervous about trying this on my own. Should I open a JIRA issue or am I probably the only person with a use case for repla

Solr example for Solr 4.10.2 gives warning about Multiple request handlers with same name

2015-01-16 Thread Tom Burton-West
Hello, I'm running Solr 4.10.2 out of the box with the Solr example. i.e. ant example cd solr/example java -jar start.jar in /example/log At start-up the example gives this message in the log: WARN - 2015-01-16 12:31:40.895; org.apache.solr.core.RequestHandlers; Multiple requestHandler regist

When not to use NRTCachingDirectory and what to use instead.

2013-07-10 Thread Tom Burton-West
Hello all, The default directory implementation in Solr 4 is the NRTCachingDirectory (in the example solrconfig.xml file , see below). The Javadoc for NRTCachingDirectoy ( http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true) says: "This cl

What does "too many merges...stalling" in indexwriter log mean?

2013-07-11 Thread Tom Burton-West
Hello, We are seeing the message "too many merges...stalling" in our indexwriter log. Is this something to be concerned about? Does it mean we need to tune something in our indexing configuration? Tom

Re: What does "too many merges...stalling" in indexwriter log mean?

2013-07-12 Thread Tom Burton-West
nvestigate. Tom On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey wrote: > On 7/11/2013 1:47 PM, Tom Burton-West wrote: > >> We are seeing the message "too many merges...stalling" in our indexwriter >> log. Is this something to be concerned about? Does it mean we nee

Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Hello, I am running solr 4.2.1 on 3 shards and have about 365 million documents in the index total. I sent a query asking for 1 million rows at a time, but I keep getting an error claiming that there is an invalid version or data not in javabin format (see below) If I lower the number of rows re

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
above 1,000. > > So, if rows=10 works for you, consider yourself lucky! > > That said, there is sometimes talk of supporting streaming, which > presumably would allow access to all results, but chunked/paged in some way. > > -- Jack Krupansky > > -Original Messag

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
Thanks Shawn, I was confused by the error message: "Invalid version (expected 2, but 60) or the data in not in 'javabin' format" Your explanation makes sense. I didn't think about what the shards have to send back to the head shard. Now that I look in my logs, I can see the posts that the shard

Re: Solr 4.2.1 limit on number of rows or number of hits per shard?

2013-07-25 Thread Tom Burton-West
e.solr.core.SolrCore execute INFO: [core] webapp=/dev-1 path=/select params={fl=vol_id&indent=on&start=3300&q=*:*&rows=100} hits=119220943 status=0 QTime=49792 Jul 25, 2013 6:39:43 PM org.apache.solr.core.SolrCore execute INFO: [core] webapp=/dev-1 path=/select params={fl=vol_id&indent

How to set discountOverlaps="true" in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
If I am using solr.SchemaSimilarityFactory to allow different similarities for different fields, do I set "discountOverlaps="true" on the factory or per field? What is the syntax? The below does not seem to work Tom

Re: How to set discountOverlaps="true" in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
Thanks Markus, I set it , but it seems to make no difference in the score or statistics listed in the debugQuery or in the ranking. I'm using a field with CommonGrams and a huge list of common words, so there should be a huge difference in the document length with and without discountOverlaps.

Re: How to set discountOverlaps="true" in Solr 4x schema.xml

2013-08-22 Thread Tom Burton-West
I should have said that I have set it both to "true" and to "false" and restarted Solr each time and the rankings and info in the debug query showed no change. Does this have to be set at index time? Tom >

ICUTokenizer class not found with Solr 4.4

2013-08-27 Thread Tom Burton-West
Hello all, According to the README.txt in solr-4.4.0/solr/example/solr/collection1, all we have to do is create a collection1/lib directory and put whatever jars we want in there. ".. /lib. If it exists, Solr will load any Jars found in this directory and use them to resolve any "pl

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
, Shawn Heisey wrote: > On 8/27/2013 4:29 PM, Tom Burton-West wrote: > >> According to the README.txt in solr-4.4.0/solr/example/solr/** >> collection1, >> all we have to do is create a collection1/lib directory and put whatever >> jars we want in there. >> >

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
My point in the previous e-mail was that following the instructions in the documentation does not seem to work. The workaround I found was to simply change the name of the collection1/lib directory to collection1/foobar and then include it in solrconfig.xml. This works, but does not

Re: ICUTokenizer class not found with Solr 4.4

2013-08-28 Thread Tom Burton-West
chema.xml. Any other optional configuration files would also be kept here. data/ This directory is the default location where Solr will keep your ... lib/ On Wed, Aug 28, 2013 at 12:11 PM, Shawn Heisey wrote: > On 8/28/2013 9:34 AM, Tom Burton-West wrote: > >> I thi

Re: Slow queries for common terms

2013-03-22 Thread Tom Burton-West
Hi David and Jan, I wrote the blog post, and David, you are right, the problem we had was with phrase queries because our positions lists are so huge. Boolean queries don't need to read the positions lists. I think you need to determine whether you are CPU bound or I/O bound.It is possible

Solr 4.x replacement for termsIndexDivisor

2013-05-21 Thread Tom Burton-West
Due to multiple languages and dirty OCR, our indexes have over 2 billion unique terms ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again ). In Solr 3.6 and previous we needed to reduce the memory used for storing the in-memory representation of the tii file. We originall

How to configure termvectors to not store positions/offsets

2012-12-13 Thread Tom Burton-West
Hello, As I understand it, MoreLikeThis only requires term frequencies, not positions or offsets. So in order to save disk space I would like to store termvectors, but without positions and offsets. Is there documentation somewhere that 1) would confirm that MoreLikeThis only needs term frequenc

configuring per-field similarity in Solr 4: "the global similarity does not support it"

2012-12-17 Thread Tom Burton-West
Hello, I have Solr 4 configured with several fields using different similarity classes according to: http://wiki.apache.org/solr/SchemaXml#Similarity However, I get this error message: " FieldType 'DFR' is configured with a similarity, but the global similarity does not support it: class org.apac

Re: configuring per-field similarity in Solr 4: "the global similarity does not support it"

2012-12-17 Thread Tom Burton-West
Thanks Markus! Adding fixed the problem. >>Keep in mind that coord and queryNorm (=1.0f) are not implemented now, so you will get different scores for TF-IDF! Can you explain more about this, or is it documented somewhere? Do I need to read the source for solr.SchemaSimilarityFactory? Is there

ICUTokenizer labels number as Han character?

2012-12-19 Thread Tom Burton-West
Hello, Don't know if the Solr admin panel is lying, or if this is a wierd bug. The string: "1986年" gets analyzed by the ICUTokenizer with "1986" being identified as type:NUM and script:Han. Then the CJKBigram filter identifies "1986" as type:Num and script:Han and "年" as type:Single and script:

Best practices for Solr highlighter for CJK

2013-01-02 Thread Tom Burton-West
. i.e. ABC => searched as AB BC only AB gets highlighted even if the matching string is ABC. (Where ABC are chinese characters such as 大亚湾 => searched as 大亚 亚湾, but only 大亚 is highlighted rather than 大亚湾) Is there some highlighting parameter that might fix this? Tom Burton-West

coord missing from debugQuery explain?

2013-01-08 Thread Tom Burton-West
Hello, I'm trying to understand some Solr relevance issues using debugQuery=on, but I don't see the coord factor listed anywhere in the explain output. My understanding is that the coord factor is not included in either the querynorm or the fieldnorm. What am I missing? Tom

Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
Hello all, I have a one term query: "ocr:aardvark" When I look at the explain output, for some matches the queryNorm and fieldWeight are shown and for some matches only the "weight" is shown with no query norm. (See below) What explains the difference? Shouldn't the queryNorm be applied to e

Re: Why does debugQuery/explain output sometimes include queryNorm and sometimes not for same query?

2013-01-25 Thread Tom Burton-West
Thanks Hoss, Yes it is a distributed query. Tom On Fri, Jan 25, 2013 at 2:32 PM, Chris Hostetter wrote: > > : I have a one term query: "ocr:aardvark" When I look at the explain > : output, for some matches the queryNorm and fieldWeight are shown and for > : some matches only the "weight" is

ngrams or truncation for multilingual searching in Solr

2013-02-05 Thread Tom Burton-West
, New York, NY, USA, 75-82. DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957 Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search

boost not showing up in Solr 3.6 debugQueries?

2012-05-17 Thread Tom Burton-West
log. Tom Burton-West 兵にな^1000 OR hanUnigrams:兵にな 兵にな^1000 OR hanUnigrams:兵にな ((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵 ((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵 0.15685473 = (MATCH) sum of: 0.15684697 = (MATCH) sum of: 0.0067602023 = (MATCH) weight(ocr:兵に in 213594

What is the "docs" number in Solr explain query results for fieldnorm?

2012-05-25 Thread Tom Burton-West
ed below the explain scoring for a couple of documents with tf 50 and 67. 0.6798219 DF9199B7049F8DFE-220 DF9199B7049F8DFE The Aeroplane 0.6798219 = (MATCH) fieldWeight(ocr:the in 16624), product of: 1.0 = tf(termFreq(ocr:the)=1) 1.087715 = idf(docFreq=16219, maxDocs=1

edismax parser ignores mm parameter when tokenizer splits tokens (i.e. CJK)

2012-06-26 Thread Tom Burton-West
oblem also occurs with non-CJK queries for example [two-thirds] turns into a Boolean OR query for ( [two] OR [thirds] ). Is there some way to tell the edismax query parser to stick with mm =100%? Appended below is the debugQuery output for these two queries and an exceprt from our schema.xml. To

edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-06-28 Thread Tom Burton-West
something, or is this a bug? I'd like to file a JIRA issue, but want to find out if I am missing something here. Details of several queries are appended below. Tom Burton-West edismax query mm=2 query with hypenated word [fire-fly] {!edismax mm=2}fire-fly {!edismax mm=2}fire-fly +D

Re: edismax parser ignores mm parameter when tokenizer splits tokens (hypenated words, WDF splitting etc)

2012-07-02 Thread Tom Burton-West
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which also lists a couple other related mailing list posts. On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West wrote: > Hello, > > My previous e-mail with a CJK example has received no replies. I > verifi

Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-21 Thread Tom Burton-West
users the choice of a list of the most relevant pages, or a list of the books containing the most relevant pages. We have approximately 3 billion pages. Does anyone have experience using field collapsing on this sort of scale? Tom Tom Burton-West Information Retrieval Programmer Digital Library

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
nt work, see my thread on Solr3.6 Field collapsing > > Thanks, > > Tirthankar > > > > -Original Message- > > From: Tom Burton-West > > Date: Tue, 21 Aug 2012 18:39:25 > > To: solr-user@lucene.apache.org > > Reply-To: "solr-user@lucene.apache.o

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Tirthankar, Can you give me a quick summary of what won't work and why? I couldn't figure it out from looking at your thread. You seem to have a different issue, but maybe I'm missing something here. Tom On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee < tchatter...@commvault.com> wr

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Hi Lance and Tirthankar, We are currently using Solr 3.6. I tried a search across our current 12 shards grouping by book id (record_no in our schema) and it seems to work fine (the query with the actual urls for the shards changed is appended below.) I then searched for the record_no of the seco

Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Hello, Usually in the example/solr file in Solr distributions there is a populated conf file. However in the distribution I downloaded of solr 4.0.0-BETA, there is no /conf directory. Has this been moved somewhere? Tom ls -l apache-solr-4.0.0-BETA/example/solr total 107 drwxr-sr-x 2 tburtonw

Re: Solr 4.0 Beta missing example/conf files?

2012-08-22 Thread Tom Burton-West
Thanks Markus! Should the README.txt file in solr/example be updated to reflect this? Is that something I need to enter a JIRA issue for? Tom On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma wrote: > Hi - The example has been moved to collection1/ > > > > -Original message- > > From:Tom B

Re: Scalability of Solr Result Grouping/Field Collapsing: Millions/Billions of documents?

2012-08-22 Thread Tom Burton-West
Thanks Tirthankar, So the issue in memory use for sorting. I'm not sure I understand how sorting of grouping fields is involved with the defaults and field collapsing, since the default sorts by relevance not grouping field. On the other hand I don't know much about how field collapsing is impl

Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
I removed the string "collection1" from my solr.xml file in solr home and modified my solr.xml file as follows: Then I restarted Solr. However, I keep getting messages about "Can't find resource 'solrconfig.xml' in classpath or '/l/solrs/dev/solrs/4.0/1/collection1/conf/'" And the log

Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
I did not describe the problems correctly. I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and .../solrs/4.0/2solrs/3 For shard 1 I have a solr.xml file with the modifications described in the previous message. For that instance, it appears that the problem is that the sema

Re: Solr 4.0 beta : Is collection1 hard coded somewhere?

2012-08-23 Thread Tom Burton-West
rowse/SOLR-3753 On Thu, Aug 23, 2012 at 1:04 PM, Tom Burton-West wrote: > I did not describe the problems correctly. > > I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and > .../solrs/4.0/2solrs/3 > > For shard 1 I have a solr.xml file with the mo

Re: Solr 4.0 Beta missing example/conf files?

2012-08-23 Thread Tom Burton-West
release. > > Thanks, > Erik > > On Aug 22, 2012, at 16:32 , Tom Burton-West wrote: > > > Thanks Markus! > > > > Should the README.txt file in solr/example be updated to reflect this? > > Is that something I need to enter a JIRA issue for? > > &

Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Hello all, Due to multiple languages and dirty OCR, our indexes have over 2 billion unique terms ( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again). In Solr 3.6 and previous we needed to reduce the memory used for storing the in-memory representation of the tii file. We o

Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
to the codec/implementation. > > In Lucene 4.0 the terms index works completely differently: these > parameters don't make sense for it. > > On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West > wrote: > > Hello all, > > > > Due to multiple languages and dirty

Re: Solr 4.0 Beta, termIndexInterval vs termIndexDivisor vs termInfosIndexDivisor

2012-09-07 Thread Tom Burton-West
Fri, Sep 7, 2012 at 2:58 PM, Robert Muir wrote: > On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West > wrote: > > Thanks Robert, > > > > I'll have to spend some time understanding the default codec for Solr > 4.0. > > Did I miss something in the changes

Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
that gets sent to Solr is not actually a dismax query. 3 ocr^200 true true 0.1 fire-fly xml fire-fly fire-fly text:fire text:fly If a correct dismax query was being sent to Solr the parsedquery would have something like the following: (+DisjunctionMaxQuery(((text:fire text:fly))) Tom Burton-West

Re: Solr 4.0 Beta: Admin UI does not correctly implement dismax/edismax query

2012-09-13 Thread Tom Burton-West
Type=dismax > > Erik > > On Sep 13, 2012, at 12:22 , Tom Burton-West wrote: > > > Just want to check I am not doing something obviously wrong before I > file a > > bug ticket. > > > > In Solr 4.0Beta, in the admin UI in the Query panel,, there is a ch

Solr 4.0 error message: "Unsupported ContentType: Content-type:text/xml"

2012-11-02 Thread Tom Burton-West
Hello all, Trying to get Solr 4.0 up and running with a port of our production 3.6 schema and documents. We are getting the following error message in the logs: org.apache.solr.common.SolrException: Unsupported ContentType: Content-type:text/xml Not in: [app lication/xml, text/csv, text/json, a

Re: Solr 4.0 error message: "Unsupported ContentType: Content-type:text/xml"

2012-11-02 Thread Tom Burton-West
s if the literal text "Content-type:" is > included in your content type. How exactly are you setting/sending the > content type? > > -- Jack Krupansky > > -Original Message- From: Tom Burton-West > Sent: Friday, November 02, 2012 5:30 PM > To: solr-user@l

Re: Skewed IDF in multi lingual index

2012-11-08 Thread Tom Burton-West
Hi Markus, No answers, but I am very interested in what you find out. We currently index all languages in one index, which presents different IDF issues, but are interested in exploring alternatives such as the one you describe. Tom Burton-West http://www.hathitrust.org/blogs/large-scale

URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
Hello, I would like to send a request to the FieldAnalysisRequestHandler. The javadoc lists the parameter names such as analysis.field, but sending those as URL parameters does not seem to work: mysolr.umich.edu/analysis/field?analysis.name=title&q=fire-fly leaving out the "analysis" doesn't w

Re: URL parameters to use FieldAnalysisRequestHandler

2012-11-13 Thread Tom Burton-West
no more > "analysis.jsp" like before? > > So maybe try using something like burpsuite and just using the > analysis UI in your browser to see what requests its sending. > > On Tue, Nov 13, 2012 at 11:00 AM, Tom Burton-West > wrote: > > Hello, > > > >

Re: BM25 model for solr 4?

2012-11-15 Thread Tom Burton-West
Hello Floyd, There is a ton of research literature out there comparing BM25 to vector space. But you have to be careful interpreting it. BM25 originally beat the SMART vector space model in the early TRECs because it did better tf and length normalization. Pivoted Document Length normalizatio

Re: Limit of Index size per machine..

2009-08-06 Thread Tom Burton-West
Hello, I think you are confusing the size of the data you want to index with the size of the index. For our indexes (large full text documents) the Solr index is about 1/3 of the size of the documents being indexed. For 3 TB of data you might have an index of 1 TB or less. This depends on many

Re: Slow Phrase Queries

2009-10-20 Thread Tom Burton-West
You might try a couple tests in the Solr admin interface to make sure the query is being processed the same in both Solr and raw lucene. 1) use the analysis panel to determine if the Solr filter chain is doing something unexpected compared to your lucene filter chain 2) try running a debug query

Re: port of Nutch CommonGrams to Solr for help with slow phrase queries

2009-03-06 Thread Tom Burton-West
Hi Norberto, After working a bit on trying to port the Nutch CommonGrams code, I ran into lots of dependencies on Nutch and Hadoop. Would it be possible to get more information on how you use shingles (or code)? Are you creating shingles for all two word combinations or using a list of words? T

Re: Contributors - Solr in Action Case Studies

2010-01-20 Thread Tom Burton-West
://www.hathitrust.org/large_scale_search and our blog: http://www.hathitrust.org/blogs/large-scale-search http://www.hathitrust.org/blogs/large-scale-search (I'll be updating the blog with details of current hardware and performance tests in the next week or so) Tom Tom Burton-West Digital Li

Re: Thanks Robert!

2010-02-05 Thread Tom Burton-West
+1 And thanks to you both for all your work on CommonGrams! Tom Burton-West Jason Rutherglen-2 wrote: > > Robert, thanks for redoing all the Solr analyzers to the new API! It > helps to have many examples to work from, best practices so to speak. > > -- View this mes

Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West
Thanks Lance and Michael, We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from Solr admin panel appended below) I tried running CheckIndex (with the -ea: switch ) on one of the shards. CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger segment contai

Re: TermInfosReader.get ArrayIndexOutOfBoundsException

2010-02-09 Thread Tom Burton-West
Thanks Michael, I'm not sure I understand. CheckIndex reported a negative number: -16777214. But in any case we can certainly try running CheckIndex from a patched lucene We could also run a patched lucene on our dev server. Tom Yes, the term count reported by CheckIndex is the total

Re: persistent cache

2010-02-12 Thread Tom Burton-West
e. A good overview of the issues is the paper by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of Caching on Search Engines ) Tom Burton-West Digital Library Production Service University of Michigan Library -- View this message in context: http://old.nabble.com/persis

Re: persistent cache

2010-02-15 Thread Tom Burton-West
Hi Tim, Due to our performance needs we optimize the index early in the morning and then run the cache-warming queries once we mount the optimized index on our servers. If you are indexing and serving using the same Solr instance, you shouldn't have to re-run the cache warming queries when you a

Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-18 Thread Tom Burton-West
Thanks Otis, I don't know enough about Hadoop to understand the advantage of using Hadoop in this use case. How would using Hadoop differ from distributing the indexing over 10 shards on 10 machines with Solr? Tom Otis Gospodnetic wrote: > > Hi Tom, > > 32MB is very low, 320MB is medium, a

Re: What is largest reasonable setting for ramBufferSizeMB?

2010-02-19 Thread Tom Burton-West
Hi Glen, I'd love to use LuSql, but our data is not in a db. Its 6-8TB of files containing OCR (one file per page for about 1.5 billion pages) gzipped on disk which are ugzipped, concatenated, and converted to Solr documents on-the-fly. We have multiple instances of our Solr document producer s

Re: Solr Performance Issues

2010-03-11 Thread Tom Burton-West
es are you will see serious contention for disk I/O.. Of course if you don't see any waiting on i/o, then your bottleneck is probably somewhere else:) See http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1 for more background on our experience. Tom Burton-

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
Thanks Simon, We can probably implement your suggestion about runs of punctuation and unlikely mixes of alpha/numeric/punctuation. I'm also thinking about looking for unlikely mixes of unicode character blocks. For example some of the CJK material ends up with Cyrillic characters. (except we wo

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
Interesting. I wonder though if we have 4 million English documents and 250 in Urdu, if the Urdu words would score badly when compared to ngram statistics for the entire corpus. hossman wrote: > > > > Since you are dealing with multiple langugaes, and multiple varient usages > of langauge

Re: Cleaning up dirty OCR

2010-03-11 Thread Tom Burton-West
We've been thinking about running some kind of a classifier against each book to select books with a high percentage of dirty OCR for some kind of special processing. Haven't quite figured out a multilingual feature set yet other than the punctuation/alphanumeric and character block ideas mention

Re: Solr RAM Requirements

2010-03-17 Thread Tom Burton-West
files. You also might want to take a look at the free memory when you start up Solr and then watch as it fills up as you get more queries (or send cache-warming queries). Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search KaktuChakarabati wrote: > > My question was m

  1   2   >