Hi David,
It may not matter for your use case but just in case you really are
interested in the "real BM25F" there is a difference between configuring K1
and B for different fields in Solr and a "real" BM25F implementation. This
has to do with Solr's model of fields being mini-documents (i.e. ea
Hello all,
The last time I worked with changing Simlarities was with Solr 4.1 and at
that time, it was possible to simply change the schema to specify the use
of a different Similarity without re-indexing. This allowed me to
experiment with several different ranking algorithms without having to
Hello,
We don't want to use locktype=native (we are using NFS) or locktype=simple
(we mount a read-only snapshot of the index on our search servers and with
locktype=simple, Solr refuses to start up becaise it sees the lock file.)
However, we don't quite understand the warnings about using lockty
Thanks Hoss,
Protection from misconfiguration and/or starting separate solr instances
pointing to the same index dir I can understand.
The current documentation on the wiki and in the ref guide (along with just
enough understanding of Solr/Lucene indexing to be dangerous) left me
wondering if ma
Hello,
We normally run an optimize with maxSegments="2" after our daily indexing.
This has worked without problem on Solr 3.6. We recently moved to Solr
4.10.2 and on several shards the optimize completed with no errors in the
logs, but left more than 2 segments.
We send this xml to Solr
I've
Hi Rishi,
As others have indicated Multilingual search is very difficult to do well.
At HathiTrust we've been using the ICUTokenizer and ICUFilterFactory to
deal with having materials in 400 languages. We also added the
CJKBigramFilter to get better precision on CJK queries. We don't use stop
w
Hi Hoss,
I created a wrapper class, compiled a jar and included an
org.apache.lucene.codecs.Codec file in META-INF/services in the jar file
with an entry for the wrapper class :HTPostingsFormatWrapper. I created a
collection1/lib directory and put the jar there. (see below)
I'm getting the drea
index/DocValuesType.html
Is the comment in the example schema file completely wrong, or is there
some issue with using a docValues with a multivalued StrField?
Tom Burton-West
https://www.hathitrust.org/blogslarge-scale-search
se case as Otis suggested. In our use
case sometimes this is appropriate, but we are investigating the
possibility of other methods of scoring the group based on a more flexible
function of the scores of the members (i.e scoring book based on function
of scores of chapters).
Tom Burton-West
http://www
Hi Shawn,
I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?
Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese ch
Hi Shawn,
I may still be missing your point. Below is an example where the
ICUTokenizer splits
Now, I'm beginning to wonder if I really understand what those flags on the
CJKBigramFilter do.
The ICUTokenizer spits out unigrams and the CJKBigramFilter will put them
back together into bigrams.
I t
Hi Shawn,
>>For an input of 田中角栄 the bigram filter works like you described, and what
I would expect. If I add a space at the point where the ICU >>tokenizer
would have split them anyway, the bigram filter output is very different.
If I'm understanding what you are reporting, I suspect this is b
Hi Markus and Wunder,
I'm missing the original context, but I don't think BM25 will solve this
particular problem.
The k1 parameter sets how quickly the contribution of tf to the score falls
off with increasing tf. It would be helpful for making sure really long
documents don't get too high a
Thanks Marcus,
I was thinking about normalization and was absolutely wrong about setting
K1 to zero. I should have taken a look at the algorithm and walked
through setting K=0. (This is easier to do looking at the formula in
wikipedia http://en.wikipedia.org/wiki/Okapi_BM25 than walking though
Hi Ken,
Given the comments which seemed to describe using NRT for the opposite of
our use case, I just set our Solr 4 to use the solr.MMapDirectoryFactory.
Didn't bother to test whether NRT would be better for our use case, mostly
because it didn't sound like there was an advantage and I've bee
version of something like that for the INEX book
track. I'll see if I can find the code and if it is in any shape to share.
Tom
Tom Burton-West
Information Retrieval Programmer
Digital Library Production Sevice
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-sc
Hello,
I'm running the example setup for Solr 4.6.1.
In the ../example/solr/ directory, I set up a second core. I wanted to
send updates to that core.
I looked at .../exampledocs/post.sh and expected to see the URL as: URL=
http://localhost:8983/solr/collection1/update
However it does not
Thanks Hoss,
>>hardcoded default of "collection1" is still used for
backcompat when there is no "defaultCoreName" configured by the user.
Aha, it's hardcoded if there is nothing set in a config. No wonder I
couldn't find it by grepping around the config files.
I'm still trying to sort out the o
SA, 677-686.
DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622
Code:
http://users.dsic.upv.es/~pgupta/mixed-script-ir.html
Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org
score (6.2) exceeded threshold (HTML_MESSAGE,RCVD_IN_DNSWL_
LOW,SPF_NEUTRAL,URIBL_SBL
Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search
Hi Ilia,
I see that Trey answered your question about how you might stack
language specific filters in one field. If I remember correctly, his
approach assumes you have identified the language of the query. That
is not the same as detecting the script of the query and is much
harder.
Trying to
Hello,
I think the documentation and example files for Solr 4.x need to be
updated. If someone will let me know I'll be happy to fix the example
and perhaps someone with edit rights could fix the reference guide.
Due to dirty OCR and over 400 languages we have over 2 billion unique
terms in our
The Solr wiki says "A repeated question is "how can I have the
original term contribute
more to the score than the stemmed version"? In Solr 4.3, the
KeywordRepeatFilterFactory has been added to assist this
functionality. "
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Stemming
(
Hello,
queryResultWindowSize sets the number of documents to cache for each
query in the queryResult cache.So if you normally output 10 results per
pages, and users don't go beyond page 3 of results, you could set
queryResultWindowSize to 30 and the second and third page requests will
read f
Thanks Hoss,
Just opened SOLR-6560 and attached a patch which removes the offending
section from the example solrconfig.xml file.
We suspect that with the much more efficient block and FST based Solr 4
default postings format that the need to mess with the parameters in order
to reduce memory u
Hello all,
In the example schema.xml for Solr 4.10.2 this comment is listed under the
"PERFORMANCE NOTE"
"For maximum indexing performance, use the ConcurrentUpdateSolrServer
java client."
Is there some documentation somewhere that explains why this will maximize
indexing peformance?
In par
Thanks Eric,
That is helpful. We already have a process that works similarly. Each
thread/process that sends a document to Solr waits until it gets a response
in order to make sure that the document was indexed successfully (we log
errors and retry docs that don't get indexed successfully), howe
Thanks everybody for the information.
Shawn, thanks for bringing up the issues around making sure each document
is indexed ok. With our current architecture, that is important for us.
Yonik's clarification about streaming really helped me to understand one of
the main advantages of CUSS:
>>When
Hello all,
Our indexes have around 3 billion unique terms, so for Solr 3, we set
TermIndexInterval to about 8 times the default. The net effect of this is
to reduce the size of the in-memory index by about 1/8th. (For background
see for
http://www.hathitrust.org/blogs/large-scale-search/too-many
Thanks Michael and Hoss,
assuming I've written the subclass of the postings format, I need to tell
Solr to use it.
Do I just do something like:
Is there a way to set this for all fieldtypes or would that require writing
a custom CodecFactory?
Tom
On Mon, Jan 12, 2015 at 4:46 PM, Chris Hoste
Thanks Hoss,
This is starting to sound pretty complicated. Are you saying this is not
doable with Solr 4.10?
>>...or at least: that's how it *should* work :) makes me a bit nervous
about trying this on my own.
Should I open a JIRA issue or am I probably the only person with a use case
for repla
Hello,
I'm running Solr 4.10.2 out of the box with the Solr example.
i.e. ant example
cd solr/example
java -jar start.jar
in /example/log
At start-up the example gives this message in the log:
WARN - 2015-01-16 12:31:40.895; org.apache.solr.core.RequestHandlers;
Multiple requestHandler regist
Hello all,
The default directory implementation in Solr 4 is the NRTCachingDirectory
(in the example solrconfig.xml file , see below).
The Javadoc for NRTCachingDirectoy (
http://lucene.apache.org/core/4_3_1/core/org/apache/lucene/store/NRTCachingDirectory.html?is-external=true)
says:
"This cl
Hello,
We are seeing the message "too many merges...stalling" in our indexwriter
log. Is this something to be concerned about? Does it mean we need to
tune something in our indexing configuration?
Tom
nvestigate.
Tom
On Thu, Jul 11, 2013 at 5:29 PM, Shawn Heisey wrote:
> On 7/11/2013 1:47 PM, Tom Burton-West wrote:
>
>> We are seeing the message "too many merges...stalling" in our indexwriter
>> log. Is this something to be concerned about? Does it mean we nee
Hello,
I am running solr 4.2.1 on 3 shards and have about 365 million documents in
the index total.
I sent a query asking for 1 million rows at a time, but I keep getting an
error claiming that there is an invalid version or data not in javabin
format (see below)
If I lower the number of rows re
above 1,000.
>
> So, if rows=10 works for you, consider yourself lucky!
>
> That said, there is sometimes talk of supporting streaming, which
> presumably would allow access to all results, but chunked/paged in some way.
>
> -- Jack Krupansky
>
> -Original Messag
Thanks Shawn,
I was confused by the error message: "Invalid version (expected 2, but 60)
or the data in not in 'javabin' format"
Your explanation makes sense. I didn't think about what the shards have to
send back to the head shard.
Now that I look in my logs, I can see the posts that the shard
e.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent=on&start=3300&q=*:*&rows=100}
hits=119220943 status=0 QTime=49792
Jul 25, 2013 6:39:43 PM org.apache.solr.core.SolrCore execute
INFO: [core] webapp=/dev-1 path=/select
params={fl=vol_id&indent
If I am using solr.SchemaSimilarityFactory to allow different similarities
for different fields, do I set "discountOverlaps="true" on the factory or
per field?
What is the syntax? The below does not seem to work
Tom
Thanks Markus,
I set it , but it seems to make no difference in the score or statistics
listed in the debugQuery or in the ranking. I'm using a field with
CommonGrams and a huge list of common words, so there should be a huge
difference in the document length with and without discountOverlaps.
I should have said that I have set it both to "true" and to "false" and
restarted Solr each time and the rankings and info in the debug query
showed no change.
Does this have to be set at index time?
Tom
>
Hello all,
According to the README.txt in solr-4.4.0/solr/example/solr/collection1,
all we have to do is create a collection1/lib directory and put whatever
jars we want in there.
".. /lib.
If it exists, Solr will load any Jars
found in this directory and use them to resolve any "pl
, Shawn Heisey wrote:
> On 8/27/2013 4:29 PM, Tom Burton-West wrote:
>
>> According to the README.txt in solr-4.4.0/solr/example/solr/**
>> collection1,
>> all we have to do is create a collection1/lib directory and put whatever
>> jars we want in there.
>>
>
My point in the previous e-mail was that following the instructions in the
documentation does not seem to work.
The workaround I found was to simply change the name of the collection1/lib
directory to collection1/foobar and then include it in solrconfig.xml.
This works, but does not
chema.xml. Any other optional configuration files would also
be kept here.
data/
This directory is the default location where Solr will keep your
...
lib/
On Wed, Aug 28, 2013 at 12:11 PM, Shawn Heisey wrote:
> On 8/28/2013 9:34 AM, Tom Burton-West wrote:
>
>> I thi
Hi David and Jan,
I wrote the blog post, and David, you are right, the problem we had was
with phrase queries because our positions lists are so huge. Boolean
queries don't need to read the positions lists. I think you need to
determine whether you are CPU bound or I/O bound.It is possible
Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms
( http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again
).
In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file. We originall
Hello,
As I understand it, MoreLikeThis only requires term frequencies, not
positions or offsets. So in order to save disk space I would like to store
termvectors, but without positions and offsets. Is there documentation
somewhere that
1) would confirm that MoreLikeThis only needs term frequenc
Hello,
I have Solr 4 configured with several fields using different similarity
classes according to:
http://wiki.apache.org/solr/SchemaXml#Similarity
However, I get this error message:
" FieldType 'DFR' is configured with a similarity, but the global
similarity does not support it: class
org.apac
Thanks Markus!
Adding fixed the problem.
>>Keep in mind that coord and queryNorm (=1.0f) are not implemented now, so
you will get different scores for TF-IDF!
Can you explain more about this, or is it documented somewhere?
Do I need to read the source for solr.SchemaSimilarityFactory?
Is there
Hello,
Don't know if the Solr admin panel is lying, or if this is a wierd bug.
The string: "1986年" gets analyzed by the ICUTokenizer with "1986" being
identified as type:NUM and script:Han. Then the CJKBigram filter
identifies "1986" as type:Num and script:Han and "年" as type:Single and
script:
. i.e. ABC =>
searched as AB BC only AB gets highlighted even if the matching string is
ABC. (Where ABC are chinese characters such as 大亚湾 => searched as 大亚 亚湾,
but only 大亚 is highlighted rather than 大亚湾)
Is there some highlighting parameter that might fix this?
Tom Burton-West
Hello,
I'm trying to understand some Solr relevance issues using debugQuery=on,
but I don't see the coord factor listed anywhere in the explain output.
My understanding is that the coord factor is not included in either the
querynorm or the fieldnorm.
What am I missing?
Tom
Hello all,
I have a one term query: "ocr:aardvark" When I look at the explain
output, for some matches the queryNorm and fieldWeight are shown and for
some matches only the "weight" is shown with no query norm. (See below)
What explains the difference? Shouldn't the queryNorm be applied to e
Thanks Hoss,
Yes it is a distributed query.
Tom
On Fri, Jan 25, 2013 at 2:32 PM, Chris Hostetter
wrote:
>
> : I have a one term query: "ocr:aardvark" When I look at the explain
> : output, for some matches the queryNorm and fieldWeight are shown and for
> : some matches only the "weight" is
, New York, NY, USA, 75-82.
DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
log.
Tom Burton-West
兵にな^1000 OR hanUnigrams:兵にな
兵にな^1000 OR hanUnigrams:兵にな
((+ocr:兵に +ocr:にな)^1000.0) hanUnigrams:兵
((+ocr:兵に +ocr:にな)^1000.0)
hanUnigrams:兵
0.15685473 = (MATCH) sum of:
0.15684697 = (MATCH) sum of:
0.0067602023 = (MATCH) weight(ocr:兵に in 213594
ed below the explain scoring for a couple of documents with tf
50 and 67.
0.6798219
DF9199B7049F8DFE-220
DF9199B7049F8DFE
The Aeroplane
0.6798219 = (MATCH) fieldWeight(ocr:the in 16624), product of:
1.0 = tf(termFreq(ocr:the)=1)
1.087715 = idf(docFreq=16219, maxDocs=1
oblem also occurs with non-CJK queries for example [two-thirds]
turns into a Boolean OR query for ( [two] OR [thirds] ).
Is there some way to tell the edismax query parser to stick with mm =100%?
Appended below is the debugQuery output for these two queries and an
exceprt from our schema.xml.
To
something, or is this a bug?
I'd like to file a JIRA issue, but want to find out if I am missing
something here.
Details of several queries are appended below.
Tom Burton-West
edismax query mm=2 query with hypenated word [fire-fly]
{!edismax mm=2}fire-fly
{!edismax mm=2}fire-fly
+D
Opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-3589, which
also lists a couple other related mailing list posts.
On Thu, Jun 28, 2012 at 12:18 PM, Tom Burton-West wrote:
> Hello,
>
> My previous e-mail with a CJK example has received no replies. I
> verifi
users the choice of
a list of the most relevant pages, or a list of the books containing the
most relevant pages. We have approximately 3 billion pages. Does anyone
have experience using field collapsing on this sort of scale?
Tom
Tom Burton-West
Information Retrieval Programmer
Digital Library
nt work, see my thread on Solr3.6 Field collapsing
> > Thanks,
> > Tirthankar
> >
> > -Original Message-
> > From: Tom Burton-West
> > Date: Tue, 21 Aug 2012 18:39:25
> > To: solr-user@lucene.apache.org
> > Reply-To: "solr-user@lucene.apache.o
Hi Tirthankar,
Can you give me a quick summary of what won't work and why?
I couldn't figure it out from looking at your thread. You seem to have a
different issue, but maybe I'm missing something here.
Tom
On Tue, Aug 21, 2012 at 7:10 PM, Tirthankar Chatterjee <
tchatter...@commvault.com> wr
Hi Lance and Tirthankar,
We are currently using Solr 3.6. I tried a search across our current 12
shards grouping by book id (record_no in our schema) and it seems to work
fine (the query with the actual urls for the shards changed is appended
below.)
I then searched for the record_no of the seco
Hello,
Usually in the example/solr file in Solr distributions there is a populated
conf file. However in the distribution I downloaded of solr 4.0.0-BETA,
there is no /conf directory. Has this been moved somewhere?
Tom
ls -l apache-solr-4.0.0-BETA/example/solr
total 107
drwxr-sr-x 2 tburtonw
Thanks Markus!
Should the README.txt file in solr/example be updated to reflect this?
Is that something I need to enter a JIRA issue for?
Tom
On Wed, Aug 22, 2012 at 3:12 PM, Markus Jelsma
wrote:
> Hi - The example has been moved to collection1/
>
>
>
> -Original message-
> > From:Tom B
Thanks Tirthankar,
So the issue in memory use for sorting. I'm not sure I understand how
sorting of grouping fields is involved with the defaults and field
collapsing, since the default sorts by relevance not grouping field. On
the other hand I don't know much about how field collapsing is impl
I removed the string "collection1" from my solr.xml file in solr home and
modified my solr.xml file as follows:
Then I restarted Solr.
However, I keep getting messages about
"Can't find resource 'solrconfig.xml' in classpath or
'/l/solrs/dev/solrs/4.0/1/collection1/conf/'"
And the log
I did not describe the problems correctly.
I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and
.../solrs/4.0/2solrs/3
For shard 1 I have a solr.xml file with the modifications described in the
previous message. For that instance, it appears that the problem is that
the sema
rowse/SOLR-3753
On Thu, Aug 23, 2012 at 1:04 PM, Tom Burton-West wrote:
> I did not describe the problems correctly.
>
> I have 3 solr shards with solr homes .../solrs/4.0/1 .../solrs/4.0/2 and
> .../solrs/4.0/2solrs/3
>
> For shard 1 I have a solr.xml file with the mo
release.
>
> Thanks,
> Erik
>
> On Aug 22, 2012, at 16:32 , Tom Burton-West wrote:
>
> > Thanks Markus!
> >
> > Should the README.txt file in solr/example be updated to reflect this?
> > Is that something I need to enter a JIRA issue for?
> >
&
Hello all,
Due to multiple languages and dirty OCR, our indexes have over 2 billion
unique terms (
http://www.hathitrust.org/blogs/large-scale-search/too-many-words-again).
In Solr 3.6 and previous we needed to reduce the memory used for storing
the in-memory representation of the tii file. We o
to the codec/implementation.
>
> In Lucene 4.0 the terms index works completely differently: these
> parameters don't make sense for it.
>
> On Fri, Sep 7, 2012 at 12:43 PM, Tom Burton-West
> wrote:
> > Hello all,
> >
> > Due to multiple languages and dirty
Fri, Sep 7, 2012 at 2:58 PM, Robert Muir wrote:
> On Fri, Sep 7, 2012 at 2:19 PM, Tom Burton-West
> wrote:
> > Thanks Robert,
> >
> > I'll have to spend some time understanding the default codec for Solr
> 4.0.
> > Did I miss something in the changes
that gets sent to Solr is not actually a
dismax query.
3
ocr^200
true
true
0.1
fire-fly
xml
fire-fly
fire-fly
text:fire text:fly
If a correct dismax query was being sent to Solr the parsedquery would have
something like the following:
(+DisjunctionMaxQuery(((text:fire text:fly)))
Tom Burton-West
Type=dismax
>
> Erik
>
> On Sep 13, 2012, at 12:22 , Tom Burton-West wrote:
>
> > Just want to check I am not doing something obviously wrong before I
> file a
> > bug ticket.
> >
> > In Solr 4.0Beta, in the admin UI in the Query panel,, there is a ch
Hello all,
Trying to get Solr 4.0 up and running with a port of our production 3.6
schema and documents.
We are getting the following error message in the logs:
org.apache.solr.common.SolrException: Unsupported ContentType:
Content-type:text/xml Not in: [app
lication/xml, text/csv, text/json, a
s if the literal text "Content-type:" is
> included in your content type. How exactly are you setting/sending the
> content type?
>
> -- Jack Krupansky
>
> -Original Message- From: Tom Burton-West
> Sent: Friday, November 02, 2012 5:30 PM
> To: solr-user@l
Hi Markus,
No answers, but I am very interested in what you find out. We currently
index all languages in one index, which presents different IDF issues, but
are interested in exploring alternatives such as the one you describe.
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale
Hello,
I would like to send a request to the FieldAnalysisRequestHandler. The
javadoc lists the parameter names such as analysis.field, but sending those
as URL parameters does not seem to work:
mysolr.umich.edu/analysis/field?analysis.name=title&q=fire-fly
leaving out the "analysis" doesn't w
no more
> "analysis.jsp" like before?
>
> So maybe try using something like burpsuite and just using the
> analysis UI in your browser to see what requests its sending.
>
> On Tue, Nov 13, 2012 at 11:00 AM, Tom Burton-West
> wrote:
> > Hello,
> >
> >
Hello Floyd,
There is a ton of research literature out there comparing BM25 to vector
space. But you have to be careful interpreting it.
BM25 originally beat the SMART vector space model in the early TRECs
because it did better tf and length normalization. Pivoted Document
Length normalizatio
Hello,
I think you are confusing the size of the data you want to index with the
size of the index. For our indexes (large full text documents) the Solr
index is about 1/3 of the size of the documents being indexed. For 3 TB of
data you might have an index of 1 TB or less. This depends on many
You might try a couple tests in the Solr admin interface to make sure the
query is being processed the same in both Solr and raw lucene.
1) use the analysis panel to determine if the Solr filter chain is doing
something unexpected compared to your lucene filter chain
2) try running a debug query
Hi Norberto,
After working a bit on trying to port the Nutch CommonGrams code, I ran into
lots of dependencies on Nutch and Hadoop. Would it be possible to get more
information on how you use shingles (or code)? Are you creating shingles for
all two word combinations or using a list of words?
T
://www.hathitrust.org/large_scale_search and our blog:
http://www.hathitrust.org/blogs/large-scale-search
http://www.hathitrust.org/blogs/large-scale-search (I'll be updating the
blog with details of current hardware and performance tests in the next week
or so)
Tom
Tom Burton-West
Digital Li
+1
And thanks to you both for all your work on CommonGrams!
Tom Burton-West
Jason Rutherglen-2 wrote:
>
> Robert, thanks for redoing all the Solr analyzers to the new API! It
> helps to have many examples to work from, best practices so to speak.
>
>
--
View this mes
Thanks Lance and Michael,
We are running Solr 1.3.0.2009.09.03.11.14.39 (Complete version info from
Solr admin panel appended below)
I tried running CheckIndex (with the -ea: switch ) on one of the shards.
CheckIndex also produced an ArrayIndexOutOfBoundsException on the larger
segment contai
Thanks Michael,
I'm not sure I understand. CheckIndex reported a negative number:
-16777214.
But in any case we can certainly try running CheckIndex from a patched
lucene We could also run a patched lucene on our dev server.
Tom
Yes, the term count reported by CheckIndex is the total
e. A good overview of the issues is the paper
by Baeza-Yates ( http://doi.acm.org/10.1145/1277741.125 The Impact of
Caching on Search Engines )
Tom Burton-West
Digital Library Production Service
University of Michigan Library
--
View this message in context:
http://old.nabble.com/persis
Hi Tim,
Due to our performance needs we optimize the index early in the morning and
then run the cache-warming queries once we mount the optimized index on our
servers. If you are indexing and serving using the same Solr instance, you
shouldn't have to re-run the cache warming queries when you a
Thanks Otis,
I don't know enough about Hadoop to understand the advantage of using Hadoop
in this use case. How would using Hadoop differ from distributing the
indexing over 10 shards on 10 machines with Solr?
Tom
Otis Gospodnetic wrote:
>
> Hi Tom,
>
> 32MB is very low, 320MB is medium, a
Hi Glen,
I'd love to use LuSql, but our data is not in a db. Its 6-8TB of files
containing OCR (one file per page for about 1.5 billion pages) gzipped on
disk which are ugzipped, concatenated, and converted to Solr documents
on-the-fly. We have multiple instances of our Solr document producer s
es are you will see serious contention
for disk I/O..
Of course if you don't see any waiting on i/o, then your bottleneck is
probably somewhere else:)
See
http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-1
for more background on our experience.
Tom Burton-
Thanks Simon,
We can probably implement your suggestion about runs of punctuation and
unlikely mixes of alpha/numeric/punctuation. I'm also thinking about
looking for unlikely mixes of unicode character blocks. For example some of
the CJK material ends up with Cyrillic characters. (except we wo
Interesting. I wonder though if we have 4 million English documents and 250
in Urdu, if the Urdu words would score badly when compared to ngram
statistics for the entire corpus.
hossman wrote:
>
>
>
> Since you are dealing with multiple langugaes, and multiple varient usages
> of langauge
We've been thinking about running some kind of a classifier against each book
to select books with a high percentage of dirty OCR for some kind of special
processing. Haven't quite figured out a multilingual feature set yet other
than the punctuation/alphanumeric and character block ideas mention
files. You
also might want to take a look at the free memory when you start up Solr and
then watch as it fills up as you get more queries (or send cache-warming
queries).
Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search
KaktuChakarabati wrote:
>
> My question was m
1 - 100 of 106 matches
Mail list logo