the field you are trying to sort
by, and what kinds of values are indexed therein?
-Mike
Any idea?
thanks
Java heap space
java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.SegmentTermEnum.termInfo
(SegmentTermEnum.java:170)
at
ou tried it?
Otherwise, I would be tempted to copy the (java) code from
analysis.jsp and use it directly.
-Mike
r caches in most DBs.
Another reason why people use stored procs is to prevent multiple
round-trips in a multi-stage query operation. This is exactly what
complex RequestHandlers do (and the equivalent to a custom stored
proc would be writing your own handler).
-Mike
queries is to "skip" past a document that
doesn't meet the criteria you want and not return any score for it
at all.
Good to know. Thanks Hoss.
-Mike
tially give you want you want.
Note too that by default solr only indexes the first 10k tokens, so
this should work for all documents in the index.
-Mike
you launch tomcat.
You can also reduce indexing memory usage by reducing maxBufferedDocs
in solrconfig.xml (say, from 1000 to 100), and by committing once in
a while (eg. autoCommit/maxDocs=50)
-Mike
7;t seem to have
any effect.
Does the field contain a match against one of the terms you are
querying for?
-Mike
unicode character directly:
>>> u'\u00e9'
u'\xe9'
>>> print u'\u00e9'
é
This is less complicated in the usual case of reading data from a
file, because the encoding should be known (terminal encoding issues
are much trickier). Use codecs.open() to get a unicode-output text
stream.
-Mike
you jvm while indexing such hugeness. (Note
that other input methods, like cvs, might behave better, but I
haven't examined them to verify.)
-Mike
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCap
ed to only have one value per doc, this could greatly
accelerate your faceting. There are probably fewer unique subjects,
so strategy 1 is likely fine.
To use strategy 2, just make sure that multivalued="false" is set for
those fields in schema.xml
-Mike
On 6-Sep-07, at 3:25 PM, Mike Klaas wrote:
There are essentially two facet computation strategies:
1. cached bitsets: a bitset for each term is generated and
intersected with the query restul bitset. This is more general and
performs well up to a few thousand terms.
2. field
ven't heard any suggestions as to how to do this with a
stock Solr install, other than increase vm memory, I'll assume it
will have to be done
with a custom solution.
Well, have you tried the CSV importer?
-Mike
overwrites that
switching to
DUH is probably a win.)
DUH also does not implement many newer update features, like autoCommit.
-Mike
che/solr/update'
)
Patches should be generally applied from the top-level solr directory
with 'patch -p0'
-Mike
button. It is also important to verify that you have the legal
right to grant the code to ASF (since it is probably your employer's
intellectual property).
Legal issues are a hassle, but are unavoidable, I'm afraid.
Thanks again,
-Mike
On 10-Sep-07, at 10:22 AM, Wagner,Har
wn framework that is generating
multiple entries for this input case?
glad to hear you figured it out,
-Mike
ing).
In the future, segment merging will occur in a separate thread,
further improving concurrency.
-Mike
ethod is filled with clauses like:
} else if ("whatever".equals(fieldName)) {
return super.lengthNorm(fieldName, /
Math.max(numTokens, MIN_LENGTH));
where MIN_LENGTH can be quite long for some fields.
-Mike
ery parser behaviour is mostly designed to make
sense for parsing _user_-entered queries.
You can achieve AND behaviour for filter queries by specifying
multiple fq parameters, or by prepending each in a series of clauses
by +.
-Mike
work committed in the last week in trunk, so you may
want to use a snapshot from two weeks ago if you want oozing- rather
than bleeding-edge.
-Mike
akes a few milliseconds, but the commit takes about 1
minute.
Could you please recommend what we should check for? Or perhaps
some tuning parameters?
Could be the cache auto-warming. Try reducing this to zero.
-Mike
names...
There just might be something like that in 1.3...
-Mike
solr webapps within a single container/
process/jvm
In the future, (1.3 or farther down the line), another option will be:
3. multiple indices within a single solr webapp, added/removed on the
fly.
-Mike
"Repositori".
You are faceting on a field that is analyzed with a stemmer
(PorterFilterStemmer). If you do not want that behaviour (but want
it for searchign), use copyField to index in another field that does
not have unexpected analysis (preferably, none).
-Mike
hours over http. Just batch
a few (10) docs per http POST, and use around N+1 threads (N=#
processors).
-Mike
On 14-Sep-07, at 3:38 PM, Tom Hill wrote:
Hi Mike,
Thanks for clarifying what has been a bit of a black box to me.
A couple of questions, to increase my understanding, if you don't
mind.
If I am only using fields with multiValued="false", with a type of
"
ed it does.
see
http://lucene.apache.org/java/docs/queryparsersyntax.html
http://wiki.apache.org/solr/SolrQuerySyntax
-Mike
see above).
2) If docs are sent asynchronously, how well could Solr can index?
As long as you don't send 1.7million docs at once, you should see a
performance improvement.
-Mike
On 18-Sep-07, at 5:39 PM, Lance Norskog wrote:
Hi-
In early June Mike Klass posted a formula for the number of file
descriptors
needed by Solr:
For each segment, 7 + num indexed fields per segment.
There should be log_{base mergefactor}(numDocs) * mergeFactor
segments
CENE-794?
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel#action_12526803), but it is currently not integrated.
It would make a great project to get one's hands dirty contributing,
though :)
-Mike
On 19-Sep-07, at 2:39 PM, Marc Bechler wrote:
Hi Mike,
thanks for the quick response.
> It would make a great project to get one's hands dirty
contributing, though :)
... sounds like giving a broad hint ;-) Sounds challenging...
I'm not sure about that--it is supposed to
want to know whether there is a component existed can do the
distributed search based on Solr.
https://issues.apache.org/jira/browse/SOLR-303?
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
regards,
-Mike
ntax (+ == clause is required):
articol_tag:pilonul ii AND articol_tag:facultative
==
+:ii +articol_tag:facultative articol_tag:pilonul
articol_tag:facultative AND articol_tag:pilonul ii
==
+articol_tag:facultative +articol_tag:pilonul :ii
try:
articol_tag:facultative AND articol_tag:"pilonul ii"
-Mike
link, but trying to cover
too many unix basics will clutter up the documentation.
-Mike
deparment1:
100 (the sum of each value) Is it clear?
Currently this is not possible out of the box with Solr.
-Mike
On 21-Sep-07, at 2:42 PM, Rafael Rossini wrote:
Thanks for the reply Mike. Is there any plans on doing some like
this? Or
some direction anyone could give?
Probably the easiest thing to do is write a custom request handlers
that iterates over the field cache and computes the statistics
No search software can search 2.5 billion docs (assuming web-sized
documents) in 5ms on a single server.
You certainly could build such a system with Solr distributed over
100's of nodes, but this is not built into Solr currently.
-Mike
e not
made on behalf of the firm.
Sorry, I'm afraid the above email is already irrevokably publicly
archived.
-Mike
n try? Or, can I adjust the facets
somehow to make this work?
http://wiki.apache.org/solr/
SimpleFacetParameters#head-1b281067d007d3fb66f07a3e90e9b1704cbc59a3
cheers,
-Mike
maxBufferedDocs > autoCommit it does not have any effect.
cheers,
-Mike
age-indexing-and-searching-
tf3885324.html#a11012939>
-Mike
Solr's main interface is http, so you can connect to that remotely.
Query each machine and combine the results using you own business logic.
Alternatively, you can try out the query distribution code being
developed in
<http://issues.apache.org/jira/browse/SOLR-303>
-Mike
O
tokenizer
side of things. If there is a consensus on a sensible way of doing
this, I could contribute the bits of code that I have.
HTH,
-Mike
versa) and
ignores it.
Another approach that I am using locally is to maintain the
transitions, but force tokens to be a minimum size (so r2d2 doesn't
tokenize to four tokens but arrrdeee does).
There is a patch here: http://issues.apache.org/jira/browse/SOLR-293
If you vote for it, I promise to get it in for 1.3
-Mike
/scoring.html to start out, in
particular the link to the Similarity class javadocs.
-Mike
o strip html if you want).
I recommend stripping the html yourself, and putting titles, anchors,
etc in separate fields.
I believe that it would be possible to write this as a Solr update-
handler plugin, if you wanted it to all run in one place.
-Mike
On 2-Oct-07, at 12:52 AM, Ravish Bhagdev wrote:
I see that you're using the HTML analyzer. Unfortunately that does
not play very well with highlighting at the moment. You may get
garbled output.
-Mike
sould be found. If you want _only_ that document to match, you
should try something like a phrase query with a bit fo slop:
trade1:"the appraisal station"~10
-Mike
for highlighting is:
1. hl=true
2. hl.fl=myfield
_If_ that field matches one of the query terms, you should see
snippets in the generated response. EVen if not, you should see a
section of the response (it will be empty).
regards,
-Mike
ewhat surprised that several people are interested in
this but none have have been sufficiently interested to implement a
solution to contribute:
http://issues.apache.org/jira/browse/SOLR-42
-Mike
which is used as the
query if the queyr string is emtpy.
To return all documents, set "alt.q=*:*"
-Mike
inst improving Solr's handling of HTML data, but it is the
type of thing that is unlikely to happen unless someone who cares
about it steps up.
Patches welcome :)
-Mike
you know at index time that the document is shady, the easiest way
to de-emphasize it globally is to set the document boost to some
value other than one.
...
cheers,
-Mike
n the value stored in a
field (which could represent a range of 'badness'). This can be used
directly in the dismax handler using the bf (boost function) query
parameter.
-Mike
ugh quite close in space
requirements for a 30-ary field on your index size).
-Mike
On 9-Oct-07, at 7:53 PM, Stu Hood wrote:
Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_
Mike, how did you calculate that value? I'm trying to tune my
caches, and any equations that could be used to dete
how
could this be when it is storing the same information as the
filterCache, but with the addition of sorting?
Solr caches only the top N documents in the queryResultCache (boosted
by queryResultWindowSize), which amounts to 40-odd ints, 40-odd
float, and change.
-Mike
g minDf to a very high value
should always outperform such an approach.
-Mike
DW
-Original Message-
From: Stu Hood [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 09, 2007 10:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space
Using the fi
false? If not, what do I need to do to make sure allowDups is set to
false when I'm adding these docs?
It is the normal mode of operation for Solr, so I'd be surprised if
it wasn't the default in solrj (but I don't actually know).
-Mike
r end. If the deletes are doc
ids, then you can collect a bunch at once and do
id:xxx id:yyy id:zzz id:aaa id:bbb to
perform them all at once.
-Mike
y have millions of unique values).
It would be helpful to know which field is causing the problem. One
way would be to do a sorted query on a quiescent index for each
field, and see if there are any suspiciously large jumps in memory
usage.
-Mike
-Original Message-
From:
two fields.
Have you tried setting multivalued=true without reindexing? I'm not
sure, but I think it will work.
-Mike
prefix. If you
want to facet multiple doclists from different queries in one
request, just write your own request handler that takes a multi-
valued q param and facets on each.
I didn't answer all the questions in your email, but I hope this
clarifies things a bit. Good luck!
-Mike
y processing:
101.0 total time
1.0 setup/query parsing
68.0 main query
30.0 faceting
0.0 pre fetch
2.0 debug
201.0 total time
1.0 setup/query parsing
138.0 main query
58.0 faceting
0.0 pre fetch
4.0 debug
I can't really think of a plausible explanation. Fortuitous
instruction pipelining? It is hard to imagine a cause that wouldn't
exhibit consistency.
-Mike
On 11-Oct-07, at 2:37 PM, Yonik Seeley wrote:
On 10/11/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
I'm seeing some interesting behaviour when doing benchmarks of query
and facet performance. Note that the query cache is disabled, and
the index is entirely in the OS disk cache. fil
ly: it doesn't
actually make the deletes visible until
-Mike
t there was a way of providing summaries
without storing doc contents, I would pee my pants with happiness and
it would be in Solr faster than you can say "diaper".
cheers,
-Mike
On 11-Oct-07, at 3:48 PM, Ravish Bhagdev wrote:
Hey guys,
Checkout this thread I opened on nutch mail
On 11-Oct-07, at 4:34 PM, Ravish Bhagdev wrote:
Hi Mike,
Thanks for your reply :)
I am not an expert of either! But, I understand that Nutch stores
contents albeit in a separate data structure (they call segment as
discussed in the thread), but what I meant was that this seems like
much more
y reason why you are faceting on a field that you are restricting?
Clearly, the answer will be '1001644' --> , (all other
categories) -> 0. Just use numFound.
Also, if there can only be one category per doc, make sure you are
using the fieldCache method for category_id.
-Mike
e as I'm not the one who originally wrote
the code.
Nevermind. I was thinking of category as being single-valued. For
multi-valued category, it is still necessary to do faceting to find
sub-categories. Sorry!
-Mike
ing documents due to
config changes.
-Mike
easy is changing Solr's interpretation of NOW in DateMath to be
UTC. What is the correct way to go about this?
-Mike
27;m pleased to inform you that DisMax already provides highlighting,
in exactly the same was as does StandardRequestHandler.
-Mike
I'm not sure that many people are dynamically taking down/starting up
Solr webapps in servlet containers. I certainly perfer process-level
management of my (many) Solr instances.
-Mike
On 18-Oct-07, at 10:40 AM, Stu Hood wrote:
Any ideas?
Has anyone had experienced this problem
f your user's are all over the
world, you'd ideally want to round to _their_ timezone, but I don't
see how this is realistic.
thanks,
-Mike
On 18-Oct-07, at 1:01 PM, Stu Hood wrote:
I'm running SVN r583865 (1.3-dev).
Mike: when you say 'process level management', do you mean starting
them statically? Or starting them dynamically, but using a
different container for each instance?
I have a large number o
On 19-Oct-07, at 7:19 AM, Ed Summers wrote:
On 10/18/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
I realize this is a bit off-topic -- but I'm curious what the
rationale was behind having that many solr instances on that many
machines and how they are coordinated. Is it a master/sla
what you want.
If you use dismax, you can add the boost to the 'bq' parameter to
affect scoring only (will not match the doc if it wouldn't have been
matched anyway).
-Mike
that point, the information about the
structure of the document is not available.
It is computable given sufficient effort, but certainly not something
Solr should provide by default.
Have you considered storing each section as a separate Solr Document?
-Mike
On 24-Oct-07, at 12:39 PM, Alf Eaton wrote:
Mike Klaas wrote:
On 24-Oct-07, at 7:10 AM, Alf Eaton wrote:
Yes, I was just trying that this morning and it's an improvement,
though
not ideal if the field contains a lot of text (in other words
it's still
a suboptimal workaround).
we are already
close to finishing.
If you mean that there are a lot of small tweaks that the community
doesn't have access to because we haven't done a release, I'm
inclined to agree that that would be ideal. It is more work to do
maintain that kind of release schedule (requires work on multiple
branches at once).
-Mike
deed--phrase matching uses a completely different part of the
index, so that needs to be warmed too.
One thing to try is solr trunk: it contains some speedups for phrase
queries (though perhaps not as substantial as you hope for).
-MIke
ave you tried other queries? 937ms seems a little high,
even for phrase queries.
Anyway I will collect the statistic on linux first and try out
other options.
Have you tried using the performance enhancements present in solr-trunk?
-Mike
ll be better. Anyone has experience on that?
Unlikely, though it might help you slightly at a high query rate with
high cache hit ratios.
-Mike
25%.
It still feels to me that you are trying doing something unique with
your phrase queries. Unfortunately, you still haven't said what you
are trying to do in general terms, which makes it very difficult for
people to help you.
-Mike
scenes if you aren't using
multiple threads.
Some possible differences:
1. Solr has more aggressive default buffering settings
(maxBufferedDocs, mergeFactor)
2. solr trunk (if that is what you are using) is using a more recent
version of Lucene than the released 2.2
-Mike
ev (lucene)? I think someone
once implemented a solution using hashmaps for sorting, but I can't
recall the issue #.
-Mike
when stemming, you'd store (account accountant)
(account accounts), etc., when filtering, (epee épée) (fantome
fantôme), etc.
Now when querying, transform your query into
^10:
épée -> epee épée^10
accountant -> account accountant^10
A bit of work to do in general, though.
-Mike
to house the # of unique values
you are faceting on? Check the cache statistics on the admin gui.
Are there large numbers of evictions?
Alternatively, is company_facet multi- or -single-valued? If the
latter, the filter cache is not used at all.
-Mike
More generally, does anyone have a
ven't been discovered yet. I'm using it in production.
More important than any claims we make is running it against your own
application's test suite, of course.
-Mike
ments that have the same resulting
token will be considered "the same").
If this is violated, the behaviour is undefined (but I wouldn't be
surprised if the first token was used).
-Mike
Hi Brian,
Found the SVN location, will download from there and give it a try.
Thanks for the help.
On 07/11/2007, Mike Davies <[EMAIL PROTECTED]> wrote:
>
> I'm using 1.2, downloaded from
>
> http://apache.rediris.es/lucene/solr/
>
> Where can i get the trunk ver
8983. Any suggestions?
Also, I'd really like to get hold of the source code to the start.jar but I
cant seem to find it anywhere. Again, any suggestions?
Thanks
Mike
I'm using 1.2, downloaded from
http://apache.rediris.es/lucene/solr/
Where can i get the trunk version?
On 07/11/2007, Brian Whitman <[EMAIL PROTECTED]> wrote:
>
>
> On Nov 7, 2007, at 10:00 AM, Mike Davies wrote:
> > java -Djetty.port=8521 -jar start.jar
> >
On 7-Nov-07, at 2:27 PM, briand wrote:
I need to perform a search against a limited set of documents. I
have the
set of document ids, but was wondering what is the best way to
formulate the
query to SOLR?
add fq=docId:(id1 id2 id3 id4 id5...)
cheers,
-Mike
ld only be possible
in the event of cold process termination (like power loss).
-Mike
-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED]
Sent: Friday, November 09, 2007 10:56 AM
To: solr-user@lucene.apache.org
Subject: Re: Delte all docs in a SOLR index?
Thanks!
deprecated
methods in external libraries as well.
I don't think so, but I suggest asking this question on java-
[EMAIL PROTECTED], which has a much broader lucene-related
audience.
-Mike
hen
doing lots of faceting on huge indices, if N is low (say, 500-1000).
One problem with the implementation above is that it stymies the
query caching in SolrIndexSearcher (since the generated DocList is >
the cache upper bound).
-Mike
Not really--there have been a few threads on this topic recently.
Perhaps in a couple months?
It may depend on the timing of the lucene release.
-MIke
On 13-Nov-07, at 3:41 PM, Dave C. wrote:
Ah...
:(
Is there a timeline for the 1.3 release?
- david
Date: Tue, 13 Nov 2007 18:33:01
web.xml ...etc...
Perhaps check your cache statistics on the admin gui. Is it possible
that you have set the capacity high and they are just filling up?
Another thing to look out for is if you tend to sort on many
different fields, but rarely.
-Mike
501 - 600 of 1080 matches
Mail list logo