g rewrite() before using
the Highlighter?
It is, in trunk/:
NamedList sumData = HighlightingUtils.doHighlighting(
results.docList, query.rewrite(req.getSearcher().getReader()),
req, new String[]{defaultFiel
d});
Definitely a bug somewhere. Does anyone more familiar with lucene see
why the above wouldn't be sufficient?
-Mike
On 3/23/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
On 3/23/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> Definitely a bug somewhere. Does anyone more familiar with lucene see
> why the above wouldn't be sufficient?
Perhaps our use of ConstantScorePrefixQuery by defaul
you want highlighting -- so instead of dn* search for dn?*
Note that you need the a recent nightly build for that to work--it
wasn't there for the last release.
-Mike
removes
the : and everything after it
q=trackURL:http%3A//host* <-- doesn't work, same as above
q=trackURL:http*host* <-- TooManyClauses exception, not what I want
anyway
Have you tried:
trackURL:http\://host*
-Mike
e completely disjoint: indexing is a lossy
operation, so if you want to be able retrieve the original contents,
they must be stored separately (ie., the first option uses the least
space).
-MIke
loss. So you don't need to store it separately. what do you think?
In theory that might be true, but lucene is not implemented that way,
I'm afraid. If this is the a priori situation, it is probably easier
to implement this outside of lucene and "store" the id in your
external index.
-Mike
ly be achieved by using a high
percentage setting, but I'd have to double check how the rounding is
done).
-Mike
a document for each customer then some field
must indicate to which customer the document instance belongs. In
that case, why not index a single copy of each document, with a field
containing a list of customers having access?
-Mike
you clarify what you're looking for Solr to do for you?
-Mike
docs. I'll don't have much lucene-sort fu, though, so an expert
should chime in...
-Mike
s incorporated into the fieldNorm and so is modified by the
lengthNorm. Further, during query the term idf, queryNorm come into
play.
You shouldn't expect that the document boost will be returned as the
document score (although you should expect it to affect it).
-Mike
he/solr/util/doc-files/min-should-match.html
Indeed it is, though I wasn't aware of the detailed documentation.
Not that it that hard to find, but it is three links away from the
main dismax page on the wiki. I might add a link directly from the
main dismax javadoc to help people like me find it,
-Mike
aps via StandardAnalyzer) occurred in the
original index.
-Mike
field.
-Mike
On 3/28/07, escher2k <[EMAIL PROTECTED]> wrote:
Mike,
I am not doing anything custom for this test. I am assuming that the
Default Similarity is used.
Surprisingly, if I remove the document level boost (set to 1.0) and just
have a field level boost, the result
seems to be correct.
A
1.com/msg/91276.html, which doesn't
exactly boast completion of such a feat!
-Mike
On 3/28/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
Hi Mike,
I'm curious about what you said there: "People have constructed (lucene)
indices with over a billion
documents.". Are y
also store doc boosts that span a wider dynamic variance (1
to 15 rather than 1 to 1.5, say), then compensate by applying a
query-time boost of 0.1.
-Mike
On 3/29/07, James liu <[EMAIL PROTECTED]> wrote:
i find solr alway can work when i delete index,,i think it maybe cached in
memory.
Tricky, eh? Unix doesn't _truly_ delete the files until all processes
have closed them or terminated.
-Mike
re memory after their
datastructures have been built, so it would be odd to see OOM after 48
hours if they were the cause.
-Mike
ately equal values anyway.
-Mike
big enough for faceting.
i could use the same thing!
It would also be useful to (for instance) insert a filter into the
filter cache that could be subsequently used by a query. Obviously,
this is really only useful for filters that aren't constructed from a
query -> docset.
-Mike
On 4/4/07, James liu <[EMAIL PROTECTED]> wrote:
2007/4/5, Mike Klaas <[EMAIL PROTECTED]>:
>
> On 4/4/07, James liu <[EMAIL PROTECTED]> wrote:
> > That means now i can' solve it with solr?
>
> Not out-of-the-box, no. But you can certainly query your sl
3. put all returned documents into an array, and reverse sort by score
4. select documents [N, N+M) from this array.
This is a relatively simple task. It gets more complicated once
multiple passes, idf compensation, deduplication, etc. are added.
-Mike
I would be very interested in this. Any idea on when this will be available?
Thanks
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Monday, April 02, 2007 1:44 AM
To: solr-user@lucene.apache.org
Subject: Re: C# API for Solr
Well, i think there will be a lot of
On 4/6/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
A) http://issues.apache.org/jira/secure/attachment/12349897/logo-solr-d.jpg
lable)
from Python objects. Then again, JSON for posting would be
really nice to have :)
It is not documented very well, but you can pass in a multi-map to the
solr.py client:
.add(field_one=['one', 'two', 'three'], field_two='value', ...)
-Mike
d whatnot which makes it also a little bit harder
to do this way.
anyways hope that makes sense,
let me know!
-Mike
k to a page that includes a link to the nightly build and
CHANGES.txt, or the release package for already-released versions.
-Mike
your existing analyzer to recognize it (WordDelimiterFilter if you are
using the standard text field in the Solr example). If it is
complicated, you should look into creating your own analyzer.
-Mike
#x27; with 1000 words of
'delhi', highest score to matches having the words nearby
-Mike
pulate it
during the main query, and grab ids from the cache during the
highlighting step.
-Mike
ecause the new posts must be available in the search as
soon as they are posted.
Do you think there is a way to optimize this?
"As soon as" is a rather vague requirement. If you can specify the
minimum acceptible delay, then you can use Solr's autocommit
functionality to trigger commits.
-Mike
roximate it by doing something like:
A:"phrase"^10 B:"phrase"^1 C:"phrase"^1000 D:"phrase"^100
E:"phrase"^30
HTH,
-Mike
a few weeks after
1.1 was cut. I suggest using a nightly build from Feb 2 or later, or
waiting until 1.2 is released.
cheers,
-Mike
e.
Sounds good. If it is sufficiently unobstrusive, it probably isn't
even necessary to change it later.
-Mike
I couldn't give you a timeline.
For the time being, consider that
1. utf-8 is the "lingua franca" of xml document encoding
2. it is very easy to convert it yourself (it would be a 3-4 line
python commandline filter, frinstance).
-Mike
nt of or limitation to a particular package.
-Mike
the default package with jetty can be used for production.
Do you know that Jetty is the culprit? We've been successfully using
it for production purposes.
-Mike
and since the provided container has never given me
any (significant[1]) issues, I've kept with it.
[1] Aside from XML-escaping irregularities that were discussed on the
list last year.
-Mike
off of windows
server so I haven't even looked into the snappuller etc.. stuff.
Thanks,
Mike
andler.
-Mike
On 4/25/07, Mike Austin <[EMAIL PROTECTED]> wrote:
Could someone give advise on a better way to do this?
I have an index of many merchants and each day I delete merchant products
and re-update my database. After doing this I than re-create the entire
index and move it to production rep
he work. I might actually be able to contribute some code to
this at some point... maybe in conjunction with my solr servlet code and how
I do faceting and category navigation.
Thanks,
Mike
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 26, 2007
example, if I called Runtime.exec with a command of
"test_program" (which is a bash script), it failed. If I called
Runtime.exec with a command of "/bin/bash test_program" it worked.
Yes, Runtime.exec does not invoke a shell automatically, so shebang
lines, shell built-ins, io redirection, etc. cannot be used directly.
-Mike
27;re hoping to fix that asap.
See http://issues.apache.org/jira/browse/SOLR-102 for my solution to
this problem. The idea is that you'd like to split at sentence
boundaries, but also not stray too far from the desired fragment size.
It would be great to get comments on/improvements to this approach.
-Mike
.
There is some past discussion on the list if you search the archives.
-Mike
browse/SOLR-216
-Mike
actly what you're planning on doing).
Typically, the feature you are talking about is implemented by
analyzing query logs, which are a much more relevant corpus than the
raw documents in this context. I suggest focusing your efforts in
that direction (possibly checking to see if someone has doing this
with lucene already...)
cheers,
-Mike
sorted(facet.field_values.items(), key=lambda x: x[1], reverse=True)
or even
from operator import itemgetter
sorted(facet.field_values.items(), key=itemgetter(1), reverse=True)
digressionally,
-Mike
You could easily store all 100 facets, display
the first ten and fill in the rest with some (hidden) javascript when
the user clicks a button (or re-request the facets from Solr with a
higher threshold).
-Mike
records in
the order of results while still maintaining the scoring?
-Mike
TF-8 by default. Any objections?
No--I'm not sure that it'll bring clarity for anyone who isn't aware
of xml encoding issues, but I can't see it hurting.
-Mike
On 5/9/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> +1 on explicit encoding declarations.
Done (even though it really wasn't needed since there were no int'l
chars in the example).
As Mike points out, it only marginally helps... if the user adds
international chars to the
be difficult to
create a patch if you were interested, but I'm curious: What about
XSL makes what seems to me an elementary string-processing task so
difficult?
regards
-Mike
sing delete by query:
docId:XXX OR docID:YYY OR docId:ZZZ ...
-Mike
readedly if you want some
concurrency).
regards,
-Mike
tition "runs out" of docs before it is done, request a new
round.
-Mike
On 14-May-07, at 6:49 PM, James liu wrote:
2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
On 14-May-07, at 1:35 AM, James liu wrote:
When you get up to 60 partitions, you should make it a multi stage
process. Assuming your partitions are disjoint and evenly
distributed, estimate the num
e the docs from 0 to N for each partition (whether
through one request or multiple).
-Mike
2007/5/15, James liu <[EMAIL PROTECTED]>:
2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
>
> On 14-May-07, at 1:35 AM, James liu wrote:
>
> > if use multi index box,
order. You have to perform that sort manually.
so it will not sorted by score correctly.
and if user click page 2 to see, how to show data?
p1 start from 10 or query other partitions?
Assemble results 1 through 20, then display 11-20 to the user.
-Mike
2007/5/15, Mike Klaas <[EMAIL P
On 14-May-07, at 10:05 PM, James liu wrote:
2007/5/15, Mike Klaas <[EMAIL PROTECTED]>:
I'm not ignoring it: I'm implying that the above is the correct
descending score-sorted order. You have to perform that sort
manually.
i mean merged results(from 60 p) and sort it,
hat timing/statistics might be
handleable on a larger scale. OTOH, it does give an easy way to
requesthandlers to insert detailed timing data in a logical place in
the output.
-Mike
categories
- simple xml configuration for the final outputted category configuration
file
I'm sure there are more cool things but that is all for now. Join the
mailing list to see more improvements in the future.
Also.. how do I get added to the Using Solr wiki page?
Thanks,
Mike Austin
need
to use a different query handler?
-Mike
e in programming
languages.
Note that the URL parameter is not a variable in Solr. Your problems
seem to be occurring due to the use of systems that attempt to map
data to variable names, which seems to me like a worse idea than
using '.' in url paramters.
regards,
-Mike
x27;t appearing. Could you clarify what you mean? What analyzers
are you using?
-Mike
Thanks
-Amit
James liu wrote:
first u try enable highlighting(
http://wiki.apache.org/solr/HighlightingParameters)
and u try solr admin gui to see its output and u will find what u
wanna.
2007/5/23, s
search and noticed pages were executed through aspx. Are you using
.net to parse the xml results from SOLR? Nice site, just trying to figure
out where SOLR fits into this.
On 5/16/07, Mike Austin <[EMAIL PROTECTED]> wrote:
I just wanted to say thanks to everyone for the creation o
ery nice job!
> It's fast too.
>
> -Yonik
>
> On 5/16/07, Mike Austin <[EMAIL PROTECTED]> wrote:
> > I just wanted to say thanks to everyone for the creation of solr. I've
> been
> > using it for a while now and I have recently brought one of my side
> pro
any non-
trivial size index.
-Mike
ess to the unindexed version? My
suggestion would be to copyField into an unanalyzed version, and
facet on that.
cheers,
-Mike
and others not -- while still leaving all other
options open.
Define two fieldTypes, and use one for "tokenized" analysis and
another for "untokenized"?
-Mike
).
Do you really have 1.5M unique values in that field. Are you
analyzing the field (you probably shouldn't be)?
-Mike
suspicious about your application. You have 1.5M
distinct tags for 4M documents? That seems quite dense.
-Mike
this exactitude to carry forth in your highlighting, specify
hl.requireFieldMatch=true.
-Mike
pen
file limit (I upped mine from 1024 to 45000 to handle huge indices).
You can alleviate this by reducing the mergeFactor, but this can
impact indexing performance.
And: is there a way to just hand the XML file to Solr without
having to POST it?
No, but POST'ing shouldn't be a bottleneck.
-Mike
log_{base mergefactor}(numDocs) * mergeFactor segments,
approximately.
-Mike
7;t even searching them.
One option is to search those fields directly, using dismax. In that
case, the highlight fields will be picked up automatically.
-Mike
ted at all (check disk usage stats).
How is Solr caching better than this?
It is unrelated. Solr can cache certain reusable components of
queries (namely, filters), and provides for fully-customizable schema
and arbitrary query execution on it.
-Mike
on.
Solr is an open-source project, so huge features will get implemented
when there is a person or group of people devoted to leading the
charge on the issue. If you're interested in being that person,
that's great!
-Mike
http://issues.apache.org/jira/browse/SOLR-257
I'll probably commit it in a day or so, at which point it will be
part of the Solr nightly build.
-Mike
hen using a high mm (minimum #clauses match)
setting with dismax, as it effectively requires 'in' to be in the url
column, which was probably not the intent of the query.
-Mike
efix query.
-Mike
2 to specify more than one
default search field, or is the above solution still the way to go?
This is precisely the situation that the dismax handler was designed
for. Plus, you don't have to fiddle around with document boosts.
try:
qt=dismax q=letters qf=keywords^3.0 title^2.0 content
-Mike
On 8-Jun-07, at 10:19 AM, Tiong Jeffrey wrote:
Hi all,
I tried to index a document that has '&' using post.jar. But during
the
indexing it causes error and it wont finish the indexing. Can I
know why is
this and how to prevent this? Thanks!
XML requires &'s to be escaped. & -> &
-Mike
-connection encoding. I
think the default is 'latin-1'; try googling 'mysql collation'.
You could use python to convert the file:
open('outfile', 'wb').write(open('infile', 'rb').read().decode
('latin-1').encode('utf-8'))
regards,
-Mike
creates an inverted index; the storage
system keeps track of the data you give it _before_ analysis/
tokenization. If there is analysis you'd like to do that also
applies to the stored status of the doc, it's probably easier to
apply it before passing the data to Solr.
-MIke
On 08
a month ago. I don't recall seeing any
bugs with the 'fq' param.
er... since the second batch of queries returned no hits, does that
not indicate that the problem _isn't_ with fq? You practically
stripped it down to raw lucene territory here.
-MIke
es to be
storing in a binary float--you're probably comparing mostly the
exponent, which is not necessarily disjoint. Have you tried sdouble?
And this problem seems to occur in most (if not all) of my range
queries. Is there anything that I am doing wrong here?
Is this true on other field types as well?
-Mike
is handling it? I
suspect it is encoded somehow, which could be problematic. Is it
going through a web browser? How is it getting into mysql?
-MIke
) ?
No, the index dir is determined by solrconfig.xml of the Solr
instance. The python client can only be used to connect to an
already-running instance.
-Mike
ascii', 'ignore')) # assuming s is a bytestring
u.encode('ascii', 'ignore') # assuming u is a unicode string
-Mike
On 15-Jun-07, at 2:45 AM, vanderkerkoff wrote:
Hi Mike
The characters that are giving us problems are the old favourites of
apostrophe's an
entirely by your
requirements. Since you wanted to create a new subindex, you'll have
to set up another Solr instance somewhere. Another machine, another
webapp, etc.
-Mike
On 18-Jun-07, at 6:27 AM, vanderkerkoff wrote:
Cheesr Mike, read the page, it's starting to get into my brian now.
Django was giving me unicode string, so I did some encoding and
decoding and
now the data is getting into solr, and it's simply not passing the
characters that a
you find that there is a performance
problem.
-Mike
I get this error, I searched the email
archive, it
seems working for other users. Does anyone know what is the problem?
CJKTokenizerFactory that I am using is appended.
Would you be interested in contributing this class to solr?
-MIke
On 20-Jun-07, at 6:38 AM, vanderkerkoff wrote:
Hello Mike, Brian
My brain is approcahing saturation point and I'm reading these two
opinoins
as opposing each other.
I'm sure I'm reading it incorrectly, but they seem to contradict
each other.
Are they?
solr.py ta
Niraj: What environment are you using? SQL Server/.NET/Windows? or something
else?
-Mike
-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 20, 2007 4:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Faceted Search!
: define the sub-categories
On 21-Jun-07, at 10:22 PM, Chris Hostetter wrote:
like i said though: i'm in favore of factories like this ... i just
don't
think we should do anything to hide their use and make refering to
Tokenizer or TOkenFilter class names directly use reflection magicly.
What would be the best way to
made configurable.
I'll add it to the future features.
-Mike
re pending documents, is that correct?
That is correct.
-Mike
901 - 1000 of 1080 matches
Mail list logo