RE: Best use of wildcard searches

2007-08-11 Thread Jonathan Woods
Thanks, Lance.

I recall reading that Lucene is used in a superfast RDF query engine:

http://www.deri.ie/about/press/releases/details/?uid=55&ref=213.

Jon

> -Original Message-
> From: Lance Norskog [mailto:[EMAIL PROTECTED] 
> 
> The Protégé project at Stanford has nice tools for editing 
> knowledge bases, taxonomies, etc.
 
> 



Re: Spell Check Handler

2007-08-11 Thread climbingrose
That's exactly what I did with my custom version of the SpellCheckerHandler.
However, I didn't handle suggestionCount and only returned the one corrected
phrase which contains the "best" corrected terms. There is an issue on
Lucene issue tracker regarding multi-word spellchecker:
https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
.


On 8/11/07, Pieter Berkel <[EMAIL PROTECTED]> wrote:
>
> On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote:
> >
> > The spellchecker handler doesn't seem to work with multi-word query. For
> > example, when I tried to spellcheck "Java developar", it returns nothing
> > while if I tried "developar", spellchecker correctly returns
> "developer".
> > I
> > followed the setup on the wiki.
>
>
> While I suppose the general case for using the spelling checker would be a
> query containing a single misspelled word, it would be quite useful if the
> handler applied the analyzer specified by the termSourceField fieldType to
> the query input and then checked the spelling of each query token. This
> would seem to be the most flexible way of supporting multi-word queries
> (provided the termSourceField didn't use any stemmer filters I suppose).
>
> Piete
>



-- 
Regards,

Cuong Hoang


Re: FunctionQuery and boosting documents using date arithmetic

2007-08-11 Thread climbingrose
I'm having the date boosting function as well. I'm using this function:
F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around
10,000 of documents added in one day, rord(createDate) returns very
different values for the same createDate. For example, the last document
added with have rord(createdDate) =1 while the last document added will have
rord(createdDate) = 10,000. When createDate > 10,000, value of F is
approaching 0. Therefore, the boost query doesn't make any difference
between the the last document added today and the document added 10 days
ago. Now if I replace 1000 in F with a large number, say 10,  the boost
function  suddenly gives the last few documents enormous boost and make the
other query scores irrelevant.

So in my case (and many others' I believe), the "true" date value would be
more appropriate. I'm thinking along the same line of adding timestamp. It
wouldn't add much overhead this way, would it?

Regards,



On 8/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
>
> : Actually, just thinking about this a bit more, perhaps adding a function
> : call such as parseDate() might add too much overhead to the actual
> query,
> : perhaps it would be better to first convert the date to a timestamp at
> index
> : time and store it in a field type slong?  This might be more efficient
> but
>
> i would agree with you there, this is where a more robust (ie:
> less efficient) DateField-ish class that supports configuration options
> to specify:
>   1) the output format
>   2) the input format(s)
>   3) the indexed format
> ...as SimpleDateFormatter pattern strings would be handy.  The
> ValueSource it uses could return seconds (or some other unit based on
> another config option) since epoch as the intValue.
>
> it's been discussed before, but there are a lot of tricky issues involved
> which is probably why no one has really tackled it.
>
> : that still leaves the problem of obtaining the current timestamp to use
> in
> : the boost function.
>
> it would be pretty easy to write a ValueSource that just knew about "now"
> as seconds since epoch.
>
> : > While it seems to work pretty well, I've realised that this may not be
> : > quite as effective as i had hoped given that the calculation is based
> on the
> : > ordinal of the field value rather than the value of the field
> itself.  In
> : > cases where the field type is 'date' and the actual field values are
> not
> : > distributed evenly across all documents in the index, the value
> returned by
> : > rord() is not going to give a true reflection of document age.  For
> example,
>
> be careful what you wish for.  you are 100% correct that functions using
> hte (r)ord value of a DateField aren't a function of true age, but
> dependong on how you look at it that may be better then using the real age
> (i think so anyway).  Why it sounds appealing to say that docA should
> score half as high as docB if it is twice as old, that typically isn't all
> that important when dealing with recent dates; and when dealing with older
> dates the ordinal value tends to approximate it decently well ... where a
> true measure of age might screw you up is when you have situations where
> few/no new articles get published on weekends (or late at night).  it's
> also very confusing to people when the ordering of documents changes even
> though no new documents have been published -- that can easily happen if
> you are heavily boosting on a true age calculation but will never happen
> when dealing with an ordinal ranking of documents by age.
>
> (allthough, this could be compensated by doing all of your true age
> calculations relative the "min age" of all articles in your index -- but
> you would still get really weird 'big' shifts in scores as soon as that
> first article gets published on monday morning.
>
>
> -Hoss
>
>


-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-11 Thread karl wettin


11 aug 2007 kl. 10.36 skrev climbingrose:


There is an issue on
Lucene issue tracker regarding multi-word spellchecker:
https://issues.apache.org/jira/browse/LUCENE-550


I think you mean LUCENE-626 that sort of depends on LUCENE-550.


--
karl





Re: Spell Check Handler

2007-08-11 Thread climbingrose
Yeah. How stable is the patch Karl? Is it possible to use it in product
environment?

On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
>
> 11 aug 2007 kl. 10.36 skrev climbingrose:
>
> > There is an issue on
> > Lucene issue tracker regarding multi-word spellchecker:
> > https://issues.apache.org/jira/browse/LUCENE-550
>
> I think you mean LUCENE-626 that sort of depends on LUCENE-550.
>
>
> --
> karl
>
>
>
>


-- 
Regards,

Cuong Hoang


Re: Spell Check Handler

2007-08-11 Thread Pieter Berkel
 On 11/08/07, climbingrose<
[EMAIL PROTECTED]> wrote:
>
> That's exactly what I did with my custom version of the
> SpellCheckerHandler.
> However, I didn't handle suggestionCount and only returned the one
> corrected
> phrase which contains the "best" corrected terms. There is an issue on
> Lucene issue tracker regarding multi-word spellchecker:
> https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>


I'd be interested to take a look at your modifications to the
SpellCheckerHandler, how did you handle phrase queries? maybe we can open a
JIRA issue to expand the spell checking functionality to perform analysis on
multi-word input values.

I did find http://issues.apache.org/jira/browse/LUCENE-626 after looking at
LUCENE-550, but since these patches are not yet included in the Lucene trunk
yet it might be a little difficult to justify implementing them in Solr.


Re: Spell Check Handler

2007-08-11 Thread karl wettin


12 aug 2007 kl. 02.35 skrev climbingrose:

I think you mean LUCENE-626



Yeah. Is it possible to use it in product environment?


It's been running live for a long time at this one place, but the  
code is stuck at Lucene 2.0 and an old version of 550. I don't really  
do any more Solr than to monitor the forums and use some analysis  
code, so I could not say how much work it would take you to get it  
running.


I'm aiming at giving the code an overview and bring it up to date  
with the Lucene trunk any day, week, month or year now, depending on  
workload and if I manage to fix a verion of 550 that is accepted to  
the trunk.


You are welcome to break out the TokenPhraseSuggester and  
NgramTokenSuggester, the parts I think you are intrerested in. If you  
do, feel free to report about it and posting a patch in the issue.


--
karl


RE: Multivalued fields and the 'copyField' operator

2007-08-11 Thread Chris Hostetter

: If I index with this, the spellcheck Analyser class complains that I'm
: putting two values in a multiValued="false" field. Since I have to make it
: multiValued, the same word in successive values is not collapsed into one
: mention of the word.

I think you are missunderstanding hte intent of the "multiValued" field
attribute -- it really has nothing to do with collapsing values or
removing duplicates, it just tells Solr wether or not you want to allow
multiple descreete values to ever be added -- wether it's by a copyField
or by you sending multiple explicit values when you add the document.

even if you take copyField out of the equation, and sent the values
explicitly; or took multiValued out of hte equation and did a string
concat of the values before adding the doc there would still be no
automatic collapsing of successive instances of a word.


-Hoss



RE: question: how to divide the indexing into sperate domains

2007-08-11 Thread Ben Shlomo, Yatir
Thanks yonik!

I do have some unused fields inside the csv file.
But they are not empty.
They are numeric they can be anything between 0 to 10,000
Can I do something like
f.unused.map=*:98765 

yatir

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Thursday, August 09, 2007 10:41 PM
To: solr-user@lucene.apache.org
Subject: Re: question: how to divide the indexing into sperate domains

Hmmm, I think you can map an empty (zero length) value to something else
via
f.foo.map=:something
But that column does currently need to be there in the CSV.

Specifying default values in a per-request basis is interesting, and
something we could perhaps support in the future.
The quickest way to index your data right now would probably be to
change the file, adding another value at the end of each file.  I
think it could even be an empty value (just add a "," at the end of
each line), and then you could map that via
f.domain.map=:98765

btw, 300M records is a lot for one Solr instance... I hope you've got
a big box with a lot of memory, and aren't too concerned with your
query latency.  Otherwise you can do some partitioning by domain.

-Yonik

On 8/9/07, Ben Shlomo, Yatir <[EMAIL PROTECTED]> wrote:
> Hi!
>
> say I have 300 csv files that I need to index.
>
> Each one holds millions of lines (each line is a few fields separated
by
> commas)
>
> Each csv file represents a different domain of data (e,g, file1 is
> computers, file2 is flowers, etc)
>
> There is no indication of the domain ID in the data inside the csv
file
>
>
>
> When I search I would like to specify the id of a specific domain
>
> And I want solr to search only in this domain - to save time and
reduce
> the number of matches
>
> I need to specify during indexing - the domain id of the csv file
being
> indexed
>
> How do I do it ?
>
>
>
>
>
> Thanks
>
>
>
>
>
>
>
> p.s.
>
> I wish I could index like this:
>
> curl
>
http://localhost:8080/solr/update/csv?stream.file=test.csv&fieldnames=fi
> eld1,field2&f.domain.value=98765
>
 ield1,field2&f.domain.value=98765>  (where 98765 is the domain id for
> ths specific csv file)
>
>


Re: tomcat and solr multiple instances

2007-08-11 Thread Chris Hostetter
: Message-ID:
: <[EMAIL PROTECTED]>
: In-Reply-To: <[EMAIL PROTECTED]>

http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking




-Hoss