Re: Forced Top Document

2007-10-25 Thread Chris Hostetter

: The typical use case, though, is for the featured document to be on top only
: for certain queries.  Like in an intranet where someone queries 401K or
: retirement or similar, you want to feature a document about benefits that
: would otherwise rank really low for that query.  I have not be able to make
: sorting strategies work very well.

this type of question typically falls into two use cases:
  1) "targeted ads"
  2) "sponsored results"

in the targeted ads case, the "special" matches aren't part of the normal 
flow of results, and don't fit into pagination -- they always appera at 
the top, or to the right, on every page, no matter what the sort  this 
kind of usage doesn't really need any special logic, it can be solved as 
easily by a second Solr hit as it can by custom request handler logic.

in the "sponsored results" use case, the "special" matches should appear 
in the normal flow of results as the #1 (2, 3, etc) matches, so that they 
don't appear on page#2 ... but that also means that it's extremely 
disconcerting for users if those matches are still at the top when the 
userse resort.  if a user is looking at product listings, sorted by 
"relevancy" and the top 3 results all say they are "sponsered" that's fine 
... but if the user sort by "price" and those 3 results are still at teh 
top of the list, even though they clearly aren't the chepest, that's just 
going to piss the user off.

in my profesional opinion: don't fuck with your users.  default to 
whatever order you want, but if the user specificly requests to sort the 
results by some option, do it.

assuming you follow my professional opinion, then "boosting" docs to have 
an artifically high score will work fine.

if you absolutely *MUST* have certain docs "sorting" before others, 
regardless of which sort option the user picks, then it is still possible 
do ... i'm hesitant to even say how, but if people insist on knowing...



allways sort by score first, then by whatever field the user wants to sort 
by ... but when the user wants to sort on a specific field, move the users 
main query input into an "fq" (so it doesn't influence the score) ... and 
use an extremely low boost matchalldocs query along with your "special doc 
matching query" as the main (scoring) query param.  the key being that 
even though your primary sort is on score, every doc except your special 
matches have identical scores.

(this may not be possible with dismax because it's not trivial to move 
the query into an fq, it might work if you can use "0" as the boost on 
fields in the qf so it still dictates the matches but doesn't influence 
the score enough to throw off the sort)





-Hoss



Re: AW: Converting German special characters / umlaute

2007-10-25 Thread Thomas Traeger

Hi,

the SnowballPorterFilterFactory is a complete stemmer that transforms 
words to their basic form (laufen -> lauf, läufer -> lauf). One part of 
that process is replacing language specific special characters.


So SnowballPorterFilterFactory does what you wanted (beside other 
things). I mentioned it because it is a very good start when using solr 
and especially when dealing with documents in languages other than english.


Tom

Matthias Eireiner schrieb:

Dear list,

it has been some time, but here is what I did.
I had a look at Thomas Traeger's tip to use the
SnowballPorterFilterFactory, which does not actually do the job.
Its purpose is to convert regular ASCII into special characters. 


And I want it the other way, such that all special character are
converted to regular ASCII.
The tip of J.J. Larrea, to use the PatternReplaceFilterFactory, solved
the problem. 
 
And as Chris Hostetter noted, stored fields always return the initial

value, which turned the second part of my question obsolete.

Thanks a lot for your help!

best 
Matthias




-Ursprüngliche Nachricht-
Von: Thomas Traeger [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 26. September 2007 23:44

An: solr-user@lucene.apache.org
Betreff: Re: Converting German special characters / umlaute


Try the SnowballPorterFilterFactory described here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You should use the German2 variant that converts ä and ae to a, ö and oe

to o and so on. More details:
http://snowball.tartarus.org/algorithms/german2/stemmer.html

Every document in solr can have any number of fields which might have 
the same source but have different field types and are therefore handled


differently (stored as is, analyzed in different ways...). Use copyField

in your schema.xml to feed your data into multiple fields. During 
searching you decide which fields you like to search on (usually the 
analyzed ones) and which you retrieve when getting the document back.


Tom

Matthias Eireiner schrieb:
  

Dear list,

I have two questions regarding German special characters or umlaute.

is there an analyzer which automatically converts all german special 
characters to their specific dissected from, such as ü to ue and ä to 
ae, etc.?!


I also would like to have, that the search is always run against the 
dissected data. But when the results are returned the initial data 
with the non modified data should be returned.


Does lucene GermanAnalyzer this job? I run across it, but I could not 
figure out from the documentation whether it does the job or not.


thanks a lot in advance.

Matthias
  





  


RE: extending StandardRequestHandler gives ClassCastException

2007-10-25 Thread Haishan Chen
I am a new Solr user and wonder if anyone can help me these questions. I used 
Solr to index about two million documents and query on it using standard 
request handler. I disabled all cache. I found phrase query was substantially 
slower than the usual query.  The statistic I collected is as following. I was 
doing the query on the one field only.  content:(auto repair)   
100348 hits47 ms repeatable
content:("auto repair") 61263  hits937 ms   
repeatablecontent:("auto repair"~1)   100384 hits  766 ms  repeatable What 
are the factors affecting phrase query performance? How come the phrase query 
("auto repair") is almost 20 times slower than (auto repair)? I also notice a 
phrase query with a slop is always faster than the one without a slop. Is it a 
performance bottleneck of Lucene or Solr?  It will be very appreciated if 
anyone can help Andrew
_
Help yourself to FREE treats served up daily at the Messenger Café. Stop by 
today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

Score customization

2007-10-25 Thread Victoria Kaganski
Hi!
My system uses solr to build a searchable archive of documents. I need
to override the default scoring/similarity function because the
additional to the query relevancy factors have to be considered. For
example, each document has "updated on" and "source" fields, which
should influence the score too. I want the score/similarity function be
a sum of weighted (according to configuration) query relevancy, update
level and source rank (calculated at search time). 
Which is the best way to achieve required functionality? Should I
inherit and override existing code or should I build some independent
module (something like requestHandler) and to configure Solr to use it? 
If I have to override the defaults, which class should it be? In
general, what am I actually trying to customize: scoring or similarity?
Thank you very much.
Victoria 


Re: Score customization

2007-10-25 Thread Otis Gospodnetic
Victoria,

Either use FunctionQuery's or hack around HitCollector.collect(int, float) in 
SolrIndexSearcher...and adjust the score using the additional values you 
mentioned.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Victoria Kaganski <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, October 25, 2007 7:19:46 AM
Subject: Score customization


Hi!
My system uses solr to build a searchable archive of documents. I need
to override the default scoring/similarity function because the
additional to the query relevancy factors have to be considered. For
example, each document has "updated on" and "source" fields, which
should influence the score too. I want the score/similarity function be
a sum of weighted (according to configuration) query relevancy, update
level and source rank (calculated at search time). 
Which is the best way to achieve required functionality? Should I
inherit and override existing code or should I build some independent
module (something like requestHandler) and to configure Solr to use it?
 
If I have to override the defaults, which class should it be? In
general, what am I actually trying to customize: scoring or similarity?
Thank you very much.
Victoria 





Re: Forced Top Document

2007-10-25 Thread Yonik Seeley
On 10/25/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : The typical use case, though, is for the featured document to be on top only
> : for certain queries.  Like in an intranet where someone queries 401K or
> : retirement or similar, you want to feature a document about benefits that
> : would otherwise rank really low for that query.  I have not be able to make
> : sorting strategies work very well.
>
> this type of question typically falls into two use cases:
>   1) "targeted ads"
>   2) "sponsored results"
>
> in the targeted ads case, the "special" matches aren't part of the normal
> flow of results, and don't fit into pagination -- they always appera at
> the top, or to the right, on every page, no matter what the sort  this
> kind of usage doesn't really need any special logic, it can be solved as
> easily by a second Solr hit as it can by custom request handler logic.
>
> in the "sponsored results" use case, the "special" matches should appear
> in the normal flow of results as the #1 (2, 3, etc) matches, so that they
> don't appear on page#2 ... but that also means that it's extremely
> disconcerting for users if those matches are still at the top when the
> userse resort.  if a user is looking at product listings, sorted by
> "relevancy" and the top 3 results all say they are "sponsered" that's fine
> ... but if the user sort by "price" and those 3 results are still at teh
> top of the list, even though they clearly aren't the chepest, that's just
> going to piss the user off.
>
> in my profesional opinion: don't fuck with your users.  default to
> whatever order you want, but if the user specificly requests to sort the
> results by some option, do it.
>
> assuming you follow my professional opinion, then "boosting" docs to have
> an artifically high score will work fine.
>
> if you absolutely *MUST* have certain docs "sorting" before others,
> regardless of which sort option the user picks, then it is still possible
> do ... i'm hesitant to even say how, but if people insist on knowing...
>
>
>
> allways sort by score first, then by whatever field the user wants to sort
> by ... but when the user wants to sort on a specific field, move the users
> main query input into an "fq" (so it doesn't influence the score) ... and
> use an extremely low boost matchalldocs query along with your "special doc
> matching query" as the main (scoring) query param.  the key being that
> even though your primary sort is on score, every doc except your special
> matches have identical scores.

That sorts by relevance for your sponsored results, right?
What if you want absolute ordering based on dollars spent on that
result, for example.

> (this may not be possible with dismax because it's not trivial to move
> the query into an fq

Should be easier in trunk:

fq=foo bar
  or
fq=

-Yonik


prefix-search ingnores the lowerCaseFilter

2007-10-25 Thread Max Scheffler

Hi,

I want to perform a prefix-search which ignores cases. To do this I 
created a fielType called suggest:



  



  
  


  


Entrys (terms) could be 'foo', 'bar'...

A request like

http://localhost:8983/solr/select/?rows=0&facet=true&q=*:*&facet.field=suggest&facet.prefix=f

returns things like


  
  

  12

  


But a request like
http://localhost:8983/solr/select/?rows=0&facet=true&q=*:*&facet.field=suggest&facet.prefix=F

returns just:


  
  

  


That's not what I've expected, cause the field-definition contains a 
LowerCaseFilter.


Is it possible that the prefix-processing ignores the filters?

Max


Re: prefix-search ingnores the lowerCaseFilter

2007-10-25 Thread Yonik Seeley
On 10/25/07, Max Scheffler <[EMAIL PROTECTED]> wrote:
> Is it possible that the prefix-processing ignores the filters?

Yes, It's a known limitation that we haven't worked out a fix for yet.
The issue is that you can't just run the prefix through the filters
because of things like stop words, stemming, minimum length filters,
etc.

-Yonik


Re: Forced Top Document

2007-10-25 Thread Walter Underwood
On 10/25/07 12:11 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> this type of question typically falls into two use cases:
>   1) "targeted ads"
>   2) "sponsored results"

3) Best bets (editorial results)

The query "house" should return "House, M.D." as the first hit,
but that is rather hard to achieve with relevance tuning and
synonyms. A manual fix is straightforward.

wunder



Re: multilingual list of stopwords

2007-10-25 Thread Maria Mosolova

Thank you very much Daniel!
Maria

Daniel Alheiros wrote:

If you do want more stopwords sources, there is this one too:
http://snowball.tartarus.org/algorithms/

And I would go for the language identification and then I would apply the
proper set.

Cheers,
Daniel


On 18/10/07 16:18, "Maria Mosolova" <[EMAIL PROTECTED]> wrote:

  

Thanks a lot Peter!
Maria

On 10/18/07, Binkley, Peter <[EMAIL PROTECTED]> wrote:


There's code in Nutch to identify the language of a given text:
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/analysis/lang/La
nguageIdentifier.html .

Peter

-Original Message-
From: Maria Mosolova [mailto:[EMAIL PROTECTED]
Sent: Thursday, October 18, 2007 8:48 AM
To: solr-user@lucene.apache.org
Subject: Re: multilingual list of stopwords

Thanks a lot to everyone who responded. Yes, I agree that eventually we
need to use separate stopword lists for different languages.
Unfortunately the data we are trying to index at the moment does not
contain any direct country/language information and we need to create
the first version of the index quickly. It does not look like analyzing
documents to determine their languge is something which could be
accomplished in a very limited timeframe. Or am I wrong here and there
are existing analyzers one could use?
Maria

On 10/18/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
  

Also "die" in German and English. --wunder

On 10/18/07 4:16 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:



One example that I'm familiar with: words "is" and "by" in English
and in Swedish. Both words are stopwords in English, but they are
content words in Swedish (ice and village, respectively). Similarly,
  
"till" in Swedish is a stopword (to, towards), but it's a content
  

word in English.
  

  



http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.

  




Re: Payloads for multiValued fields?

2007-10-25 Thread Alf Eaton
Alf Eaton wrote:
> Mike Klaas wrote:
>> On 24-Oct-07, at 7:10 AM, Alf Eaton wrote:
>>> Yes, I was just trying that this morning and it's an improvement, though
>>> not ideal if the field contains a lot of text (in other words it's still
>>> a suboptimal workaround).
>>>
>>> I do think it might be useful for the response to contain an element
>>> saying which fields were matched by the query, including which
>>> sub-sections of a multi-valued field were matched.
>> This isn't readily-accessible information.   Text search engines work by
>> storing a list of documents and occurrence frequency for each document
>> _per term_.  At that point, the information about the structure of the
>> document is not available.
> 
> The highlighting engine seems to know which fields were matched by the
> query though - enough to be able to use hl.requireFieldMatch to only
> return snippets from matched fields. The highlighter seems to have a
> small problem with snippets reaching across multivalued fields, but if
> that was sorted out then in theory the highlighter should be able to
> tell you which field, and which of the multiple values, was matched, no?
> 
>> Have you considered storing each section as a separate Solr Document?
> 
> I have considered this - in theory it would be easy enough to create a
> separate index just for these items, but it adds an extra lump of
> complexity to the search engine that I'd rather avoid. The workaround of
> adding a marked-up value to the indexed field, setting hl.fragsize to 0
> and parsing out the marked-up value from the highlighted fragment should
> be good enough for now.
> 

Actually this is still a problem: with hl.fragsize set to 0 the highlighter 
actually returns the whole of the multi-valued field, with all of the items 
lumped together, so there really is no way to know reliably which of the 
multiple values was matched.

Maybe it will be necessary to build a separate index after all.

alf



RE: My filters are not used

2007-10-25 Thread Norskog, Lance
This search has up to 8000 records. Does this require a query cache of
8000 records? When is the query cache filled?

This answers a second question: the filter design is intended for small
search sets. I'm interested in selecting maybe 1/10 of a few million
records as a search limiter. Is it possible to create a similar feature
that caches low-level data areas for aquery? Let's say that the if query
selects 1/10 of the document space, this means that only 40% of the
total memory area contains data for that 1/10. Is there a cheap way to
record this data? Would it be a feature like filters which records a
much lower-level data structure like disk blocks?

Thanks,

Lance Norskog

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik
Seeley
Sent: Wednesday, October 24, 2007 8:24 PM
To: solr-user@lucene.apache.org
Subject: Re: My filters are not used

On 10/24/07, Norskog, Lance <[EMAIL PROTECTED]> wrote:
> I am creating a filter that is never used. Here is the query sequence:
>
> q=*:*&fq=contentid:00*&start=0&rows=200
>
> q=*:*&fq=contentid:00*&start=200&rows=200
>
> q=*:*&fq=contentid:00*&start=400&rows=200
>
> q=*:*&fq=contentid:00*&start=600&rows=200
>
> q=*:*&fq=contentid:00*&start=700&rows=200
>
> Accd' to the statistics here is my filter cache usage:
>
> lookups : 1
[...]
>
> I'm completely confused. I thought this should be 1 insert, 4 lookups,

> 4 hits, and a hitratio of 100%.

Solr has a query cache too... the query cache is checked, there's a hit,
and the query process is short circuited.

-Yonik


Re: Forced Top Document

2007-10-25 Thread mark angelillo
Thanks for your thoughts, Chris. I agree with you about the user's  
experience. Snooth doesn't serve any ads/sponsored results -- the  
goal here is to make sure that the most recent document the user has  
acted on shows up top in searches for recent activity. My aim is to  
forcibly preserve the sort order until the document can be reindexed/ 
updated.


Since the dynamic field is too memory intensive, I'll try boosting on  
the date field -- and boosting more on the date field for the  
document that needs to be up top. If that doesn't end up working I'll  
just perform two queries and be done with it.


Mark

On Oct 25, 2007, at 3:11 AM, Chris Hostetter wrote:



: The typical use case, though, is for the featured document to be  
on top only
: for certain queries.  Like in an intranet where someone queries  
401K or
: retirement or similar, you want to feature a document about  
benefits that
: would otherwise rank really low for that query.  I have not be  
able to make

: sorting strategies work very well.

this type of question typically falls into two use cases:
  1) "targeted ads"
  2) "sponsored results"

in the targeted ads case, the "special" matches aren't part of the  
normal
flow of results, and don't fit into pagination -- they always  
appera at
the top, or to the right, on every page, no matter what the  
sort  this
kind of usage doesn't really need any special logic, it can be  
solved as

easily by a second Solr hit as it can by custom request handler logic.

in the "sponsored results" use case, the "special" matches should  
appear
in the normal flow of results as the #1 (2, 3, etc) matches, so  
that they

don't appear on page#2 ... but that also means that it's extremely
disconcerting for users if those matches are still at the top when the
userse resort.  if a user is looking at product listings, sorted by
"relevancy" and the top 3 results all say they are "sponsered"  
that's fine
... but if the user sort by "price" and those 3 results are still  
at teh
top of the list, even though they clearly aren't the chepest,  
that's just

going to piss the user off.

in my profesional opinion: don't fuck with your users.  default to
whatever order you want, but if the user specificly requests to  
sort the

results by some option, do it.

assuming you follow my professional opinion, then "boosting" docs  
to have

an artifically high score will work fine.

if you absolutely *MUST* have certain docs "sorting" before others,
regardless of which sort option the user picks, then it is still  
possible
do ... i'm hesitant to even say how, but if people insist on  
knowing...




allways sort by score first, then by whatever field the user wants  
to sort
by ... but when the user wants to sort on a specific field, move  
the users
main query input into an "fq" (so it doesn't influence the  
score) ... and
use an extremely low boost matchalldocs query along with your  
"special doc

matching query" as the main (scoring) query param.  the key being that
even though your primary sort is on score, every doc except your  
special

matches have identical scores.

(this may not be possible with dismax because it's not trivial to move
the query into an fq, it might work if you can use "0" as the boost on
fields in the qf so it still dictates the matches but doesn't  
influence

the score enough to throw off the sort)





-Hoss



mark angelillo
snooth inc.
o: 646.723.4328
c: 484.437.9915
[EMAIL PROTECTED]
snooth -- 1.8 million ratings and counting...




indexing one documents with different populated fields causes a deletion of documents in with other populated fileds

2007-10-25 Thread Anton Valdstein
Hi,
I have 2 fields defined in the schema.xml. One is named ItalianTitle and the
other is named ItalianOrEnglishTitle_t.
I want to index first all the Italian titles into documents having the
Italian texts stored and indexed in the ItalianTitle field while these
documents should have the field ItalianOrEnglishTitle_t
empty. I succeeded doing so, but than I tried to index the Italian texts
into the documents having the ItalianTitle field empty and stored in the
ItalianOrEnglishTitle_t.
this resulted in my index only having documents with ItalianOrEnglishTitle_t
populated and the previous indexed documents having the ItalianTitle field
populated where deleted from the index.
I looked at the statistics and found that, though no delete request was
issued the
updated handler did delete the number of documents i had added with the
field ItalianTitle


Does solr check automatically for duplicate texts in  other fields and
delete documents  that have the same text stored  in  other fields?


Re: My filters are not used

2007-10-25 Thread Yonik Seeley
On 10/25/07, Norskog, Lance <[EMAIL PROTECTED]> wrote:
> This search has up to 8000 records.

does not compute...
Are you saying there are 8000 records with contentid:00*

> Does this require a query cache of
> 8000 records?

No, one query == one query cache entry.

>When is the query cache filled?

potentially on any query.

> This answers a second question: the filter design is intended for small
> search sets.

Not sure I understand that.

> I'm interested in selecting maybe 1/10 of a few million
> records as a search limiter. Is it possible to create a similar feature
> that caches low-level data areas for aquery? Let's say that the if query
> selects 1/10 of the document space, this means that only 40% of the
> total memory area contains data for that 1/10. Is there a cheap way to
> record this data? Would it be a feature like filters which records a
> much lower-level data structure like disk blocks?

You lost me here too... what issue are you seeing?

-Yonik


Re: indexing one documents with different populated fields causes a deletion of documents in with other populated fileds

2007-10-25 Thread Yonik Seeley
On 10/25/07, Anton Valdstein <[EMAIL PROTECTED]> wrote:
> Does solr check automatically for duplicate texts in  other fields and
> delete documents  that have the same text stored  in  other fields?

Solr automatically overwrites (deletes old versions of) documents with
the same uniqueKey field (normally called "id").

Both Lucene and Solr lack the ability to change (or add fields to)
existing documents.

-Yonik


Delete index and "commit or optimize"

2007-10-25 Thread Jae Joo
Hi,

I have 9g index and try to delete a couple of document. The actual deletion
is working fine.

Here is my question.
Do I have to OPTIMIZE the index after deleting? or just COMMIT it? The
original index already optimized.

Thanks,

Jae Joo


Re: Delete index and "commit or optimize"

2007-10-25 Thread Otis Gospodnetic
You don't need to optimize an index after deletion, just commit and you are 
done.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Jae Joo <[EMAIL PROTECTED]>
To: solr-user 
Sent: Thursday, October 25, 2007 3:10:49 PM
Subject: Delete index and "commit or optimize"


Hi,

I have 9g index and try to delete a couple of document. The actual
 deletion
is working fine.

Here is my question.
Do I have to OPTIMIZE the index after deleting? or just COMMIT it? The
original index already optimized.

Thanks,

Jae Joo





field name synonyms

2007-10-25 Thread Maria Mosolova

Hello,

I am trying to figure out whether there is a way to specify field names
synonyms in Solr/Lucene schema. For instance, I have a field with the
name "title" in the database and want to be able to use queries:

title:query
t:query

to get the data from the same field.
Is there a way to do this?

Thank you in advance,
Maria


Re: field name synonyms

2007-10-25 Thread Otis Gospodnetic
I suppose you could use copyField functionality, though that feels like an 
overkill.  Why not just have a map with field name aliases in your app that 
rewrites the field names before sending the query to Solr?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Maria Mosolova <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, October 25, 2007 3:38:36 PM
Subject: field name synonyms


Hello,

I am trying to figure out whether there is a way to specify field names
synonyms in Solr/Lucene schema. For instance, I have a field with the
name "title" in the database and want to be able to use queries:

title:query
t:query

to get the data from the same field.
Is there a way to do this?

Thank you in advance,
Maria





Performance Recommendation

2007-10-25 Thread Wagner,Harry
Where is a good place to look for some performance recommendations? We
have a 2.4G index running on server with 16G. Overall performance is
very good, but the initial sort on an index is too slow. Any idea what,
if anything, in the solrConfig would help that?

Thanks... harry


Re: indexing one documents with different populated fields causes a deletion of documents in with other populated fileds

2007-10-25 Thread Anton Valdstein
thanks, that explains a lot (:,
I have another question: about how the idf is calculated:
is the document frequency the sum of all documents containing the term in
one of their fields or just in the field the query contained?


On 10/25/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On 10/25/07, Anton Valdstein <[EMAIL PROTECTED]> wrote:
> > Does solr check automatically for duplicate texts in  other fields and
> > delete documents  that have the same text stored  in  other fields?
>
> Solr automatically overwrites (deletes old versions of) documents with
> the same uniqueKey field (normally called "id").
>
> Both Lucene and Solr lack the ability to change (or add fields to)
> existing documents.
>
> -Yonik
>


SOLR 1.3 Release?

2007-10-25 Thread Matthew Runo
Any ideas on when 1.3 might be released? We're starting a new project  
and I'd love to use 1.3 for it - is SVN head stable enough for use?


++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++




Re: Performance Recommendation

2007-10-25 Thread Erik Hatcher


On Oct 25, 2007, at 4:19 PM, Wagner,Harry wrote:

Where is a good place to look for some performance recommendations? We
have a 2.4G index running on server with 16G. Overall performance is
very good, but the initial sort on an index is too slow. Any idea  
what,

if anything, in the solrConfig would help that?


One option is to configure a warming query with sorts in  
solrconfig.xml.  Check out the newSeacher feature:





Erik



Re: SOLR 1.3 Release?

2007-10-25 Thread Yonik Seeley
On 10/25/07, Matthew Runo <[EMAIL PROTECTED]> wrote:
> Any ideas on when 1.3 might be released? We're starting a new project
> and I'd love to use 1.3 for it - is SVN head stable enough for use?

I think it's stable in the sense of "does the right thing and doesn't
crash", but IMO
isn't stable in the sense that new interfaces (internal and external)
added since 1.2 may still be changing.

Lots of new stuff going in (and has gone in), and I wouldn't expect to
see 1.3 super soon.
Just IMO of course.

-Yonik


Re: field name synonyms

2007-10-25 Thread Maria Mosolova
Thanks Otis. Yes, I can change the application, just hoped  that there 
might be a better way to handle the situation ...

Maria

Otis Gospodnetic wrote:

I suppose you could use copyField functionality, though that feels like an 
overkill.  Why not just have a map with field name aliases in your app that 
rewrites the field names before sending the query to Solr?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

- Original Message 
From: Maria Mosolova <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, October 25, 2007 3:38:36 PM
Subject: field name synonyms


Hello,

I am trying to figure out whether there is a way to specify field names
synonyms in Solr/Lucene schema. For instance, I have a field with the
name "title" in the database and want to be able to use queries:

title:query
t:query

to get the data from the same field.
Is there a way to do this?

Thank you in advance,
Maria



  




Re: indexing one documents with different populated fields causes a deletion of documents in with other populated fileds

2007-10-25 Thread Yonik Seeley
On 10/25/07, Anton Valdstein <[EMAIL PROTECTED]> wrote:
> thanks, that explains a lot (:,
> I have another question: about how the idf is calculated:
> is the document frequency the sum of all documents containing the term in
> one of their fields or just in the field the query contained?

idfs are field (fieldname) specific.  So it's based on the count of
documents containing that word in that field.

Things are done on the basis of "term" in Lucene, and a term consists
of the fieldname and the word.

-Yonik


RE: extending StandardRequestHandler gives ClassCastException

2007-10-25 Thread Chris Hostetter

: Subject: RE: extending StandardRequestHandler gives ClassCastException

http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss



Re: SOLR 1.3 Release?

2007-10-25 Thread Ryan McKinley

Yonik Seeley wrote:

On 10/25/07, Matthew Runo <[EMAIL PROTECTED]> wrote:

Any ideas on when 1.3 might be released? We're starting a new project
and I'd love to use 1.3 for it - is SVN head stable enough for use?


I think it's stable in the sense of "does the right thing and doesn't
crash", but IMO
isn't stable in the sense that new interfaces (internal and external)
added since 1.2 may still be changing.



A lot has been added since 1.2 -- if you have the time/temperament to be 
ok interfaces that may be in flux, it is great to have more feedback on 
how they work/ how they should work.  Since 1.2, i think any bugs or 
serious problems that have arised (not many) have been fixed within a 
day or two.  (as good as paid support!)


Note the public interfaces from 1.2 are (and will be) totally compatible 
with 1.3 - the only interface issues you may run into are if you are 
writing custom code or using new features added since 1.2




Lots of new stuff going in (and has gone in), and I wouldn't expect to
see 1.3 super soon.
Just IMO of course.



I don't think it is soon either -- there are a few big things that need 
to get in and have time to settle before locking into public APIs


ryan









Re: Performance Recommendation

2007-10-25 Thread Chris Hostetter

: Subject: Performance Recommendation
: In-Reply-To: <[EMAIL PROTECTED]>

http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss



sorting on dynamic fields - good, bad, neither?

2007-10-25 Thread Charles Hornberger
Hi --

I'm building a Solr index to replace an existing RDBMS-based system,
and I have one requirement that I'm not sure how to best satisfy.
Documents in our collection can have user-generated ratings associated
with them; these user-generated ratings are aggregated by source
(sources are basically business partners who use our public API to a)
publish content on our system, and to b) allow their users to interact
with -- i.e., rate, comment on, etc. -- content in our system). When
we query the index, it's important to be able to return documents
sorted by the aggregated ratings data for any source.

The simplest solution I could think of was to add some dynamic fields
to the schema:

  
  
  

And when I'm indexing documents, I add one field for each source from
which users have contributed ratings, e.g.:

   3.3
   10
   33
   2.8
   20
   56
   etc...

So far this seems acceptable. Query performance seems fine when using
the dynamic fields to sort result sets; indexing performance also
seems fine*. That said, there are only 400K documents in the
collection I'm working with, and few external rating sources at the
moment (there are about a dozen, and most documents have no external
ratings data associated with them). But as these fields will be
created from user-generated data, there's nothing to stop those
numbers from ballooning.

What I'm wondering is whether any of the Solr experts on this list
would endorse this solution, or caution against it? Are there any
things I need to know before I proceed with it?

Before this obvious solution occurred to me, I was thinking I would
need to create a custom FieldType of my own, and perhaps my own
SortComparatorSource, so that I could sort records based in query-time
parameters (i.e., the ID of the source whose ratings are to be used as
the sort key). I've got a copy of LIA, and the
DistanceComparatorSource example from the start of chapter 6 seemed a
bit out of date, but like it ought to serve me plenty well. But then
this message made me think that maybe that wasn't going to be quite as
easy as I'd hoped:

  http://www.nabble.com/custom-sorting-tf4521989.html#a12951515

(It also made me think that I ought to take on the project proposed
there -- i.e., "the idea of being able to specify a raw function as a
sort" -- once I've got a better handle on Solr's internals.)

Thanks in advance for any advice you can give.

-Charlie

* I'm adding about 250 docs/sec, though because of how I'm feeding
documents, it's hard to say how much of that time is spent in Solr,
and how much is spent in the Python feeding script I'm using; in any
case, 250 docs/sec is perfectly adequate for now.


Re: Payloads for multiValued fields?

2007-10-25 Thread Mike Klaas

On 24-Oct-07, at 12:39 PM, Alf Eaton wrote:


Mike Klaas wrote:

On 24-Oct-07, at 7:10 AM, Alf Eaton wrote:
Yes, I was just trying that this morning and it's an improvement,  
though
not ideal if the field contains a lot of text (in other words  
it's still

a suboptimal workaround).

I do think it might be useful for the response to contain an element
saying which fields were matched by the query, including which
sub-sections of a multi-valued field were matched.


This isn't readily-accessible information.   Text search engines  
work by
storing a list of documents and occurrence frequency for each  
document
_per term_.  At that point, the information about the structure of  
the

document is not available.


The highlighting engine seems to know which fields were matched by the
query though - enough to be able to use hl.requireFieldMatch to only
return snippets from matched fields. The highlighter seems to have a
small problem with snippets reaching across multivalued fields, but if
that was sorted out then in theory the highlighter should be able to
tell you which field, and which of the multiple values, was  
matched, no?


In theory, sure.  The contrib Highlighter (that Solr uses) doesn't  
work based on a Lucene stored field; it is instead fed a single  
String.  This means that Solr has to piece together all the values in  
the field to do highlighting, and in the process, the distinction  
among them is lost (or at least muted---some effort is made to keep a  
position increment gap between them).  So, it isn't trivial to return  
this data.



Have you considered storing each section as a separate Solr Document?


I have considered this - in theory it would be easy enough to create a
separate index just for these items, but it adds an extra lump of
complexity to the search engine that I'd rather avoid. The  
workaround of
adding a marked-up value to the indexed field, setting hl.fragsize  
to 0
and parsing out the marked-up value from the highlighted fragment  
should

be good enough for now.


It is also important to note that the highlighter _reanalyzes_ the  
document to find the matches.  So, there is nothing stopping you from  
writing a bit of code that accomplishes exactly the same thing, and  
returns the data in a custom way.  TermVector could be used to speed  
this up further.


-Mike



Re: SOLR 1.3 Release?

2007-10-25 Thread Matthew Runo
I'm mostly interested in using the SOLRj library for now, and the  
spellsheck handler & work on per-field updates.


I think I'll just go with 1.3 and report back if something seems broken.

++
 | Matthew Runo
 | Zappos Development
 | [EMAIL PROTECTED]
 | 702-943-7833
++


On Oct 25, 2007, at 3:29 PM, Ryan McKinley wrote:


Yonik Seeley wrote:

On 10/25/07, Matthew Runo <[EMAIL PROTECTED]> wrote:
Any ideas on when 1.3 might be released? We're starting a new  
project

and I'd love to use 1.3 for it - is SVN head stable enough for use?

I think it's stable in the sense of "does the right thing and doesn't
crash", but IMO
isn't stable in the sense that new interfaces (internal and external)
added since 1.2 may still be changing.



A lot has been added since 1.2 -- if you have the time/temperament  
to be ok interfaces that may be in flux, it is great to have more  
feedback on how they work/ how they should work.  Since 1.2, i  
think any bugs or serious problems that have arised (not many) have  
been fixed within a day or two.  (as good as paid support!)


Note the public interfaces from 1.2 are (and will be) totally  
compatible with 1.3 - the only interface issues you may run into  
are if you are writing custom code or using new features added  
since 1.2



Lots of new stuff going in (and has gone in), and I wouldn't  
expect to

see 1.3 super soon.
Just IMO of course.



I don't think it is soon either -- there are a few big things that  
need to get in and have time to settle before locking into public APIs


ryan











Re: SOLR 1.3 Release?

2007-10-25 Thread patrick o'leary




It might be good though to have an interim release say 1.2.x
Which would simply allow patches, and valuable contributions to get
added to the trunk.

Right now, there are a few items which are falling behind because the
trunk code is changing rapidly.

A 1.2.x release will give you the opportunity to start defining what
the 1.3 release will look like
and include.

P

Ryan McKinley wrote:
Yonik
Seeley wrote:
  
  On 10/25/07, Matthew Runo
<[EMAIL PROTECTED]> wrote:

Any ideas on when 1.3 might be released?
We're starting a new project
  
and I'd love to use 1.3 for it - is SVN head stable enough for use?
  


I think it's stable in the sense of "does the right thing and doesn't

crash", but IMO

isn't stable in the sense that new interfaces (internal and external)

added since 1.2 may still be changing.


  
  
A lot has been added since 1.2 -- if you have the time/temperament to
be ok interfaces that may be in flux, it is great to have more feedback
on how they work/ how they should work.  Since 1.2, i think any bugs or
serious problems that have arised (not many) have been fixed within a
day or two.  (as good as paid support!)
  
  
Note the public interfaces from 1.2 are (and will be) totally
compatible with 1.3 - the only interface issues you may run into are if
you are writing custom code or using new features added since 1.2
  
  
  
  Lots of new stuff going in (and has gone in),
and I wouldn't expect to

see 1.3 super soon.

Just IMO of course.


  
  
I don't think it is soon either -- there are a few big things that need
to get in and have time to settle before locking into public APIs
  
  
ryan
  
  
  
  
  
  
  
  
  


-- 
Patrick O'Leary


You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles.
 Do you understand this? 
And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat.
  - Albert Einstein

View
Patrick O Leary's profile





Re: Solr and security

2007-10-25 Thread Nick Jenkin
You have to remember that Solr is search, not security, its not
considered a great idea to have it publicly accessible. If you want a
public instance any requests to your solr instance should be "proxied"
by some interface between solr and the user.

e.g.
user requests http://foobar.com/searchapi?k=foobar&userToken=123456789
and then that page will check the userToken and send the request to
solr and return the result solr gives.

-Nick

On 10/25/07, Cool Coder <[EMAIL PROTECTED]> wrote:
> Thanks. I am trying to implement some sort authentication mechanism in Solr. 
> This means each request will have a key which can authenticate whether the 
> request is authentic or not. And do you think, I need to still take care the 
> steps mentioned by you and why??
>
>   - BR
>
> "Wagner,Harry" <[EMAIL PROTECTED]> wrote:
>   One effective method is to block access to the port Solr runs on. Force
> application access to come thru the HTTP server, and let it map to the
> application server (i.e., like mod_jk does for for Apache & Tomcat).
> Simple, but effective.
>
> Cheers!
> harry
>
> -Original Message-
> From: Cool Coder [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, October 24, 2007 12:17 PM
> To: solr-user@lucene.apache.org
> Subject: Solr and security
>
> Hi Group,
> As far as I know, to use solr, we need to deploy it as a
> server and communicate to solr using http protocol. How about its
> security? i.e. how can we ensure that it only accepts request from
> predefined set of users only. Is there any way we can specify this in
> solr or solr depends only on web server security model. I am not sure
> whether my interpretation is right?
> Your suggestion/input?
>
> - BR
>
> __
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
>
>  __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com


Re: SOLR 1.3 Release?

2007-10-25 Thread Mike Klaas


On 25-Oct-07, at 4:50 PM, patrick o'leary wrote:


It might be good though to have an interim release say 1.2.x
Which would simply allow patches, and valuable contributions to get  
added to the trunk.


I'm not sure what you mean.  The "problem" is that there lots of  
valuable contributions in trunk that haven't gotten shook out  
sufficiently yet, not that we're holding them up.


If we did a 1.2.x, it shoud (imo) contain no new features, only  
important bugfixes.


Right now, there are a few items which are falling behind because  
the trunk code is changing rapidly.


Again, I'm not sure what you mean by "falling behind".  We've got a  
few major code restructuring changes currently in trunk--they need  
time to shake out.  I would be onboard with a proprosal to limit  
major things under consideration ƒor 1.3 to the things we are already  
close to finishing.


If you mean that there are a lot of small tweaks that the community  
doesn't have access to because we haven't done a release, I'm  
inclined to agree that that would be ideal.  It is more work to do  
maintain that kind of release schedule (requires work on multiple  
branches at once).


-Mike

Re: SOLR 1.3 Release?

2007-10-25 Thread Venkatraman S
On 10/26/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>
> If we did a 1.2.x, it shoud (imo) contain no new features, only
> important bugfixes.


I have been having a look at the trunk for quite sometime now, and must say
that its changing pretty fast. Having an  interim release now will require
more work with the comitters having to put an extra effort to spruce up
things. I would personally suggest 1-jan-2008 as the next Solr release which
would help in sprucing up the code and also releasing the bunch of new
features that have been lying in the trunk to be used in
*the-next-stable-release*.

-Venkat

-- 
Blog @ http://blizzardzblogs.blogspot.com


Solr Index update - specific field only

2007-10-25 Thread Jae Joo
Hi,

I have index which has the field  NOT stored and would like update some
field which is indexed and stored.
Updating index requires all fields same as original (before updating) with
updated field.
Is there any way to post "JUST  UPDATED FIELD ONLY"?
Here is an example.
field  indexed  stored
-
item_id  yes yes
searchable yes yes
price yes yes
title  yes yes
description yes no

The way I know to update  the "Searchable" field from Y to N for item_it
"12345".


12345
Y
6699
title sample
This is the detail description of item




and I am looking the way to update the specific field by



12345
Y


  --> it may keep the unchanged field.

Thanks,

Jae Joo


Re: SOLR 1.3 Release?

2007-10-25 Thread James liu
where i can read 1.3 new features?

2007/10/26, Venkatraman S <[EMAIL PROTECTED]>:
>
> On 10/26/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >
> > If we did a 1.2.x, it shoud (imo) contain no new features, only
> > important bugfixes.
>
>
> I have been having a look at the trunk for quite sometime now, and must
> say
> that its changing pretty fast. Having an  interim release now will require
> more work with the comitters having to put an extra effort to spruce up
> things. I would personally suggest 1-jan-2008 as the next Solr release
> which
> would help in sprucing up the code and also releasing the bunch of new
> features that have been lying in the trunk to be used in
> *the-next-stable-release*.
>
> -Venkat
>
> --
> Blog @ http://blizzardzblogs.blogspot.com
>



-- 
regards
jl


Re: SOLR 1.3 Release?

2007-10-25 Thread Pieter Berkel
On 26/10/2007, James liu <[EMAIL PROTECTED]> wrote:
>
> where i can read 1.3 new features?
>


Take a look at CHANGES.txt in the root directory of svn trunk, or also here:
http://svn.apache.org/viewvc/lucene/solr/trunk/CHANGES.txt

Piete


Re: Solr Index update - specific field only

2007-10-25 Thread Chris Hostetter

there is some work in progress on this, but it isn't ready for prime time 
yet ... you are welcome to be an early adopter and try out some of 
the patches...

https://issues.apache.org/jira/browse/SOLR-139



-Hoss



A question about solr score

2007-10-25 Thread zx zhang
Hi, everyone!
As we known, solr uses lucene scoring.
This score is the raw score. Scores returned from Hits aren't
necessarily the raw score, however. If the top-scoring document scores
greater than 1.0, all scores are normalized from that score, such that
all scores from Hits are uaranteed to be 1.0 or less.
Now it is my question, I always get scores of some documents which are
above 1.0, even some get up to 10.0!
Why?
I will really appreciate your reply.


phrase query performance

2007-10-25 Thread Haishan Chen
I am a new Solr user and wonder if anyone can help me these questions. I used 
Solr to index about two million documents and query on it using standard 
request handler. I disabled all cache. I found phrase query was substantially 
slower than the usual query.  The statistic I collected is as following. I was 
doing the query on the one field only.  content:(auto repair)47 
ms  repeatablecontent:("auto repair")  937 ms 
repeatablecontent:("auto repair"~1) 766 ms repeatable What are the 
factors affecting phrase query performance? How come the phrase query 
content:("auto repair") is almost 20 times slower than content:(auto repair)? I 
also notice a the phrase query with a slop is always faster than the one 
without a slop. Is the difference I observe here a performance problem of 
Lucene or Solr?
 
It will be appreciated if anyone can help
 
Haishan
 
_
Help yourself to FREE treats served up daily at the Messenger Café. Stop by 
today.
http://www.cafemessenger.com/info/info_sweetstuff2.html?ocid=TXT_TAGLM_OctWLtagline

RE: extending StandardRequestHandler gives ClassCastException

2007-10-25 Thread Haishan Chen
Hi Hoss,
 
I am sorry about that. I know it was not very polite to do so. I was new to the 
community and new to mailing list. I was experimenting how to start a 
discussion.  
I tried starting the discussion by sending a new email to [EMAIL PROTECTED] and 
[EMAIL PROTECTED] But it doesn't seem to work. I didn't find my email 
searchable in mail list archive. I didn't know whether people were able to see 
it or not. So I tried replying one. It seems to work when I reply because my 
replying email was shown in mail list archive 
http://www.mail-archive.com/solr-user@lucene.apache.org/. 
 
Now after reading your email I notice the email address I start a new 
discussion should be solr-user@lucene.apache.org
 
Again I am sorry about that. 
 
Haishan
 
 



> Date: Thu, 25 Oct 2007 15:22:19 -0700> From: [EMAIL PROTECTED]> To: 
> solr-user@lucene.apache.org> Subject: RE: extending StandardRequestHandler 
> gives ClassCastException> > > : Subject: RE: extending StandardRequestHandler 
> gives ClassCastException> > http://people.apache.org/~hossman/#threadhijack> 
> > When starting a new discussion on a mailing list, please do not reply to > 
> an existing message, instead start a fresh email. Even if you change the > 
> subject line of your email, other mail headers still track which thread > you 
> replied to and your question is "hidden" in that thread and gets less > 
> attention. It makes following discussions in the mailing list archives > 
> particularly difficult.> See Also: 
> http://en.wikipedia.org/wiki/Thread_hijacking> > > > -Hoss> 
_
Climb to the top of the charts!  Play Star Shuffle:  the word scramble 
challenge with star power.
http://club.live.com/star_shuffle.aspx?icid=starshuffle_wlmailtextlink_oct