Re: Query for German "Special Characters" (i.e., ä, ö, ß)

2007-09-15 Thread Marc Bechler

Hi Walter,

good advice -- but you need to know the language of your material ... 
could be hard for an automatized processing ;-)


I also stumbled on the "same words in different languages" problem. The 
sole solution might be the dream of an English-only documented world ;-)


Regards from good old Umlaute-Germany ;-)

 marc

Walter Underwood schrieb:

You could use index into multiple fields with different analyzers
and search all of them.

text_en: uses English stemmer
text_de: uses German stemmer
text_exact: no stemming
text_strip: uses ISOLatin1AccentFilter

You can search all of these and put different boosts on them,
with higher boosts for the more exact matches.

I don't know if any of these handle "typewriter umlauts", like
"ueber" for "über".

The German Porter stemmer probably does not break compound words,
like "Feuerwehrmannschaft" into "Feuerwehr" and "Mannschaft"
(but not further). That can cause missed matches.

You can put these in synonyms.txt, but that could be a lot
of work.

One problem that I have seen in cross-language searching is
strings that appear in both languages. For example, "die" is
common in German but rare in English, so it will have a higher
IDF when matched against English and the English hits will
score higher. Same for "mit". In English, that is the Massachusetts
Institute of Technology.

wunder
==
Walter Underwood
Search Guy, Netflix

On 9/14/07 2:09 PM, "Marc Bechler" <[EMAIL PROTECTED]> wrote:


Hi Tom,

thanks for your professional response -- works fine and looks good :-).
Since I am playing around with mixed texts (English and German), I do
not have any idea whether or not an EnglishPorter will be useful for
German texts. But I will find it out by playing around ;-)

Regards from Germany,

  marc



Tom Hill schrieb:

Hi Marc,

The searches are going to look for an exact match of the query (after
analysis) in the index (after analysis).

So, realli will not match really.

So you want to have the same stemmer (probably not the English one, given
your examples) in both in index analyzer, and the query analyzer. I've
appended the section from solr 1.2 example schema.xml, note
EnglishPorterFilterFactory is in both sections. That would be what you want
to do, with the appropriate stemmer for your application.

Or, you could use no stemmer for BOTH, but I think most people go with
stemming. At least, I do. :-)

Tom


  







  
  







  


On 9/14/07, Marc Bechler <[EMAIL PROTECTED]> wrote:

Index for "really": 5* really. Query for "really": 5* really, 2* realli
(from: EnglishPorterFilterFactory {protected=protwords.txt},
RemoveDuplicatesTokenFilterFactory {})

For "this" everyting is completely fine.

Is a complete matching required between index and query or is a partial
matching also okay?

Thanks for helping me

  marc




Tom Hill schrieb:

Hi Marc,

Are you using the same stemmer on your queries that you use when

indexing?

Try the analysis function in the admin UI, to see how things are stemmed

for

indexing vs. querying. If they don't match for really and fünny, and do
match for kraßen, then that's your problem.

Tom


On 9/14/07, Marc Bechler <[EMAIL PROTECTED]> wrote:

Hi,

oops, the URIEncoding was lost during the update to tomcat 6.0.14.
Thanks for the advice.

But now I am really curioused. After indexing the document from

scratch,

I have the effect that queries to "this" and "is" work fine, whereas
queries to "really" and "fünny" do not return the result. Fünnily ;-) ,
after extending my sometext to "This is really fünny kraßen.", queries
to "really" and "fünny" still do not work, but "kraßen" is found.
Now I am somehow confused -- hopefully anyone has a good explanation

;-)

Regards,

  marc


Tom Hill schrieb:

If you are using tomcat, try adding "URIEncoding="UTF-8" to your
tomcat connector.



use the analysis page of the admin interface to check to see what's
 happening to your queries, too.

http://localhost:8080/solr/admin/analysis.jsp?highlight=on  (your
port # may vary)

Tom

On 9/13/07, Marc Bechler < [EMAIL PROTECTED]> wrote:

Hi SOLR kings,

I'm just playing around with queries, but I was not able to query
for any special characters like the German "Umlaute" ( i.e., ä, ö,
ü). Maybe others might have the same effects and already found a
solution ;-)

Here is my example: I have one field called "sometext" of type
"text" (the one delivered with the SOLR example). I indexed a few
words similar to

 

Works fine, and searching for "really" shows the result and fünny
will be displayed correctly. However, the query for "fünny" using
the /solr/admin page is resolved (correctly) to the URL
...q=f%C3%BCnny... but does not find the document.

And now the question: Any ideas? ;-)

Cheers,

marc





Wiki mistake in using 'curl'

2007-09-15 Thread Lance Norskog
In the wiki are various examples of using 'curl' to post data.  Curl
requires "-X POST" arguments to do this. The examples do not have this.
Also the nice way to post a file to 'curl' is with '-T filename'.
 
Will someone with superpowers please fix?
 
Thanks,
 
Lance Norskog
 


Re: Wiki mistake in using 'curl'

2007-09-15 Thread Yonik Seeley
On 9/15/07, Lance Norskog <[EMAIL PROTECTED]> wrote:
> In the wiki are various examples of using 'curl' to post data.  Curl
> requires "-X POST" arguments to do this.

Giving parameters such as -d or --data-binary seem to use POST.

However, there may be some older curl examples somehwere on the wiki
that don't work (perhaps because they don't supply the correct
content-type).

> Also the nice way to post a file to 'curl' is with '-T filename'.

I just tried this (as opposed to the @filename), and something is a
bit weird/different.  Testing with netcat, it seems to pause a second
or two between the headers and the file being displayed.

> Will someone with superpowers please fix?

The wikis can be edited by anyone... just create yourself an account.

-Yonik


Combining Proximity & Range search

2007-09-15 Thread Bharani

Hi,

Is it possible to combine proximity search together with range in a query.
My document will have a multivalued compound field like

revision_01012007
review_02012007

i am thinking of a query like comp:"type:review date:[02012007 TO
02282007]"~0
type and date are fields indexed by copyField extracted from the compound
comp field

Is this possible with solr? 

Thanks
Bharani
-- 
View this message in context: 
http://www.nabble.com/Combining-Proximity---Range-search-tf4450179.html#a12696909
Sent from the Solr - User mailing list archive at Nabble.com.



'suggest' query sorting

2007-09-15 Thread Ryan McKinley

Hello-

I'm building an interface where I need to display matching options as a 
user types into a search box.  Something like google suggest, but it 
needs to be a little more flexible in its matches.


It first glance, I thought I just needed to write a filter that chunks 
each token into a set of prefixes.  Check SOLR-357 -- As Hoss points 
out, I may just be able to use the EdgeNGramFilterFactory.


I have the basics working, but need some help getting the details to 
behave properly.


Consider the strings:
 Canon PowerShot
 iPod Cable
 Canon EX PIXMA
 Video Card

If I query for 'ca' I expect to get all these back.  This works fine, 
but I need help with is ordering.


How can I boost words where the whole value (not just the token) is 
closer to the front of the value?  That is, I want 'ca' to return:

 1. Canon PowerShot
 2. Canon EX PIXMA
 3. iPod Cable
 4. Video Card
(actually 1&2 could be swapped)

After that works, how do I boost tokens that are closer together?  If I 
search for 'canon p', how can I make sure the results are returned as:

 1. Canon PowerShot
 2. Canon EX PIXMA


thanks
ryan