Re: Sorting TEXT Field problems :-(

2008-10-28 Thread Thomas Traeger

Kraus, Ralf | pixelhouse GmbH schrieb:

Hello,

Querry:
{wt=json&rows=30&json.nl=map&start=0&sort=RezeptName+asc}

Result :

Doppeldecker
Eiersalat
Curry - Eiersalat
Eiersalat

Why is my second "Curry..." after "Doppeldecker" ???
RezeptName is a normal "text" field defined as :

   positionIncrementGap="100">

   
   
   
   
   language="German" />

   
 
   
   
   
   language="German" />

   
 


Greets -Ralf-


Hi,

normally you would define at least one special field for sorting: 
http://wiki.apache.org/solr/CommonQueryParameters#head-9f40612b42721ed9e1979a4a80d68f4f8524e9b4


you have to use a single valued, indexed but untokenized field (or use a 
tokenizer that produces only one token)


You might also look at field "alphaOnlySort" in the example schema.

Tom


Re: Customizing SOLR-236 field collapsing

2009-05-21 Thread Thomas Traeger
Is adding QueryComponent to your SearchComponents an option? When 
combined with the CollapseComponent this approach would return the 
collapsed and the complete result set.


i.e.:


 collapse
 query
 facet
 mlt
 highlight


Thomas

Marc Sturlese schrieb:

Hey there,
I have been testing the last adjacent field collapsing patch in trunk and
seems to work perfectly. I am trying to modify the function of it but don't
know exactly how to do it. What I would like to do is instead of collapse
the results send them to the end of the results cue.
Aparently it is not possible to do that due to the way it is implemented. I
have noticed that you get a DocSet of the ids that "survived" the collapsing
and that match the query and filters (collapseFilterDocSet =
collapseFilter.getDocSet();, you get it in CollapseComponent.java.
Once it is done the search is excuted again, this time the DocSet obtained
before is passed as a filter:

DocListAndSet results = searcher.getDocListAndSet(rb.getQuery(),
  collapseFilterDocSet
== null? rb.getFilters(): null,
  collapseFilterDocSet,
 
rb.getSortSpec().getSort(),
 
rb.getSortSpec().getOffset(),
 
rb.getSortSpec().getCount(),

  rb.getFieldFlags());

The result of this search will give you the final result (with the correct
offset and start).
I have thought in saving the collapsed docs in another DocSet and after do
something with them... but don't know how to manage it.
Any clue about how could I reach the goal?
Thanks in advance
  




Re: grouping response docs together

2009-05-25 Thread Thomas Traeger

Hello Matt,

the patch should work with trunk and after a small fix with 1.3 too (see
my comment in SOLR-236). I just made a successful build to be sure.

Do you see any error messages?

Thomas

Matt Mitchell schrieb:

Thanks guys. I looked at the dedup stuff, but the documents I'm adding
aren't really duplicates. They're very similar, but different.

I checked out the field collapsing feature patch, applied the patch but
can't get it to build successfully. Will this patch work with a nightly
build?

Thanks!

On Fri, May 15, 2009 at 7:47 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:


Matt - you may also want to detect near duplicates at index time:

http://wiki.apache.org/solr/Deduplication

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 

From: Matt Mitchell 
To: solr-user@lucene.apache.org
Sent: Friday, May 15, 2009 6:52:48 PM
Subject: grouping response docs together

Is there a built-in mechanism for grouping similar documents together in

the

response? I'd like to make it look like there is only one document with
multiple "hits".

Matt








Re: Search combination?

2009-06-02 Thread Thomas Traeger

I assume you are using the StandardRequestHandler, so this should work:

http://192.168.105.54:8983/solr/itas?q=size:7* AND extension:pdf

Also have a look at the follwing links:

http://wiki.apache.org/solr/SolrQuerySyntax
http://lucene.apache.org/java/2_4_1/queryparsersyntax.html

Thomas

Jörg Agatz schrieb:

Hi users...

i have a Problem...

i will search for:

http://192.168.105.54:8983/solr/itas?q=size:7*&extension:db

i mean i search for all documents they are size 7* and extension:pdf,

But it dosent work
i get some other files, with extension doc ore db
what is Happens about ?

Jörg

  




Re: Converting German special characters / umlaute

2007-09-26 Thread Thomas Traeger

Try the SnowballPorterFilterFactory described here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You should use the German2 variant that converts ä and ae to a, ö and oe 
to o and so on. More details:

http://snowball.tartarus.org/algorithms/german2/stemmer.html

Every document in solr can have any number of fields which might have 
the same source but have different field types and are therefore handled 
differently (stored as is, analyzed in different ways...). Use copyField 
in your schema.xml to feed your data into multiple fields. During 
searching you decide which fields you like to search on (usually the 
analyzed ones) and which you retrieve when getting the document back.


Tom

Matthias Eireiner schrieb:

Dear list,

I have two questions regarding German special characters or umlaute.

is there an analyzer which automatically converts all german special
characters to their specific dissected from, such as ü to ue and ä to
ae, etc.?!

I also would like to have, that the search is always run against the
dissected data. But when the results are returned the initial data with
the non modified data should be returned. 


Does lucene GermanAnalyzer this job? I run across it, but I could not
figure out from the documentation whether it does the job or not.

thanks a lot in advance.

Matthias
  


Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-10 Thread Thomas Traeger

in short: use stemming

Try the SnowballPorterFilterFactory with German2 as language attribute 
first and use synonyms for combined words i.e. "Herrenhose" => "Herren", 
"Hose".


By using stemming you will maybe have some "interesting" results, but it 
is much better living with them than having no or much less results ;o)


Find more infos on the Snowball stemming algorithms here:

http://snowball.tartarus.org/

Also have a look at the StopFilterFactory, here is a sample stopwordlist 
for the german language:


http://snowball.tartarus.org/algorithms/german/stop.txt

Good luck,

Tom


Martin Grotzke schrieb:

Hello,

with our application we have the issue, that we get different
results for singular and plural searches (german language).

E.g. for "hose" we get 1.000 documents back, but for "hosen"
we get 10.000 docs. The same applies to "t-shirt" or "t-shirts",
of e.g. "hut" and "hüte" - lots of cases :)

This is absolutely correct according to the schema.xml, as right
now we do not have any stemming or synonyms included.

Now we want to have similar search results for these singular/plural
searches. I'm thinking of a solution for this, and want to ask, what
are your experiences with this.

Basically I see two options: stemming and the usage of synonyms. Are
there others?

My concern with stemming is, that it might produce unexpected results,
so that docs are found that do not match the query from the users point
of view. I asume that this needs a lot of testing with different data.

The issue with synonyms is, that we would have to create a file
containing all synonyms, so we would have to figure out all cases, in
contrast to a solutions that is based on an algorithm.
The advantage of this approach is IMHO, that it is very predictable
which results will be returned for a certain query.

Some background information:
Our documents contain products (id, name, brand, category, producttype,
description, color etc). The singular/plural issue basically applied to
the fields name, category and producttype, so we would like to restrict
the solution to these fields.

Do you have suggestions how to handle this?

Thanx in advance for sharing your experiences,
cheers,
Martin


  


Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-11 Thread Thomas Traeger

Martin Grotzke schrieb:
Try the SnowballPorterFilterFactory with German2 as language attribute 
first and use synonyms for combined words i.e. "Herrenhose" => "Herren", 
"Hose".


so you use a combined approach?
  
Yes, we define the relevant parts of compounded words (keywords only) as 
synonyms and feed them in a special field that is used for searching and 
for the product index. I hope there will be a filter that can split 
compounded word sometimes in the future...
By using stemming you will maybe have some "interesting" results, but it 
is much better living with them than having no or much less results ;o)


Do you have an example what "interesting" results I can expect, just to
get an idea?
  

Find more infos on the Snowball stemming algorithms here:

http://snowball.tartarus.org/


Thanx! I also had a look at this site already, but what is missing is a
demo where one can see what's happening. I think I'll play a little with
stemming to get a feeling for this.
  
I think the Snowball stemmer is very good so I have no practical example 
for you. Maybe this is of value to see what happens:


http://snowball.tartarus.org/algorithms/german/diffs.txt

If you have mixed languages in your content, which sometimes happens in 
product data, you might get into some trouble.


Also have a look at the StopFilterFactory, here is a sample stopwordlist 
for the german language:


http://snowball.tartarus.org/algorithms/german/stop.txt


Our application handles products, do you think such stopwords are useful
in this scenario also? I wouldn't expect a user to search for "keine
hose" or s.th. like this :)
  

I have seen much worse queries, so you never know ;o)

think of a query like this: "Hose in blau für Herren"

You will definetly want to remove "in" and "für" during searching and it 
reduces index size when removed during indexing. Maybe you will even get 
better scores when only relevant terms are used. You should optimze the 
stopword list based on your data.


Regards,

Tom



Re: AW: Converting German special characters / umlaute

2007-10-25 Thread Thomas Traeger

Hi,

the SnowballPorterFilterFactory is a complete stemmer that transforms 
words to their basic form (laufen -> lauf, läufer -> lauf). One part of 
that process is replacing language specific special characters.


So SnowballPorterFilterFactory does what you wanted (beside other 
things). I mentioned it because it is a very good start when using solr 
and especially when dealing with documents in languages other than english.


Tom

Matthias Eireiner schrieb:

Dear list,

it has been some time, but here is what I did.
I had a look at Thomas Traeger's tip to use the
SnowballPorterFilterFactory, which does not actually do the job.
Its purpose is to convert regular ASCII into special characters. 


And I want it the other way, such that all special character are
converted to regular ASCII.
The tip of J.J. Larrea, to use the PatternReplaceFilterFactory, solved
the problem. 
 
And as Chris Hostetter noted, stored fields always return the initial

value, which turned the second part of my question obsolete.

Thanks a lot for your help!

best 
Matthias




-Ursprüngliche Nachricht-
Von: Thomas Traeger [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 26. September 2007 23:44

An: solr-user@lucene.apache.org
Betreff: Re: Converting German special characters / umlaute


Try the SnowballPorterFilterFactory described here:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

You should use the German2 variant that converts ä and ae to a, ö and oe

to o and so on. More details:
http://snowball.tartarus.org/algorithms/german2/stemmer.html

Every document in solr can have any number of fields which might have 
the same source but have different field types and are therefore handled


differently (stored as is, analyzed in different ways...). Use copyField

in your schema.xml to feed your data into multiple fields. During 
searching you decide which fields you like to search on (usually the 
analyzed ones) and which you retrieve when getting the document back.


Tom

Matthias Eireiner schrieb:
  

Dear list,

I have two questions regarding German special characters or umlaute.

is there an analyzer which automatically converts all german special 
characters to their specific dissected from, such as ü to ue and ä to 
ae, etc.?!


I also would like to have, that the search is always run against the 
dissected data. But when the results are returned the initial data 
with the non modified data should be returned.


Does lucene GermanAnalyzer this job? I run across it, but I could not 
figure out from the documentation whether it does the job or not.


thanks a lot in advance.

Matthias
  





  


Re: All facet.fields for a given facet.query?

2007-06-19 Thread Thomas Traeger

Hi,

I'm also just at that point where I think I need a wildcard facet.field 
parameter (or someone points out another solution for my problem...). 
Here is my situation:


I have many products of different types with totally different 
attributes. There are currently more than 300 attributes
I use dynamic fields to import the attributes into solr without having 
to define a specific field for each attribute. Now when I make a query I 
would like to get back all facet.fields that are relevant for that query.


I think it would be really nice, if I don't have to know which facets 
fields are there at query time, instead just import attributes into 
dynamic fields, get the relevant facets back and decide in the frontend 
which to display and how...


What do the experts think about this?

Tom


Re: All facet.fields for a given facet.query?

2007-06-20 Thread Thomas Traeger

first: sorry for the bad quoting, I found your message in the archive only...


I have many products of different types with totally different
attributes. There are currently more than 300 attributes
I use dynamic fields to import the attributes into solr without having
to define a specific field for each attribute. Now when I make a query I
would like to get back all facet.fields that are relevant for that query.
I think it would be really nice, if I don't have to know which facets
fields are there at query time, instead just import attributes into



The problem is there may be lots of fields you index but don't want to
facet on (full text search fields) and Solr has no easy way of knowing the
difference between those and the fields you think it makes sense to facet
on ... even if a field does make sense to facet on some of the time, that
doesn't mean it makes sense all of the time (as you say "when I make a
query I would like to get back all facet.fields that are relevant for that
query" ... Solr has no way of knowing which fields make sense for that
query unless it tries them all (can be very expensive) or you tell it.
I solve this problem by having metadata stored in my index which tells
my custom request handler what fields to facet on for each category ...
but i've also got several thousand categories.  If you've got less then
100 categories, you could easily enumerate them all with default
facet.field params in your solrconfig using seperate requesthandler
instances.



What do the experts think about this?



you may want to read up on the past discussion of this in SOLR-247 ... in
particular note the link to the mail archive where there was assitional
discussion about it as well.  Where we left things is that it
might make sense to support true globging in both fl and facet.field, so
you can use naming conventions and say things like facet.field=facet_*
but that in general trying to do something like facet.field=* would be a
very bad idea even if it was supported.
http://issues.apache.org/jira/browse/SOLR-247



to make it clear, i agree that it doesn't make sense faceting on all available 
fields, I only want faceting on those 300 attributes that are stored together 
with the fields for full text searches. A product/document has typically only 
5-10 attributes.

I like to decide at index time which attributes of a product might be of 
interest for faceting and store those in dynamic fields with the attribute-name 
and some kind of prefix or suffix to identify them at query time as 
facet.fields. Exactly the naming convention you mentioned.

I will have a closer look at SOLR-247 and the supplied patch, seems like a good 
starting point to dig deeper into solr... :o)

Tom




Re: All facet.fields for a given facet.query?

2007-06-20 Thread Thomas Traeger

Martin Grotzke schrieb:

On Tue, 2007-06-19 at 19:16 +0200, Thomas Traeger wrote:
  

Hi,

I'm also just at that point where I think I need a wildcard facet.field 
parameter (or someone points out another solution for my problem...). 
Here is my situation:


I have many products of different types with totally different 
attributes. There are currently more than 300 attributes
I use dynamic fields to import the attributes into solr without having 
to define a specific field for each attribute. Now when I make a query I 
would like to get back all facet.fields that are relevant for that query.


I think it would be really nice, if I don't have to know which facets 
fields are there at query time, instead just import attributes into 
dynamic fields, get the relevant facets back and decide in the frontend 
which to display and how...


Do you really need all facets in the frontend?
  

no, only the subset with matches for the current query.

Would it be a solution to have a facet ranking in the field definitions,
and then decide at query time, on which fields to facet on? This would
need an additional query parameter like facet.query.count.

E.g. if you have a query with q=foo+AND+prop1:bar+AND+prop2:baz
and you have fields
prop1 with facet-ranking 100
prop2 with facet-ranking 90
prop3 with facet-ranking 80
prop4 with facet-ranking 70
prop5 with facet-ranking 60

then you might decide not to facet on prop1 and prop2 as you have
already a constraint on it, but to facet on prop3 and prop4 if
facet.query.count is 2.

Just thinking about that... :)

Cheers,
Martin

  
One step after the other ;o), the ranking of the facets will be another 
problem I have to solve, counts of facets and matching documents will be 
a starting point. Another idea is to use the score of the documents 
returned by the query to compute a score for the facet.field...


Tom


Re: All facet.fields for a given facet.query?

2007-06-20 Thread Thomas Traeger

Chris Hostetter schrieb:

: to make it clear, i agree that it doesn't make sense faceting on all
: available fields, I only want faceting on those 300 attributes that are
: stored together with the fields for full text searches. A
: product/document has typically only 5-10 attributes.
:
: I like to decide at index time which attributes of a product might be of
: interest for faceting and store those in dynamic fields with the
: attribute-name and some kind of prefix or suffix to identify them at
: query time as facet.fields. Exactly the naming convention you mentioned.

but if the facet fields are different for every document, and they use a
simple dynamicField prefix (like "facet_*" for example) how do you know at
query time which fields to facet on? ... even if wildcards work in
facet.field, usingfacet.field=facet_* would require solr to compute the
counts for *every* field matching that pattern to find out which ones have
positive counts for the current result set -- there may only be 5 that
actually matter, but it's got to try all 300 of them to find out which 5
that is.
I just made a quick test by building a facet query with those 300 
attributes.
I realized, that the facets are build out of the whole index, not the 
subset

returned by the initial query. Therefore I have a large number of empty
facets which I simply ignore. In my case the QueryTime is somewhat 
higher (of

course) but it is still at some milliseconds. (wow!!!) :o)

So at this state of my investigation and in my use case I don't have to 
worry
about performance even if I use the system in a way that uses more 
resources

than necessary.

this is where custom request handlers that understand that faceting
"metadata" for your documents becomes key ... so you can say "when
querying across the entire collection, only try to facet on category and
manufacturer.  if the search is constrained by category, then lookup other
facet options to offer based on that category name from our metadata
store, etc...

Faceting on manufacturers and categories first and than present the
corresponding facets might be used under some circumstances, but in my case
the category structure is quite deep, detailed and complex. So when
the user enters a query I like to say to him "Look, here are the
manufacturers and categories with matches to your query, choose one if you
want, but maybe there is another one with products that better fit your
needs or products that you didn't even know about. So maybe you like to
filter based on the following attributes." Something like this ;o)

The point is, that i currently don't want to know too much about the data,
I just want to feed it into solr, follow some conventions and get the most
out of it as quickly as possible. Optimizations can and will take place at
a later time.

I hope to find some time to dig into solr SimpleFacets this weekend.

Regards,

Tom


Re: All facet.fields for a given facet.query?

2007-06-21 Thread Thomas Traeger



: Faceting on manufacturers and categories first and than present the
: corresponding facets might be used under some circumstances, but in my case
: the category structure is quite deep, detailed and complex. So when
: the user enters a query I like to say to him "Look, here are the
: manufacturers and categories with matches to your query, choose one if you
: want, but maybe there is another one with products that better fit your
: needs or products that you didn't even know about. So maybe you like to
: filter based on the following attributes." Something like this ;o)

categories was just an example i used because it tends to be a common use
case ... my point is the decision about which facet qualifies for the
"maybe there is another one with products that better fit your needs" part
of the response either requires computing counts for *every* facet
constraint and then looking at them to see which ones provide good
distribution, or by knowing something more about your metadata (ie: having
stats that show the majority of people who search on the word "canon" want
to facet on "megapixels") .. this is where custom biz logic comes in,
becuase in a lot of situations computing counts for every possible facet
may not be practical (even if the syntax to request it was easier)
I get your point, but how to know where additional metadata is of value 
if not
just trying? Currently I start with a generic approach to see what 
really is

in the product data, to get an overview of the quality of the data and
what happens if I use the data in the new search solution. Then I can 
decide

what is to do to optimize the system, i.e. try to reduce the count of
attributes, get the marketing to split somewhat generic attributes into 
more
detailed ones, find a way to display the most relevant facets for the 
current

query first and so on...

Tom