Re: A German Question

2010-07-29 Thread Christian Vogler
On Thursday 29 of July 2010 14:00:21 Eric Grobler wrote:
> But faceting then looks like:
>  molln
>  munchen
>  rossdorf
> 
> How can I enable case-insensitive and german agnostic character filters and
> output proper formatted names in the facet result?

Just create another field without any filtering or conversions, and use the 
copy 
field mechanism to shadow the city names there. Then, facet on that field.

Best regards
- Christian


Re: Solr Security and XSRF

2008-06-26 Thread Christian Vogler
On Fri, Jun 27, 2008 at 1:54 AM, Chris Hostetter
<[EMAIL PROTECTED]> wrote:
> A basic technique that can be used to mitigate the risk of a possible CSRF
> attack like this is to configure your Servlet Container so that access to
> paths which can modify the index (ie: /update, /update/csv, etc...) are
> restricted either to specific client IPs, or using HTTP Authentication.

My understanding is that HTTP authentication is useless against XSRF,
because browsers cache the authentication tokens. Once you have
authenticated, you are still vulnerable to attacks.

Restricting access to the servlet container by IP is probably safer.
To access the admin pages, I proxy the servlet container via Apache,
similar to this snippet given below.

This requires the user to authenticate via SSL for all SOLR-related
pages, and additionally blocks all update queries. If one also would
like to block specific admin pages, one could conceivably do so by
adding  + Deny directives.

Comments, anyone? This configuration is container-agnostic, so if no
serious problems are found with my setup, which Wiki page would be
most appropriate for this snippet?


ServerName your.server.name
ServerAdmin [EMAIL PROTECTED]

SSLEngine on
SSLCertificateFile /etc/ssl/certs/your_cert.pem
SSLCertificateKeyFile /etc/ssl/private/your_key.pem

DocumentRoot /var/webroot/www/webadmin/html

   ErrorLog /var/webroot/www/webadmin/logs/error_ssl.log
   # Possible values include: debug, info, notice, warn, error, crit,
   # alert, emerg.
   LogLevel warn

   CustomLog /var/webroot/www/webadmin/logs/access_ssl.log combined

# SOLR admin pages

Order deny,allow
Allow from all # change this to restrict to specific
IP addresses


ProxyPreserveHost On
ProxyRequests Off
ProxyPass /solr/admin http://127.0.0.1:9000/solr/admin
ProxyPassReverse /solr/admin http://127.0.0.1:9000/solr/admin
ProxyPass /solr/select http://127.0.0.1:9000/solr/select
ProxyPassReverse /solr/select http://127.0.0.1:9000/solr/select


AuthType Basic
AuthName "SOLR Admin Pages"
AuthUserFile /var/webroot/www/webadmin/auth/solr-auth
Require valid-user



Best regards
- Christian


Synonyms and stemming revisited

2008-08-30 Thread Christian Vogler
I apologize for beating a dead horse, but upon searching the archives,
I found no satisfactory resolution. According to the archives, Hoss
recommends in multiple messages that the synonym filter is put before
the stemmer and that synonym stemming at query time then should work
as expected. Unfortunately, this is only true for the first word that
appears in the synonym list.

Consider the following simplified index-time configuration:

  






  


Furthermore, consider the following synonym definition:

reise,urlaub

(These mean travel and vacation, respectively)

Both words can appear with many different endings, such as:

reise, reisen, reist, ...
urlaub,urlaube,urlauben, ...

The stemmer reduces all these to "reis" and "urlaub", respectively.

Now, suppose that a document contains "reise" at index time. According
to the filter order, this
will be expanded by the synonym filter to:

reise urlaub, and then stemmed as:

reis urlaub.

So far, so good. In this case, queries for urlaube, reisen, etc., will
all hit the indexed document.

However, consider a document that contains "reisen" at index time. As
the synonym filter comes first, there is no match for the synonym, and
the analyzer progresses to index this document with "reisen" -> "reis"
only, with "urlaub" missing.

Hence, queries such as "reisen, reist" will hit, but "urlaub",
"urlaube", etc. will not.

I see two solutions:

Either put all possible endings in the synonym file - I do not really
like this solution, as it would make the file very large, and it also
is too easy to miss some specific ending. Or run the stemmer before
the synonym filter, in which case the synonym definitions need to
appear in their stemmed forms. Am I missing something, or does the
conversion of the synonym text file need to be done by hand at the
moment? I suppose that it would not be too difficult to write some
code that does this conversion automatically, so that the synonym
definition:

reise,urlaub is converted to
reis,urlaub

which then should solve all problems.

Best regards
- Christian
-- 
Christian Vogler, Ph.D.
Institute for Language and Speech Processing
Athens, Greece


Re: Synonyms and stemming revisited

2008-09-07 Thread Christian Vogler
> If i've given differnet advice in the past, I'm sure i had a good reason
> for -- possible due to some aspect of those problems that are subtly
> differnet then yours ... can you post links to hte specific messages
> you're refering to, it might help jog my memory.

One thread is: http://www.nabble.com/synonyms-td16284520.html

Based on my reading of that thread, I believe that the issue raised there is 
the same as the one I just raised, but the original post was not entirely 
clear and perhaps easy to misunderstand.

Another thread is: 
http://www.nabble.com/stemming-the-synonyms-to16945953.html#a16945953

> A recently added feature is that when configuring SynonymFilterFactory
> you can give it the name of a TokenizerFactory to use when parsing the
> synonym file.  This could be used to stem words *if* you write a
> TokenizerFactory that calls out to your Stemmer.

Ah, cool. I will give the SOLR 1.3 nightlies a spin, once I make it past my 
current deadlines and obligations.

> (see SOLR-319 for the backround on why you can only specify a Tokenizer
> and not a full "fieldType" to get the analysis chain from ... in a
> nutshell: 1. it would have been harder to implement; 2. the only use cases
> people could think of where Tokenization based.)

There probably needs to be a chain of tokenizers, because in the German 
language compound words need to be split before stemming. I will take a stab 
at writing the TokenizerFactory that chains them. Should not be too 
difficult.

Best regards
- Christian


any experiences with running SOLR on OpenJDK?

2009-02-24 Thread Christian Vogler
Hi all,

does anyone have experience with running SOLR on OpenJDK 6.0? Any data points, 
positive or negative, would be appreciated. I am trying to decide whether to 
switch to OpenJDK on Debian Lenny, or whether to stick with the non-free JDK 
5.0 for the time being.

Best regards
- Christian


Re: highlighting html content

2009-04-28 Thread Christian Vogler
Hi Matt,

On Tue, Apr 28, 2009 at 4:24 AM, Matt Mitchell  wrote:
> I've been toying with setting custom pre/post delimiters and then removing
> them in the client, but I thought I'd ask the list before I go to far with
> that idea :)

this is what I do. I define the custom highlight delimiters as
[solr:hl] and [/solr:hl], and then do a string replace with   on the search results.

It is simple to implement, and effective.

Best regards
- Christian


Re: Upgrading Tika in Solr

2010-02-18 Thread Christian Vogler
Just a word of caution: I've been bitten by this bug, which affects Tika 0.6: 
https://issues.apache.org/jira/browse/PDFBOX-541

It causes the parser to go into an infinite loop, which isn't exactly great 
for server stability. Tika 0.4 is not affected in the same way - as far as I 
remember, the parser just fails on such PDF files.

According to the Tika folks, PDFBox and Tika releases need to be synchronized, 
so it might be wise to hold off upgrading until the next Tika version has been 
released that contains the fixed PDFBox.

Best regards
- Christian


On Wednesday 17 February 2010 11:40:50 am Liam O'Boyle wrote:
> I just copied in the newer .jars and got rid of the old ones and
> everything seemed to work smoothly enough.
> 
> Liam
> 
> On Tue, 2010-02-16 at 13:11 -0500, Grant Ingersoll wrote:
> > I've got a task open to upgrade to 0.6.  Will try to get to it this week.
> >  Upgrading is usually pretty trivial.
> >
> > On Feb 14, 2010, at 12:37 AM, Liam O'Boyle wrote:
> > > Afternoon,
> > >
> > > I've got a large collections of documents which I'm attempting to add
> > > to a Solr index using Tika via the ExtractingRequestHandler, but there
> > > are a large number that it has problems with (PDFs, PPTX and XLS
> > > documents mainly).
> > >
> > > I've tried them with the most recent stand alone version of Tika and it
> > > handles most of the failing documents correctly.  I tried using a
> > > recent nightly build of Solr, but the same problems seem to occur.
> > >
> > > Are there instructions somewhere on installing a more recent Tika build
> > > into Solr?
> > >
> > > Thanks,
> > > Liam
> >
> > --
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> 

-- 
Christian Vogler, Ph.D.
Institute for Language and Speech Processing, Athens, Greece


Why does highlight use the index analyzer (instead of query)?

2008-02-25 Thread Christian Vogler
Hi,

I am using Solr 1.2.0 with a custom compound word analyzer, which inserts the 
decompositions into the token stream. Because I assume that when the user 
queries for a compound word, he is interested only in whole-word matches, I 
have it enabled only in my index analyzer chain.

However, due to a bug in the analyzer (entirely my fault), I came to realize 
that when highlighting is enabled, the highlighter uses the index analyzer 
chain to find the matches, instead of the query analyzer chain.

I find this curious, and I was wondering whether this is intentional, and if 
so, what is the rationale for this?

Best regards
- Christian


Seeing strange highlighting in multi-valued field (was: Why does highlight use the index analyzer)

2008-02-27 Thread Christian Vogler
On Wednesday 27 February 2008 03:58:14 Chris Hostetter wrote:
> I'm not much of a highligher expert, but this *seems* like it was probably
> intentional ... you are tlaking abouthte use case where you have a stored
> field, and no term positions correct? ... so in order to highlight, the
> highlighter needs to analyzed the stored text to find the word positions?

Yes, that is correct. I index and store the field, and have term positions 
disabled. Your explanation makes sense, thanks. 

However, to follow up, I have run into some strange highlighter behavior on 
multi-valued text fields. In particular, I have a field like this:

...

The analyzers for indexing and query are identical, except that I put a 
compound word splitter in the indexer chain. I use this in a multi-valued 
category field:



Typical values from documents are:
GebärdenspracheRecht

where the indexed terms, after analysis are: "gebard" "sprach" and "recht", 
respectively. Now, if I query for "Gebärden" (which the analyzer transforms 
into "gebard"), I get matches, as expected, but the highlighter retrieves 
only the match on the first token of the first field, like this:

Gebärden

The fragment, snippet, and merging parameters have no effect on this behavior; 
hl.requireFieldMatch is off; hl.fragmenter is gap.

What is a bit strange is that If the field have only one value, then the 
highlighter retrieves the entire contents of the field; that is, if we have 
indexed

Gebärdensprache

then the highlighter will show

Gebärdensprache

which is the behavior that I expected, irrespective of whether the field has 
one or more values.

Any idea what could be going on here?

Best regards
- Christian


Re: Python Client / Parameters with a dot

2008-03-10 Thread Christian Vogler
On Monday 10 March 2008 19:34:09 Eric Falconnier wrote:
> I am beginning to use the python client from the subversion
> repository. Everything works well except if I want to pass a parameter
> with a dot to the search method of the SolrConnection class (for
> example facet.field). The solution I have is to replace the "." by
> "__dot__" ( facet__dot__field ) and then reverse this in the search
> method. The new method look like this :

You may want to take a look at SOLR-216 
(http://issues.apache.org/jira/browse/SOLR-216). I have been using this code 
in production for a while with minor modifications, and so far it has worked 
very well. It allows you to substitute underscores for dots, e.g. hl_fl=...

The code, as posted in the JIRA, has a few minor issues with UTF-8 encoding, 
but these are easily fixed (see the comments, and also a recent post to the 
list). I probably will propose a revised version soon.

Best regards
- Christian 


Re: Plans for a new Solr Python library

2008-03-24 Thread Christian Vogler
On Monday 24 March 2008 01:01:59 Leonardo Santagada wrote:
> I have done some modifications on the solr python client[1], and
> though we kept the same license and my work could be put back in solr
> I think if there are more people interested we could improve the
> module a lot.

Have you taken a look at SOLR-216 on the issue tracker? I've been using this 
version in production, and it is quite nice.

Maybe it is possible to take the best from both versions?

Best  regards
- Christian


Re: synonyms

2008-03-29 Thread Christian Vogler
On Friday 28 March 2008 21:44:29 Leonardo Santagada wrote:
> Well his examples are in brazilian portuguese and not spanish and the
> biggest problem is that a spanish stemmer is not goin to work. I
> haven't found a pt_BR steammer, have I overlooked something?

Try the Snowball Porter filter factory. The algorithm is specified in plain 
text files, so adding new stemmers to the codebase is pretty easy. The hard 
part is finding a good specification of the algorithm for Brazilian 
Portuguese.

A Google search reveals some references to Brazilian Portuguese versions of 
the Porter algorithm. Maybe one of these is suitably unencumbered for 
implementation and distribution as free software.

As a last resort, there already is a Snowball Porter stemmer for Portuguese in 
the SOLR codebase. However, I do not know how suitable it would be for 
adaptation to Brazilian Portuguese, as I know zilch about the variant spoken 
in Portugal.

Best  regards
- Christian


Re: solr highlighting

2008-05-14 Thread Christian Vogler
On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote:
> Hi there,
>
> I am new to solr. I want search term to be highlighted on the results. I
> thought it is pretty simple, but could not make it work. I read a lot of
> solr documents and mail archives (I wish there is a search function for
> this, we are talking about solr, aren’t we? ☺).

Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as per 
http://wiki.apache.org/solr/HighlightingParameters.

In particular, setting hl.fragsize to 0 might be what you want if I understand 
your question correctly.

Best regards
- Christian
-- 
Christian Vogler, Ph.D.
Institute for Language and Speech Processing, Athens, Greece
http://gri.gallaudet.edu/~cvogler/
[EMAIL PROTECTED]


Re: field normalization and omitNorms

2008-05-28 Thread Christian Vogler
On Wednesday 28 May 2008 01:37:57 Otis Gospodnetic wrote:
> If you have tokenized fields of variable size and you want the field length
> to affect the relevance score, then you do not want to omit norms. 
> Omitting norms is good for fields where length is of no importance (e.g.
> gender="Male" vs. gender="Female").  Omitting norms saves you heap/RAM, one
> byte per doc per field without norms, I believe.

I am also toying with the hypothesis that omitting the field norm may be a 
good idea for title fields in languages with compound words, which typically 
consist of only a few words. 

On our server we use a German language stemmer in conjunction with a compound 
word tokenizer, which inserst extra tokens into the stream. With typical 
short titles, such as:

Elterntagung mit Rekordbeteiligung,

which is tokenized as (before stemming):

elterntagung eltern tagung mit rekordbeteiligung rekord beteiligung, 

the title ends up having 7 tokens instead of 3 or even 5, which significantly 
affects the field norms. The reason for retaining the original compound token 
is that it forces compound word queries to return only hits on compound 
words.

In addition, we also have a copied field with just the 3 tokens that skips the 
compound tokenizer, in order to boost queries that match whole words. As a 
consequence, according to the "explain" parameter, the match score for the 
non-compound title fields is *way* out of proportion.

I will have to experiment a bit - one thing that I want to try is moving the 
non-compound field from the qf parameter to the bq parameter, but omitting 
the title field norms is also on my list of things to try.

Best regards
- Christian