Re: A German Question
On Thursday 29 of July 2010 14:00:21 Eric Grobler wrote: > But faceting then looks like: > molln > munchen > rossdorf > > How can I enable case-insensitive and german agnostic character filters and > output proper formatted names in the facet result? Just create another field without any filtering or conversions, and use the copy field mechanism to shadow the city names there. Then, facet on that field. Best regards - Christian
Re: Solr Security and XSRF
On Fri, Jun 27, 2008 at 1:54 AM, Chris Hostetter <[EMAIL PROTECTED]> wrote: > A basic technique that can be used to mitigate the risk of a possible CSRF > attack like this is to configure your Servlet Container so that access to > paths which can modify the index (ie: /update, /update/csv, etc...) are > restricted either to specific client IPs, or using HTTP Authentication. My understanding is that HTTP authentication is useless against XSRF, because browsers cache the authentication tokens. Once you have authenticated, you are still vulnerable to attacks. Restricting access to the servlet container by IP is probably safer. To access the admin pages, I proxy the servlet container via Apache, similar to this snippet given below. This requires the user to authenticate via SSL for all SOLR-related pages, and additionally blocks all update queries. If one also would like to block specific admin pages, one could conceivably do so by adding + Deny directives. Comments, anyone? This configuration is container-agnostic, so if no serious problems are found with my setup, which Wiki page would be most appropriate for this snippet? ServerName your.server.name ServerAdmin [EMAIL PROTECTED] SSLEngine on SSLCertificateFile /etc/ssl/certs/your_cert.pem SSLCertificateKeyFile /etc/ssl/private/your_key.pem DocumentRoot /var/webroot/www/webadmin/html ErrorLog /var/webroot/www/webadmin/logs/error_ssl.log # Possible values include: debug, info, notice, warn, error, crit, # alert, emerg. LogLevel warn CustomLog /var/webroot/www/webadmin/logs/access_ssl.log combined # SOLR admin pages Order deny,allow Allow from all # change this to restrict to specific IP addresses ProxyPreserveHost On ProxyRequests Off ProxyPass /solr/admin http://127.0.0.1:9000/solr/admin ProxyPassReverse /solr/admin http://127.0.0.1:9000/solr/admin ProxyPass /solr/select http://127.0.0.1:9000/solr/select ProxyPassReverse /solr/select http://127.0.0.1:9000/solr/select AuthType Basic AuthName "SOLR Admin Pages" AuthUserFile /var/webroot/www/webadmin/auth/solr-auth Require valid-user Best regards - Christian
Synonyms and stemming revisited
I apologize for beating a dead horse, but upon searching the archives, I found no satisfactory resolution. According to the archives, Hoss recommends in multiple messages that the synonym filter is put before the stemmer and that synonym stemming at query time then should work as expected. Unfortunately, this is only true for the first word that appears in the synonym list. Consider the following simplified index-time configuration: Furthermore, consider the following synonym definition: reise,urlaub (These mean travel and vacation, respectively) Both words can appear with many different endings, such as: reise, reisen, reist, ... urlaub,urlaube,urlauben, ... The stemmer reduces all these to "reis" and "urlaub", respectively. Now, suppose that a document contains "reise" at index time. According to the filter order, this will be expanded by the synonym filter to: reise urlaub, and then stemmed as: reis urlaub. So far, so good. In this case, queries for urlaube, reisen, etc., will all hit the indexed document. However, consider a document that contains "reisen" at index time. As the synonym filter comes first, there is no match for the synonym, and the analyzer progresses to index this document with "reisen" -> "reis" only, with "urlaub" missing. Hence, queries such as "reisen, reist" will hit, but "urlaub", "urlaube", etc. will not. I see two solutions: Either put all possible endings in the synonym file - I do not really like this solution, as it would make the file very large, and it also is too easy to miss some specific ending. Or run the stemmer before the synonym filter, in which case the synonym definitions need to appear in their stemmed forms. Am I missing something, or does the conversion of the synonym text file need to be done by hand at the moment? I suppose that it would not be too difficult to write some code that does this conversion automatically, so that the synonym definition: reise,urlaub is converted to reis,urlaub which then should solve all problems. Best regards - Christian -- Christian Vogler, Ph.D. Institute for Language and Speech Processing Athens, Greece
Re: Synonyms and stemming revisited
> If i've given differnet advice in the past, I'm sure i had a good reason > for -- possible due to some aspect of those problems that are subtly > differnet then yours ... can you post links to hte specific messages > you're refering to, it might help jog my memory. One thread is: http://www.nabble.com/synonyms-td16284520.html Based on my reading of that thread, I believe that the issue raised there is the same as the one I just raised, but the original post was not entirely clear and perhaps easy to misunderstand. Another thread is: http://www.nabble.com/stemming-the-synonyms-to16945953.html#a16945953 > A recently added feature is that when configuring SynonymFilterFactory > you can give it the name of a TokenizerFactory to use when parsing the > synonym file. This could be used to stem words *if* you write a > TokenizerFactory that calls out to your Stemmer. Ah, cool. I will give the SOLR 1.3 nightlies a spin, once I make it past my current deadlines and obligations. > (see SOLR-319 for the backround on why you can only specify a Tokenizer > and not a full "fieldType" to get the analysis chain from ... in a > nutshell: 1. it would have been harder to implement; 2. the only use cases > people could think of where Tokenization based.) There probably needs to be a chain of tokenizers, because in the German language compound words need to be split before stemming. I will take a stab at writing the TokenizerFactory that chains them. Should not be too difficult. Best regards - Christian
any experiences with running SOLR on OpenJDK?
Hi all, does anyone have experience with running SOLR on OpenJDK 6.0? Any data points, positive or negative, would be appreciated. I am trying to decide whether to switch to OpenJDK on Debian Lenny, or whether to stick with the non-free JDK 5.0 for the time being. Best regards - Christian
Re: highlighting html content
Hi Matt, On Tue, Apr 28, 2009 at 4:24 AM, Matt Mitchell wrote: > I've been toying with setting custom pre/post delimiters and then removing > them in the client, but I thought I'd ask the list before I go to far with > that idea :) this is what I do. I define the custom highlight delimiters as [solr:hl] and [/solr:hl], and then do a string replace with on the search results. It is simple to implement, and effective. Best regards - Christian
Re: Upgrading Tika in Solr
Just a word of caution: I've been bitten by this bug, which affects Tika 0.6: https://issues.apache.org/jira/browse/PDFBOX-541 It causes the parser to go into an infinite loop, which isn't exactly great for server stability. Tika 0.4 is not affected in the same way - as far as I remember, the parser just fails on such PDF files. According to the Tika folks, PDFBox and Tika releases need to be synchronized, so it might be wise to hold off upgrading until the next Tika version has been released that contains the fixed PDFBox. Best regards - Christian On Wednesday 17 February 2010 11:40:50 am Liam O'Boyle wrote: > I just copied in the newer .jars and got rid of the old ones and > everything seemed to work smoothly enough. > > Liam > > On Tue, 2010-02-16 at 13:11 -0500, Grant Ingersoll wrote: > > I've got a task open to upgrade to 0.6. Will try to get to it this week. > > Upgrading is usually pretty trivial. > > > > On Feb 14, 2010, at 12:37 AM, Liam O'Boyle wrote: > > > Afternoon, > > > > > > I've got a large collections of documents which I'm attempting to add > > > to a Solr index using Tika via the ExtractingRequestHandler, but there > > > are a large number that it has problems with (PDFs, PPTX and XLS > > > documents mainly). > > > > > > I've tried them with the most recent stand alone version of Tika and it > > > handles most of the failing documents correctly. I tried using a > > > recent nightly build of Solr, but the same problems seem to occur. > > > > > > Are there instructions somewhere on installing a more recent Tika build > > > into Solr? > > > > > > Thanks, > > > Liam > > > > -- > > Grant Ingersoll > > http://www.lucidimagination.com/ > > > > Search the Lucene ecosystem using Solr/Lucene: > > http://www.lucidimagination.com/search > -- Christian Vogler, Ph.D. Institute for Language and Speech Processing, Athens, Greece
Why does highlight use the index analyzer (instead of query)?
Hi, I am using Solr 1.2.0 with a custom compound word analyzer, which inserts the decompositions into the token stream. Because I assume that when the user queries for a compound word, he is interested only in whole-word matches, I have it enabled only in my index analyzer chain. However, due to a bug in the analyzer (entirely my fault), I came to realize that when highlighting is enabled, the highlighter uses the index analyzer chain to find the matches, instead of the query analyzer chain. I find this curious, and I was wondering whether this is intentional, and if so, what is the rationale for this? Best regards - Christian
Seeing strange highlighting in multi-valued field (was: Why does highlight use the index analyzer)
On Wednesday 27 February 2008 03:58:14 Chris Hostetter wrote: > I'm not much of a highligher expert, but this *seems* like it was probably > intentional ... you are tlaking abouthte use case where you have a stored > field, and no term positions correct? ... so in order to highlight, the > highlighter needs to analyzed the stored text to find the word positions? Yes, that is correct. I index and store the field, and have term positions disabled. Your explanation makes sense, thanks. However, to follow up, I have run into some strange highlighter behavior on multi-valued text fields. In particular, I have a field like this: ... The analyzers for indexing and query are identical, except that I put a compound word splitter in the indexer chain. I use this in a multi-valued category field: Typical values from documents are: GebärdenspracheRecht where the indexed terms, after analysis are: "gebard" "sprach" and "recht", respectively. Now, if I query for "Gebärden" (which the analyzer transforms into "gebard"), I get matches, as expected, but the highlighter retrieves only the match on the first token of the first field, like this: Gebärden The fragment, snippet, and merging parameters have no effect on this behavior; hl.requireFieldMatch is off; hl.fragmenter is gap. What is a bit strange is that If the field have only one value, then the highlighter retrieves the entire contents of the field; that is, if we have indexed Gebärdensprache then the highlighter will show Gebärdensprache which is the behavior that I expected, irrespective of whether the field has one or more values. Any idea what could be going on here? Best regards - Christian
Re: Python Client / Parameters with a dot
On Monday 10 March 2008 19:34:09 Eric Falconnier wrote: > I am beginning to use the python client from the subversion > repository. Everything works well except if I want to pass a parameter > with a dot to the search method of the SolrConnection class (for > example facet.field). The solution I have is to replace the "." by > "__dot__" ( facet__dot__field ) and then reverse this in the search > method. The new method look like this : You may want to take a look at SOLR-216 (http://issues.apache.org/jira/browse/SOLR-216). I have been using this code in production for a while with minor modifications, and so far it has worked very well. It allows you to substitute underscores for dots, e.g. hl_fl=... The code, as posted in the JIRA, has a few minor issues with UTF-8 encoding, but these are easily fixed (see the comments, and also a recent post to the list). I probably will propose a revised version soon. Best regards - Christian
Re: Plans for a new Solr Python library
On Monday 24 March 2008 01:01:59 Leonardo Santagada wrote: > I have done some modifications on the solr python client[1], and > though we kept the same license and my work could be put back in solr > I think if there are more people interested we could improve the > module a lot. Have you taken a look at SOLR-216 on the issue tracker? I've been using this version in production, and it is quite nice. Maybe it is possible to take the best from both versions? Best regards - Christian
Re: synonyms
On Friday 28 March 2008 21:44:29 Leonardo Santagada wrote: > Well his examples are in brazilian portuguese and not spanish and the > biggest problem is that a spanish stemmer is not goin to work. I > haven't found a pt_BR steammer, have I overlooked something? Try the Snowball Porter filter factory. The algorithm is specified in plain text files, so adding new stemmers to the codebase is pretty easy. The hard part is finding a good specification of the algorithm for Brazilian Portuguese. A Google search reveals some references to Brazilian Portuguese versions of the Porter algorithm. Maybe one of these is suitably unencumbered for implementation and distribution as free software. As a last resort, there already is a Snowball Porter stemmer for Portuguese in the SOLR codebase. However, I do not know how suitable it would be for adaptation to Brazilian Portuguese, as I know zilch about the variant spoken in Portugal. Best regards - Christian
Re: solr highlighting
On Wednesday 14 May 2008 09:21:36 Kevin Xiao wrote: > Hi there, > > I am new to solr. I want search term to be highlighted on the results. I > thought it is pretty simple, but could not make it work. I read a lot of > solr documents and mail archives (I wish there is a search function for > this, we are talking about solr, aren’t we? ☺). Take a look at hl.fragsize, hl.snippets, and hl.mergeContiguous, as per http://wiki.apache.org/solr/HighlightingParameters. In particular, setting hl.fragsize to 0 might be what you want if I understand your question correctly. Best regards - Christian -- Christian Vogler, Ph.D. Institute for Language and Speech Processing, Athens, Greece http://gri.gallaudet.edu/~cvogler/ [EMAIL PROTECTED]
Re: field normalization and omitNorms
On Wednesday 28 May 2008 01:37:57 Otis Gospodnetic wrote: > If you have tokenized fields of variable size and you want the field length > to affect the relevance score, then you do not want to omit norms. > Omitting norms is good for fields where length is of no importance (e.g. > gender="Male" vs. gender="Female"). Omitting norms saves you heap/RAM, one > byte per doc per field without norms, I believe. I am also toying with the hypothesis that omitting the field norm may be a good idea for title fields in languages with compound words, which typically consist of only a few words. On our server we use a German language stemmer in conjunction with a compound word tokenizer, which inserst extra tokens into the stream. With typical short titles, such as: Elterntagung mit Rekordbeteiligung, which is tokenized as (before stemming): elterntagung eltern tagung mit rekordbeteiligung rekord beteiligung, the title ends up having 7 tokens instead of 3 or even 5, which significantly affects the field norms. The reason for retaining the original compound token is that it forces compound word queries to return only hits on compound words. In addition, we also have a copied field with just the 3 tokens that skips the compound tokenizer, in order to boost queries that match whole words. As a consequence, according to the "explain" parameter, the match score for the non-compound title fields is *way* out of proportion. I will have to experiment a bit - one thing that I want to try is moving the non-compound field from the qf parameter to the bq parameter, but omitting the title field norms is also on my list of things to try. Best regards - Christian