Sorry for taking forever to reply but anyway...
We're using Solr-1.2.0 and can't for various reasons use the
Nightly-version.
The 1.2.0-version doesn't have NGramFilterFactory and
EdgeNGramFilterFactory so the only ones I can utilize are
EdgeNGramTokenizerFactory and NGramTokenizerFactory.
I've done some playing around with them but the best result I've gotten
so far is a field-type that enables searching for specific letters, for
example I can search for an item that contains the letters a and x, but
it returns a hit no matter where these letters are in the text, they
don't have to be next to each other, and that's not the result I was
going for. If the field contains "monitor" I want a hit on a search for
"onit" but not on "rint" for example.
I've never attempted to construct a new field-type of my own before and
I'm finding the available documentation somewhat incomplete and not very
helpful so I really need some pointers from people who know better than
me here.
If anyone could help me out maybe even with some example-code I'd be
eternally grateful.
//Daniel
Otis Gospodnetic wrote:
Hi Daniel,
Well, searching "inside of words" requires special treatment, because normally
searches work on words/terms/tokens.
Make use of the following:
$ ff \*NGram\*java
./src/java/org/apache/solr/analysis/EdgeNGramTokenizerFactory.java
./src/java/org/apache/solr/analysis/NGramTokenizerFactory.java
./src/java/org/apache/solr/analysis/NGramFilterFactory.java
./src/java/org/apache/solr/analysis/EdgeNGramFilterFactory.java
Use these to create a new field type make Solr tokenize and index your terms as, say, uni-grams.
Instead (or in addition to) indexing "Termobyxa", index "T e r m o b y x a".
Do the same with the query-time analyzer, and you'll be able to search within words.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Daniel Löfquist <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, April 17, 2008 5:46:15 AM
Subject: Searching "inside of words"
Hi,
I'm still pretty new to Solr. We're using it for searching on our site
right now though.
The configuration is however pretty much based on the example-files that
come with Solr and there's one type of search that I can't get to work.
Each item has fields called "title" and "description", both of which are
of type "text".
The type "text" is defined like this in our schema.xml :
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
My problem is that if I have an item with "title"="Termobyxa", a search
for "Termo" gives me a hit but if I search for "ermo" or "byxa" I get no
hit. How do I make it so that this kind of search "inside a word"
returns a hit?
Sincerely,
Daniel Löfquist
--
Daniel Löfquist
Application Manager / Software Engineer
CDON.COM
Bergsgatan 20, Box 385, SE 201 23 Malmö, Sweden
Office: +46 40 601 61 00
Direct: +46 40 601 61 16
Mobile: +46 702 92 21 75
Fax: +46 40 601 61 20
E-mail: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
CDON.COM <http://www.cdon.com/>
Confidentiality
Information contained in this e-mail is intended for the use of the
addressee only, and is confidential. Any dissemination, distribution,
copying or use of this communication without prior permission of
the addressee is strictly prohibited. If you are not the intended
addressee you must delete this e-mail and its attachments.