Re: Question about textTight

Stephen Weiss Tue, 28 Oct 2008 08:27:03 -0700

Thanks for the reply. I've been looking at the debug page... and Ireally don't see any clues there (maybe I don't know how to read it).


<?xml version="1.0" encoding="UTF-8"?>
<response>


<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
 <str name="wt">standard</str>
 <str name="rows">10</str>

 <str name="start">0</str>
 <str name="explainOther"/>
 <str name="hl.fl"/>
 <str name="indent">on</str>
 <str name="q">name:(stm 0810 m_*)</str>
 <str name="fl">*,score</str>
 <str name="qt">standard</str>

 <str name="debugQuery">on</str>
 <str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="0" start="0" maxScore="0.0"/>
<lst name="debug">
<str name="rawquerystring">name:(stm 0810 m_*)</str>
<str name="querystring">name:(stm 0810 m_*)</str>

<str name="parsedquery">+name:stm +name:0810 +name:m_*</str>
<str name="parsedquery_toString">+name:stm +name:0810 +name:m_*</str>
<lst name="explain"/>
</lst>
</response>

I mean, as far as I can tell, that seems right. I think I'm missingsomething here.

The wiki page is awesome though, thank you. The catenateAll optiondoes seem to do what I think it did... but should I perhaps justremove any kind of filter or analyzer on this field? It's really nota big deal if someone has to get the dashes and underscores exactlyright - it's a worse problem if they do get them right, but it stilldoesn't work (usually they copy and paste these from an e-mail orsomething). Just in general, it's never really critical for someoneto search by parts of the filename - except for searching withwildcard (that is, stm0810m_* and the like), and it would be a loteasier if they didn't have to put spaces where letters change tonumbers & vice versa.


Thanks again for your input.

--
Steve

On Oct 28, 2008, at 10:49 AM, Feak, Todd wrote:

You may want to take a very close look at what the WordDelimiterFilter
is doing. I believe the underscore is dropped entirely during indexing
AND searching as it's not alphanumeric.

Wiki doco here
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
okenizer)#head-1c9b83870ca7890cd73b193cefed83c283339089

The admin analysis page and query debug will help a lot to see what's
going on.

-Todd

-----Original Message-----
From: Stephen Weiss [mailto:[EMAIL PROTECTED]
Sent: Monday, October 27, 2008 10:32 PM
To: solr-user@lucene.apache.org
Subject: Question about textTight

Hi,

So I've been using the textTight field to hold filenames, and I've run
into a weird problem.  Basically, people want to search by part of a
filename (say, the filename is stm0810m_ws_001ftws and they want to
find everything starting with stm0810m_ (stm0810m_*).  I'm hoping
someone might have done this before (I bet someone has).

Lots of things work - you can search for stm0810m_ws_001ftws and get a
result, or (stm 0810 m*), or various other combinations.  What does
not work, is searching for (stm0810m_*) or (stm 0810 m_*) or anything
like that - a problem, because often they don't want things with ma_
or mx_, but just m_.  It's almost like underscores just break
everything, escaping them does nothing.

Here's the field definition (it should be what came with my solr):

    <fieldType name="textTight" class="solr.TextField"
positionIncrementGap="100" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

and usage:

   <field name="name" type="textTight"
          indexed="true" stored="true" omitNorms="true"
          />


Now, I thought textTight would be good because it's the one best
suited for SKU's, but I guess I'm wrong.  What should I be using for
this?  Would changing any of these "generateWordParts" or
"catenateAll" options help?  I can't seem to find any documentation so
I'm really not sure what it would do, but reindexing this whole thing
will take quite some time so I'd rather know what will actually work
before I just start changing things.

Thanks so much for any insight!

--
Steve

Re: Question about textTight

Reply via email to