Re: Question about textTight

Stephen Weiss Tue, 28 Oct 2008 12:52:55 -0700

OK, thanks everyone. Since this is the only thing this field is usedfor, I think we'll just reindex without the filters and go fromthere... Now if only I could just reindex that field! Oh well.


--
Steve


On Oct 28, 2008, at 3:32 PM, Yonik Seeley wrote:

I'm wrong: I saw the punctuation being left in for "m_*" and thought
that the WordDelimiterFilter wasn't working.

So as Todd pointed out, underscores are dropped during indexing and
searching.  The limitation you are running into is that things like
prefix and wildcard queries are not analyzed (so the _ won't be
dropped).  You could set up another field for use with wildcard
queries, or you could create separate query and index analyzers for
textTight and set the index analyzer to use a WordDelimiterFilter that
also indexes the original token.

-Yonik
On Tue, Oct 28, 2008 at 2:31 PM, Stephen Weiss<[EMAIL PROTECTED]> wrote:
That's strange then. The schema hasn't changed in well over amonth, solr'sbeen restarted several times since then to reload synonyms and thewholething was reindexed just this past week to add in new chinesetranslations
(the fields were already there but left blank).





I attached the full schema if that helps.
--
Steve

On Oct 28, 2008, at 1:54 PM, Yonik Seeley wrote:
These query parsing results don't match with the config you'veposted.Double-check the type of the "name" field and that you haverestarted
Solr since changing the schema.xml

-Yonik
On Tue, Oct 28, 2008 at 11:25 AM, Stephen Weiss <[EMAIL PROTECTED]>
wrote:
Thanks for the reply.  I've been looking at the debug page... and I
really
don't see any clues there (maybe I don't know how to read it).

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="wt">standard</str>
<str name="rows">10</str>

<str name="start">0</str>
<str name="explainOther"/>
<str name="hl.fl"/>
<str name="indent">on</str>
<str name="q">name:(stm 0810 m_*)</str>
<str name="fl">*,score</str>
<str name="qt">standard</str>

<str name="debugQuery">on</str>
<str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="0" start="0" maxScore="0.0"/>
<lst name="debug">
<str name="rawquerystring">name:(stm 0810 m_*)</str>
<str name="querystring">name:(stm 0810 m_*)</str>

<str name="parsedquery">+name:stm +name:0810 +name:m_*</str>
<str name="parsedquery_toString">+name:stm +name:0810 +name:m_*</str>
<lst name="explain"/>
</lst>
</response>
I mean, as far as I can tell, that seems right. I think I'mmissing
something here.
The wiki page is awesome though, thank you. The catenateAlloption doesseem to do what I think it did... but should I perhaps justremove any
kind
of filter or analyzer on this field?  It's really not a big deal if
someone
has to get the dashes and underscores exactly right - it's a worse
problem
if they do get them right, but it still doesn't work (usuallythey copy
and
paste these from an e-mail or something). Just in general, it'sneverreally critical for someone to search by parts of the filename -except
for
searching with wildcard (that is, stm0810m_* and the like), andit would
be
a lot easier if they didn't have to put spaces where letterschange to
numbers & vice versa.

Thanks again for your input.

--
Steve

On Oct 28, 2008, at 10:49 AM, Feak, Todd wrote:
You may want to take a very close look at what theWordDelimiterFilteris doing. I believe the underscore is dropped entirely duringindexing
AND searching as it's not alphanumeric.

Wiki doco here
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=(t
okenizer)#head-1c9b83870ca7890cd73b193cefed83c283339089
The admin analysis page and query debug will help a lot to seewhat's
going on.

-Todd

-----Original Message-----
From: Stephen Weiss [mailto:[EMAIL PROTECTED]
Sent: Monday, October 27, 2008 10:32 PM
To: solr-user@lucene.apache.org
Subject: Question about textTight

Hi,
So I've been using the textTight field to hold filenames, andI've runinto a weird problem. Basically, people want to search by partof afilename (say, the filename is stm0810m_ws_001ftws and they wantto
find everything starting with stm0810m_ (stm0810m_*).  I'm hoping
someone might have done this before (I bet someone has).
Lots of things work - you can search for stm0810m_ws_001ftws andget aresult, or (stm 0810 m*), or various other combinations. Whatdoesnot work, is searching for (stm0810m_*) or (stm 0810 m_*) oranythinglike that - a problem, because often they don't want things withma_
or mx_, but just m_.  It's almost like underscores just break
everything, escaping them does nothing.

Here's the field definition (it should be what came with my solr):

<fieldType name="textTight" class="solr.TextField"
positionIncrementGap="100" >
  <analyzer>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
    <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

and usage:

<field name="name" type="textTight"
      indexed="true" stored="true" omitNorms="true"
      />


Now, I thought textTight would be good because it's the one best
suited for SKU's, but I guess I'm wrong. What should I be usingfor
this?  Would changing any of these "generateWordParts" or
"catenateAll" options help? I can't seem to find anydocumentation soI'm really not sure what it would do, but reindexing this wholethingwill take quite some time so I'd rather know what will actuallywork
before I just start changing things.

Thanks so much for any insight!

--
Steve

Re: Question about textTight

Reply via email to