Re: Indexing fieldvalues with dashes and spaces

Erick Erickson Fri, 06 Aug 2010 16:46:46 -0700

See below:

On Fri, Aug 6, 2010 at 9:00 AM, PeterKerk <vettepa...@hotmail.com> wrote:


>
> Ah, I'm glad it does, makes me feel a bit less stupid ;)
>
> So to summarize and see if I understand it now:
> - the analyzers allow for many different ways to index a field, these
> analyzers are placed in a chain
>

Minor terminology nit. An Analyzer consists of a Tokenizer and N Filters.
The Tokenizer breaks up the input stream then the Filters "do things"
to the token. So say you're using WhitespaceTokenizer on "This
time all People are Good". The tokenizer would create tokens
This, time, all, People, Good. LowerCaseFilter would transform
these to
this, time, all, people, are, good
then you could apply, say, a StopWordFilter which could remove
tokens this all are and you'd have
time people good
etc....

See http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


> - when a field is indexed it can be searched
> - a field could also be stored as-is, but I would need to indicate that
> - only if a field is stored, a search query returns that field if the
> search
> query matches a part of it
>
Very close. You have to configure your search handlers to return
a field or specify it with a parameter like &fl=fieldname

>
> If thats the case its clear what it all does.
>
> Then still the question remains on how to configure this in the schema.xml.
> There's just so much documentation and examples in so many different places
> that I'm lost. I've used almost the literal example schema.xml which has
> many similarities (e.g. on categories facet) with my use case, but I dont
> know if they allow for the exact operations I require.
>
> If you look at my schema.xml, how would you configure it to do the
> following:
>
> a city field is something that I want users to search on via text input, so
> lets say "New Yo" would give the results for "New York".
> ===> so this field would need to be stored right?
>
> No, you don't need to store it at all. You can search anything
that's indexed. Stored is only for returning a copy of the data
as a field. What you *would* have to do is figure out the rules you
wanted to apply to have "New Yo" match "New York". You could
use one of the NGramFilterFactory or EdgeNGramFilterFactory.
You could decide to search wildcards. You could choose to
autocomplete the user entering data. You could...


> But also a facet "Cities" is available in which "New York" is just one of
> the cities that is selectable as a filter/facet.
> ===> for this I need to create a facet
>
> The other facet is "theme", which in my example holds values like
> "Gemeentehuis" and "Strand & Zee", that would not be a thing on which can
> be
> searched via manual input but IS selectable as a filter/facet
> ===> this field would NOT have to be stored right?
>
> You don't have to store things that are faceted. See the discussion
here:
http://wiki.apache.org/solr/SolrFacetingOverview

Best
Erick


> Thanks for your time! :)
>
> Regards,
> Pete
>
>
>
> Erick Erickson wrote:
> >
> > This confuses lots of people. When you index a field, it's Analyzed 10
> > ways from Sunday. Consider "The World is an unknown Entity". When
> > you INDEX it, many thing happen, depending upon the analyser.
> > Stopwords may be removed. each token may be lower cased. Each token
> > may be stemmed. It all depends on what's in your analyzer chain. Assume
> > a simple chain consisting of breaking up tokens on whitespace,
> > lowercasing,
> > and removing stopwords. The actual tokens INDEXED would be "world",
> > "unknown", and "entity". That is what is searched against.
> >
> > However, the string, unchanged, would be STORED if you specified it so.
> > So when you asked for the field to be returned in a search result, you
> > would
> > get "The World is an unknown Entity" if you asked for the field to be
> > returned as part of a search result that matched on, say, "world".
> >
> > HTH
> > Erick
> >
> > On Thu, Aug 5, 2010 at 4:31 AM, PeterKerk <vettepa...@hotmail.com>
> wrote:
> >
> >>
> >> @Michael, @Erick,
> >>
> >> You both mention interesting things that triggered me.
> >>
> >> @Erick:
> >> Your referenced page is very useful. It seems the whitespace tokenizer
> >> under
> >> the text_ws is causing issues.
> >>
> >> You do mention another interesting thing:
> >> "And do be aware that fields you get back from a request (i.e. a search)
> >> are
> >> the stored fields, NOT what's indexed."
> >>
> >> On the page you provided I see this under the Analyzers section:
> >> "Analyzers
> >> are components that pre-process input text at index time and/or at
> search
> >> time."
> >>
> >> So I dont completely understand how that sentence is in line with your
> >> comment.
> >>
> >>
> >> @Michael:
> >> You say: "use the tokenized field to return results, but have a
> duplicate
> >> field of fieldtype="string" to show the untokenized results. E.g. facet
> >> on
> >> that field."
> >> I think your comment applies on my requirement: "a city field is
> >> something
> >> that I want users to search on via text input, so lets say "New Yo"
> would
> >> give the results for "New York".
> >> But also a facet "Cities" is available in which "New York" is just one
> of
> >> the cities that is clickable.
> >> The other facet is "theme", which in my example holds values like
> >> "Gemeentehuis" and "Strand & Zee", that would not be a thing on which
> can
> >> be
> >> searched via manual input but IS clickable. "
> >>
> >> Could you please indicate (just for the above fields) what needs to be
> >> changed in my schema.xml and if so how that affects the way my request
> is
> >> build up?
> >>
> >>
> >> Thanks so much ahead in getting me started!
> >>
> >>
> >> This is my schema.xml
> >>
> >>
> >> <?xml version="1.0" encoding="UTF-8" ?>
> >>
> >> <schema name="db" version="1.1">
> >>
> >>  <types>
> >>    <fieldType name="string" class="solr.StrField" sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="boolean" class="solr.BoolField"
> >> sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
> >>    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
> >>    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
> >>    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>
> >>    <fieldType name="sint" class="solr.SortableIntField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="slong" class="solr.SortableLongField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="sfloat" class="solr.SortableFloatField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="sdouble" class="solr.SortableDoubleField"
> >> sortMissingLast="true" omitNorms="true"/>
> >>    <fieldType name="date" class="solr.DateField" sortMissingLast="true"
> >> omitNorms="true"/>
> >>    <fieldType name="random" class="solr.RandomSortField" indexed="true"
> >> />
> >>    <fieldType name="text_ws" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer>
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>      </analyzer>
> >>    </fieldType>
> >>    <fieldType name="text" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer type="index">
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt"/>
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.EnglishPorterFilterFactory"
> >> protected="protwords.txt"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>      <analyzer type="query">
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="true"/>
> >>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt"/>
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.EnglishPorterFilterFactory"
> >> protected="protwords.txt"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >>    <fieldType name="textTight" class="solr.TextField"
> >> positionIncrementGap="100" >
> >>      <analyzer>
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >> ignoreCase="true" expand="false"/>
> >>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> >> words="stopwords.txt"/>
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <filter class="solr.EnglishPorterFilterFactory"
> >> protected="protwords.txt"/>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >>    <fieldType name="alphaOnlySort" class="solr.TextField"
> >> sortMissingLast="true" omitNorms="true">
> >>      <analyzer>
> >>        <tokenizer class="solr.KeywordTokenizerFactory"/>
> >>        <filter class="solr.LowerCaseFilterFactory" />
> >>        <filter class="solr.TrimFilterFactory" />
> >>        <filter class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >> replacement="" replace="all" />
> >>      </analyzer>
> >>    </fieldType>
> >>    <fieldtype name="ignored" stored="false" indexed="false"
> >> class="solr.StrField" />
> >>  </types>
> >>
> >>  <fields>
> >>   <field name="id" type="string" indexed="true" stored="true"
> >> required="true" />
> >>   <field name="title" type="text_ws" indexed="true" stored="true"/>
> >>    <field name="city" type="text_ws" indexed="true" stored="true"/>
> >>    <field name="official" type="integer" indexed="true" stored="true"/>
> >>    <field name="theme" type="text_ws" indexed="true" stored="true"
> >> multiValued="true" omitNorms="true" termVectors="true" />
> >>   <field name="features" type="text_ws" indexed="true" stored="true"
> >> multiValued="true"/>
> >>   <field name="services" type="text_ws" indexed="true" stored="true"
> >> multiValued="true"/>
> >>   <field name="province" type="text_ws" indexed="true" stored="true"/>
> >>    <field name="word" type="string" indexed="true" stored="true"/>
> >>   <field name="text" type="text" indexed="true" stored="false"
> >> multiValued="true"/>
> >>   <field name="timestamp" type="date" indexed="true" stored="true"
> >> default="NOW" multiValued="false"/>
> >>
> >>   <dynamicField name="*_i"  type="sint"    indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_s"  type="string"  indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_l"  type="slong"   indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_t"  type="text"    indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_b"  type="boolean" indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_f"  type="sfloat"  indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_d"  type="sdouble" indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="*_dt" type="date"    indexed="true"
> >> stored="true"/>
> >>   <dynamicField name="random*" type="random" />
> >>
> >>  </fields>
> >>
> >>  <uniqueKey>id</uniqueKey>
> >>
> >>  <defaultSearchField>text</defaultSearchField>
> >>
> >>  <solrQueryParser defaultOperator="OR"/>
> >>
> >>   <copyField source="theme" dest="text"/>
> >>   <copyField source="title" dest="text"/>
> >>   <copyField source="city" dest="text"/>
> >>   <copyField source="official" dest="text" />
> >>   <copyField source="features" dest="text"/>
> >>   <copyField source="services" dest="text"/>
> >> </schema>
> >>
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1025463.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
> >
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Indexing-fieldvalues-with-dashes-and-spaces-tp1023699p1029811.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Indexing fieldvalues with dashes and spaces

Reply via email to