Re: Handling and sorting email addresses

Erick Erickson Mon, 08 Mar 2010 05:54:45 -0800

Well, it's not unfortunate <G>. What would it mean to sort
on a tokenized field? Let's say I index "is testing fun". Removing
stopwords and stemming probably indexes "test" "fun". How
in the world would meaningful sorts happen now? Even if
it was "in order", since the first token was stopped out this
document wouldn't even be in the right part of the alphabet.


The usual solution is to use copyfield and index your field
untokenized in that second field, then sort on *that* field.

HTH
Erick

On Mon, Mar 8, 2010 at 6:56 AM, Ian Battersby <ian.batter...@gmail.com>wrote:

> Thanks Mitch, using the analysis page has been a real eye-opener and given
> me a better insight into how Solr was applying the filters (and more
> importantly in which order). I've ironically ended up with a charFilter
> mapping file as this seemed the only route to replacing characters before
> the tokenizer kicked in, unfortunately Solr just refused to allow sorting
> on
> anything tokenized with characters other than whitespace.
>
> Cheers, Ian.
>
> -----Original Message-----
> From: MitchK [mailto:mitc...@web.de]
> Sent: 07 March 2010 22:44
> To: solr-user@lucene.apache.org
> Subject: Re: Handling and sorting email addresses
>
>
> Ian,
>
> did you have a look at Solr's admin analysis.jsp?
> When everything on the analysis's page is fine, you have missunderstood
> Solr's schema.xml-file.
>
> You've set two attributes in your schema.xml:
> stored = true
> indexed = true
>
> What you get as a response is the stored field value.
> The stored field value is the original field value, without any
> modifications.
> However, Solr is using the indexed field value to query your data.
>
> Kind regards
> - Mitch
>
>
> Ian Battersby wrote:
> >
> > Forgive what might seem like a newbie question but am struggling
> > desperately
> > with this.
> >
> > We have a dynamic field that holds email address and we'd like to be able
> > to
> > sort by it, obviously when trying to do this we get an error as it thinks
> > the email address is a tokenized field. We've tried a custom field type
> > using PatternReplaceFilterFactory to specify that @ and . should be
> > replaced
> > with " AT " and " DOT " but we just can't seem to get it to work, all the
> > field still contain the unparsed email.
> >
> > We used an example found on the mailing-list for the field type:
> >
> >     <fieldType name="email" class="solr.TextField"
> > positionIncrementGap="100">
> >       <analyzer>
> >        <tokenizer class="solr.StandardTokenizerFactory"/>
> >        <filter class="solr.LowerCaseFilterFactory"/>
> >        <filter class="solr.PatternReplaceFilterFactory" pattern="\."
> > replacement=" DOT " replace="all" />
> >        <filter class="solr.PatternReplaceFilterFactory" pattern="@"
> > replacement=" AT " replace="all" />
> >        <filter class="solr.WordDelimiterFilterFactory"
> > generateWordParts="1"
> > generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> > catenateAll="0" splitOnCaseChange="0"/>
> >       </analyzer>
> >     </fieldType>
> >
> > .. our dynamic field looks like ..
> >
> >   <dynamicField name="dynamicemail_*"  type="email"  indexed="true"
> > stored="true"  multiValued="true" />
> >
> > When writing a document to Solr it still seems to write the original
> email
> > address (e.g. this.u...@somewhere.com) opposed to its parsed version
> (e.g.
> > this DOT user AT somewhere DOT com). Can anyone help?
> >
> > We are running version 1.4 but have even tried the nightly build in an
> > attempt to solve this problem.
> >
> > Thanks.
> >
> >
> >
>
> --
> View this message in context:
>
> http://old.nabble.com/Handling-and-sorting-email-addresses-tp27813111p278152
> 39.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
>

Re: Handling and sorting email addresses

Reply via email to