Hello, Jack, Steve,

Thank you for your answers. I´ve never used UAX29URLEmailTokenizerFactory,
but I´ve read about it before trying RegExp´s queries. As far as I
know, UAX29URLEmailTokenizerFactory
allows to tokenize an entry text value into patterns that match URLs,
E-mails, etc. Reading the documentation I haven´t found any way to select
just E-mail patterns, not URL ones, for example. I feel that it may have
sense to specify one or multiple patterns in a configuration file to be
setted during the Tokenizer definition in the schema.xml, but I found
nothing.

I´ve just want to retrieve those documents indexed where they appear at
least one E-mail inside de text. However, even using
UAX29URLEmailTokenizerFactory,
and suposing that I store that E-mail data in a field called 'emails' (I
feel creative, hehe), a query like the following appears to be... dirty:

http://localhost:8080/mysolr/select?q=emails:[* TO
*]&start=0&rows=10&sort=mydate desc

What do you think about?

And Andy... I know many RegExps to find E-mail patterns in a text - that
wasn´t my question, and of course there is no perfect one. However, Lucene
RegExp syntax is different from classic RegExp one, so is not as easy as
copy & paste any RegExps and, voilá! E-mails everywhere.

Thank you very much in advance,

Best regards,





2013/7/30 Jack Krupansky <j...@basetechnology.com>

> Just use the UAX29URLEmailTokenizerFactory, which recognizes email
> addresses.
>
> Any particular reason that you're trying to reinvent the wheel?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Luis Cappa Banda
> Sent: Tuesday, July 30, 2013 10:53 AM
> To: solr-user@lucene.apache.org
> Subject: Email regular expression.
>
>
> Hello everyone!
>
> Unfortunately I have to search all E-mail addresses found in a text field
> from each document. I've been reading for a while how to use RegExp's in
> Solr, but after trying some of them they didn't work. I've noticed that
> Lucene RegExp syntax sometimes is very different from the classic RegExp
> syntax, so that may be the reason why they didn't work for me, and maybe
> someone more expert can help me.
>
> The syntax is the following:
>
> *E-mail: *
>
> text:/[a-z0-9_\|-]+(\.[a-z0-9_**\|-]|)*@[a-z0-9-]|(\.[a-z0-9-]**
> |)*\.([a-z]{2,4})/
>
> Thank you very much in advance!
>
> Best regards,
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Reply via email to