Re: Possible to define a field so that substring-search is always used?

Erick Erickson Tue, 24 Jul 2018 09:52:22 -0700

1. the standard way to do this is to use ngrams. The index is larger,
but it gives you much quicker searches than trying to to
pre-and-postfix wildcards


2. use a fieldType with KeywordTokenizerFactory + (probably)
LowerCaseFilterFactory + TrimFilterFactory. And, in your case,
NGramTokenizerFactory (I'd start with bigrams, i.e. min=2 and max=2)

3. no. The destination field has it's own field type and that's how
the input stream is analyzed. There's no good way to say "don't
analyze input from field X when copied to field Y". Probably best not
to copy it there at all.

Best,
Erick

On Tue, Jul 24, 2018 at 9:05 AM, Christopher Schultz
<ch...@christopherschultz.net> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> All,
>
> We are using Solr as a user index, and users have email addresses.
>
> Our old search behavior used a SQL substring match for any search
> terms entered, and so users are used to being able to search for e.g.
> "chr" and finding my email address ("ch...@christopherschultz.net").
>
> By default, Solr doesn't perform substring matches, and it might be
> difficult to re-train users to use *chr* to find email addresses by
> substring.
>
> Is there a way to define the field such that searches are always done
> as a substring? While we are at it, I'd like to define the field to
> avoid tokenization because it's never useful to search for
> "m...@gmail.com" and find a few million search results because many
> users use @gmail.com email addresses.
>
> Here is the current field definition from our create-schema script:
>
>   "add-field":{
>      "name":"email_address",
>      "type":"text_general",
>      "multiValued" : false,
>      "stored":true },
>
> Later, we add the email address to the "all" field (which aggregates
> everything from all useful fields into the field used as the
> default-field):
>
>   "add-copy-field":{
>      "source":"email_address",
>      "dest":"all" },
>
> Is there a way to define these fields such that:
>
> 1. The email_address field is always searched using a substring
> 2. The email_address field is not tokenized
> 3. The copied-email-address is not tokenized in the "all" field
>
> Thanks,
> - -chris
> -----BEGIN PGP SIGNATURE-----
> Comment: GPGTools - http://gpgtools.org
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAltXTkcACgkQHPApP6U8
> pFh1aRAAilB2nVGycjVyY2taAJv6x2ss33UcVL6xBATRUkHTCbyAr5LFN3FWmcOR
> iCbZdxCU5LSa0x0clMTlRjR0U8HF+l2J4ArMQYiveA9mXc6fZz+ovAYrBqDguE6b
> UZnbOcR3pDF+P5h3ch9aMbdkHAhsVN7AX5yiSIS0fqKn6irNrI7TkvRmiZqNzVFx
> sDIPChL9meMfh8rz7vVmu5IjaImnQZ+2tmc+QruFsbgKGXJMR4n+d0CjacIfd5vp
> hoZDpg9qcasnYau925xqlj4BBrPS1XiYOqvdgCxnO1l6qqVfBK+lVsPaP5FOtXZP
> 7Fe/unkzuK8j1Y0mZNpcZtMYYhsMHboT1Kegrn1mUZp9S6iL1NzbqzmsbDQyNqlg
> 8HghvGG7ROj/hkqLPOlGy6wp72GFQYrHuIEzdyDI9wHOaP+cdliCdkkmqIAQJilR
> ketzTVhEbOHGEHGa9obHg0NPqmYwP4DDmSOZ42z5UPr2KqaqpeXsqcB2CV7nnvB3
> 6hvKuHVWIrHE1P1k1XFwMF3Vy+YbeojFbvKLH+eNKXXOXu8PEn2MaZU5v12WNWEr
> 0l6K16VnFf436WqH/fSa1DZUfuphA4z0qg/oHqcUcfhVFjc+U1wSZVvdvpG+rSf1
> n3NS9pqFAWruWq7V0ID5cV0PVRwp9g6pgs4XJAhKYEkiXVO8u7Y=
> =wAsa
> -----END PGP SIGNATURE-----

Re: Possible to define a field so that substring-search is always used?

Reply via email to