RE: How to use a regex search within a phrase query?

Erez Michalak Mon, 23 May 2016 03:00:19 -0700

Good points, thanks Erick.

As you guessed, the use case is not in the main flow for the general user, but 
an advanced flow for a technical one.

Regarding the performance issue, I thought of a few optimizations for some 
expected expressions I need to support.
For instance, to walk around the digits regex in all my examples from the mail 
below, I can simply index terms with '\d' instead of every digit (like '\d\d\d' 
for '123').
This enables a faster search as follows:
* search for "\d\d\d" instead of "/[0-9]{3}/"
* search for "\d\d\d \d\d\d\d" instead of "/[0-9]{3}/ /[0-9]{4}/"
* search for "\d\d\d example" instead of "/[0-9]{3}/ example"
Clearly, this approach supports very limited set of expressions in expense for 
an increase in the index size.
For the general case, though, regular expressions may indeed require a full 
index scan. Seems like all I can do in that case is to warn the user in advance 
that this may take a (long) while.

Any further ideas on how to reduce the performance hit and survive the bad 
impact of a full index scan are welcomed..
Erez

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Sunday, May 22, 2016 7:43 PM
To: solr-user <solr-user@lucene.apache.org>
Subject: Re: How to use a regex search within a phrase query?

Erez:

Before going too far down this path, understand that even if you can get this 
syntax to work, you're going to pay a _very_ significant performance hit if you 
have any decent size corpus. Conceptually, what happens is that all the terms 
that the regex matches are made into clauses. So let's take a very simple 
wildcard case:

field1 has two values f1A and f1B
field2 has two values, f2A and f2B

The result of asking for "field1:f1? field2:f2?" (as a phrase) is "field1:f1A 
field2:f2A"
OR
"field1:f1A field2:f2B"
OR
"field1:f1B field2:f2A"
OR
"field1:f1B field2:f2B"

which may take quite a while to execute, and that doesn't even include the time 
that it'll take to enumerate the terms in a field that match your regex, which 
can get very ugly if your regex is such that it has to examine _every_ term in 
the field, i.e. the entire terms list for the field for the entire corpus.

This might be an XY problem, what problem are you solving with regexes? Might 
you be better off constructing better analysis chains?
The reason I ask is that unless you have technical users, regexes are unlikely 
to be even used....

FWIW,
Erick

On Sun, May 22, 2016 at 8:19 AM, Erez Michalak <emicha...@varonis.com> wrote:
> Thanks you Ahmet for the JIRA reference - it looks really promising and I'll 
> check it out.
>
> Regarding your question - once a piece of text is tokenized, it seems like 
> there is no way to perform a regex query across term boundaries. The pure 
> regex is good as long I'm querying for a single term.
>
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Sunday, May 22, 2016 4:49 PM
> To: solr-user@lucene.apache.org; Erez Michalak <emicha...@varonis.com>
> Subject: Re: How to use a regex search within a phrase query?
>
> Hi Erez,
>
> I don't think it is possible to combine regex with phrase out-of-the-box.
> However, there is https://issues.apache.org/jira/browse/LUCENE-5205 for the 
> task.
>
> Can't you define your query in terms of pure regex?
> something like /[0-9]{3} .* [0-9]{4}/
>
> ahmet
>
>
> On Sunday, May 22, 2016 1:37 PM, Erez Michalak <emicha...@varonis.com> wrote:
> Hey,
> I'm developing a search application based on SOLR 5.3.1, and would like to 
> add to it regex search capabilities on a specific tokenized text field named 
> 'content'.
> Is it possible to combine the default regex syntax within a phrase query (and 
> moreover, within a proximity search)? If so, please instruct me how..
>
> Thanks in advance,
> Erez Michalak
>
> p.s.
> Maybe the following example will make my question clearer:
> The query content:/[0-9]{3}/ returns documents with (at least one) 3 digits 
> token as expected.
> However,
>
> *         the query content:"/[0-9]{3}/ /[0-9]{4}/" doesn't match the 
> contents '123-1234' and '123 1234', even though they are tokenized to two 
> tokens ('123' and '1234') which individually match each part of the query
>
> *         the query content:"/[0-9]{3}/ example" doesn't match the content 
> '123 example'
>
> *         even the query content:"/[0-9]{3}/" (same as the query that works 
> but surrounded with quotation marks) doesn't return documents with 3 digits 
> token!
>
> *         etc.
>
>
> ________________________________
> This email and any attachments thereto may contain private, confidential, and 
> privileged material for the sole use of the intended recipient. Any review, 
> copying, or distribution of this email (or any attachments thereto) by others 
> is strictly prohibited. If you are not the intended recipient, please contact 
> the sender immediately and permanently delete the original and any copies of 
> this email and any attachments thereto.
> ________________________________
> This email and any attachments thereto may contain private, confidential, and 
> privileged material for the sole use of the intended recipient. Any review, 
> copying, or distribution of this email (or any attachments thereto) by others 
> is strictly prohibited. If you are not the intended recipient, please contact 
> the sender immediately and permanently delete the original and any copies of 
> this email and any attachments thereto.
________________________________
This email and any attachments thereto may contain private, confidential, and 
privileged material for the sole use of the intended recipient. Any review, 
copying, or distribution of this email (or any attachments thereto) by others 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender immediately and permanently delete the original and any copies of 
this email and any attachments thereto.

RE: How to use a regex search within a phrase query?

Reply via email to