Good points, thanks Erick. As you guessed, the use case is not in the main flow for the general user, but an advanced flow for a technical one.
Regarding the performance issue, I thought of a few optimizations for some expected expressions I need to support. For instance, to walk around the digits regex in all my examples from the mail below, I can simply index terms with '\d' instead of every digit (like '\d\d\d' for '123'). This enables a faster search as follows: * search for "\d\d\d" instead of "/[0-9]{3}/" * search for "\d\d\d \d\d\d\d" instead of "/[0-9]{3}/ /[0-9]{4}/" * search for "\d\d\d example" instead of "/[0-9]{3}/ example" Clearly, this approach supports very limited set of expressions in expense for an increase in the index size. For the general case, though, regular expressions may indeed require a full index scan. Seems like all I can do in that case is to warn the user in advance that this may take a (long) while. Any further ideas on how to reduce the performance hit and survive the bad impact of a full index scan are welcomed.. Erez -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Sunday, May 22, 2016 7:43 PM To: solr-user <solr-user@lucene.apache.org> Subject: Re: How to use a regex search within a phrase query? Erez: Before going too far down this path, understand that even if you can get this syntax to work, you're going to pay a _very_ significant performance hit if you have any decent size corpus. Conceptually, what happens is that all the terms that the regex matches are made into clauses. So let's take a very simple wildcard case: field1 has two values f1A and f1B field2 has two values, f2A and f2B The result of asking for "field1:f1? field2:f2?" (as a phrase) is "field1:f1A field2:f2A" OR "field1:f1A field2:f2B" OR "field1:f1B field2:f2A" OR "field1:f1B field2:f2B" which may take quite a while to execute, and that doesn't even include the time that it'll take to enumerate the terms in a field that match your regex, which can get very ugly if your regex is such that it has to examine _every_ term in the field, i.e. the entire terms list for the field for the entire corpus. This might be an XY problem, what problem are you solving with regexes? Might you be better off constructing better analysis chains? The reason I ask is that unless you have technical users, regexes are unlikely to be even used.... FWIW, Erick On Sun, May 22, 2016 at 8:19 AM, Erez Michalak <emicha...@varonis.com> wrote: > Thanks you Ahmet for the JIRA reference - it looks really promising and I'll > check it out. > > Regarding your question - once a piece of text is tokenized, it seems like > there is no way to perform a regex query across term boundaries. The pure > regex is good as long I'm querying for a single term. > > > -----Original Message----- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Sunday, May 22, 2016 4:49 PM > To: solr-user@lucene.apache.org; Erez Michalak <emicha...@varonis.com> > Subject: Re: How to use a regex search within a phrase query? > > Hi Erez, > > I don't think it is possible to combine regex with phrase out-of-the-box. > However, there is https://issues.apache.org/jira/browse/LUCENE-5205 for the > task. > > Can't you define your query in terms of pure regex? > something like /[0-9]{3} .* [0-9]{4}/ > > ahmet > > > On Sunday, May 22, 2016 1:37 PM, Erez Michalak <emicha...@varonis.com> wrote: > Hey, > I'm developing a search application based on SOLR 5.3.1, and would like to > add to it regex search capabilities on a specific tokenized text field named > 'content'. > Is it possible to combine the default regex syntax within a phrase query (and > moreover, within a proximity search)? If so, please instruct me how.. > > Thanks in advance, > Erez Michalak > > p.s. > Maybe the following example will make my question clearer: > The query content:/[0-9]{3}/ returns documents with (at least one) 3 digits > token as expected. > However, > > * the query content:"/[0-9]{3}/ /[0-9]{4}/" doesn't match the > contents '123-1234' and '123 1234', even though they are tokenized to two > tokens ('123' and '1234') which individually match each part of the query > > * the query content:"/[0-9]{3}/ example" doesn't match the content > '123 example' > > * even the query content:"/[0-9]{3}/" (same as the query that works > but surrounded with quotation marks) doesn't return documents with 3 digits > token! > > * etc. > > > ________________________________ > This email and any attachments thereto may contain private, confidential, and > privileged material for the sole use of the intended recipient. Any review, > copying, or distribution of this email (or any attachments thereto) by others > is strictly prohibited. If you are not the intended recipient, please contact > the sender immediately and permanently delete the original and any copies of > this email and any attachments thereto. > ________________________________ > This email and any attachments thereto may contain private, confidential, and > privileged material for the sole use of the intended recipient. Any review, > copying, or distribution of this email (or any attachments thereto) by others > is strictly prohibited. If you are not the intended recipient, please contact > the sender immediately and permanently delete the original and any copies of > this email and any attachments thereto. ________________________________ This email and any attachments thereto may contain private, confidential, and privileged material for the sole use of the intended recipient. Any review, copying, or distribution of this email (or any attachments thereto) by others is strictly prohibited. If you are not the intended recipient, please contact the sender immediately and permanently delete the original and any copies of this email and any attachments thereto.