Re: [regex] New enum type syntax_option_type
On Mon, Aug 19, 2013 at 10:26 PM, Tim Shen wrote: > Great, let's test it now, though I don't think that it breaks something. Committed. -- Tim Shen
Re: [Patch] Fix empty grouping problem in regex
Change vector::at() call to vector::operator[](). Booted and tested with -m32, -m64 and -m64 debug-mode under x86_64 GNU/Linux and committed(r201914). -- Tim Shen
Re: [Patch] Replace spaces with tabs in regex files
On Thu, Aug 22, 2013 at 4:56 PM, Paolo Carlini wrote: > Ok, thanks! Committed. -- Tim Shen
Re: [Patch] Fix empty grouping problem in regex
On Thu, Aug 22, 2013 at 9:08 AM, Tim Shen wrote: > Booted and tested with -m32, -m64 and -m64 debug-mode under x86_64 > GNU/Linux and committed(r201914). ...and here's the patch committed. -- Tim Shen emptygroup.patch Description: Binary data
[Patch 2/2] Localization problem in regex
Inspired by this mail: http://gcc.gnu.org/ml/libstdc++/2013-08/msg00131.html Thanks! -- Tim Shen regex-locale.patch Description: Binary data
Re: [Patch 2/2] Localization problem in regex
On Fri, Aug 23, 2013 at 6:29 PM, Paolo Carlini wrote: > Thanks a lot indeed. But: "Fix callers" of *what*?!? And from *where*? To be > really honest this kind of ChangeLog entry doesn't make much sense. Please > get used to list the specific functions you are changing. It's important. I see. Actually I find a mistake by this review /w\, thanks! -- Tim Shen regex-locale.patch Description: Binary data
Re: [Patch 1/2] Rewrite regex scanner
On Fri, Aug 23, 2013 at 5:40 PM, Jonathan Wakely wrote: > I would rearrange the ChangeLog, to put > > * include/bits/regex_executor.tcc: Don't use reference for submatch. > > after regex_scanner.tcc, so that the files the _Scanner is moved from > come immediately before the files it is moved to, otherwise it reads > like "move _Scanner from ... don't use reference ... here" which > doesn't make much sense :-) Hmmm...it turns out there's no specified file order? // I always use default order before. So here it is. By the way, it seems to take surprisingly short time to review; after all, it's a huge patch. -- Tim Shen changelog Description: Binary data
Re: [Patch 2/2] Localization problem in regex
On Fri, Aug 23, 2013 at 10:26 PM, Paolo Carlini wrote: > Great. But please but the names of the functions you are changing between > round brackets. You have tons of examples everywhere. ...like this? -- Tim Shen changelog Description: Binary data
Re: [Patch 2/2] Localization problem in regex
On Fri, Aug 23, 2013 at 11:03 PM, Paolo Carlini wrote: > The assign change is fine. But I would also put between round brackets > (class _Compiler<>) for the _M_traits change and use something like > (_DFSExecutor<>::_DFSExecutor): Adjust. for the last change (remember that > constructors are special member *functions* thus should be normally listed). I see. On Fri, Aug 23, 2013 at 10:57 PM, Jonathan Wakely wrote: > Yes, that makes it clear what part of the file is affected. That's > important when searching ChangeLogs to find what change last touched > e.g. basic_regex::assign. Why don't we `svn blame`? By the way, do we seariously need to keep ChangeLog in files? Is it more like a tradition than a solution? After all, we have SVN or Git now. I remember a mail metioned this but cannot find it >.< -- Tim Shen changelog Description: Binary data
Re: [Patch 2/2] Localization problem in regex
On Sat, Aug 24, 2013 at 1:51 AM, Stefan Schweter wrote: > Am Fri, 23 Aug 2013 17:17:41 +0800 > schrieb Tim Shen : > >> Inspired by this mail: >> http://gcc.gnu.org/ml/libstdc++/2013-08/msg00131.html >> >> Thanks! >> >> > > Hi Tim, > > I applied the patch + compiled everything - it's working now! Thanks! > > But... I found a problem with .imbue(): > > int main() > { > std::setlocale(LC_ALL, ""); > > std::wstring str2 = L"öäü"; > std::wregex re2; > re2.imbue(std::locale("")); > re2.assign(L"([[:lower:]]+)"); > std::wsmatch m2; > > std::wsregex_token_iterator end {}; > > for (std::wsregex_token_iterator p{str2.begin(), str2.end(), > re2, {1}}; p != end; ++p) { std::wcout << *p << std::endl; > } > > return 0; > } > > Works fine! But when there's no imbue() call, a Segmentation > fault occurs. > > int main() > { > std::setlocale(LC_ALL, ""); > > std::wstring str2 = L"öäü"; > std::wregex re2; > //re2.imbue(std::locale("")); > re2.assign(L"([[:lower:]]+)"); > std::wsmatch m2; > > std::wsregex_token_iterator end {}; > > for (std::wsregex_token_iterator p{str2.begin(), str2.end(), > re2, {1}}; p != end; ++p) { std::wcout << *p << std::endl; > } > > return 0; > } > > May be it's better to throw an exception here? Thanks in advance, > > Stefan Fixed. It's not fully tested. I just run the 28_regex testcases. I'll do a full test before committing. Thanks again! -- Tim Shen regex-locale.patch Description: Binary data
Re: [Patch 2/2] Localization problem in regex
> Fixed. > > It's not fully tested. I just run the 28_regex testcases. I'll do a > full test before committing. Committed. -- Tim Shen changelog Description: Binary data regex-locale.patch Description: Binary data
[Patch] Rewrite regex matchers
Rewrite matchers using structs instead of closures. It's more clear and easy to maintain now. Tested under -m32, -m64 and debug mode. Thanks! -- Tim Shen matchers.patch Description: Binary data
Re: [Patch] Rewrite regex matchers
On Fri, Aug 30, 2013 at 9:26 AM, Paolo Carlini wrote: > Hi, > > On 08/30/2013 02:05 PM, Tim Shen wrote: >> >> + const _TraitsT&_M_traits; >> + _FlagT _M_flags; >> + bool _M_is_non_matching; >> + std::set<_CharT> _M_char_set; >> + std::set> _M_range_set; >> + _CharClassT_M_class_set; > > another, very general comment: now that nothing is decided ABI-wise for > these features, let's pay attention to the layouts, let's make sure that the > data members are ordered in the best way to minimize the size. For example, > when I see a bool sandwiched between big objects something seems at least > weird... Let's make measurements and optimize for 64-bit but let's double > check on 32-bit too. I didn't use any tool to check that, but adjust it by hand. It shouldn't break anything, but I'll test it again before committing. On Fri, Aug 30, 2013 at 8:34 AM, Stephen M. Webb wrote: > Other than the test case (breaks on mingw32 target) this is looking very > sweet. Could you please show some details? -- Tim Shen matchers.patch Description: Binary data
Re: [Patch] Rewrite regex matchers
On Fri, Aug 30, 2013 at 10:04 PM, Tim Shen wrote: > I didn't use any tool to check that, but adjust it by hand. It > shouldn't break anything, but I'll test it again before committing. Tested under -m32, -m64 and debug mode and committed. -- Tim Shen
[Patch] Rewrite _StateSeq in regex
This patch fix the optional operator '?' and interval repeating like "a{3,5}" through rewrite _StateSeq. After this patch, the assertion supporting shall be the last significant feature to be implemented. It's booted and tested under x86_64. It'll be fully tested before committing. Thanks! -- Tim Shen stateseq.patch Description: Binary data
Re: [Patch] Rewrite _StateSeq in regex
According to this email(http://gcc.gnu.org/ml/libstdc++/2013-09/msg5.html), I add a testcase. -- Tim Shen stateseq.patch Description: Binary data
Re: [Patch] Rewrite _StateSeq in regex
On Thu, Sep 5, 2013 at 10:07 AM, Paolo Carlini wrote: > On 09/04/2013 05:37 AM, Tim Shen wrote: > Thanks. Apparently there aren't further comments, thus please install it. Committed. Complete unmatched parentheses in last commit :) -- Tim Shen
Re: [Patch] Support assertions and greedy/ungreedy matching in regex
On Thu, Sep 12, 2013 at 11:37 AM, Paolo Carlini wrote: > .. please fix the overlong lines, I spotted quite a few. Quite a few? I see only one line by casting: egrep '^\+.{80}' a.patch Even it's not 81. That'll be fixed before committing. Seriously, I think we need an AST-based formatter? Clang seems to write a good one. -- Tim Shen
Re: [Patch] Support assertions and greedy/ungreedy matching in regex
On Thu, Sep 12, 2013 at 9:57 AM, Paolo Carlini wrote: > Great. A quick-quick comment: if these are the last two features, why we > can't un-xfail the testcase which we added latety? Also, a grep revealed a > couple more xfails. Can you clarify? I say `feature` when I think that, what these xfails reveal are too small to be features, say, regex_search/regex_match flags. Now turns out "feature" is not a good word for them. I do mean all C++ library independent part, or pure regex function part, are done. Next days I will add flags implementation. > Also, much more generally, I would be curious about the remaining work: I > think essentially it boils down to the vagaries for the other non-default > regex dialects? Is it a lot of work? This patch(http://gcc.gnu.org/ml/libstdc++/2013-08/msg00142.html) already supports all dialects(ECMAScript, basic, extended, grep, egrep, awk) specified by standard. Most of the differences between them are eliminated by _Scanner, aka tokenizer, so it's actually not a lot of work to even add one more syntax. But again, I think more testcases are needed, especially from those experienced regex users. I'm actually not a big fan of regex(but of NFA ;), or can I borrow some boost/libc++ testcases without making any licence trouble? Thanks! -- Tim Shen
Re: [Patch] Implement regex_match and regex_search
On Tue, Sep 17, 2013 at 4:35 AM, Paolo Carlini wrote: > Great. Tim, please complete the testing on -m32 etc, if everything goes > well, just wait a day or so and commit. Tested under -m32, -m64 and check-debug, and committed, with changing the date of the new file to today. Thank you! -- Tim Shen
[Patch] match_results::format and regex_replace
This patch complete the last two parts of the whole regex module, but two problems left: 1) regex_traits<>::transform_primary [28.7.7]. I don't know how to implement it correctly. Can anyone give some advice? 2) Digraph support. Is it need to be done, since the standard doesn't specify it? Tested under -m64. I'll do a full test before committing. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] match_results::format and regex_replace
On Sun, Sep 22, 2013 at 4:20 PM, Paolo Carlini wrote: > If testing goes well patch is Ok to commit. Tested under -m32 and -m64 and committed :) I'll learn how locale in glibc works. Thank you all! -- Tim Shen
[Patch] Regex Fixes on constants and no-backref in extended
Just follow the standard :) Booted and tested under -m32, -m64. Thank you! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Regex Fixes on constants and no-backref in extended
On Thu, Sep 26, 2013 at 5:37 PM, Jonathan Wakely wrote: > The ChangeLog entry says "stanard", with that fixed this is OK to > commit, thanks! Committed. -- Tim Shen
[Patch] Let ordinary escaping in POSIX regex be valid
POSIX ERE says that escaping an ordinary char, say R"\n" is not permitted, because 'n' is not a special char. However, they also say that : "Implementations are permitted to extend the language to allow these. Conforming applications cannot use such constructs." So let's support it not to make users surprised. Booted and tested under -m32 and -m64 Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Let ordinary escaping in POSIX regex be valid
On Fri, Sep 27, 2013 at 9:37 AM, Jonathan Wakely wrote: > Ah I see. I definitely agree it's good to accept that instead of > being unnecessarily strict, but other people will want the option of > strict conformance, so I think we can please everyone with something > like: > > else > { > #ifdef __STRICT_ANSI__ > __throw_regex_error(regex_constants::error_escape); > #else >_M_token = _S_token_ord_char; >_M_value.assign(1, __c); > #endif > } Sorry for late >.< Do I need to bootstrap again? -- Tim Shen a.patch Description: Binary data
Re: [Patch] Let ordinary escaping in POSIX regex be valid
On Fri, Sep 27, 2013 at 4:30 PM, Paolo Carlini wrote: > Nah, only double check that the testcase you are un-xfail-ing uses > -std=gnu++11, otherwise will not pass ;) Committed :) Thanks! -- Tim Shen
[Patch] Fix interval quantifier in lookahead subexpr in regex
Forget to let _M_clone and _M_eliminate_dummy follow _S_opcode_subexpr_lookahead, which also have _M_alt. Is it OK not to bootstrap it ? Tested under -m64 and -m32. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Fix interval quantifier in lookahead subexpr in regex
On Tue, Oct 1, 2013 at 1:32 AM, Paolo Carlini wrote: > Ok, thanks. Committed. Thanks! -- Tim Shen
Re: Announcing ?
On Mon, Sep 30, 2013 at 2:25 PM, Paolo Carlini wrote: > 3- Tweak the implementation status page > libstdc++/doc/xml/manual/status_cxx2011.xml (possibly clarify bits still > missing too) Here is the patch for this, along with some interface fixes. Thank you! -- Tim Shen a.patch Description: Binary data
Re: Announcing ?
On Tue, Oct 1, 2013 at 6:08 PM, Paolo Carlini wrote: > What do you mean by "not well defined"? Non confoming? Why? I don't know if the word 'define' is appropriate, maybe I should just use 'implement'. It's indeed the problem that needs extra support from , described by [28.7.7]. More detailed, http://www.open-std.org/Jtc1/sc22/wg21/docs/papers/2003/n1429.htm offered what the function means. -- Tim Shen
Re: Announcing ?
On Tue, Oct 1, 2013 at 6:41 PM, Paolo Carlini wrote: > Ok, thus just say that transform_primary is unimplented. Please also add a > comment inline in the code explaining the issue, but don't word it in term of > , because is just what it is per C++11, and doesn't directly > provide what we need in the public interface (if I understand the issue): > just say that implementing the function appears to require libc support + a > link to the message you sent a couple of weeks ago to the libstdc++-v3 > mailing list. > > Ok with the above changes. Committed :) Thanks! -- Tim Shen a.patch Description: Binary data
Re: Announcing ?
On Tue, Oct 1, 2013 at 7:49 PM, Paolo Carlini wrote: > Thanks. But then the function actually *is* implemented, saying 'not > implemented' is misleading. Please change the xml to something like 'isn't > correctly implemented', or more informally, 'needs work', up to you, but > please adjust it. Committed. I learn that an implementation could be incorrect ;) -- Tim Shen a.patch Description: Binary data
[wwwdocs] Mention libstdc++-v3 in 4.9 changes.html
Hi, libstdc++-v3 is ready for releasing. Is it Ok to apply? By the way, do we need a News entry for this improvement? Thanks! Index: htdocs/gcc-4.9/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v retrieving revision 1.27 diff -r1.27 changes.html 136a137,140 > href="http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2011";> >Improved experimental support for the new ISO C++ standard, C++11, >including support for <regex>. > -- Tim Shen
[Patch] Fix incorrect behavior of "[[=a=]]" in regex
_BracketMatcher<>::_M_add_equivalence_class is misimplemented so I try `git blame regex_compiler.h`...that's me! Booted and tested under -m32, -m64. Thanks ;) -- Tim Shen a.patch Description: Binary data
Re: [wwwdocs] Mention libstdc++-v3 in 4.9 changes.html
I feel little bit uncomfortable with "new ISO C++ standard, C++11", since C++14 is already there, so I removed it. Please check the words, since English is not my first language >.< Thanks! Index: htdocs/index.html === RCS file: /cvs/gcc/wwwdocs/htdocs/index.html,v retrieving revision 1.891 diff -r1.891 index.html 55a56,59 > C++11 <regex> support > [2013-10-02] > Regular expression support in libstdc++-v3 is > now available. > Index: htdocs/gcc-4.9/changes.html === RCS file: /cvs/gcc/wwwdocs/htdocs/gcc-4.9/changes.html,v retrieving revision 1.27 diff -r1.27 changes.html 136a137,139 > href="http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2011";> >Improved support for C++11, including support for <regex>. > -- Tim Shen
Re: [Patch] Fix incorrect behavior of "[[=a=]]" in regex
Committed. Thanks! On Wed, Oct 2, 2013 at 11:10 AM, Jonathan Wakely wrote: > On 2 October 2013 15:26, Tim Shen wrote: >> _BracketMatcher<>::_M_add_equivalence_class is misimplemented so I try >> `git blame regex_compiler.h`...that's me! >> >> Booted and tested under -m32, -m64. > > This is OK to commit, thanks -- Tim Shen
Re: [wwwdocs] Mention libstdc++-v3 in 4.9 changes.html
Committed. Thanks! On Wed, Oct 2, 2013 at 11:12 AM, Jonathan Wakely wrote: > On 2 October 2013 15:52, Tim Shen wrote: >> I feel little bit uncomfortable with "new ISO C++ standard, C++11", >> since C++14 is already there, so I removed it. > > Good idea. > >> Please check the words, since English is not my first language >.< > > The english is fine, please wait a few hours in case anyone else has > comments, but if you don't hear anything else you can go ahead and > commit it. -- Tim Shen
[Patch] Fix wrong backup variable initialization in regex
Tested under -m32, -m64, however didn't do a bootstrap. Is that OK? Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Fix wrong backup variable initialization in regex
On Thu, Oct 3, 2013 at 4:39 PM, Paolo Carlini wrote: > Seems trivial enough, Ok. Committed. > PS: I suppose isn't easy to prepare a testcase which would exercise those > lines? Telling the truth, this bug *is* revealed by rerunning testcases with the non-default DFS matcher. Since the other matcher(BFS) is the default, it can be fully tested every time we run `make check`; this time I force(by editting code) the dispatcher to use DFS, and find this mistake. To test both of them, rewriting every testcase seems not a good idea. So I edit dispatching code by hand and test the non-defuault matcher every 2-3 commits. Do we have a better solution? Thanks! -- Tim Shen
Re: [Patch] Fix wrong backup variable initialization in regex
On Thu, Oct 3, 2013 at 5:40 PM, Paolo Carlini wrote: > Before figuring out a fully general solution, I would anyway add a testcase > which manually switches the executor and tests for the problem. More elegant > would be a testcase which due to its nature automatically leads to an > executor switch. Not sure if it's possible in this specific case. I may find a possible workaround: when set _GLIBCXX_DEBUG, run both matchers and check if they agree. Then `make check-debug` will test them all. I really want a macro _GLIBCXX_TESTSUITE which is set in `make check`. > But in principle yes, since we deliver two, and the users have a way to choose > which one they want (X) Users can "choose" the matcher only by adding/removing brs to/from the regex input string. We can easily convert every no-br regex to a br one, by adding an empty capture and make a br to it(which won't change the regex's meaning) to "choose" the non-default matcher. There's no explicit switch exported to users, because it's designed to be completely transparent to users. > Well, unless we want to keep secret - that is, undocumented - the switching > facility, > but I'm thinking that we should not. Yes I think we should keep secret, because the standard doesn't specify it. They only way to publish the switch to user is making a library extension(is that true?), but there's no obvious benefit to do that(is that true? I shall be humble). -- Tim Shen
[Patch] Fix COMPILE error in regex and remove default parameter in function definition
Stupid errors hidden in some large commit. I don't think they will break boostrap? Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Internal functions for testsuite in regex
On Sat, Oct 5, 2013 at 7:45 AM, Paolo Carlini wrote: > Patch looks great to me. If you don't get more comments over the next day or > so, please go ahead. Committed :) -- Tim Shen
Re: [Patch] Fix COMPILE error in regex and remove default parameter in function definition
On Sun, Oct 6, 2013 at 5:38 AM, Paolo Carlini wrote: > Ok, thanks. Committed :) -- Tim Shen
[Patch] Optimize _BFSExecutor in regex
Here's a simple piece of code https://gist.github.com/innocentim/6849759 the reveals _BFSExecutor's inefficiency. Some optimizations are here to reduce the unecessary time complexity from O(n^2) to O(n). I'll do a bootstrap and full testing before committing. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Optimize _BFSExecutor in regex
Sorry I forgot to reply the mailing list. Here's the discussion: --- Tim On Oct 7, 2013 7:12 AM, "Jonathan Wakely" wrote: > Does _TodoList really need to derive from std::vector<_StateIdT> or > could it just have a second member? I think there's no benefit to > using inheritance here. I use private inheritance here to convey "it's just a wrapper class of vector". I don't see great difference between it and using a second member? > _TodoList::__exists should be named _M_exists to follow the library > coding standards. Sorry for that >.< > Shouldn't _TodoList::_M_empty() also set __exists[n] = false for all n? We can add an assertion here, but _M_empty shouldn't set anything. This function has no side effect. > I think it's guaranteed that __exists[__u] is always in range, right? Of course. Otherwise vector's range checker will help, right? ------- Jonathan On 7 October 2013 14:31, Tim Shen wrote: > On Oct 7, 2013 7:12 AM, "Jonathan Wakely" wrote: >> Does _TodoList really need to derive from std::vector<_StateIdT> or >> could it just have a second member? I think there's no benefit to >> using inheritance here. > > I use private inheritance here to convey "it's just a wrapper class of > vector". I don't see great difference between it and using a second member? Aggregation is simpler, If I see subtyping I wonder if some users of the type need to rely on conversion to the base class, i.e. whether it is important that _TodoList IS-A vector, but as it's private and has no friends, that can't be true. In general, prefer composition to inheritance: http://www.artima.com/cppsource/codestandards3.html >> Shouldn't _TodoList::_M_empty() also set __exists[n] = false for all n? > > We can add an assertion here, but _M_empty shouldn't set anything. This > function has no side effect. Sorry, I meant _M_clear() not _M_empty(). Currently you loop over the vector popping one element at a time until it's empty, why not just use _BaseT::clear()? Does anything even use that member? If it is needed, why is it correct to not reset the flags in __exists to false? >> I think it's guaranteed that __exists[__u] is always in range, right? > > Of course. Otherwise vector's range checker will help, right? Only in Debug Mode, there is no range checking in operator[] in the normal case. But as long as the vector has the right size (from _M_nfa.size()) and can only be accessed with valid indices it's fine. --- Tim On Mon, Oct 7, 2013 at 10:56 AM, Jonathan Wakely wrote: > Aggregation is simpler, OK I'll change it to use a private member, since indeed I don't need to use protected part of vector. > Currently you loop over the vector popping one element at a time until > it's empty, why not just use _BaseT::clear()? Does anything even use > that member? If it is needed, why is it correct to not reset the > flags in __exists to false? I use a loop to clear rather than a _BaseT::clear() then reset the __exists(_M_exists), simply because it's too expensive(O(__exists.size())). With my implementation, I can make it O(this->size()), which is usually far smaller than size of __exists. A better way to do that is to traverse *this and only reset all trues, then _BaseT::clear(). Ah we can remove this member function, because it's used only in early versions of this patch. --- Jonathan On 7 October 2013 16:34, Tim Shen wrote: > On Mon, Oct 7, 2013 at 10:56 AM, Jonathan Wakely > wrote: >> Aggregation is simpler, > > OK I'll change it to use a private member, since indeed I don't need > to use protected part of vector. > >> Currently you loop over the vector popping one element at a time until >> it's empty, why not just use _BaseT::clear()? Does anything even use >> that member? If it is needed, why is it correct to not reset the >> flags in __exists to false? > > I use a loop to clear rather than a _BaseT::clear() then reset the > __exists(_M_exists), Where do you reset __exists? That was my point, I don't see that being reset, but it should have been. > simply because it's too > expensive(O(__exists.size())). With my implementation, I can make it > O(this->size()), which is usually far smaller than size of __exists. A > better way to do that is to traverse *this and only reset all trues, > then _BaseT::clear(). Traversing a vector is not
Re: [Patch] Optimize _BFSExecutor in regex
On Mon, Oct 7, 2013 at 12:22 PM, Jonathan Wakely wrote: > because that turns into the equivalent of a std::memset() on integers. Here I catch your idea. But think about the following example: _M_exists.size() == 1000, but only 3 of the elements are true. Now what I intend to do is assigning these 3 elements to false, rather than reseting the whole vector. Is this called "pay for what you get"? The ideal solution could be: void _M_clear() { if (__exists.size() < _BaseT::size() * CPU_WORD_LENGTH * some_factor) clear_one_by_one(); else reset_them_all(); } -- Tim Shen
Re: [Patch] Optimize _BFSExecutor in regex
On Mon, Oct 7, 2013 at 7:12 AM, Jonathan Wakely wrote: > On 6 October 2013 23:43, Tim Shen wrote: >> Here's a simple piece of code >> https://gist.github.com/innocentim/6849759 the reveals _BFSExecutor's >> inefficiency. Some optimizations are here to reduce the unecessary >> time complexity from O(n^2) to O(n). >> >> I'll do a bootstrap and full testing before committing. > > In the ChangeLog, please fix the spelling of "unnecessary". > > The constructor of _TodoList could be "explicit" > > Does _TodoList really need to derive from std::vector<_StateIdT> or > could it just have a second member? I think there's no benefit to > using inheritance here. > > _TodoList::__exists should be named _M_exists to follow the library > coding standards. > > Shouldn't _TodoList::_M_empty() also set __exists[n] = false for all n? > > I think it's guaranteed that __exists[__u] is always in range, right? Anyway, here's the patch. I intended to write something like a Quine[1] as the testcase but gave up ;) Thank you! [1] http://en.wikipedia.org/wiki/Quine_(computing) -- Tim Shen a.patch Description: Binary data
Re: [Patch] Optimize _BFSExecutor in regex
On Mon, Oct 7, 2013 at 1:28 PM, Jonathan Wakely wrote: > std::memset() on about 125 bytes is not a big deal, I doubt it's > significantly more expensive than the calculations to find the right > bits and do the necessary masking for three elements. > std::vector is a strange beast, remember. OK so I may change to use something else(bitset?) next time I encounter this. Is copying 125 bytes more expensive comparing to 3 assignment and substrction? Because it *is* cheap to find all elements that's true, they are just in _M_base(in the latest patch). So: while (!this->empty()) this->pop(); ...will be executed only 3 times. I think it's cheaper than copying a long-enough and sparse-enough byte chunk? Or at least we should theck the density(_M_base.size() / _M_exists.size()) of _M_exists ? -- Tim Shen
Re: [Patch] Optimize _BFSExecutor in regex
On Mon, Oct 7, 2013 at 1:40 PM, Tim Shen wrote: > while (!this->empty()) > this->pop(); Sorry it's this->_M_empty() and this->_M_pop(); -- Tim Shen
Re: [Patch] Optimize _BFSExecutor in regex
On Mon, Oct 7, 2013 at 1:28 PM, Jonathan Wakely wrote: > std::memset() on about 125 bytes is not a big deal, I doubt it's > significantly more expensive than the calculations to find the right > bits and do the necessary masking for three elements. > std::vector is a strange beast, remember. Probably there could be bitwise operation? I've test the 3 trues out of 1000 elements here : https://gist.github.com/innocentim/6874889 In my machine, choosing "v.assign(v.size(), false);" cost 1.768s and choosing the other cost 1.088s. When we push 7 elements, they cost almost same time. When push more than that, v.assign() shall win. I do this because I need to use _M_clear() in this patch :) Anyway let's use v.assign(). Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Optimize _BFSExecutor in regex
On Mon, Oct 7, 2013 at 7:11 PM, Jonathan Wakely wrote: > The version in your gist (which is *not* what your first patch did!) > is faster for any size of _M_exists and any ratio of _M_states.size() > / _M_exists.size(): Sorry for my original patch, I made a mistake! It's: while (!_BaseT::empty()) _BaseT::pop_back(); but I truely meant: while (!this->_M_empty()) this->_M_pop(); To make all trues reset. In my machine the while loop is slower than assign() when the _M_states.size() / _M_exists.size() is more than even 7/1000. so let's keep the assign(), because the ratio in regex could probably be more than 25%. -- Tim Shen a.patch Description: Binary data
Re: [Patch] Optimize _BFSExecutor in regex
On Mon, Oct 7, 2013 at 8:43 PM, Jonathan Wakely wrote: > OK, this latest patch looks good so please go ahead and commit it - thanks! Committed :) -- Tim Shen
[Patch] Fix PR58737
This memory leak is because forgetting virtual destructor of the base class _Executor. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Fix PR58737
On Tue, Oct 15, 2013 at 11:01 AM, Jonathan Wakely wrote: > Great, if it passes the testsuite please commit it. It surely passed -m32 and -m64, and committed :) -- Tim Shen
[Patch] Fix undefined behaviors in regex
It's detected by `clang++ -g -std=c++11 -fsanitize=undefined` but not detected by g++ with flags `g++ -g -std=c++11 -fsanitize=undefined -lpthread`. Am I on the right way? -m32 and -m64 tested :) Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Fix undefined behaviors in regex
On Wed, Oct 16, 2013 at 6:13 PM, Paolo Carlini wrote: > Thank you! Committed. Thanks! -- Tim Shen
Re: [Patch] Fix undefined behaviors in regex
On Wed, Oct 16, 2013 at 6:33 PM, Paolo Carlini wrote: > By the way, please feel free to prepare a reduced testcase for the > -fsanitize people (Marek, Jakub?) Here it is. And we should get undefined behaviors before this commit(r203732). -- Tim Shen split.cc Description: Binary data
Re: [Patch] Fix undefined behaviors in regex
On Wed, Oct 16, 2013 at 6:42 PM, Paolo Carlini wrote: > On 10/17/2013 12:39 AM, Tim Shen wrote: >> >> On Wed, Oct 16, 2013 at 6:33 PM, Paolo Carlini >> wrote: >>> >>> By the way, please feel free to prepare a reduced testcase for the >>> -fsanitize people (Marek, Jakub?) >> >> Here it is. >> >> And we should get undefined behaviors before this commit(r203732). > > To be honest, I was thinking something much smaller than the whole > ;) But let's add Marek in CC. int work() { } int main() { int a = work(); return a; } /* This is a smaller case to test the sanitizer. It seems that the undefined sanitizer is not merged? I use `g++ (GCC) 4.9.0 20131003`, is that too old? */ -- Tim Shen
[Patch] Fix warnings in regex
...to prevent potential bugs. By the way, I execute `g++ -E file.cc > prep.cc && sed -i 's|^#.*$||' prep.cc && g++ -Wall prep.cc 2> log` to get all warnings. Do we have a better way? Bootstraps and tested under -m64 and -m32. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Fix undefined behaviors in regex
On Thu, Oct 17, 2013 at 5:03 AM, Paolo Carlini wrote: >> Though, in the above case, the question is why people ignore warnings from >> the compiler and need to have special runtime instrumentation to remind them >> instead. I'm not objecting to that sanitization, only find it weird. That's my fault, I forgot that compiling code included from libstdc++ doesn't generate warnings. Then I try to get all warnings, and indeed get the missing-return-statement one. Thanks! -- Tim Shen
Re: [Patch] Fix warnings in regex
On Thu, Oct 17, 2013 at 3:58 PM, Paolo Carlini wrote: > Please explain in the ChangeLog *which* warnings. Committed with attached patch. Thanks! -- Tim Shen a.patch Description: Binary data
[Patch] Fix wide char locale support(like CJK charset) in regex
The bug is because naively calling `map::count(__c)` where __c could be a wchar_t, and an implicit cast(a truncate?) happened. -m32 and -m64 tested. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Fix wide char locale support(like CJK charset) in regex
On Thu, Oct 17, 2013 at 8:46 PM, Paolo Carlini wrote: >> + const wchar_t * s = L"ä½ å¥½, 世+界"; >> + wregex re(s); >> + VERIFY(regex_match_debug(L"ä½ å¥½, 世世世界", re)); Oops that's only perfect Chinese version of "hello, world" under UTF-8 charset. Now it's hex format. I also make the char handling simpler. -- Tim Shen a.patch Description: Binary data
Re: [Patch] Fix wide char locale support(like CJK charset) in regex
On Fri, Oct 18, 2013 at 5:39 AM, Paolo Carlini wrote: >>Oops that's only perfect Chinese version of "hello, world" under UTF-8 >>charset. Now it's hex format. >> >>I also make the char handling simpler. > > Great, thanks! -m32 and -m64 tested and committed. By the way, UTF-8 *encoding* is a more accurate word, instead of `charset`(which is Unicode). -- Tim Shen
[Patch] Change default executor to DFS in regex
As the comment in this patch said, DFS approach is faster in simple regex, but exponentially slower in complex(many quantifier) cases. DFA optimization could be added, but I'm afraid it needs rewriting regex_automaton.*. Tested under -m32 and -m64. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Change default executor to DFS in regex
On Fri, Oct 18, 2013 at 9:13 PM, Tim Shen wrote: > As the comment in this patch said, DFS approach is faster in simple > regex, but exponentially slower in complex(many quantifier) cases. Actually I suggest to use DFS where number of quantifiers < 2, to make this 'optimization' more conservative. Now `split regex`, say "\s+" can be optimized. -- Tim Shen a.patch Description: Binary data
Re: [Patch] Change default executor to DFS in regex
On Sat, Oct 19, 2013 at 4:03 AM, Paolo Carlini wrote: > About the < 2, in general hardcoding a parameter value in the code isn't a > nice idea. Why don't we take it out to a macro, say > _GLIBCXX_REGEX_NFA_QUANTIFIERS_LIMIT? In stl_deque.h we have something > similar and in the present case it would be even safe from the ABI point of > view, if I'm not mistaken. I here use a const static local variable to hide it from other parts. > Finally, you mentioned the DFA optimization: I would like to see some > details about it. Would it be a big win in all cases? Before we freeze the > next ABI we should keep open all the possibilities... It'll be great help to implement partial DFA compilation for all cases, though DFA cannot handle back-references neither. I need more time to investigate. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Change default executor to DFS in regex
On Sat, Oct 19, 2013 at 3:12 PM, Paolo Carlini wrote: > Yes, but giving it a name doesn't buy us much wrt the issue I pointed out. > For comparison, in similar cases, the compiler driver has --params which the > user can fine tune on the command line. The best approximation we have got > in the library - for the time being at least, in principle the driver could > also forward parameters to the library - is a macro, which is defined and > documented in a comment a few lines earlier and then is possibly used to > initialize a const static. Again, see the stl_deque.h example. I can't > imagine any other simple (and conforming! eg, no additional template parms) > solution. I see. Here's the macro version. Thanks! -- Tim Shen a.patch Description: Binary data
Re: [Patch] Change default executor to DFS in regex
On Sun, Oct 20, 2013 at 4:42 AM, Paolo Carlini wrote: > Ok with those changes. Committed. Thanks! -- Tim Shen a.patch Description: Binary data
[Patch] Regex comments
Add comments for last regex commit. Thanks! -- Regards, Tim Shen commit b4aa16a08992866ca6fd60f40e422f551d0d6b2a Author: tim Date: Sun Oct 27 15:50:44 2013 -0400 2013- diff --git a/libstdc++-v3/include/bits/regex_executor.tcc b/libstdc++-v3/include/bits/regex_executor.tcc index d3b9a04..2f677de 100644 --- a/libstdc++-v3/include/bits/regex_executor.tcc +++ b/libstdc++-v3/include/bits/regex_executor.tcc @@ -53,6 +53,52 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return false; } + // This function serves in different modes, DFS mode or BFS mode, indicated + // by template variable __dfs_mode. See _M_main for details. + // + // + // + // DFS mode: + // + // It applys a Depth-First-Search (aka backtracking) on given NFA and input + // string. + // At the very beginning the executor stands in the start state, then it try + // every possible state transition in current state recursively. Some state + // transitions consume input string, say, a single-char-matcher or a + // back-reference matcher; some don't, like assertion or other anchor nodes. + // When the input is exhausted and/or the current state is an accepting state, + // the whole executor returns true. + // + // TODO: This approach is exponentially slow for certain input. + // Try to compile the NFA to a DFA. + // + // Time complexity: o(match_length), O(2^(_M_nfa.size())) + // Space complexity: \theta(match_results.size() + match_length) + // + // + // + // BFS mode: + // + // Russ Cox's article (http://swtch.com/~rsc/regexp/regexp1.html) + // explained this algorithm clearly. + // + // It first computes epsilon clousure for every state that's still matching, + // using the same DFS algorithm, but doesn't reenter states (set true in + // _M_visited), nor follow _S_opcode_match. + // + // Then apply DFS to every _S_opcode_match one by one (in _M_match_queue). + // + // The order of which states needs to be recursively applied DFS matters, + // depend on which greedy mode we use. See _S_opcode_alternative branch in + // _M_dfs. + // + // It significantly reduces potential duplicate states, so have a better + // upper bound; but it deserves more overhead. + // + // Time complexity: o(match_length * match_results.size()) + // O(match_length * _M_nfa.size() * match_results.size()) + // Space complexity: o(_M_nfa.size() + match_results.size()) + // O(_M_nfa.size() * match_results.size()) template template @@ -68,18 +114,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } else { - // Like the DFS approach, it try every possible state transition; - // Unlike DFS, it uses a queue instead of a stack to store matching - // states. It's a BFS approach. - // - // Russ Cox's article(http://swtch.com/~rsc/regexp/regexp1.html) - // explained this algorithm clearly. - // - // Time complexity: o(match_length * match_results.size()) - // O(match_length * _M_nfa.size() - //* match_results.size()) - // Space complexity: o(_M_nfa.size() + match_results.size()) - // O(_M_nfa.size() * match_results.size()) _M_match_queue->push(make_pair(_M_start_state, _M_results)); bool __ret = false; while (1) @@ -132,20 +166,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return false; } - // A _DFSExecutor perform a DFS on given NFA and input string. At the very - // beginning the executor stands in the start state, then it try every - // possible state transition in current state recursively. Some state - // transitions consume input string, say, a single-char-matcher or a - // back-reference matcher; some not, like assertion or other anchor nodes. - // When the input is exhausted and the current state is an accepting state, - // the whole executor return true. - // - // TODO: This approach is exponentially slow for certain input. - // Try to compile the NFA to a DFA. - // - // Time complexity: o(match_length), O(2^(_M_nfa.size())) - // Space complexity: \theta(match_results.size() + match_length) - // template template @@ -164,25 +184,39 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION { case _S_opcode_alternative: // Greedy or not, this is a question ;) + // + // _M_alt branch is "match once more", while _M_next is "get me out + // of this quantifier". + + // Greedy. if (!__state._M_neg) { + // Once more. _M_dfs<__match_mode>(__state._M_alt); + // If it's DFS executor and already accepted, we're done. if (!__dfs_mode || !_M_has_sol) _M_dfs<__match_mode>(__state._M_next); } - else + else // Ungreedy mode { if (__dfs_mode) { + // vice-versa. _M_dfs<__match_mode>(_
Re: [Patch] Regex comments
On Sun, Oct 27, 2013 at 4:20 PM, Paolo Carlini wrote: > Should we move this comment too before the 'case', and aligned with it? Actually this comment is for the if statement below, not the whole case branch. I've made this clearer by moving "Every change..." part to the switch statement, where it should stay. -- Regards, Tim Shen commit 1f13761ba34b13a52b4f41d2ebdeba1114205bac Author: tim Date: Sun Oct 27 15:50:44 2013 -0400 2013-10-27 Tim Shen * regex_executor.tcc: Add comments. diff --git a/libstdc++-v3/include/bits/regex_executor.tcc b/libstdc++-v3/include/bits/regex_executor.tcc index d3b9a04..c4ce362 100644 --- a/libstdc++-v3/include/bits/regex_executor.tcc +++ b/libstdc++-v3/include/bits/regex_executor.tcc @@ -53,6 +53,52 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return false; } + // This function serves in different modes, DFS mode or BFS mode, indicated + // by template variable __dfs_mode. See _M_main for details. + // + // + // + // DFS mode: + // + // It applys a Depth-First-Search (aka backtracking) on given NFA and input + // string. + // At the very beginning the executor stands in the start state, then it try + // every possible state transition in current state recursively. Some state + // transitions consume input string, say, a single-char-matcher or a + // back-reference matcher; some don't, like assertion or other anchor nodes. + // When the input is exhausted and/or the current state is an accepting state, + // the whole executor returns true. + // + // TODO: This approach is exponentially slow for certain input. + // Try to compile the NFA to a DFA. + // + // Time complexity: o(match_length), O(2^(_M_nfa.size())) + // Space complexity: \theta(match_results.size() + match_length) + // + // + // + // BFS mode: + // + // Russ Cox's article (http://swtch.com/~rsc/regexp/regexp1.html) + // explained this algorithm clearly. + // + // It first computes epsilon clousure for every state that's still matching, + // using the same DFS algorithm, but doesn't reenter states (set true in + // _M_visited), nor follow _S_opcode_match. + // + // Then apply DFS to every _S_opcode_match one by one (in _M_match_queue). + // + // The order of which states needs to be recursively applied DFS matters, + // depend on which greedy mode we use. See _S_opcode_alternative branch in + // _M_dfs. + // + // It significantly reduces potential duplicate states, so have a better + // upper bound; but it deserves more overhead. + // + // Time complexity: o(match_length * match_results.size()) + // O(match_length * _M_nfa.size() * match_results.size()) + // Space complexity: o(_M_nfa.size() + match_results.size()) + // O(_M_nfa.size() * match_results.size()) template template @@ -68,18 +114,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } else { - // Like the DFS approach, it try every possible state transition; - // Unlike DFS, it uses a queue instead of a stack to store matching - // states. It's a BFS approach. - // - // Russ Cox's article(http://swtch.com/~rsc/regexp/regexp1.html) - // explained this algorithm clearly. - // - // Time complexity: o(match_length * match_results.size()) - // O(match_length * _M_nfa.size() - //* match_results.size()) - // Space complexity: o(_M_nfa.size() + match_results.size()) - // O(_M_nfa.size() * match_results.size()) _M_match_queue->push(make_pair(_M_start_state, _M_results)); bool __ret = false; while (1) @@ -132,20 +166,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return false; } - // A _DFSExecutor perform a DFS on given NFA and input string. At the very - // beginning the executor stands in the start state, then it try every - // possible state transition in current state recursively. Some state - // transitions consume input string, say, a single-char-matcher or a - // back-reference matcher; some not, like assertion or other anchor nodes. - // When the input is exhausted and the current state is an accepting state, - // the whole executor return true. - // - // TODO: This approach is exponentially slow for certain input. - // Try to compile the NFA to a DFA. - // - // Time complexity: o(match_length), O(2^(_M_nfa.size())) - // Space complexity: \theta(match_results.size() + match_length) - // template template @@ -160,29 +180,45 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } const auto& __state = _M_nfa[__i]; + // Every change on _M_cur_results and _M_current will be rolled back after + // finishing the recursion step. switch (__state._M_opcode) { case _S_opcode_alternative: // Greedy or not, this is a question ;
Re: [Patch] Regex comments
Thank you for figuring out so many syntax errors, I'll be careful next time. On Sun, Oct 27, 2013 at 5:38 PM, Jonathan Wakely wrote: > + // The order of which states needs to be recursively applied DFS matters, > + // depend on which greedy mode we use. > > I don't understand this sentence at all, sorry. Can you explain it in > other terms, and I'll try to suggest better phrasing? +// _M_alt branch is "match once more", while _M_next is "get me out +// of this quantifier". Executing _M_next first or _M_alt first don't +// mean the same thing, and we need to choose the correct order under +// given greedy mode. -- Regards, Tim Shen commit b195603ee8d79d2afd5303bcae363c47e4c89fab Author: tim Date: Sun Oct 27 15:50:44 2013 -0400 2013-10-28 Tim Shen * regex_executor.tcc: Add comments. diff --git a/libstdc++-v3/include/bits/regex_executor.tcc b/libstdc++-v3/include/bits/regex_executor.tcc index d3b9a04..0c42189 100644 --- a/libstdc++-v3/include/bits/regex_executor.tcc +++ b/libstdc++-v3/include/bits/regex_executor.tcc @@ -53,6 +53,49 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return false; } + // This function operates in different modes, DFS mode or BFS mode, indicated + // by template parameter __dfs_mode. See _M_main for details. + // + // + // + // DFS mode: + // + // It applies a Depth-First-Search (aka backtracking) on given NFA and input + // string. + // At the very beginning the executor stands in the start state, then it tries + // every possible state transition in current state recursively. Some state + // transitions consume input string, say, a single-char-matcher or a + // back-reference matcher; some don't, like assertion or other anchor nodes. + // When the input is exhausted and/or the current state is an accepting state, + // the whole executor returns true. + // + // TODO: This approach is exponentially slow for certain input. + // Try to compile the NFA to a DFA. + // + // Time complexity: o(match_length), O(2^(_M_nfa.size())) + // Space complexity: \theta(match_results.size() + match_length) + // + // + // + // BFS mode: + // + // Russ Cox's article (http://swtch.com/~rsc/regexp/regexp1.html) + // explained this algorithm clearly. + // + // It first computes epsilon closure for every state that's still matching, + // using the same DFS algorithm, but doesn't reenter states (set true in + // _M_visited), nor follows _S_opcode_match. + // + // Then apply DFS using every _S_opcode_match (in _M_match_queue) as the start + // state. + // + // It significantly reduces potential duplicate states, so has a better + // upper bound; but it requires more overhead. + // + // Time complexity: o(match_length * match_results.size()) + // O(match_length * _M_nfa.size() * match_results.size()) + // Space complexity: o(_M_nfa.size() + match_results.size()) + // O(_M_nfa.size() * match_results.size()) template template @@ -68,18 +111,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } else { - // Like the DFS approach, it try every possible state transition; - // Unlike DFS, it uses a queue instead of a stack to store matching - // states. It's a BFS approach. - // - // Russ Cox's article(http://swtch.com/~rsc/regexp/regexp1.html) - // explained this algorithm clearly. - // - // Time complexity: o(match_length * match_results.size()) - // O(match_length * _M_nfa.size() - //* match_results.size()) - // Space complexity: o(_M_nfa.size() + match_results.size()) - // O(_M_nfa.size() * match_results.size()) _M_match_queue->push(make_pair(_M_start_state, _M_results)); bool __ret = false; while (1) @@ -132,20 +163,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return false; } - // A _DFSExecutor perform a DFS on given NFA and input string. At the very - // beginning the executor stands in the start state, then it try every - // possible state transition in current state recursively. Some state - // transitions consume input string, say, a single-char-matcher or a - // back-reference matcher; some not, like assertion or other anchor nodes. - // When the input is exhausted and the current state is an accepting state, - // the whole executor return true. - // - // TODO: This approach is exponentially slow for certain input. - // Try to compile the NFA to a DFA. - // - // Time complexity: o(match_length), O(2^(_M_nfa.size())) - // Space complexity: \theta(match_results.size() + match_length) - // template template @@ -160,29 +177,44 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } const auto& __state = _M_nfa[__i]; + // Every change on _M
Re: [Patch] Regex comments
On Sun, Oct 27, 2013 at 9:26 PM, Jonathan Wakely wrote: > I'm happy for that latest patch to be committed, thanks for taking the > time to improve the comments. Committed. Thanks! -- Regards, Tim Shen
[Patch, regex, libstdc++/69794] Unify special character parsing
I did it wrong in r227289 - I ignored the "\n" special case in grep. Turns out using code to handle special cases is error prone, so I turned to use data (_M_grep_spec_char and _M_egrep_spec_char). Bootstrapped and tested on x86_64-pc-linux-gnu. Thanks! -- Regards, Tim Shen commit 03e651ef56e516f1bc7b0d041d93ef657af54791 Author: Tim Shen Date: Sat Feb 13 10:55:38 2016 -0800 PR libstdc++/69794 * include/bits/regex_scanner.h: Add different special character sets for grep and egrep regex. * include/bits/regex_scanner.tcc: Use _M_spec_char more unifiedly. * testsuite/28_regex/regression.cc: Add new testcase. diff --git a/libstdc++-v3/include/bits/regex_scanner.h b/libstdc++-v3/include/bits/regex_scanner.h index bff7366..16071da 100644 --- a/libstdc++-v3/include/bits/regex_scanner.h +++ b/libstdc++-v3/include/bits/regex_scanner.h @@ -95,11 +95,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION : _M_awk_escape_tbl), _M_spec_char(_M_is_ecma() ? _M_ecma_spec_char -: _M_is_basic() +: _M_flags & regex_constants::basic ? _M_basic_spec_char -: _M_extended_spec_char), +: _M_flags & regex_constants::extended +? _M_extended_spec_char +: _M_flags & regex_constants::grep +? _M_grep_spec_char +: _M_flags & regex_constants::egrep +? _M_egrep_spec_char +: _M_flags & regex_constants::awk +? _M_extended_spec_char +: nullptr), _M_at_bracket_start(false) -{ } +{ __glibcxx_assert(_M_spec_char); } protected: const char* @@ -177,6 +185,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION const char* _M_ecma_spec_char = "^$\\.*+?()[]{}|"; const char* _M_basic_spec_char = ".[\\*^$"; const char* _M_extended_spec_char = ".[\\()*+?{|^$"; +const char* _M_grep_spec_char = ".[\\*^$\n"; +const char* _M_egrep_spec_char = ".[\\()*+?{|^$\n"; _StateT _M_state; _FlagT_M_flags; diff --git a/libstdc++-v3/include/bits/regex_scanner.tcc b/libstdc++-v3/include/bits/regex_scanner.tcc index 920cb14..fedba09 100644 --- a/libstdc++-v3/include/bits/regex_scanner.tcc +++ b/libstdc++-v3/include/bits/regex_scanner.tcc @@ -97,9 +97,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_scan_normal() { auto __c = *_M_current++; - const char* __pos; - if (std::strchr(_M_spec_char, _M_ctype.narrow(__c, '\0')) == nullptr) + if (std::strchr(_M_spec_char, _M_ctype.narrow(__c, ' ')) == nullptr) { _M_token = _S_token_ord_char; _M_value.assign(1, __c); @@ -177,12 +176,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_state = _S_state_in_brace; _M_token = _S_token_interval_begin; } - else if (((__pos = std::strchr(_M_spec_char, _M_ctype.narrow(__c, '\0'))) - != nullptr - && *__pos != '\0' - && __c != ']' - && __c != '}') - || (_M_is_grep() && __c == '\n')) + else if (__c != ']' && __c != '}') { auto __it = _M_token_tbl; auto __narrowc = _M_ctype.narrow(__c, '\0'); diff --git a/libstdc++-v3/testsuite/28_regex/regression.cc b/libstdc++-v3/testsuite/28_regex/regression.cc index f95bef9..c9a3402 100644 --- a/libstdc++-v3/testsuite/28_regex/regression.cc +++ b/libstdc++-v3/testsuite/28_regex/regression.cc @@ -33,10 +33,26 @@ test01() regex re("((.)", regex_constants::basic); } +void +test02() +{ + bool test __attribute__((unused)) = true; + + std::string re_str +{ + "/abcd" "\n" + "/aecf" "\n" + "/ghci" +}; + auto rx = std::regex(re_str, std::regex_constants::grep | std::regex_constants::icase); + VERIFY(std::regex_search("/abcd", rx)); +} + int main() { test01(); + test02(); return 0; }
Re: [Patch, regex, libstdc++/69794] Unify special character parsing
On Mon, Feb 15, 2016 at 4:26 AM, Jonathan Wakely wrote: > Those new members change the size of the type, so are an ABI change. > > Couldn't they be static members? Ahh right. Since they are just used for once, use them in the line. -- Regards, Tim Shen commit 4db39da3091a33e4125ffd8b55da37859277d0d2 Author: Tim Shen Date: Sat Feb 13 10:55:38 2016 -0800 PR libstdc++/69794 * include/bits/regex_scanner.h: Add different special character sets for grep and egrep regex. * include/bits/regex_scanner.tcc: Use _M_spec_char more unifiedly. * testsuite/28_regex/regression.cc: Add new testcase. diff --git a/libstdc++-v3/include/bits/regex_scanner.h b/libstdc++-v3/include/bits/regex_scanner.h index bff7366..37dea84 100644 --- a/libstdc++-v3/include/bits/regex_scanner.h +++ b/libstdc++-v3/include/bits/regex_scanner.h @@ -95,11 +95,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION : _M_awk_escape_tbl), _M_spec_char(_M_is_ecma() ? _M_ecma_spec_char -: _M_is_basic() +: _M_flags & regex_constants::basic ? _M_basic_spec_char -: _M_extended_spec_char), +: _M_flags & regex_constants::extended +? _M_extended_spec_char +: _M_flags & regex_constants::grep +? ".[\\*^$\n" +: _M_flags & regex_constants::egrep +? ".[\\()*+?{|^$\n" +: _M_flags & regex_constants::awk +? _M_extended_spec_char +: nullptr), _M_at_bracket_start(false) -{ } +{ __glibcxx_assert(_M_spec_char); } protected: const char* @@ -137,6 +145,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION { return _M_flags & regex_constants::awk; } protected: +// TODO: Make them static in the next abi change. const std::pair _M_token_tbl[9] = { {'^', _S_token_line_begin}, diff --git a/libstdc++-v3/include/bits/regex_scanner.tcc b/libstdc++-v3/include/bits/regex_scanner.tcc index 920cb14..fedba09 100644 --- a/libstdc++-v3/include/bits/regex_scanner.tcc +++ b/libstdc++-v3/include/bits/regex_scanner.tcc @@ -97,9 +97,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_scan_normal() { auto __c = *_M_current++; - const char* __pos; - if (std::strchr(_M_spec_char, _M_ctype.narrow(__c, '\0')) == nullptr) + if (std::strchr(_M_spec_char, _M_ctype.narrow(__c, ' ')) == nullptr) { _M_token = _S_token_ord_char; _M_value.assign(1, __c); @@ -177,12 +176,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_state = _S_state_in_brace; _M_token = _S_token_interval_begin; } - else if (((__pos = std::strchr(_M_spec_char, _M_ctype.narrow(__c, '\0'))) - != nullptr - && *__pos != '\0' - && __c != ']' - && __c != '}') - || (_M_is_grep() && __c == '\n')) + else if (__c != ']' && __c != '}') { auto __it = _M_token_tbl; auto __narrowc = _M_ctype.narrow(__c, '\0'); diff --git a/libstdc++-v3/testsuite/28_regex/regression.cc b/libstdc++-v3/testsuite/28_regex/regression.cc index f95bef9..c9a3402 100644 --- a/libstdc++-v3/testsuite/28_regex/regression.cc +++ b/libstdc++-v3/testsuite/28_regex/regression.cc @@ -33,10 +33,26 @@ test01() regex re("((.)", regex_constants::basic); } +void +test02() +{ + bool test __attribute__((unused)) = true; + + std::string re_str +{ + "/abcd" "\n" + "/aecf" "\n" + "/ghci" +}; + auto rx = std::regex(re_str, std::regex_constants::grep | std::regex_constants::icase); + VERIFY(std::regex_search("/abcd", rx)); +} + int main() { test01(); + test02(); return 0; }
[Patch] Add comments for future regex work
...for optimization purpose. Should be done in one month. Thanks! -- Regards, Tim Shen commit cc7d58128e68455498d0257c4796cb70a9e24990 Author: tim Date: Mon Dec 2 15:49:15 2013 -0500 2013-12-02 Tim Shen * regex_compiler.h: Add todo comment. * regex_executor.tcc: Likewise. diff --git a/libstdc++-v3/include/bits/regex_compiler.h b/libstdc++-v3/include/bits/regex_compiler.h index b9f8127..5759d48 100644 --- a/libstdc++-v3/include/bits/regex_compiler.h +++ b/libstdc++-v3/include/bits/regex_compiler.h @@ -237,6 +237,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION }; /// Matches a character range (bracket expression) + // TODO: Convert used _M_flags fields to template parameters, including + // collate and icase. Avoid using std::set, could use flat_set + // (sorted vector and binary search) instead; use an fixed sized (256) + // vector for char specialization if necessary. template struct _BracketMatcher { diff --git a/libstdc++-v3/include/bits/regex_executor.tcc b/libstdc++-v3/include/bits/regex_executor.tcc index 22fd67c..150adb4 100644 --- a/libstdc++-v3/include/bits/regex_executor.tcc +++ b/libstdc++-v3/include/bits/regex_executor.tcc @@ -162,6 +162,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return false; } + // TODO: Use a function vector to dispatch, instead of using switch-case. template template
Re: [Patch] Add comments for future regex work
On Tue, Dec 3, 2013 at 4:47 AM, Paolo Carlini wrote: > Great. Consider that we are now in Stage 3, until mid of February. > Considering that is a new feature, I think we have some leeway in > terms of timing, but definitely we can't commit invasive changes not fixing > regressions while in Stage 4. ...committed. -- Regards, Tim Shen
[Patch] Regex bracket matcher cache optimization
The data structure _BracketMatcher (storing regex like [123a-z]) could be quite slow mainly because of regex_traits. So a result of cache for small range of inputs (char) sounds reasonable. It iterates all 256 inputs and calculate them at regex compile time. Booted and tested under -m64 and -m32. Thanks! -- Regards, Tim Shen commit de1e43989be9c9dffebf488e118549cfe2987b9e Author: tim Date: Fri Jan 3 11:43:12 2014 -0500 2014-01-03 Tim Shen * include/bits/regex_compiler.h (_AnyMatcher<>::_AnyMatcher(), _AnyMatcher<>::operator(), _AnyMatcher<>::_M_apply(), _CharMatcher<>::_CharMatcher(), _CharMatcher<>::_M_translate(), _BracketMatcher<>::_BracketMatcher(), _BracketMatcher<>::operator(), _BracketMatcher<>::_M_add_char(), _BracketMatcher<>::_M_add_collating_element(), _BracketMatcher<>::_M_add_equivalence_class(), _BracketMatcher<>::_M_add_character_class(), _BracketMatcher<>::_M_make_range(), _BracketMatcher<>::_M_ready(), _BracketMatcher<>::_M_apply(), _BracketMatcher<>::_M_make_cache()): Fix _AnyMatcher behavior of POSIX style and move _M_flags to template parameter; Add cache for _BracketMatcher. Adjust declarations from here... * include/bits/regex.h (basic_regex<>::imbue()): ...to here. Also, imbuing a regex will trigger a recompilation to rebuild the cache. * include/bits/regex_compiler.tcc (_Compiler<>::_M_atom(), _Compiler<>::_M_bracket_expression()): Adjust matchers' caller for different template bool parameters. * include/bits/regex_executor.h: Remove unnecessary declarations. * include/std/regex: Adjust including orders. * testsuite/28_regex/traits/char/user_defined.cc: New. * testsuite/28_regex/traits/wchar_t/user_defined.cc: New. diff --git a/libstdc++-v3/include/bits/regex.h b/libstdc++-v3/include/bits/regex.h index 4e091e0..ae8e1f5 100644 --- a/libstdc++-v3/include/bits/regex.h +++ b/libstdc++-v3/include/bits/regex.h @@ -30,6 +30,15 @@ namespace std _GLIBCXX_VISIBILITY(default) { +_GLIBCXX_BEGIN_NAMESPACE_VERSION + template +class basic_regex; + + template +class match_results; + +_GLIBCXX_END_NAMESPACE_VERSION + namespace __detail { _GLIBCXX_BEGIN_NAMESPACE_VERSION @@ -48,6 +57,56 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION const basic_regex<_CharT, _TraitsT>& __re, regex_constants::match_flag_type __flags); + template +class _Executor; + + template +struct __has_contiguous_iter : std::false_type { }; + + template +struct __has_contiguous_iter> +: std::true_type // string storage is contiguous +{ }; + + template +struct __has_contiguous_iter> +: std::true_type // vector storage is contiguous +{ }; + + template +struct __has_contiguous_iter> +: std::false_type // vector storage is not contiguous +{ }; + + template +struct __is_contiguous_normal_iter : std::false_type { }; + + template +struct +__is_contiguous_normal_iter<__gnu_cxx::__normal_iterator<_Tp, _Cont>> +: __has_contiguous_iter<_Cont>::type +{ }; + + template +using __enable_if_contiguous_normal_iter + = typename enable_if< __is_contiguous_normal_iter<_Iter>::value, + std::shared_ptr<_NFA<_TraitsT>> >::type; + + template +using __disable_if_contiguous_normal_iter + = typename enable_if< !__is_contiguous_normal_iter<_Iter>::value, + std::shared_ptr<_NFA<_TraitsT>> >::type; + + template +__disable_if_contiguous_normal_iter<_FwdIter, _TraitsT> +__compile_nfa(_FwdIter __first, _FwdIter __last, const _TraitsT& __traits, + regex_constants::syntax_option_type __flags); + + template +__enable_if_contiguous_normal_iter<_Iter, _TraitsT> +__compile_nfa(_Iter __first, _Iter __last, const _TraitsT& __traits, + regex_constants::syntax_option_type __flags); + _GLIBCXX_END_NAMESPACE_VERSION } @@ -501,6 +560,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION basic_regex(_FwdIter __first, _FwdIter __last, flag_type __f = ECMAScript) : _M_flags(__f), + _M_original_str(__first, __last), _M_automaton(__detail::__compile_nfa(__first, __last, _M_traits, _M_flags)) { } @@ -637,6 +697,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION flag_type __flags = ECMAScript) { _M_flags = __flags; + _M_original_str.assign(__s.begin(), __s.end()); _M_automaton = __detail::__compile_nfa(__s.begin(), __s.end(), _M_traits, _M_flags); return *this; @@ -701,7 +762,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION */ locale_type imbue(locale_type __loc) - { return _M_traits.imbue(__loc); } + { + auto __ret = _M_traits.imbue(__loc); + this->assign(_M_original_str, _M_flags); + return __ret; + }
Re: [Patch] Regex bracket matcher cache optimization
On Sat, Jan 4, 2014 at 4:25 AM, Paolo Carlini wrote: > Good. Could we actually measure something in the performance testsuite? Before: split.cc 108r 107u 0s0mem0pf split_bfs.cc 192r 191u 0s0mem0pf After: split.cc14r 14u 0s0mem0pf split_bfs.cc78r 77u 0s0mem0pf The cache obviously works. -- Regards, Tim Shen
Re: [Patch] Regex bracket matcher cache optimization
On Fri, Jan 3, 2014 at 9:35 PM, Tim Shen wrote: > The data structure _BracketMatcher (storing regex like [123a-z]) could > be quite slow mainly because of regex_traits. So a result of cache for > small range of inputs (char) sounds reasonable. It iterates all 256 > inputs and calculate them at regex compile time. In addition, boost failed the test case (testsuite/28_regex/traits/char/user_defined.cc), which can be seen here: https://gist.github.com/innocentim/8257944 -- Regards, Tim Shen
Re: [Patch] Regex bracket matcher cache optimization
I'd prefer to propose another patch that should be commited with this one, which fix bugs (say _M_flags & regex_constants::icase, not "&&"), and do more "matcher optimization". It is now more DRY (_RegexTranslator<>) and efficient, but takes longer to compile, mainly because now we have 4 times more _Compiler<> versions :) Booted and tested with -m32 and -m64 respectively. Thank you! -- Regards, Tim Shen commit 9dcc87dd782c7a09b5e075e746ceb3d0c6f1f4db Author: tim Date: Mon Jan 6 00:03:41 2014 -0500 2014-01-07 Tim Shen * bits/regex_automaton.tcc: Indentation fix. * bits/regex_compiler.h (__compile_nfa<>(), _Compiler<>, _RegexTranslator<> _AnyMatcher<>, _CharMatcher<>, _BracketMatcher<>): Add bool option template parameter and specialization to make matching more efficient and space saving. * bits/regex_compiler.tcc: Likewise. diff --git a/libstdc++-v3/include/bits/regex_automaton.tcc b/libstdc++-v3/include/bits/regex_automaton.tcc index 7edc67f..e222803 100644 --- a/libstdc++-v3/include/bits/regex_automaton.tcc +++ b/libstdc++-v3/include/bits/regex_automaton.tcc @@ -134,9 +134,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _NFA<_TraitsT>::_M_dot(std::ostream& __ostr) const { __ostr << "digraph _Nfa {\n" - " rankdir=LR;\n"; + " rankdir=LR;\n"; for (size_t __i = 0; __i < this->size(); ++__i) -(*this)[__i]._M_dot(__ostr, __i); + (*this)[__i]._M_dot(__ostr, __i); __ostr << "}\n"; return __ostr; } diff --git a/libstdc++-v3/include/bits/regex_compiler.h b/libstdc++-v3/include/bits/regex_compiler.h index 4ac67df..4873166 100644 --- a/libstdc++-v3/include/bits/regex_compiler.h +++ b/libstdc++-v3/include/bits/regex_compiler.h @@ -39,11 +39,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION * @{ */ - template + template struct _BracketMatcher; /// Builds an NFA from an input iterator interval. - template + template class _Compiler { public: @@ -63,7 +63,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION typedef typename _ScannerT::_TokenT _TokenT; typedef _StateSeq<_TraitsT>_StateSeqT; typedef std::stack<_StateSeqT, std::vector<_StateSeqT>> _StackT; - typedef _BracketMatcher<_TraitsT> _BMatcherT; + typedef _BracketMatcher<_TraitsT, __icase, __collate> _BMatcherT; typedef std::ctype_CtypeT; // accepts a specific token or returns false. @@ -95,12 +95,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_expression_term(_BMatcherT& __matcher); bool - _M_range_expression(_BMatcherT& __matcher); - - bool - _M_collating_symbol(_BMatcherT& __matcher); - - bool _M_equivalence_class(_BMatcherT& __matcher); bool @@ -134,8 +128,20 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION __compile_nfa(_FwdIter __first, _FwdIter __last, const _TraitsT& __traits, regex_constants::syntax_option_type __flags) { - using _Cmplr = _Compiler<_FwdIter, _TraitsT>; - return _Cmplr(__first, __last, __traits, __flags)._M_get_nfa(); + if (!(__flags & regex_constants::icase)) + if (!(__flags & regex_constants::collate)) + return _Compiler<_FwdIter, _TraitsT, false, false> + (__first, __last, __traits, __flags)._M_get_nfa(); + else + return _Compiler<_FwdIter, _TraitsT, false, true> + (__first, __last, __traits, __flags)._M_get_nfa(); + else + if (!(__flags & regex_constants::collate)) + return _Compiler<_FwdIter, _TraitsT, true, false> + (__first, __last, __traits, __flags)._M_get_nfa(); + else + return _Compiler<_FwdIter, _TraitsT, true, true> + (__first, __last, __traits, __flags)._M_get_nfa(); } template @@ -148,16 +154,110 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return __compile_nfa(__cfirst, __cfirst + __len, __traits, __flags); } - template -struct _AnyMatcher + // [28.13.14] + template +class _RegexTranslator { - typedef typename _TraitsT::char_type _CharT; +public: + typedef typename _TraitsT::char_type _CharT; + typedef typename _TraitsT::string_type _StringT; + typedef typename std::conditional<__collate, + _StringT, + _CharT>::type _StrTransT; explicit - _AnyMatcher(const _TraitsT& __traits) + _RegexTranslator(const _TraitsT& __traits) : _M_traits(__traits) { } + _CharT + _M_translate(_CharT __ch) con
Re: [Patch] Regex bracket matcher cache optimization
On Tue, Jan 7, 2014 at 4:02 AM, Paolo Carlini wrote: > Ideally, I would suggest committing first the improvements in your previous > patch (by the way, thanks for the numbers!) + the pure bug fixes and > separate the further performance improvements which have compile-time > performance implications (how big?), see if, eg, Jon has something to > recommend. Can we do that? First patch committed. I later found that the second patch "b.diff" is based on the committed version (the attach, which fixed the "&&" problem); Here's an example to test compile time: #include using namespace std; int main() { regex a("a"); regex b("a", regex_constants::ECMAScript | regex_constants::icase); regex c("a", regex_constants::ECMAScript | regex_constants::collate); regex d("a", regex_constants::ECMAScript | regex_constants::icase | regex_constants::collate); return 0; } Before the second patch: g++ -g -Wall -std=c++11 -O3 regextest.cc 3.30s user 0.13s system 99% cpu 3.435 total After it: g++ -g -Wall -std=c++11 -O3 regextest.cc 4.35s user 0.10s system 99% cpu 4.454 total I didn't noticed that's so time consuming. I think reducing the compile time is possible (by templating several member functions instead of whole _Compiler<> class). -- Regards, Tim Shen Index: include/std/regex === --- include/std/regex (revision 206399) +++ include/std/regex (revision 206400) @@ -56,11 +56,11 @@ #include #include +#include +#include #include -#include #include #include -#include #endif // C++11 Index: include/bits/regex_compiler.h === --- include/bits/regex_compiler.h (revision 206399) +++ include/bits/regex_compiler.h (revision 206400) @@ -129,43 +129,6 @@ _StackT _M_stack; }; - template -struct __has_contiguous_iter : std::false_type { }; - - template -struct __has_contiguous_iter> -: std::true_type // string storage is contiguous -{ }; - - template -struct __has_contiguous_iter> -: std::true_type // vector storage is contiguous -{ }; - - template -struct __has_contiguous_iter> -: std::false_type // vector storage is not contiguous -{ }; - - template -struct __is_contiguous_normal_iter : std::false_type { }; - - template -struct -__is_contiguous_normal_iter<__gnu_cxx::__normal_iterator<_Tp, _Cont>> -: __has_contiguous_iter<_Cont>::type -{ }; - - template -using __enable_if_contiguous_normal_iter - = typename enable_if< __is_contiguous_normal_iter<_Iter>::value, - std::shared_ptr<_NFA<_TraitsT>> >::type; - - template -using __disable_if_contiguous_normal_iter - = typename enable_if< !__is_contiguous_normal_iter<_Iter>::value, - std::shared_ptr<_NFA<_TraitsT>> >::type; - template inline __disable_if_contiguous_normal_iter<_FwdIter, _TraitsT> __compile_nfa(_FwdIter __first, _FwdIter __last, const _TraitsT& __traits, @@ -185,7 +148,7 @@ return __compile_nfa(__cfirst, __cfirst + __len, __traits, __flags); } - template + template struct _AnyMatcher { typedef typename _TraitsT::char_type _CharT; @@ -197,25 +160,55 @@ bool operator()(_CharT __ch) const + { return _M_apply(__ch, typename is_same<_CharT, char>::type()); } + + bool + _M_apply(_CharT __ch, true_type) const { - return _M_traits.translate(__ch) != '\n' - && _M_traits.translate(__ch) != '\r' - && _M_traits.translate(__ch) != u'\u2028' - && _M_traits.translate(__ch) != u'\u2029'; + auto __c = _M_traits.translate(__ch); + if (__is_ecma) + { + static auto __n = _M_traits.translate('\n'); + static auto __r = _M_traits.translate('\r'); + return __c != __n && __c != __r; + } + else + { + static auto __nul = _M_traits.translate('\0'); + return __c != __nul; + } } + bool + _M_apply(_CharT __ch, false_type) const + { + auto __c = _M_traits.translate(__ch); + if (__is_ecma) + { + static auto __n = _M_traits.translate('\n'); + static auto __r = _M_traits.translate('\r'); + static auto __u2028 = _M_traits.translate(u'\u2028'); + static auto __u2029 = _M_traits.translate(u'\u2029'); + return __c != __n && __c != __r && __c != __u2028 + && __c != __u2029; + } + else
Re: [Patch] Regex bracket matcher cache optimization
On Wed, Jan 8, 2014 at 5:20 AM, Paolo Carlini wrote: > On 01/08/2014 10:24 AM, Jonathan Wakely wrote: >> Ouch! Yes, that's quite a bit slower, and this code is already very >> slow to compile. With this patch (who is based on a-fixed.diff, committed earlerly), who use templated member functions instead of templating the whole _Compiler, time consumption is: g++ -g -Wall -std=c++11 -g -Wall -std=c++11 -O3 regextest.cc 3.79s user 0.14s system 98% cpu 3.981 total Comparing to 4.5s it's better and probably fine. Booted and tested with -m32 and -m64 respectively. > I only want to add that, besides keeping compile-time under control for > 4.9.0 - please investigate a bit more along the mentioned lines - we should > also start experimenting with exporting the instantiations. I don't know > what the other implementations are doing, but in general it definitely makes > sense, for compile-time performance too. I think we already said that some > time ago, but the issue seems more important now. Maybe it's really > unavoidable if we need template complexity for first class run-time > performance. After this patch I plan to instantiate _Compiler and _Executor. -- Regards, Tim Shen
Re: [Patch] Regex bracket matcher cache optimization
On Wed, Jan 8, 2014 at 5:38 PM, Paolo Carlini wrote: > I agree, it's probably fine for now, but please actually attach the patch ;) Oops sorry >.< So my plan is to instantiate _Compiler and _Executor instead of user interfaces like basic_regex or regex_match, because the implementation may change (say add a new executor) later. Is that Ok? -- Regards, Tim Shen commit d9f47e783680a1cab86bd704e67236025cbdff18 Author: tim Date: Mon Jan 6 00:03:41 2014 -0500 2014-01-08 Tim Shen * bits/regex_automaton.tcc: Indentation fix. * bits/regex_compiler.h (__compile_nfa<>(), _Compiler<>, _RegexTranslator<> _AnyMatcher<>, _CharMatcher<>, _BracketMatcher<>): Add bool option template parameters and specializations to make matching more efficient and space saving. * bits/regex_compiler.tcc: Likewise. diff --git a/libstdc++-v3/include/bits/regex_automaton.tcc b/libstdc++-v3/include/bits/regex_automaton.tcc index 7edc67f..e222803 100644 --- a/libstdc++-v3/include/bits/regex_automaton.tcc +++ b/libstdc++-v3/include/bits/regex_automaton.tcc @@ -134,9 +134,9 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _NFA<_TraitsT>::_M_dot(std::ostream& __ostr) const { __ostr << "digraph _Nfa {\n" - " rankdir=LR;\n"; + " rankdir=LR;\n"; for (size_t __i = 0; __i < this->size(); ++__i) -(*this)[__i]._M_dot(__ostr, __i); + (*this)[__i]._M_dot(__ostr, __i); __ostr << "}\n"; return __ostr; } diff --git a/libstdc++-v3/include/bits/regex_compiler.h b/libstdc++-v3/include/bits/regex_compiler.h index 4ac67df..b73fe30 100644 --- a/libstdc++-v3/include/bits/regex_compiler.h +++ b/libstdc++-v3/include/bits/regex_compiler.h @@ -39,7 +39,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION * @{ */ - template + template struct _BracketMatcher; /// Builds an NFA from an input iterator interval. @@ -63,7 +63,6 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION typedef typename _ScannerT::_TokenT _TokenT; typedef _StateSeq<_TraitsT>_StateSeqT; typedef std::stack<_StateSeqT, std::vector<_StateSeqT>> _StackT; - typedef _BracketMatcher<_TraitsT> _BMatcherT; typedef std::ctype_CtypeT; // accepts a specific token or returns false. @@ -91,20 +90,30 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION bool _M_bracket_expression(); - void - _M_expression_term(_BMatcherT& __matcher); + template + void + _M_insert_any_matcher_ecma(); - bool - _M_range_expression(_BMatcherT& __matcher); + template + void + _M_insert_any_matcher_posix(); - bool - _M_collating_symbol(_BMatcherT& __matcher); + template + void + _M_insert_char_matcher(); - bool - _M_equivalence_class(_BMatcherT& __matcher); + template + void + _M_insert_character_class_matcher(); - bool - _M_character_class(_BMatcherT& __matcher); + template + void + _M_insert_bracket_matcher(bool __neg); + + template + void + _M_expression_term(_BracketMatcher<_TraitsT, __icase, __collate>& + __matcher); int _M_cur_int_value(int __radix); @@ -148,16 +157,110 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION return __compile_nfa(__cfirst, __cfirst + __len, __traits, __flags); } - template -struct _AnyMatcher + // [28.13.14] + template +class _RegexTranslator { - typedef typename _TraitsT::char_type _CharT; +public: + typedef typename _TraitsT::char_type _CharT; + typedef typename _TraitsT::string_type _StringT; + typedef typename std::conditional<__collate, + _StringT, + _CharT>::type _StrTransT; explicit - _AnyMatcher(const _TraitsT& __traits) + _RegexTranslator(const _TraitsT& __traits) : _M_traits(__traits) { } + _CharT + _M_translate(_CharT __ch) const + { + if (__icase) + return _M_traits.translate_nocase(__ch); + else if (__collate) + return _M_traits.translate(__ch); + else + return __ch; + } + + _StrTransT + _M_transform(_CharT __ch) const + { + return _M_transform_impl(__ch, typename integral_constant::type()); + } + +private: + _StrTransT + _M_transform_impl(_CharT __ch, false_type) const + { return __ch; } + + _StrTransT + _M_transform_impl(_CharT __ch, true_type) const + { + _StrTransT __str = _StrTransT(1, _M_translate(__ch)); + return _M_traits.transform(__str
Re: [GSoC] Patches for shared_ptr array and polymorphic_allocator
On Fri, Jul 17, 2015 at 7:16 PM, Fan You wrote: > Hi, > > According to > <http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4335.html#memory.smartptr> > > Here is my implementation of > > [8.2] Extend shared_ptr to support arrays Please don't resend the shared_ptr patch, since it's already tracked in another thread. > [8.3] Type-Erased allocator Please send a working patch with tests and (probably with Makefile.am changes). Format: please replace leading consecutive spaces with tabs. L70: static std::atomic s_default_resource; naming: _S_default_resource. L43: virtual ~memory_resource() { } Please break the line after virtual/return type. This also applies for other places in the patch. L81: template class __constructor_helper_imp This doesn't work correctly for at least following case: std::__constructor_helper_imp(std::allocator_arg, std::allocator(), std::true_type(), std::true_type(), std::true_type()); Based on [allocator.uses.construction], this falls into the second rule, because there exists priorities in those rules; but overloading resolution thinks it's ambiguous. To enforce the order, you can do this: template struct __uses_allocator_construction_helper; ... { constexpr bool __uses_alloc = uses_allocator<...>::value; constexpr bool __normally_constructible = is_constructible<_Tp, _Args...>::value; constexpr bool __constructible_alloc_before = is_constructible<_Tp, allocator_tag_t, _Alloc, _Args...>::value; constexpr bool __constructible_alloc_after = is_constructible<_Args..., _Alloc>::value; constexpr bool __uses_rule1 = !__uses_alloc && __normally_constructible; constexpr bool __uses_rule2 = __uses_alloc && __constructible_alloc_before; constexpr bool __uses_rule3 = __uses_alloc && __constructible_alloc_after; __uses_allocator_construction_helper<__uses_rule1 ? 1 : (__uses_rule2 ? 2 : (__uses_rule3 ? 3 : 0))>::_S_apply(...); } Consider use a more readable helper name, like __uses_allocator_construction_helper and document it. L73: bool operator==(const memory_resource& __a, const memory_resource& __b) noexcept { return &__a == &__b || __a.is_equal(__b); } Make all non-template functions inlined. L178: { } // used here What does this mean? L180: polymorphic_allocator(memory_resource* __r) : _M_resource(__r ? __r : get_default_resource()) { } [8.6.2.3] describes __r != nullptr as a precondition, which is guaranteed by the caller, so we don't have to check here. Alternatively you can use _GLIBCXX_ASSERT. L262: memory_resource* _M_resource; private member variables should be defined after private member functions. L286: template bool operator!=(const polymorphic_allocator<_Tp1>& __a, const polymorphic_allocator<_Tp2>& __b) noexcept { return ! (__a == __b); } Remove extra space after "!". L340: auto __p = dynamic_cast(&__other); What if the user turns off RTTI? L345: // Calculate Aligned Size size_t _Aligned_size(size_t __size, size_t __alignment) { return ((__size - 1)|(__alignment - 1)) + 1; } bool _M_supported (size_t __x) { return ((__x != 0) && (__x != 0) && !(__x & (__x - 1))); } Document these two functions' behaviors, e.g.: Returns a size that is larger than or equal to __size and divided by __alignment, where __alignment is required to be the power of 2. L355: // Global memory resources atomic memory_resource::s_default_resource; L386: // The default memory resource memory_resource* get_default_resource() noexcept { memory_resource *__ret = memory_resource::s_default_resource.load(); if (__ret == nullptr) { __ret = new_delete_resource(); } return __ret; } According to [8.8.5], memory resource pointer should be intialized with new_delete_resource(), not nullptr; and get_default_resource should only return the pointer. L396: memory_resource* set_default_resource(memory_resource* __r) noexcept { if ( __r == nullptr) { __r = new_delete_resource(); } memory_resource* __prev = get_default_resource(); memory_resource::s_default_resource.store(__r); return __prev; } We shouldn't care if it's nullptr or not. Your get-then-set may cause a data race. I think std::atomic<>::exchange will work, but we should confirm with Jon. -- Regards, Tim Shen
Re: [Patch] Small refactor on _State<>
On Sat, Jul 25, 2015 at 12:11 AM, Tim Shen wrote: > It's not a very necessary refactoring, but simply can't resist. :) > > I'm not sure of the ::memcpy calls. It looks not very idiomatic, but > std::copy on char* looks even more weird? :/ > > Bootstrapped and tested. > > Thanks! > > > -- > Regards, > Tim Shen -- Regards, Tim Shen
[libstdc++/67015, patch] Fix regex POSIX bracket parsing
Kinda important, since "[a-z0-9-]" may be a common case. Bootstrapped and tested. Guess it can also be backported to 5, or even 4.9? Thanks! -- Regards, Tim Shen
Re: [libstdc++/67015, patch] Fix regex POSIX bracket parsing
On Sun, Jul 26, 2015 at 5:19 AM, Tim Shen wrote: > Kinda important, since "[a-z0-9-]" may be a common case. > > Bootstrapped and tested. Actual patch... -- Regards, Tim Shen commit e0e6c2e3b722e1453d29ad3a56d0de80046453b0 Author: Tim Shen Date: Sun Jul 26 04:37:45 2015 -0700 PR libstdc++/67015 * include/bits/regex_compiler.h (_Compiler<>::_M_expression_term, _BracketMatcher<>::_M_add_collating_element): Change signature to make checking the and of bracket expression easier. * include/bits/regex_compiler.tcc (_Compiler<>::_M_expression_term): Treat '-' as a valid literal if it's at the end of bracket expression. * testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc: New testcases. diff --git a/libstdc++-v3/include/bits/regex_compiler.h b/libstdc++-v3/include/bits/regex_compiler.h index 4472116..6d9799e 100644 --- a/libstdc++-v3/include/bits/regex_compiler.h +++ b/libstdc++-v3/include/bits/regex_compiler.h @@ -116,8 +116,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION void _M_insert_bracket_matcher(bool __neg); + // Returns true if successfully matched one term and should continue. + // Returns false if the compiler should move on. template - void + bool _M_expression_term(pair& __last_char, _BracketMatcher<_TraitsT, __icase, __collate>& __matcher); @@ -389,7 +391,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION #endif } - void + _StringT _M_add_collating_element(const _StringT& __s) { auto __st = _M_traits.lookup_collatename(__s.data(), @@ -400,6 +402,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION #ifdef _GLIBCXX_DEBUG _M_is_ready = false; #endif + return __st; } void diff --git a/libstdc++-v3/include/bits/regex_compiler.tcc b/libstdc++-v3/include/bits/regex_compiler.tcc index 33d7118..f48e3c1 100644 --- a/libstdc++-v3/include/bits/regex_compiler.tcc +++ b/libstdc++-v3/include/bits/regex_compiler.tcc @@ -424,8 +424,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION __last_char.first = true; __last_char.second = _M_value[0]; } - while (!_M_match_token(_ScannerT::_S_token_bracket_end)) - _M_expression_term(__last_char, __matcher); + while (_M_expression_term(__last_char, __matcher)); __matcher._M_ready(); _M_stack.push(_StateSeqT( *_M_nfa, @@ -434,21 +433,31 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template template -void +bool _Compiler<_TraitsT>:: _M_expression_term(pair& __last_char, _BracketMatcher<_TraitsT, __icase, __collate>& __matcher) { + if (_M_match_token(_ScannerT::_S_token_bracket_end)) + return false; + if (_M_match_token(_ScannerT::_S_token_collsymbol)) - __matcher._M_add_collating_element(_M_value); + { + auto __symbol = __matcher._M_add_collating_element(_M_value); + if (__symbol.size() == 1) + { + __last_char.first = true; + __last_char.second = __symbol[0]; + } + } else if (_M_match_token(_ScannerT::_S_token_equiv_class_name)) __matcher._M_add_equivalence_class(_M_value); else if (_M_match_token(_ScannerT::_S_token_char_class_name)) __matcher._M_add_character_class(_M_value, false); - // POSIX doesn't permit '-' as a start-range char (say [a-z--0]), - // except when the '-' is the first character in the bracket expression - // ([--0]). ECMAScript treats all '-' after a range as a normal character. - // Also see above, where _M_expression_term gets called. + // POSIX doesn't allow '-' as a start-range char (say [a-z--0]), + // except when the '-' is the first or last character in the bracket + // expression ([--0]). ECMAScript treats all '-' after a range as a + // normal character. Also see above, where _M_expression_term gets called. // // As a result, POSIX rejects [-], but ECMAScript doesn't. // Boost (1.57.0) always uses POSIX style even in its ECMAScript syntax. @@ -459,10 +468,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION { if (!__last_char.first) { + __matcher._M_add_char(_M_value[0]); if (_M_value[0] == '-' && !(_M_flags & regex_constants::ECMAScript)) - __throw_regex_error(regex_constants::error_range); - __matcher._M_add_char(_M_value[0]); + { + if (_M_match_token(_ScannerT::_S_token_bracket_end)) + return false; + __throw_regex_error(regex_constants:
Re: [libstdc++/67015, patch] Fix regex POSIX bracket parsing
On Mon, Jul 27, 2015 at 4:45 AM, Jonathan Wakely wrote: > On 26/07/15 05:20 -0700, Tim Shen wrote: >> >> @@ -389,7 +391,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION >> #endif >> } >> >> - void >> + _StringT >> _M_add_collating_element(const _StringT& __s) >> { >> auto __st = _M_traits.lookup_collatename(__s.data(), > > > Changing this return type is an ABI change. When compiled with older > versions of GCC this function will not return anything, so code > compiled with newer versions that links to that older code will read > garbage off the stack if linked to the old version. > > (This isn't a problem for _M_expression_term because the return type > is part of the mangled name for function templates). > > One solution would be to rename the function, so that code compiled > with the new GCC will use the new function, not link to the old > version that doesn't return anything. Done by s/_M_add_collating_element/_M_add_collate_element/. -- Regards, Tim Shen commit 4a3b4ab285b0ecabb706a776551dabf5a37059aa Author: Tim Shen Date: Sun Jul 26 04:37:45 2015 -0700 PR libstdc++/67015 * include/bits/regex_compiler.h (_Compiler<>::_M_expression_term, _BracketMatcher<>::_M_add_collating_element): Change signature to make checking the and of bracket expression easier. * include/bits/regex_compiler.tcc (_Compiler<>::_M_expression_term): Treat '-' as a valid literal if it's at the end of bracket expression. * testsuite/28_regex/algorithms/regex_match/cstring_bracket_01.cc: New testcases. diff --git a/libstdc++-v3/include/bits/regex_compiler.h b/libstdc++-v3/include/bits/regex_compiler.h index 4472116..0cb0c04 100644 --- a/libstdc++-v3/include/bits/regex_compiler.h +++ b/libstdc++-v3/include/bits/regex_compiler.h @@ -116,8 +116,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION void _M_insert_bracket_matcher(bool __neg); + // Returns true if successfully matched one term and should continue. + // Returns false if the compiler should move on. template - void + bool _M_expression_term(pair& __last_char, _BracketMatcher<_TraitsT, __icase, __collate>& __matcher); @@ -389,8 +391,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION #endif } - void - _M_add_collating_element(const _StringT& __s) + _StringT + _M_add_collate_element(const _StringT& __s) { auto __st = _M_traits.lookup_collatename(__s.data(), __s.data() + __s.size()); @@ -400,6 +402,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION #ifdef _GLIBCXX_DEBUG _M_is_ready = false; #endif + return __st; } void diff --git a/libstdc++-v3/include/bits/regex_compiler.tcc b/libstdc++-v3/include/bits/regex_compiler.tcc index 33d7118..9a62311 100644 --- a/libstdc++-v3/include/bits/regex_compiler.tcc +++ b/libstdc++-v3/include/bits/regex_compiler.tcc @@ -424,8 +424,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION __last_char.first = true; __last_char.second = _M_value[0]; } - while (!_M_match_token(_ScannerT::_S_token_bracket_end)) - _M_expression_term(__last_char, __matcher); + while (_M_expression_term(__last_char, __matcher)); __matcher._M_ready(); _M_stack.push(_StateSeqT( *_M_nfa, @@ -434,21 +433,31 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template template -void +bool _Compiler<_TraitsT>:: _M_expression_term(pair& __last_char, _BracketMatcher<_TraitsT, __icase, __collate>& __matcher) { + if (_M_match_token(_ScannerT::_S_token_bracket_end)) + return false; + if (_M_match_token(_ScannerT::_S_token_collsymbol)) - __matcher._M_add_collating_element(_M_value); + { + auto __symbol = __matcher._M_add_collate_element(_M_value); + if (__symbol.size() == 1) + { + __last_char.first = true; + __last_char.second = __symbol[0]; + } + } else if (_M_match_token(_ScannerT::_S_token_equiv_class_name)) __matcher._M_add_equivalence_class(_M_value); else if (_M_match_token(_ScannerT::_S_token_char_class_name)) __matcher._M_add_character_class(_M_value, false); - // POSIX doesn't permit '-' as a start-range char (say [a-z--0]), - // except when the '-' is the first character in the bracket expression - // ([--0]). ECMAScript treats all '-' after a range as a normal character. - // Also see above, where _M_expression_term gets called. + // POSIX doesn't allow '-' as a start-range char
Re: [libstdc++/67015, patch] Fix regex POSIX bracket parsing
On Tue, Jul 28, 2015 at 3:27 AM, Jonathan Wakely wrote: > On 27/07/15 19:40 -0700, Tim Shen wrote: >> >> Done by s/_M_add_collating_element/_M_add_collate_element/. > > > Great, thanks. OK for trunk and gcc-5-branch. Committed. Is there no need for gcc-4_9-branch? What's the general policy for backporting fixes? Thanks! -- Regards, Tim Shen
Re: [Patch] Small refactor on _State<>
On Wed, Jul 29, 2015 at 1:32 AM, Jonathan Wakely wrote: > Apologies, you have a user-declared move constructor, so assignment is > already deleted. It wouldn't hurt to make that explicit though: I'm just bad at memorizing when they are implicitly declared/defined/deleted and fully unware of the deleted copy assignment :) It's better to make it explicit, since I don't want to be clever on pretending knowing all ctor/assigment rules :) > _State& operator=(const _State&) = delete; > > So it's just the alignment issue that I'm concerned about now. > Will alignof(std::function) be all the same across instantiations, including user-defined C, now and in the future? I created template parameter _Matcher_size (which should be __matcher_size) because I don't want to make this assumption on sizeof(). If they (both alignment and size) are expected to be the same, we can remove that template parameter (and all indentation changes!); otherwise, is alignment more unlikely to change than size the reason we always stick on alignof(std::function)? -- Regards, Tim Shen
Re: [libstdc++/67015, patch] Fix regex POSIX bracket parsing
On Wed, Jul 29, 2015 at 1:35 AM, Jonathan Wakely wrote: > The 4.9 branch is the oldest one we still support, so should be very > stable now, although C++11 support in 4.9 is still labelled > experimental, and this only changes C++11 code. How important is it to > fix? Um... I'd say medium importance, for [a-z0-9-] not working in POSIX syntax (while ECMAScript seems to be more popular, which works correctly). If 4.9 is `stable` and expected to be experimetal, which by my definition should only include important changes, we may leave it as is? -- Regards, Tim Shen
Re: [Patch] Small refactor on _State<>
On Wed, Jul 29, 2015 at 2:15 AM, Jonathan Wakely wrote: > Yes, that makes sense. See the code in for how > to set the alignment of the buffer appropriately. You can use the size > and alignment of std::function even though it will > sometimes be a different std::function specialization. Done. Also change _Executor::_M_lookahead(_State<>) to _Executor::_M_lookahead(_StateIdT __next_state). -- Regards, Tim Shen commit 52c70e70bdbef15c787f81e83722bfc119543ff0 Author: Tim Shen Date: Wed Jul 29 21:08:43 2015 -0700 * include/bits/regex_automaton.h (_State_base, _State<>): Remove _TraitsT dependency from _State<>; Make matcher member into the union to reduce struct size. * include/bits/regex_automaton.tcc (_State_base<>::_M_print, _State_base<>::_M_dot, _StateSeq<>::_M_clone): Adjust to fit the interface. Factor out common parts in _M_clone as _State<>::_M_has_alt. * include/bits/regex_executor.h (_Executer<>::_M_lookahead): Only pass state id instead of the whole state. * include/bits/regex_executor.tcc (_Executer<>::_M_dfs, _Executer<>::_M_lookahead): Adjust to fit the interface. * include/std/regex: Include diff --git a/libstdc++-v3/include/bits/regex_automaton.h b/libstdc++-v3/include/bits/regex_automaton.h index fc0eb41..0ef3896 100644 --- a/libstdc++-v3/include/bits/regex_automaton.h +++ b/libstdc++-v3/include/bits/regex_automaton.h @@ -72,7 +72,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION struct _State_base { + protected: _Opcode _M_opcode; // type of outgoing transition + + public: _StateIdT_M_next; // outgoing transition union // Since they are mutually exclusive. { @@ -87,16 +90,24 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION // quantifiers (ungreedy if set true) bool _M_neg; }; + // For _S_opcode_match + __gnu_cxx::__aligned_membuf<_Matcher> _M_matcher_storage; }; + protected: explicit _State_base(_Opcode __opcode) : _M_opcode(__opcode), _M_next(_S_invalid_state_id) { } - protected: -~_State_base() = default; - public: +bool +_M_has_alt() +{ + return _M_opcode == _S_opcode_alternative + || _M_opcode == _S_opcode_repeat + || _M_opcode == _S_opcode_subexpr_lookahead; +} + #ifdef _GLIBCXX_DEBUG std::ostream& _M_print(std::ostream& ostr) const; @@ -107,14 +118,64 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION #endif }; - template + template struct _State : _State_base { - typedef _Matcher _MatcherT; + typedef _Matcher<_Char_type> _MatcherT; + static_assert(sizeof(_MatcherT) == sizeof(_Matcher), + "The aussmption std::function has " + "the same size as std::function is violated"); + static_assert(alignof(_MatcherT) == alignof(_Matcher), + "The aussmption std::function has " + "the same alignment as std::function is violated"); + + explicit + _State(_Opcode __opcode) : _State_base(__opcode) + { + if (_M_opcode() == _S_opcode_match) + new (this->_M_matcher_storage._M_addr()) _MatcherT(); + } + + _State(const _State& __rhs) : _State_base(__rhs) + { + if (__rhs._M_opcode() == _S_opcode_match) + new (this->_M_matcher_storage._M_addr()) + _MatcherT(__rhs._M_get_matcher()); + } + + _State(_State&& __rhs) : _State_base(__rhs) + { + if (__rhs._M_opcode() == _S_opcode_match) + new (this->_M_matcher_storage._M_addr()) + _MatcherT(std::move(__rhs._M_get_matcher())); + } + + _State& + operator=(const _State&) = delete; + + ~_State() + { + if (_M_opcode() == _S_opcode_match) + _M_get_matcher().~_MatcherT(); + } + + // Since correct ctor and dtor rely on _M_opcode, it's better not to + // change it over time. + _Opcode + _M_opcode() const + { return _State_base::_M_opcode; } + + bool + _M_matches(_Char_type __char) const + { return _M_get_matcher()(__char); } - _MatcherT _M_matches;// for _S_opcode_match + _MatcherT& + _M_get_matcher() + { return *reinterpret_cast<_MatcherT*>(this->_M_matcher_storage._M_addr()); } - explicit _State(_Opcode __opcode) : _State_base(__opcode) { } + const _MatcherT& + _M_get_matcher() const + { return *reinterpret_cast(this->_M_matcher_storage._M_addr()); } }; struct _NFA_base @@ -155,10 +216,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template struct _NFA -: _NFA_base, std::vector<_State<_TraitsT>> +: _NFA_base, std::vector<_State> { - typede
Re: [Patch] Small refactor on _State<>
On Wed, Jul 29, 2015 at 9:21 PM, Tim Shen wrote: > On Wed, Jul 29, 2015 at 2:15 AM, Jonathan Wakely wrote: >> Yes, that makes sense. See the code in for how >> to set the alignment of the buffer appropriately. You can use the size >> and alignment of std::function even though it will >> sometimes be a different std::function specialization. > > Done. Oops, fixed one typo: s/_Matcher/_Matcher/. -- Regards, Tim Shen commit d9fb4e3ec5eb9fcaf08f757c2a9ddcf57289684f Author: Tim Shen Date: Wed Jul 29 21:08:43 2015 -0700 * include/bits/regex_automaton.h (_State_base, _State<>): Remove _TraitsT dependency from _State<>; Make matcher member into the union to reduce struct size. * include/bits/regex_automaton.tcc (_State_base<>::_M_print, _State_base<>::_M_dot, _StateSeq<>::_M_clone): Adjust to fit the interface. Factor out common parts in _M_clone as _State<>::_M_has_alt. * include/bits/regex_executor.h (_Executer<>::_M_lookahead): Only pass state id instead of the whole state. * include/bits/regex_executor.tcc (_Executer<>::_M_dfs, _Executer<>::_M_lookahead): Adjust to fit the interface. * include/std/regex: Include diff --git a/libstdc++-v3/include/bits/regex_automaton.h b/libstdc++-v3/include/bits/regex_automaton.h index fc0eb41..e153d42 100644 --- a/libstdc++-v3/include/bits/regex_automaton.h +++ b/libstdc++-v3/include/bits/regex_automaton.h @@ -72,7 +72,10 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION struct _State_base { + protected: _Opcode _M_opcode; // type of outgoing transition + + public: _StateIdT_M_next; // outgoing transition union // Since they are mutually exclusive. { @@ -87,16 +90,24 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION // quantifiers (ungreedy if set true) bool _M_neg; }; + // For _S_opcode_match + __gnu_cxx::__aligned_membuf<_Matcher> _M_matcher_storage; }; + protected: explicit _State_base(_Opcode __opcode) : _M_opcode(__opcode), _M_next(_S_invalid_state_id) { } - protected: -~_State_base() = default; - public: +bool +_M_has_alt() +{ + return _M_opcode == _S_opcode_alternative + || _M_opcode == _S_opcode_repeat + || _M_opcode == _S_opcode_subexpr_lookahead; +} + #ifdef _GLIBCXX_DEBUG std::ostream& _M_print(std::ostream& ostr) const; @@ -107,14 +118,64 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION #endif }; - template + template struct _State : _State_base { - typedef _Matcher _MatcherT; + typedef _Matcher<_Char_type> _MatcherT; + static_assert(sizeof(_MatcherT) == sizeof(_Matcher), + "The aussmption std::function has " + "the same size as std::function is violated"); + static_assert(alignof(_MatcherT) == alignof(_Matcher), + "The aussmption std::function has " + "the same alignment as std::function is violated"); + + explicit + _State(_Opcode __opcode) : _State_base(__opcode) + { + if (_M_opcode() == _S_opcode_match) + new (this->_M_matcher_storage._M_addr()) _MatcherT(); + } + + _State(const _State& __rhs) : _State_base(__rhs) + { + if (__rhs._M_opcode() == _S_opcode_match) + new (this->_M_matcher_storage._M_addr()) + _MatcherT(__rhs._M_get_matcher()); + } + + _State(_State&& __rhs) : _State_base(__rhs) + { + if (__rhs._M_opcode() == _S_opcode_match) + new (this->_M_matcher_storage._M_addr()) + _MatcherT(std::move(__rhs._M_get_matcher())); + } + + _State& + operator=(const _State&) = delete; + + ~_State() + { + if (_M_opcode() == _S_opcode_match) + _M_get_matcher().~_MatcherT(); + } + + // Since correct ctor and dtor rely on _M_opcode, it's better not to + // change it over time. + _Opcode + _M_opcode() const + { return _State_base::_M_opcode; } + + bool + _M_matches(_Char_type __char) const + { return _M_get_matcher()(__char); } - _MatcherT _M_matches;// for _S_opcode_match + _MatcherT& + _M_get_matcher() + { return *reinterpret_cast<_MatcherT*>(this->_M_matcher_storage._M_addr()); } - explicit _State(_Opcode __opcode) : _State_base(__opcode) { } + const _MatcherT& + _M_get_matcher() const + { return *reinterpret_cast(this->_M_matcher_storage._M_addr()); } }; struct _NFA_base @@ -155,10 +216,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template struct _NFA -: _NFA_base, std::vector<_State<_TraitsT>> +: _NFA_base, std::vector<_State>
[v3 patch] refactoring - pull out common data members as _Context
In next few weeks I'm gonna sending patches for refactoring . The goal is to make the gigantic _Executor class into several smaller ones and hopefully more readable. After that, there are several correctness/performance issues to be fixed. We may also need more documentation, but self-documenting code is even more important. I've almost finished my drafting (with a lot of off-history back and forth, but stablized finally): https://github.com/innocentim/gcc/commits/executer, but I don't want to commit this branch, since it's more like a mind experiment (and trust me, you won't follow the commit history of this branch ;). This patch doesn't change any behavior. I've tried my best not to sacrifies performance for this _Context abstraction (as all C++ programmers do :), so far so good. I let _Executor privately inherits from _Context; but alternatively we can just make it a member. Bootstrapped and tested. Take your time. This isn't urgent. My plan is to finish them by the next big release. Thanks! -- Regards, Tim Shen commit 007f04c627b0e0ea83b4e2e818354254f4ce649a Author: Tim Shen Date: Thu Aug 6 00:37:21 2015 -0700 * include/bits/regex.tcc: Adjust caller side for interface change. * include/bits/regex_executor.h: Created __regex namespace. Created _Context for common data members for all executers. Do not pass around __match_mode, but make it a _Context member (_M_search_mode). Deleted unnecessary member _State_info::_M_start. * include/bits/regex_executor.tcc: Adjust for interface/member change. * include/std/regex: Reorder included headers. diff --git a/libstdc++-v3/include/bits/regex.tcc b/libstdc++-v3/include/bits/regex.tcc index fedc2b9..361ad32 100644 --- a/libstdc++-v3/include/bits/regex.tcc +++ b/libstdc++-v3/include/bits/regex.tcc @@ -61,25 +61,21 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION __it.matched = false; bool __ret; + constexpr auto __mode = __match_mode + ? __regex::_Search_mode::_Match : __regex::_Search_mode::_Search; if ((__re.flags() & regex_constants::__polynomial) || (__policy == _RegexExecutorPolicy::_S_alternate && !__re._M_automaton->_M_has_backref)) { - _Executor<_BiIter, _Alloc, _TraitsT, false> - __executor(__s, __e, __m, __re, __flags); - if (__match_mode) - __ret = __executor._M_match(); - else - __ret = __executor._M_search(); + __regex::_Executor<_BiIter, _Alloc, _TraitsT, false> __executor( + __s, __e, *__re._M_automaton, __flags, __mode, __res); + __ret = __executor.template _M_match<__mode>(); } else { - _Executor<_BiIter, _Alloc, _TraitsT, true> - __executor(__s, __e, __m, __re, __flags); - if (__match_mode) - __ret = __executor._M_match(); - else - __ret = __executor._M_search(); + __regex::_Executor<_BiIter, _Alloc, _TraitsT, true> __executor( + __s, __e, *__re._M_automaton, __flags, __mode, __res); + __ret = __executor.template _M_match<__mode>(); } if (__ret) { diff --git a/libstdc++-v3/include/bits/regex_executor.h b/libstdc++-v3/include/bits/regex_executor.h index f3f8876..5310bb8 100644 --- a/libstdc++-v3/include/bits/regex_executor.h +++ b/libstdc++-v3/include/bits/regex_executor.h @@ -32,15 +32,71 @@ namespace std _GLIBCXX_VISIBILITY(default) { +_GLIBCXX_BEGIN_NAMESPACE_VERSION +_GLIBCXX_BEGIN_NAMESPACE_CXX11 + + template +class sub_match; + +_GLIBCXX_END_NAMESPACE_CXX11 +_GLIBCXX_END_NAMESPACE_VERSION namespace __detail { +namespace __regex +{ _GLIBCXX_BEGIN_NAMESPACE_VERSION - /** * @addtogroup regex-detail * @{ */ + enum class _Search_mode : unsigned char { _Match, _Search }; + + template +struct _Context +{ + using _Char_type = typename iterator_traits<_Bi_iter>::value_type; + + _Context(_Bi_iter __begin, _Bi_iter __end, const _NFA<_Traits>& __nfa, + regex_constants::match_flag_type __flags, + _Search_mode __search_mode) + : _M_begin(__begin), _M_end(__end), _M_match_flags(__flags), + _M_nfa(__nfa), _M_search_mode(__search_mode) { } + + bool + _M_is_word(_Char_type __ch) const + { + static const _Char_type __s[2] = { 'w' }; + return _M_nfa._M_traits.isctype + (__ch, _M_nfa._M_traits.lookup_classname(__s, __s+1)); + } + + bool + _M_at_begin() const + { + return _M_current == _M_begin + && !(_M_match_flags & (regex_constants::match_not_bol +| regex_constants::match_prev_avail)); + } + + bool + _M_at_end() const + { + return _M_current == _M_end +
Re: [Patch, libstdc++] Add specific error message into exceptions
On Wed, Sep 16, 2015 at 10:38 AM, Jonathan Wakely wrote: > On 12/09/15 01:57 +0000, Tim Shen wrote: >> >> Ok then, let's not appending dynamic location string, but only throw a >> string literal pointer. > > > This looks great, and a *huge* improvement on the current errors even > without more precise location info. I'm glad to hear this :). > OK for trunk, thanks very much for doing this. Tested & Committed as r227936. -- Regards, Tim Shen
Re: [v3 patch] refactoring - pull out common data members as _Context
Hi, As the changes grow (https://github.com/innocentim/gcc/commits/master), it's getting harder to rebase them onto svn trunk. Can we start slowly reviewing and checking these in? Should I post them one by one to the lis? These patches typically break one giant piece of code (mainly _Executor::_M_dfs) into different smaller ones and fix mostly known issues and standard conformance issues. -- Regards, Tim Shen
Re: [v3 patch] Fix spelling in regex headers
On Wed, Nov 6, 2013 at 4:47 AM, Jonathan Wakely wrote: > This simply fixes s/boundry/boundary/, which I wanted to do first > before some more regex refactoring I'm planning. Thanks a lot Jonathan! -- Regards, Tim Shen
[Patch, libstdc++/63497] Avoid dereferencing invalid iterator in regex_executor
Bootstrapped and tested. Thanks! -- Regards, Tim Shen commit 95c73ab6280c1f8182d018ee29a44230965dd4ef Author: timshen Date: Sun Oct 19 15:14:55 2014 -0700 PR libstdc++/63497 include/bits/regex_executor.h (_Executor::_M_word_boundary): Remove const qualifier. include/bits/regex_executor.tcc (_Executor::_M_dfs, _Executor::_M_word_boundary): Avoid dereferecing _M_current at _M_end or other invalid position. diff --git a/libstdc++-v3/include/bits/regex_executor.h b/libstdc++-v3/include/bits/regex_executor.h index cd9e55d..b867951 100644 --- a/libstdc++-v3/include/bits/regex_executor.h +++ b/libstdc++-v3/include/bits/regex_executor.h @@ -145,7 +145,7 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION } bool - _M_word_boundary(_State<_TraitsT> __state) const; + _M_word_boundary(_State<_TraitsT> __state); bool _M_lookahead(_State<_TraitsT> __state); diff --git a/libstdc++-v3/include/bits/regex_executor.tcc b/libstdc++-v3/include/bits/regex_executor.tcc index 5eab852..9655c7a 100644 --- a/libstdc++-v3/include/bits/regex_executor.tcc +++ b/libstdc++-v3/include/bits/regex_executor.tcc @@ -284,9 +284,11 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION _M_dfs(__match_mode, __state._M_next); break; case _S_opcode_match: + if (_M_current == _M_end) + break; if (__dfs_mode) { - if (_M_current != _M_end && __state._M_matches(*_M_current)) + if (__state._M_matches(*_M_current)) { ++_M_current; _M_dfs(__match_mode, __state._M_next); @@ -407,25 +409,28 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION template bool _Executor<_BiIter, _Alloc, _TraitsT, __dfs_mode>:: -_M_word_boundary(_State<_TraitsT>) const +_M_word_boundary(_State<_TraitsT>) { - // By definition. - bool __ans = false; - auto __pre = _M_current; - --__pre; - if (!(_M_at_begin() && _M_at_end())) + bool __left_is_word = false; + if (_M_current != _M_begin + || (_M_flags & regex_constants::match_prev_avail)) { - if (_M_at_begin()) - __ans = _M_is_word(*_M_current) - && !(_M_flags & regex_constants::match_not_bow); - else if (_M_at_end()) - __ans = _M_is_word(*__pre) - && !(_M_flags & regex_constants::match_not_eow); - else - __ans = _M_is_word(*_M_current) - != _M_is_word(*__pre); + --_M_current; + if (_M_is_word(*_M_current)) + __left_is_word = true; + ++_M_current; } - return __ans; + bool __right_is_word = false; + if (_M_current != _M_end && _M_is_word(*_M_current)) + __right_is_word = true; + + if (__left_is_word == __right_is_word) + return false; + if (__left_is_word && !(_M_flags & regex_constants::match_not_eow)) + return true; + if (__right_is_word && !(_M_flags & regex_constants::match_not_bow)) + return true; + return false; } _GLIBCXX_END_NAMESPACE_VERSION