[The context: *very basic* header validation of e-mail messages] On 2015-04-28 10:27:40 +0200, Nicolas George wrote: > L'octidi 8 floréal, an CCXXIII, Vincent Lefevre a écrit : > > I don't understand the point. Accumulating in strings (which involves > > copies and possible reallocations) and doing a split is much slower > > than reading lines one by one and treating them separately. > > First: not necessarily, because once the header is loaded in a string, you > can apply regexps to the whole header at once instead of using a loop. This > may prove faster.
I've finally tried this solution (i.e. accumulating, then apply regexp on the full strings) and it takes about 60% more time when the data are in the disk cache. This is not surprising, IMHO, for the following reasons: First, as I've said, accumulating lines in a string may involve copies and reallocations because the string grows (I don't know whether there is a way to solve that without obfuscating the code). Then I don't think that in the particular case of header validation, there is much gain applying regexp's on the full header at once; the reason is that my regexp's use the end of line as a separator (things like /\n[^:\s]+\s/ and /^Message-ID:.../im). So, when I read the file line by line, I already do a part of the job of regexp matching. And finally, for each test, the header has to be read several times. However I'm not sure that this is a problem here, because this could be seen as reordering read from the L1 cache[*] and tests. So, it is not clear what is the best. [*] Each header should fit in it. > The gist of it is the usual saying: "profile, don't speculate". You had a > particular issue that made your program immensely slower. Now that this > problem is resolved and your program run-time is acceptable, you may want to > trade a bit of CPU consumption for simplicity: having the whole header in a > string makes a lot of things easier and/or more robust, especially > everything that has to do with folded headers. And remember you already > traded A LOT of CPU for simplicity: you are using Perl, not assembly. In my case, I don't need to deal with folded headers, except validating the format, which is very easy with a line-by-line parsing. I may have other scripts that need to deal with them, but in this case, I accumulate physical lines into a single logical one. AFAIK, this is what mail processors do (postfix header filtering, procmail...). But there is no need to accumulate the full header in a single string. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon) -- To UNSUBSCRIBE, email to debian-user-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org Archive: https://lists.debian.org/20150518153809.ga2...@ypig.lip.ens-lyon.fr