On Fri, Jan 14, 2011 at 3:53 PM, krueger, Andreas (Andreas Krüger, DV-RATIO) <andreas.krue...@hp.com> wrote: > Hello, Johan and all, > > first, for the record, here is another comparison between > binary and text merge performance, this time with the files > generated by my script (repeated below): > > Binary merge took 3.56 seconds, text merge took 3:45:45.202 hours. > In this particular case, binary merge performance was 3805 times > faster than text merge performance. > > > >> Textual merging in svn makes use of a variant of the standard diff >> algorithm, namely diff3. Just a couple of days ago, I finally >> succeeded in making diff3 take advantage of ... performance >> improvements ... . > > Good news! Excellent! Thank you! > > But... does this relate to my problem? > > The improved diff3 will give a nice performance improvement in the > *general* case. > > I certainly want that improvement! > > > Another nice performance improvement of a factor of several hundreds > (or thousands) could be obtained for big files in the *trivial* case, > if SVN didn't diff3 at all, but simply copied the result. > > I also want this other improvement! > > > Finally: > > SVN already contains the intelligence needed to find out whether a > merge is trivial or not. For, in the binary case, the trivial merges > are precisely the ones that SVN knows how to do. > > > Johan (or whoever else), please kindly enlighten me, should I be > missing something!
Ok, after rereading this thread, I'm starting to understand what you mean: why would "merge" perform an expensive diffing algorithm while it can just be 100% sure that it can simply copy the contents from the source to the target (because the target file has not had any changes since it was branched)? I think it's a good suggestion, but I can't really comment more on (the feasibility of) it, because I'm not that familiar with that part of the codebase. I've only concentrated on the diff algorithm itself (and how it's used by "svn diff" and "svn merge" (for text files)). Maybe someone else can chime in to comment on that? Of course, if there would be any change to the target file, it wouldn't be a trivial merge (copy contents) anymore, so you would immediately fallback to the very expensive case. But I agree that's hardly a reason not to take the shortcut when you can... A couple more thoughts on the diff-side of the story: - Your perl script presents an extremely hard case for the diff algorithm, because: * Files A and B differ in each and every one of the 1000000 lines (so my prefix/suffix scanning optimization will not help at all). * All lines in each file are also unique inside the file (this makes it more expensive). * Most lines have identical lengths as other lines (also makes it more expensive). * The lines are very short (the diff algorithm is mainly proportional with the number of lines). These things are very atypical when you compare with real-world examples. Usually there is some identical part at the beginning and/or the end, and lines vary a lot in length. If there is a significant portion of identical lines at the beginning and/or the end, the optimizations in the diff-optimizations-bytes branch will help a lot. - Interestingly, GNU diff calculates the diff between these files in 7 seconds on my machine. But if I give it the option '--minimal', it also runs for hours (started it 2 hours ago; it's still running). - Can you try the merge on your original example (big.xml) with the option -x-b (ignore changes in amount of whitespace)? Just to know if it would make a difference. In my tests this made diff *much* faster, so I'm guessing the same for merge. Of course, this depends entirely on the example data (won't help a bit for the perl-generated files (will be slowed down even more)). Cheers, -- Johan