Hi Achim,
Thank you very much for the detailed instructions and also the comparison data Linux vs Cygwin for all those testcases.

Achim Gratz wrote:
ASSI writes:
I have a Cygwin malloc speedup patch that *might* help the m-t part.
I'll prepare and submit that to cygwin-patches shortly.

Well, if you want to test it with the new ZStandard, give it a spin…
I'll check how far I can strip that test down so you can use the Cygwin
source tree for testing.

I've now done this.  And I don't see any improvement.  Reasons below...

OK, it's actually pretty simple, do this inside a checkout of
newlib-cygwin:

$ find newlib winsup texinfo -type f > flist
$ zstd --train-cover --ultra -22 -T0 -vv --filelist=flist -o dict-cover

On Linux, it reads in all the files in about two seconds, while it takes
quite a while longer on Cygwin.  But the real bummer is that
constructing the partial suffix arrays (which is single-threaded) will
seemingly take forever, while it's done much faster on Linux.  You can
pare down the number of files like that:

$ shuf -n 320 flist > slist

I've settled on '-n 1600' for testing. I'm running these Cygwin tests on a 2C/4T i3-something with 8GB memory and an SSD used for filesystem and page file. Not a dog but clearly not a dire-wolf either.

The page fault numbers are comparable to what you've shown for Cygwin on your system. The long pause after zstd prints "Constructing partial suffix array" is because zstd is cpu-bound in qsort() for a long time. No paging during that time. Then when the statistics start being printed out, that's when the paging insanity starts.

What I discovered is that zstd is repeatedly asking malloc() for large memory blocks, presumably to mmap files in, then free()ing them. Any malloc request 256K or larger is fulfilled by mmap() rather than enlarging the heap for it. But crucially, there is no mechanism for our malloc to hang on to freed mmap()ed pages for future use. If you free an mmap()ed block, it is unmap()ed immediately. So for zstd's usage pattern you get an incredible number of page faults to satisfy the mmap()s and Windows seems to take a non-trivial bit of time for each mmap().

I will be looking at our malloc implementation to see if tuning something can fix this behavior. Adding code is the last resort.
Thanks again for the great testcase.

..mark

Reply via email to