Re: crc-x86_64: undefined behaviour

2025-01-18 Thread Sam Russell
IIRC we test values from 0-16 bytes in the unit test and the pclmul implementation doesn't get hit for smaller values? On Sat, Jan 18, 2025, 13:16 Simon Josefsson wrote: > Thanks for discussion, Lasse. I suppose the current situation is > difficult to change due to mainly licensing... > > Lasse

Re: crc-x86_64: undefined behaviour

2025-01-17 Thread Sam Russell
Worth noting I've seen faults at avx2/512 for reads that aren't correctly aligned, unless I've made a mistake somewhere along the way

Re: crc-x86_64: undefined behaviour

2025-01-17 Thread Sam Russell
please read "we could not do" as "we could avoid doing"

Re: crc-x86_64: undefined behaviour

2025-01-17 Thread Sam Russell
We discussed this previously, we decided since AVX1 supports unaligned accesses we could not do an alignment check at the start of the function, but as you've discovered, this memcpy issue creates undefined behaviour. Most performant would probably be an alignment check at the start and then manua

Re: crc-x86_64: Fix compilation error with clang

2024-12-21 Thread Sam Russell
this is a shame, the pragma option seemed super clean if clang worked with it. ad hoc library seems like a close second given we can't change CFLAGS without changing the whole repo On Sat, Dec 21, 2024, 23:14 Bruno Haible wrote: > The gnulib CIs also exercise compilation with clang, and it faile

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-17 Thread Sam Russell
> Thanks - I have pushed this now. Thanks :) > My point was that it would be better to use the same mechanism the .c > implementation uses in the ./configure check, to make sure it tests for > what will be used. That is, drop the CFLAGS changes and add: I agree from a testability perspective, I

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-16 Thread Sam Russell
> End sentences with '.' done > Same here. done > That filename comment is now wrong. fixed > Is that true? If this new original code, shouldn't it say 2024? If > this isn't new original code (does it come from coreutils), it make > sense to add a comment saying where it is coming from, lik

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-16 Thread Sam Russell
Thanks for the quick reply Bruno, updated patch attached On Mon, 16 Dec 2024 at 17:25, Bruno Haible wrote: > Sam Russell wrote: > > crc-x86_64 is added as its own package with minimal bindings in crc.c to > > support, flagged off by the GL_CRC_PCLMUL flag which is only set if the

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-16 Thread Sam Russell
:40, Sam Russell wrote: > Sounds good, thanks for explaining. I'll get going on the autoconf stuff > then. > > On Fri, Dec 13, 2024, 20:03 Jim Meyering wrote: > >> That's not the way it usually works. >> Typically, the author and/or person most familiar with th

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-13 Thread Sam Russell
you don't want to list yourself and no one else steps up, just mark > it as "all". > > On Fri, Dec 13, 2024 at 10:52 AM Sam Russell > wrote: > > > > Jim, would you be willing to be the maintainer for this in gnulib? If so > then I can fix the autoconf this wee

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-13 Thread Sam Russell
it. > > On Fri, Dec 13, 2024 at 9:45 AM Sam Russell > wrote: > > > > I should also add that I'm planning long term to try and get the major > open source tools (e.g. gzip) using the most efficient algorithms for > CRC32, so there will be more patches coming in f

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-13 Thread Sam Russell
these? On Fri, Dec 13, 2024, 17:52 Sam Russell wrote: > > I'd prefer of the crc PCLMUL feature is "opt-in" from a package > > maintainer point of view. > > fine by me, this just requires a change to the build script then? the > binding in crc.c has #ifdefs ar

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-13 Thread Sam Russell
> I'd prefer of the crc PCLMUL feature is "opt-in" from a package > maintainer point of view. fine by me, this just requires a change to the build script then? the binding in crc.c has #ifdefs around it already > So this would be removed. ok, so we just need crc pclmul to be explicitly flagged o

Re: [PATCH] crc: Add PCLMUL implementation

2024-12-12 Thread Sam Russell
Hi guys, Just bumping this, is this a change we would like to include? Are there any issues with the patch that you'd like me to fix? Cheers Sam On Tue, Nov 26, 2024, 23:01 Sam Russell wrote: > > It's your choice: 3 compilation units for x86_64, or 1 compilation unit >

Memory requirements for CRC32 calculation

2024-12-05 Thread Sam Russell
Hi, I've been looking into other CRC32 implementations for architectures that don't have accelerated polynomial multiplication (PCLMUL/VMULL for x86-SSE/AVX or ARM-NEON), and there are improvements we can make, but they make tradeoffs for memory (both for lookup tables and for working set). The "b

Re: [PATCH] crc: Add PCLMUL implementation

2024-11-26 Thread Sam Russell
hoose in the end. Final patch for now with alignment removed based on Jeff's feedback that _mm_loadu_si128() accepts unaligned data. bench-crc, test-crc and gzip all happy. Am I correct in thinking that we are just waiting for Simon's input now? On Tue, 26 Nov 2024 at 22:53, Bruno H

Re: [PATCH] crc: Add PCLMUL implementation

2024-11-26 Thread Sam Russell
de. > I believe the way to zero a __m128i is using _mm_setzero_si128(). It Perfect, it's outside the main loop so I'm not worried but this does look more correct. On Tue, 26 Nov 2024 at 22:47, Jeffrey Walton wrote: > On Tue, Nov 26, 2024 at 4:27 PM Sam Russell > wrote: > >

Re: [PATCH] crc: Add PCLMUL implementation

2024-11-26 Thread Sam Russell
as well as the alignment check above. test-crc and bench-crc are fine, gzip builds with this gnulib and decompresses my test file with no hash error On Tue, 26 Nov 2024 at 20:29, Sam Russell wrote: > > Cool. But it even gets better: one can use these target options on a per- > > functi

Re: [PATCH] crc: Add PCLMUL implementation

2024-11-26 Thread Sam Russell
> Cool. But it even gets better: one can use these target options on a per- > function basis, via __attribute__. See > https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/x86-Function-Attributes.html#index-target_0028_0022avx_0022_0029-function-attribute_002c-x86 > https://gcc.gnu.org/onlinedocs/gcc-14

Re: [PATCH] crc: Add PCLMUL implementation

2024-11-26 Thread Sam Russell
> * I would suggest to rename the main source file from crc-pclmul.c to > crc-x86_64.c. > Rationale: So that immediately clear that the code is specific to the > x86_64 CPUs. Not everyone is an assembly language hacker, and even some > assembly language hackers (like me) don't know ab

[PATCH] crc: Add PCLMUL implementation

2024-11-26 Thread Sam Russell
85% time reduction on AMD Ryzen 5 5600: $ ./gltests/bench-crc 100 real 1.740296 user 1.740 sys0.000 $ ../bench-crc-pclmul 100 real 0.248324 user 0.248 sys0.000 This translates to a 13% time reduction for gzip: $ time ./gzip_sliceby8 -k -d -c large_file.gz > /dev/null re

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-11-03 Thread Sam Russell
gzip just took my patch to initialise MAINTAINERCLEANFILES, just built from HEAD with gnulib pulled from HEAD, everything looking good make check Testsuite summary for gzip 1.13.36-46e9-dirty

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-31 Thread Sam Russell
especially reluctant to add the existing cksum crc variant to gnulib, > since I don't see anything but cksum itself using that. I need to research a little further, but bzip2 uses the same implementation as cksum afaik On Thu, 31 Oct 2024 at 18:33, Pádraig Brady wrote: > On 31/10/2024 16

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-31 Thread Sam Russell
I can extend coreutils cksum if you like, what are your thoughts on adding some parameters to the commandline? Even if it's just an extra flag for --rfc1952 or --gzip to change all the parameters just between these two variants? On Thu, Oct 31, 2024, 17:41 Sam Russell wrote: > These

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-31 Thread Sam Russell
These are all solvable: Polynomial: technically the same but one could argue it's the reversed version (we reverse the polynomial so we can read LSB first). We now generate the tables at compile time so this could be a parameter to table generation Initial/final value: easy enough to configure at

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-31 Thread Sam Russell
> I believe this should be fixed through the second patch of mine. Yes, thanks > I think you should add 'MAINTAINERCLEANFILES =' to gzip instead. Adding to lib/Makefile.am on line 24 fixes this Patch attached to mark functions in crc.c and crc-generate-tables.c as static, GZIP now builds $ ti

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-31 Thread Sam Russell
Thanks for merging Simon. I've fixed a couple of issues with building gzip, marking the functions as static fixes the -Werror=missing-declarations error, and the MAINTAINERCLEANFILES variable isn't set when building gzip so there's an error when using += with it. There's another issue with the way

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-29 Thread Sam Russell
ere's anything outstanding please let me know and I'm happy to add whatever is missing. On Mon, 28 Oct 2024 at 22:06, Sam Russell wrote: > Is there anything else we're missing on this patch? From what I can tell > the configure flag is just the standard behaviour of AC_ARG_ENABLE

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-28 Thread Sam Russell
23:21 Simon Josefsson wrote: > Thanks - I am happy with this version, or even without the user-visible > logic too (not sure if this is something users should ever tune). My > question about selection logic was more for maintainers rather than > end-users. > > /Simon > > 27

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-27 Thread Sam Russell
I had a play with autoconf (inspired heavily by m4/debug.m4), can confirm the behaviour works as follows: # build with slice-by-8 (default enabled) ./configure make gltests/bench-crc 100 real 1.714850 user 1.715 sys0.000 # build without slice-by-8 ./configure --disable-crc-slice-by-8 m

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-27 Thread Sam Russell
thanks, moved the xors to the front of the lines On Sun, Oct 27, 2024, 13:41 Bruno Haible wrote: > Sam Russell wrote: > > > Also, in GNU coding style, when line breaking is needed within > > expressions, > > > we do the line break before the operator, not after t

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-27 Thread Sam Russell
Done. > I'll then deal with the cross-compiling situation. Will need to experiment > a bit, since that's the first time, in Gnulib, that we have a built file > generated by a C program (as opposed to a pre-installed tool). Glad to be on the bleeding edge :) On Sun, 27 Oct 2024

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-27 Thread Sam Russell
> You can fix most of these by simply running 'indent' on your *.h and *.c Thanks, this is super helpful. I'm happy to follow the coding style but a linter would make it easier for me to track down these whitespace issues. > Regarding the Automake stuff, please try the attached patch, on top of y

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-26 Thread Sam Russell
. If it is compliant then can someone please merge it. Cheers Sam On Thu, 24 Oct 2024 at 11:56, Sam Russell wrote: > Sorry for the bump, confirming that I've tested on s390x locally (WSL > amd64 with docker-qemu running s390x) > > root@3d86c908e4f5:/gnulib# ./gnulib-tool --with

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-25 Thread Sam Russell
w feature, it's currently disabled by default. On Tue, 22 Oct 2024 at 22:38, Sam Russell wrote: > Tests pass with GL_CRC_SLICE_BY_8 set > > > > On Wed, 16 Oct 2024 at 13:52, Simon Josefsson wrote: > >> Sam Russell writes: >> >> >>

Re: [PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-25 Thread Sam Russell
il if we read more than requested. We also have 0-length reads in the mix which is a no-op and returns 0x as a result. On Thu, 24 Oct 2024 at 23:08, Bruno Haible wrote: > Sam Russell wrote: > > I think we are massively overengineering this. We shouldn't need to > > fu

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-22 Thread Sam Russell
Tests pass with GL_CRC_SLICE_BY_8 set On Wed, 16 Oct 2024 at 13:52, Simon Josefsson wrote: > Sam Russell writes: > > >> People using the crc module who want to disable the faster/larger CRC32 > >> implementation can add 'gl_crc_slice_by_8=yes' to their

Re: [PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-22 Thread Sam Russell
Thanks, my mistake on the dropped CC On Tue, 22 Oct 2024 at 22:24, Simon Josefsson wrote: > Thanks, pushed in your name -- next time please attach git-format-patch > output that includes the commit message, and keep cc to the list. > > /Simon > > Sam Russell writes: &g

Re: [PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-22 Thread Sam Russell
> I'd opt for reducing this set of expected values to 32 values, > by reintroducing the memcpy() trick from your previous patch: Attached is a patch with this implementation On Tue, 22 Oct 2024 at 14:34, Sam Russell wrote: > I think we are massively overengineering this. We sho

Re: [PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-22 Thread Sam Russell
I think we are massively overengineering this. We shouldn't need to future-proof tests for changes that *might* be added in the future, and having 32x32 hashes in a table at the start of the test is not a big deal (it took less than a second to generate and it'll take the same amount of time if we

Re: [PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-22 Thread Sam Russell
ntations to match the current implementation On Tue, 22 Oct 2024 at 12:06, Bruno Haible wrote: > Sam Russell wrote: > > is there anymore feedback on this? > > + char plaintext[] = "Gnulib crc testsGnulib crc tests" > + "Gnulib crc testsGnuli

Re: [PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-21 Thread Sam Russell
, endianness) Cheers Sam On Tue, Oct 15, 2024, 10:44 Simon Josefsson wrote: > Thanks this looks better. Let's keep it around for more feedback/review > until your copyright papers arrive. > > /Simon > > Sam Russell writes: > > >> Or repeat "Gnulib c

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-17 Thread Sam Russell
> I have one use-case (although not published so less > important) where the table size increase is a problem, and would prefer > to use the current smaller crc.c (or even better, a smaller one that > generate the tables on the fly, I forgot that it is possible). There's another approach for table

Re: crc tests: Add a benchmark program

2024-10-16 Thread Sam Russell
Thank you for including explicit usage cmdline :) Results with current slice_by_8 code GL_CRC_SLICE_BY_8 not set $ gltests/bench-crc 100 real 7.502692 user 7.503 sys0.000 $ gltests/bench-crc 100 real 7.481411 user 7.481 sys0.000 $ gltests/bench-crc 100 real 7.525393

Re: [PATCH] crc: New optimised slice-by-8 implementation

2024-10-16 Thread Sam Russell
alignment and do the upcasting ourselves) You are correct that this breaks under big endian though, thanks for the tip for endian.h. I've added a le64toh() call and this should make it work correctly on big endian systems (how do we go about testing these btw? I don't have one o

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-16 Thread Sam Russell
> I seem to remember there was something special with the polynomial that forced you byteswap the data. That's correct, coreutils uses the normal polynomial, gnulib uses the bit-reversed polynomial, and gzip uses gnulib directly as the RFC 1952 specifies the bit-reversed polynomial. Both use the s

[PATCH] crc: New optimised slice-by-8 implementation

2024-10-15 Thread Sam Russell
This is my implementation of the slice-by-8 algorithm for CRC32 generation. I've added a flag CRC_ENABLE_SLICE_BY_8, I'd appreciate if someone can give me a hint on how to set this up in the makefile config. I get the impression that we want this to be on by default or for the major architectures,

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-15 Thread Sam Russell
e end. On Tue, 15 Oct 2024 at 16:59, Pádraig Brady wrote: > On 15/10/2024 12:58, Sam Russell wrote: > > I'm happy with the slice-by-8 code I have < > https://github.com/samrussell/gnulib/blob/slice_by_8/lib/crc.c < > https://github.com/samrussell/gnulib/blob/slice_by_8/lib

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-15 Thread Sam Russell
I'm happy with the slice-by-8 code I have < https://github.com/samrussell/gnulib/blob/slice_by_8/lib/crc.c> but the cksum_pclmul implementation is quite detailed so it would be useful if we could relicense that for gnulib. There's a minor optimisation around dealing with a non-round number of byte

Re: [PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-15 Thread Sam Russell
ld get away with a single off-by-1 test but here we are testing everything from 0-15 bytes of misalignment; a 49-byte misalignment should create the same problem as a 1-byte misalignment. On Tue, 15 Oct 2024 at 09:57, Simon Josefsson wrote: > Sam Russell writes: > > > + char p

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-15 Thread Sam Russell
t module, seems like a good improvement. You should start on > the copyright assignment paper process first. > > /Simon > > Sam Russell writes: > > > IANAL but it appears the source (paper, not code) for both > implementations > > in coreutils is free, I'm happy

[PATCH] tests/test-crc.c: New tests for future optimised implementation.

2024-10-15 Thread Sam Russell
Patch file attached. As far as I can tell, the main concern with the new optimisations is that we're moving from processing byte-by-byte to processing in blocks of up to 128 bytes, and these will be read in either 8 or 16 byte chunks. This new test calls the inner CRC update function with data ali

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-15 Thread Sam Russell
gt; Otherwise another approach is to keep a simple LGPL crc module and > performant crc-gpl module that uses the coreutils code. > > /Simon > > Sam Russell writes: > > > That sounds good. It looks like there some subtle differences anyway, the > > gzip version does ev

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
ing looks on the wire, but I am trying my best to do things by the book here for portability. If there's a kosher way of telling when we're aligned then that would be the holy grail here. On Mon, 14 Oct 2024 at 23:25, Collin Funk wrote: > Sam Russell writes: > > >

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
it though... It would make sense to also offer a second function to allow callers to dump in a block of guaranteed-aligned memory. On Mon, Oct 14, 2024, 18:30 Bruno Haible wrote: > Sam Russell wrote: > > I built from HEAD, named it gzip_vanilla, rebuilt with my CRC code and > > nam

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
en source projects in the last 2 days. On Mon, 14 Oct 2024 at 20:26, Simon Josefsson wrote: > Sam Russell writes: > > > I've noticed that GZIP trails behind zlib in performance and part of this > > is down to the fact that zlib is using a more efficient CRC32 &

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
, 2024, 19:05 Pádraig Brady wrote: > On 14/10/2024 15:53, Sam Russell wrote: > > > For reference, coreutils' cksum uses slice by 8 unconditionally since: > > > https://github.com/coreutils/coreutils/commit/a7533917e < > https://github.com/coreutils/coreutils/commi

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
user0m0.478s sys 0m0.010s Looks to be about a ~35% reduction in time On Mon, 14 Oct 2024 at 17:44, Bruno Haible wrote: > Sam Russell wrote: > > As for your question on speed, I noticed between zstd (which uses zlib > as a > > backend) and gzip there seems to be an improvemen

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
Hi Bruno, Presumably you've read Pádraig's comment in the other thread that I mistakenly created, there are two interesting things from this: - coreutils is already GNU so no copyright review required, although the code appears to be inside the cksum utility so it's not in a position to be includ

Re: Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
4 at 15:50, Pádraig Brady wrote: > On 14/10/2024 11:16, Sam Russell wrote: > > Hi, > > > > First off, this is my first GNU contribution so I have *no idea* what > I'm doing, feedback is appreciated. > > > > I've noticed that GZIP trails behind zlib i

Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
Hi, First off, this is my first GNU contribution so I have *no idea* what I'm doing, feedback is appreciated. I've noticed that GZIP trails behind zlib in performance and part of this is down to the fact that zlib is using a more efficient CRC32 implementation. I've written an implementation of t

Adding slice-by-4 and slice-by-8 to CRC32

2024-10-14 Thread Sam Russell
Hi, First off, this is my first GNU contribution so I have *no idea* what I'm doing, feedback is appreciated. I've noticed that GZIP trails behind zlib in performance and part of this is down to the fact that zlib is using a more efficient CRC32 implementation. I've written an implementation of t