Thursday, May 9, 2024

CRC64 in Valkey got faster

As teased in my last blog post, CRC64 in Redis was supposed to get faster, but it didn't. About a year after I posted my PR, Redis the company (formerly Garantia Data, then Redis Labs, whom I've worked a booth with at AWS, written articles for, gave several talks for, etc.) changed the license of Redis the software. I don't like the new license. Subsequently, Madelyn Olson asked me to port my changes over to Valkey, and I want to make the world better, so I did.

When I originally ported to Valkey, it was using the generally unused `SERVER_TEST` define, which was not exposed in a convenient manner. Since then, Madelyn has migrated those tests over to the new C-level unittest framework, so instructions have changed since my changes, and like most software engineers, I like to be able to test my things, and tell you how to test them too.

Testing updated crc64 performance in Valkey

The first thing you'll want is to get the most recent unstable branch of Valkey (if you want to get all of Valkey / Redis history, omit the '--depth 1' argument), which will require (at minimum) git, make, and a recent gcc or clang:

 

% git clone --depth 1 https://github.com/valkey-io/valkey.git
%
cd valkey/src
%
make valkey-unit-tests
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000
%
./valkey-unit-tests --single test_crc64combine.c --crc 10000000 --combine

The first unit test should output something like...

[START] - test_crc64combine.c
test size=10000000 algorithm=crc_1byte 375 M/sec matches=1
test size=10000000 algorithm=crcspeed 1345 M/sec matches=1
test size=10000000 algorithm=crcdual 2246 M/sec matches=1
test size=10000000 algorithm=crctri 2428 M/sec matches=1
test size=6093750 algorithm=crc_1byte 402 M/sec matches=1
test size=6093750 algorithm=crcspeed 1343 M/sec matches=1
test size=6093750 algorithm=crcdual 2254 M/sec matches=1
test size=6093750 algorithm=crctri 2423 M/sec matches=1
...

Examining the output, wherever you see crcdual and crctri start being slower than crcspeed, is where you should set your dual / tri cutoffs. At present, these cutoffs are set at default levels that seem to work well with Intel processors in the Sandy Bridge era or later (OEM shipments starting in 2013). I don't know about performance on AMD cpus, or even ARM cpus on the new macs, but if someone wants to test and let me know, I'd be appreciative.

Similarly, we can see how our crc extension performs with the additional --combine argument, producing something like:

[START] - test_crc64combine.c
init_64 size=18446744073709551615 in 689000 nsec
hash_as_size_combine size=10000000 in 1036 nsec
largest_combine size=18446744073709551615 in 10187 nsec
combine size=10000000 in 1049 nsec
combine size=6093750 in 2333 nsec
combine size=3713381 in 1576 nsec
combine size=2262843 in 1560 nsec
combine size=1378922 in 1217 nsec
combine size=840282 in 1431 nsec
combine size=512048 in 1030 nsec
...

We have generally set our cutoffs higher for non-x86 (arm-64 on MacOS / Raspberry Pi) because we don't have an explicitly vectorized combine function for them. Mostly the combine function is a matter of curiosity, but for longer-term scaling, it is important, as it sets the minimum size where combining makes sense vs. doing the regular calculation.

How did we get here, and can we go faster?

I originally added combine functionality in 2019 as I was building my own threaded Redis fork called StructD (now defunct). Because I had threads, spawning multiple threads in addition to optimizing individual thread performance would allow me to process entire files in parallel, chunked out to workers, merging and verifying the results at the end.

While it might not seem necessary to perform CRC64 verification with such speed, we were both creating and loading 10-100 gigabyte snapshots. As I was building StructD, the crcspeed changes from Matt Standcliff had not yet been merged in, and we were seeing 4+ minutes to verify our largest snapshots.

With 1.4 gigs/second after merging crcspeed, that pushed timing down to ~75 seconds just to verify. But with my updated 2.4 gig/second crctri, plus being able to go multi-threaded, my record for verifying a 40 gig snapshot on my local machine (file stored in shared memory to eliminate disk IO) was about 2.5 seconds, compared with about 30 seconds for April 25, 2020 crcspeed Redis, and 4+ minutes with one byte at a time classic. Overall, seeing a 100x performance improvement just in verifying snapshots.

Testing updated performance in smhasher

Clocking in at about 15 gigs/second CRC64 computation using 8 threads is not quite a linear speedup, but it beats every other real CRC of any bit width available in the smhasher suite of hashes, hardware or software. But to compare properly, our single-threaded CRC64 variant is about half the speed of hardware CRC64 on my machine, using only software. Available as crc64_jones, crc64_jones1, crc64_jones2, and crc64_jones3 for the Valkey automatic 1-3 pipeline, 1, 2, and 3-pipeline versions, respectively (you will need git, make, cmake, gcc, and g++ to build smhasher via 'sh build.sh' then './SMHasher --test=Speed crc64_jones').

On the whole, my changes demonstrate a path for single-threaded CRC seeing near-hardware performance for any polynomial on modern CPUs with enough spare pipelines, L1/L2 cache, and memory bandwidth. When combined with threads directly, we become limited by data source IO speeds and memory bandwidth - not CRC algorithm performance.

For more free software, please feel free to sponsor me over on Github.