> On Aug 8, 2020, at 12:03 PM, George Spelvin <l...@sdf.org> wrote:
> 
> On Sat, Aug 08, 2020 at 10:07:51AM -0700, Andy Lutomirski wrote:
>>>   - Cryptographically strong ChaCha, batched
>>>   - Cryptographically strong ChaCha, with anti-backtracking.
>> 
>> I think we should just anti-backtrack everything.  With the "fast key 
>> erasure" construction, already implemented in my patchset for the 
>> buffered bytes, this is extremely fast.
> 
> The problem is that this is really *amorized* key erasure, and
> requires large buffers to amortize the cost down to a reasonable
> level.
> 
> E,g, if using 256-bit (32-byte) keys, 5% overhead would require generating
> 640 bytes at a time.
> 
> Are we okay with ~1K per core for this?  Which we might have to
> throw away occasionally to incorporate fresh seed material?

I don’t care about throwing this stuff away. My plan (not quite implemented 
yet) is to have a percpu RNG stream and to never to anything resembling mixing 
anything in. The stream is periodically discarded and reinitialized from the 
global “primary” pool instead.  The primary pool has a global lock. We do some 
vaguely clever trickery to arrange for all the percpu pools to reseed from the 
primary pool at different times.

Meanwhile the primary pool gets reseeded by the input pool on a schedule for 
catastrophic reseeding.

5% overhead to make a fresh ChaCha20 key each time sounds totally fine to me. 
The real issue is that the bigger we make this thing, the bigger the latency 
spike each time we run it.

Do we really need 256 bits of key erasure?  I suppose if we only replace half 
the key each time, we’re just asking for some cryptographer to run the numbers 
on a break-one-of-many attack and come up with something vaguely alarming.

I wonder if we get good performance by spreading out the work. We could, for 
example, have a 320 byte output buffer that get_random_bytes() uses and a 
320+32 byte “next” buffer that is generated as the output buffer is used. When 
we finish the output buffer, the first 320 bytes of the next buffer becomes the 
current buffer and the extra 32 bytes becomes the new key (or nonce).  This 
will have lower worst case latency, but it will hit the cache lines more often, 
potentially hurting throughout.

Reply via email to