draft-ietf-codec-oggopus-07 adds some concrete limits to the packet sizes that must be accepted, which I agree is a good idea, however there are some issues with the specific text.
Section 6 states: "Decoders SHOULD reject packets larger than 60 kB per channel, and display a warning message, and MAY reject packets larger than 7.5 kB per channel." It is unclear whether "channel" here refers to output channels, decoded channels, or physical channels (although output channels makes the most sense). Also, while the next paragraph appears to be a justification for these numbers, the numbers in that paragraph are per stream and not per channel. The number of output channels may be higher or lower than the number of streams so the connection is unclear. Additionally it is unclear whether this section applies only to audio data packets or also to comment packets, which do not have channels or streams but could be larger than audio data packets. It is also unclear how this interacts with the Section 5.1.1.5 requirement that an Ogg Opus player play *any* Ogg Opus stream with a channel mapping family of 0 or 1 (apparently including those with large packets). In order to comply with that, does that mean that large packets may only be rejected when the channel mapping family is not 0 or 1, or by applications that are not an "Ogg Opus player"? It appears that the intent is to require Ogg Opus players to play any valid Ogg Opus stream with channel mapping family 0 or 1 (maximum 8 channels) within limits that should be reasonable for any such stream, allowing for resource-constrained implementations with fixed buffers and a reasonably reliable test for ensuring that a particular stream is likely to play even on resource constrained players as intended, while still allowing streams that are limited only by what should be considered malicious which may not be playable on all players. To address these issues, I propose that the packet size required to be processed by any implementation be stated as a simple 60 kB size and be connected to the Section 5.1.1.5 requirement. This is the size suggested at the end of Section 6 and simplifies the requirement without increasing the packet size that could be interpreted to already be required to be played by a minimal player. I further suggest that the larger size at which audio data packets SHOULD be rejected be set at 60 kB per stream rather than per "channel", aligning with the largest possible valid packet without padding as described at the beginning of the second paragraph of Section 6. Because packets exceeding 60 kB per stream necessarily use padding in excess of what would be required to make a VBR stream CBR, this also aligns with Section 6 encoder recommendations and avoids the situation where decoders SHOULD reject certain packets that an encoder not violating any SHOULD or MUST might produce. The new text about decoders displaying a warning message is also worrying. Typically a decoder does not have access to the application's preferred display or localization facilities so this seems inappropriate. Even considering the higher layers it seems unlikely that typical usage, such as background audio decoding in a portable music player, game engine, or the like would have a facility for displaying this kind of warning message during decoding. Also if a warning should be displayed when packets larger than 60 kB per channel are rejected, why not when smaller packets are rejected? For these reasons, and because applications are already free to indicate any issues that are important to their users in the manner most appropriate for their application, I suggest omitting this text. On another note, it appears that in the packet sizes in Section 6, kB refers to 1024 bytes and MB refers to 1024*1024 bytes, whereas everywhere else in the document (including kbps and Mbps in the same paragraph) the standard SI decimal multiples are intended. Because the usage is limited I suggest just sticking with bytes/octets to avoid confusion, rather than adding a sentence to clarify the mixed usage. If a 1024-byte unit is desired I believe that the traditional designation is KB (or the newer KiB). Below is a precise proposal for comment: In Section 5.1.1.5 replace: An Ogg Opus player MUST play any Ogg Opus stream with a channel mapping family of 0 or 1, even if the number of channels does not match the physically connected audio hardware. with: An Ogg Opus player MUST play any valid Ogg Opus stream with a channel mapping family of 0 or 1 that contains no packet larger than 61,440 octets, even if the number of channels does not match the physically connected audio hardware. In Section 6 replace: Technically, valid Opus packets can be arbitrarily large due to the padding format, although the amount of non-padding data they can contain is bounded. These packets might be spread over a similarly enormous number of Ogg pages. Encoders SHOULD use no more padding than is necessary to make a variable bitrate (VBR) stream constant bitrate (CBR). Decoders SHOULD avoid attempting to allocate excessive amounts of memory when presented with a very large packet. Decoders SHOULD reject packets larger than 60 kB per channel, and display a warning message, and MAY reject packets larger than 7.5 kB per channel. The presence of an extremely large packet in the stream could indicate a memory exhaustion attack or stream corruption. with: Technically, valid Opus packets can be arbitrarily large due to padding, although the amount of non-padding data that an audio data packet can contain is bounded. These packets might be spread over a similarly enormous number of Ogg pages. In an audio data packet, encoders SHOULD use no more padding than is necessary to make a variable bitrate (VBR) stream constant bitrate (CBR). Decoders SHOULD reject audio data packets larger than 61,440 octets per stream; such packets necessarily contain more padding than needed for this purpose. Decoders SHOULD avoid attempting to allocate excessive amounts of memory when presented with a very large packet, and MAY reject or partially process any packet larger than 61,440 octets. The presence of an extremely large packet in the stream could indicate a memory exhaustion attack or stream corruption. Replace: In an Ogg Opus stream, the largest possible valid packet that does not use padding has a size of (61,298*N - 2) octets, or about 60 kB per Opus stream. With 255 streams, this is 15,630,988 octets (14.9 MB) and can span up to 61,298 Ogg pages, all but one of which will have a granule position of -1. This is of course a very extreme packet, consisting of 255 streams, each containing 120 ms of audio encoded as 2.5 ms frames, each frame using the maximum possible number of octets (1275) and stored in the least efficient manner allowed (a VBR code 3 Opus packet). Even in such a packet, most of the data will be zeros as 2.5 ms frames cannot actually use all 1275 octets. The largest packet consisting of entirely useful data is (15,326*N - 2) octets, or about 15 kB per stream. This corresponds to 120 ms of audio encoded as 10 ms frames in either SILK or Hybrid mode, but at a data rate of over 1 Mbps, which makes little sense for the quality achieved. A more reasonable limit is (7,664*N - 2) octets, or about 7.5 kB per stream. This corresponds to 120 ms of audio encoded as 20 ms stereo CELT mode frames, with a total bitrate just under 511 kbps (not counting the Ogg encapsulation overhead). With N=8, the maximum number of channels currently defined by mapping family 1, this gives a maximum packet size of 61,310 octets, or just under 60 kB. This is still quite conservative, as it assumes each output channel is taken from one decoded channel of a stereo packet. An implementation could reasonably choose any of these numbers for its internal limits. with: In an Ogg Opus stream, the largest possible valid audio data packet that does not use padding has a size of (61,298*N - 2) octets, although a valid comment header packet may be larger. With 255 Opus streams, this is 15,630,988 octets and can span up to 61,298 Ogg pages, all but one of which will have a granule position of -1. This is of course a very extreme packet, consisting of 255 Opus streams, each containing 120 ms of audio encoded as 2.5 ms frames, each frame using the maximum possible number of octets (1275) and stored in the least efficient manner allowed (a VBR code 3 Opus packet). Even in such a packet, most of the data will be zeros as 2.5 ms frames cannot actually use all 1275 octets. The largest audio data packet consisting of entirely useful data is (15,326*N - 2) octets. This corresponds to 120 ms of audio encoded as 10 ms frames in either SILK or Hybrid mode, but at a data rate of over 1 Mbps, which makes little sense for the quality achieved. A more reasonable limit is (7,664*N - 2) octets. This corresponds to 120 ms of audio encoded as 20 ms stereo CELT mode frames, with a total bitrate just under 511 kbps (not counting the Ogg encapsulation overhead). For mapping family 1, N=8 provides a reasonable upper bound, as it allows for each of the 8 possible output channels to be encoded in a separate stereo Opus stream. This gives a packet size of 61,310 octets, which is rounded up to a multiple of 1024 octets to give the packet size of 61,440 octets that is expected to be successfully processed by any implementation. I also found that the fragment identifiers are no longer valid in the informative references to the Vorbis specification. This fixes those and makes the host name consistent. In Section 14.2 replace: https://www.xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-800004.3.9 with: https://www.xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-810004.3.9 Replace: https://xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-130000A.2 with: https://www.xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-132000A.2 - Mark _______________________________________________ codec mailing list [email protected] https://www.ietf.org/mailman/listinfo/codec
