Mark Harris wrote:
It would be nice to have guidelines for encoding these gaps in the
Ogg Opus draft, including:
I took a stab at drafting some text that answers these questions (XML
diff attached):
4.1. Repairing Gaps in Real-time Streams
In order to support capturing a real-time stream that has lost
packets, or that uses discontinuous transmission (DTX), a muxer
SHOULD emit packets that explicitly request the use of Packet Loss
Concealment (PLC) in place of the packets that were not transmitted.
Only gaps that are a multiple of 2.5 ms are repairable, as these are
the only durations that can be created by packet loss or DTX. Muxers
need not handle other gap sizes. Creating the necessary packets
involves synthesizing a TOC byte (defined in Section 3.1
of [RFC6716])---and whatever additional internal framing is needed---
to indicate the packet duration for each stream. The actual length
of each missing Opus frame inside the packet is zero bytes, as
defined in Section 3.2.1 of [RFC6716].
[RFC6716] does not impose any requirements on the PLC, but this
section outlines choices that are expected to have a positive
influence on most PLC implementations, including the reference
implementation. When possible, creating the TOC byte using the same
mode, audio bandwidth, channel count, and frame size as the previous
packet (if any) covers all losses that do not include a configuration
switch, as defined in Section 4.5 of [RFC6716]. This is the simplest
and usually the most well-tested case for the PLC to handle. If
there is no previous packet, reasonable decoders will not emit
anything other than silence regardless of the mode. Using the CELT-
only mode for this case (with any audio bandwidth) allows maximum
flexibility, since a single packet can represent any duration up to
120 ms that is a multiple of 2.5 ms using at most two bytes.
When a previous packet is available, keeping the audio bandwidth and
channel count the same allows the PLC to provide maximum continuity
in the concealment data it generates. However, if the size of the
gap is not a multiple of the most recent frame size, then the frame
size will have to change for at least some frames. Delaying such
changes as long as possible to simplifies things for PLC
implementations. A 95 ms gap could be encoded as 19 5 ms frames in
two bytes with a single CBR code 3 packet. If the previous frame
size was 20 ms, using four 80 ms frames, followed by three 5 ms
frames requires 4 bytes (plus an extra byte of Ogg lacing overhead),
but allows the PLC to use its well-tested steady state behavior for
as long as possible. The total bitrate of the latter approach,
including Ogg overhead, is about 0.4 kbps, so the impact on file size
is minimal.
Changing modes is discouraged, since this causes some decoder
implementations to reset their PLC state. However, SILK and Hybrid
modes cannot fill gaps that are not a multiple of 10 ms. If
switching to CELT mode is needed to match the gap size, doing so at
the end of the gap allows the PLC to function for as long as
possible. Since CELT does not support medium-band audio, using
wideband when switching from medium-band SILK ensures that any PLC
implementation that does try to migrate state between the modes will
not be forced to artificially reduce the bandwidth.
The synthetic TOC byte MAY use any of codes 0, 1, 2, or 3 to pack the
frame(s) into a packet. If the TOC configuration matches, the muxer
MAY combine the empty frames with previous or subsequent non-zero-
length frames (using code 2 or VBR code 3).
It's a little bit long, but if people think it's useful (or have
suggestions for shortening it), we should include it.
diff --git a/doc/draft-ietf-codec-oggopus.xml b/doc/draft-ietf-codec-oggopus.xml
index 6131e69..d7cca9f 100644
--- a/doc/draft-ietf-codec-oggopus.xml
+++ b/doc/draft-ietf-codec-oggopus.xml
@@ -240,23 +240,91 @@ The one exception is the last page in the stream, as described below.
<t>
All other pages with completed packets after the first MUST have a granule
position equal to the number of samples contained in packets that complete on
that page plus the granule position of the most recent page with completed
packets.
This guarantees that a demuxer can assign individual packets the same granule
position when working forwards as when working backwards.
For this to work, there cannot be any gaps.
-In order to support capturing a stream that uses discontinuous transmission
- (DTX), an encoder SHOULD emit packets that explicitly request the use of
- Packet Loss Concealment (PLC) (i.e., with a frame length of 0, as defined in
- Section 3.2.1 of <xref target="RFC6716"/>) in place of the packets that were
- not transmitted.
</t>
+<section anchor="gap-repair" title="Repairing Gaps in Real-time Streams">
+<t>
+In order to support capturing a real-time stream that has lost packets, or that
+ uses discontinuous transmission (DTX), a muxer SHOULD emit packets that
+ explicitly request the use of Packet Loss Concealment (PLC) in place of the
+ packets that were not transmitted.
+Only gaps that are a multiple of 2.5 ms are repairable, as these are the
+ only durations that can be created by packet loss or DTX.
+Muxers need not handle other gap sizes.
+Creating the necessary packets involves synthesizing a TOC byte (defined in
+ Section 3.1 of <xref target="RFC6716"/>)---and whatever additional
+ internal framing is needed---to indicate the packet duration for each stream.
+The actual length of each missing Opus frame inside the packet is zero bytes,
+ as defined in Section 3.2.1 of <xref target="RFC6716"/>.
+</t>
+
+<t>
+<xref target="RFC6716"/> does not impose any requirements on the PLC, but this
+ section outlines choices that are expected to have a positive influence on
+ most PLC implementations, including the reference implementation.
+When possible, creating the TOC byte using the same mode, audio bandwidth,
+ channel count, and frame size as the previous packet (if any) covers all
+ losses that do not include a configuration switch, as defined in
+ Section 4.5 of <xref target="RFC6716"/>.
+This is the simplest and usually the most well-tested case for the PLC to
+ handle.
+If there is no previous packet, reasonable decoders will not emit anything
+ other than silence regardless of the mode.
+Using the CELT-only mode for this case (with any audio bandwidth) allows
+ maximum flexibility, since a single packet can represent any duration up to
+ 120 ms that is a multiple of 2.5 ms using at most two bytes.
+</t>
+
+<t>
+When a previous packet is available, keeping the audio bandwidth and channel
+ count the same allows the PLC to provide maximum continuity in the concealment
+ data it generates.
+However, if the size of the gap is not a multiple of the most recent frame
+ size, then the frame size will have to change for at least some frames.
+Delaying such changes as long as possible to simplifies things for PLC
+ implementations.
+A 95 ms gap could be encoded as 19 5 ms frames in two bytes
+ with a single CBR code 3 packet.
+If the previous frame size was 20 ms, using four 80 ms frames,
+ followed by three 5 ms frames requires 4 bytes (plus an extra byte
+ of Ogg lacing overhead), but allows the PLC to use its well-tested steady
+ state behavior for as long as possible.
+The total bitrate of the latter approach, including Ogg overhead, is about
+ 0.4 kbps, so the impact on file size is minimal.
+</t>
+
+<t>
+Changing modes is discouraged, since this causes some decoder implementations
+ to reset their PLC state.
+However, SILK and Hybrid modes cannot fill gaps that are not a multiple of
+ 10 ms.
+If switching to CELT mode is needed to match the gap size, doing so at the end
+ of the gap allows the PLC to function for as long as possible.
+Since CELT does not support medium-band audio, using wideband when switching
+ from medium-band SILK ensures that any PLC implementation that does try to
+ migrate state between the modes will not be forced to artificially reduce the
+ bandwidth.
+</t>
+
+<t>
+The synthetic TOC byte MAY use any of codes 0, 1, 2, or 3 to pack the
+ frame(s) into a packet.
+If the TOC configuration matches, the muxer MAY combine the empty frames with
+ previous or subsequent non-zero-length frames (using code 2 or
+ VBR code 3).
+</t>
+</section>
+
<section anchor="preskip" title="Pre-skip">
<t>
There is some amount of latency introduced during the decoding process, to
allow for overlap in the MDCT modes, stereo mixing in the LP modes, and
resampling, and the encoder will introduce even more latency (though the exact
amount is not specified).
Therefore, the first few samples produced by the decoder do not correspond to
real input audio, but are instead composed of padding inserted by the encoder
_______________________________________________
codec mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/codec