Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
 
Other projects
   Altirra

Archives

Blog Archive

AVI timing and audio sync

Last time I promised I would write up some information about how VBR audio is popularly implemented in an AVI file; I'm going to generalize this slightly and talk about the timing of AVI streams. I'm not going to speak on the properness of VBR audio in AVI because almost everyone knows how I feel about this and that doesn't change the fact that VBR files are out in the wild and will be encountered by applications that accept the AVI format. Instead, here are the technicals so you will at least know how it works and what issues arise as a result.

I should note that I didn't devise the VBR scheme; I simply reverse engineered it from the Nandub output when I started receiving reports that newer versions of VirtualDub suddenly were not handling audio sync properly on some files. The technique I describe below varies slightly from Nandub's output, as I omit some settings that, as far as I can tell, are not necessary to get VBR-in-AVI working.

As usual, any and all corrections are welcome.

AVI streams, both audio and video, are composed of a series of samples which are evenly spaced in time. For a video stream, a stream sample is a video frame, and the stream rate is the frame rate; for an audio stream, a stream sample is an audio block, which for PCM is equivalent to an audio sample. These stream samples are in turn stored in chunks, where there is generally one sample per chunk for a video stream, and multiple samples per chunk for an audio stream. These chunks are then pointed to by the index, which lists all chunks in the file in their stream order.

Timing of an AVI stream is governed by several variables:

There is one last tidbit missing: where exactly each sample starts and ends. The standard set by DirectShow is that the start time for the initial sample is zero, so assuming dwStart=0, the first sample in a 25/sec stream would occupy [0ms, 40ms), the second [40ms, 80ms), etc. This can be interpreted as nearest neighbor sampling, which means that an interpolator would consider the samples to be in the center of each interval at 20ms, 60ms, and so on.

Note that, based on the above, the timing of a sample is determined solely by its position in the stream -- that is, a sample N always has a start time of (dwStart + N)*(dwScale/dwRate) seconds regardless of its position in the file. In particular, the grouping of samples into chunks or the position of a stream's chunks relative to another stream's chunks doesn't matter. This means that interleaving of a file doesn't affect synchronization between two streams. That doesn't mean that interleaving doesn't affect performance, and if a player has strict playback constraints as hardware devices often do, poor interleave may render a player unable to maintain correct sync or even uninterrupted playback. However, a non-realtime conversion on a hard disk (or other random access medium) on a PC should not have such constraints.

Now, about VBR....

You might think that setting dwSampleSize=0 for an audio stream would allow it to be encoded as variable bitrate (VBR) like a video stream, where each sample has a different size. Unfortunately, this is not the case -- Microsoft AVI parsers simply ignore dwSampleSize for audio streams and use nBlockAlign from the WAVEFORMATEX audio structure instead, which cannot be zero. Nuts. So how is it done, then?

The key is in the translation from chunks to samples.

Earlier, I said that the number of samples in a chunk is determined from the size of the chunk in bytes, since samples are a fixed size. But what happens if the chunk size is not evenly divisible by the sample size? Well, DirectShow, the engine behind Windows Media Player and a number of third-party video players that run on Windows, rounds up. This means that if you set nBlockAlign to be higher than the size of any chunk in the stream, DirectShow will regard all of them as holding one sample, even though they are all different sizes. Thus, to encode VBR MP3, you simply have to set nBlockAlign to at least 960, the maximum frame size for MPEG layer III, and then store each MPEG audio frame in its own chunk. Since each audio frame encodes a constant amount of audio data -- 1152 samples for 32KHz or higher, 576 samples for 24Khz or lower -- this permits proper timing and seeking despite the variable bitrate. This can also be done for other compressed audio formats, provided that the encoding application is able to determine the compressed block boundaries and the maximum block size, and the decoders accept non-standard values for the nBlockAlign header field.

The advantages of this VBR encoding:

Now, the downsides:

As I mentioned in the introduction, I will refrain from saying whether VBR audio should or shouldn't be used, as I've already done the subject to death. Hopefully now those of you trying to write AVI parsers will have some idea about how to read and detect VBR files, however.

Comments

This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.