Multimedia dedicated weblog.

B-Frames in DirectShow

June 8th, 2008 Posted in Uncategorized

Ever wondered about B-Frame support in DirectShow? … or do you think that DirectShow is perfect? Unluckily it’s not :) Read the rest of the article for more info on this issue.

Frame types

It can be said that all modern video codecs reduce the amount of bits required to describe a video sequence by exploiting spatial and temporal redundancy. A frame encoded using only the spatial compression tools is mostly called an I frame (Intra frame). In most cases this type of frame can also be referred to as the ‘Key frame’ because it is not derived from any other frames and can be used as a reference frame for the following frames. Frame types that also utilize the temporal compression tools are called the ‘Inter frames’ or P frames (predicted) because they are derived from at least one other frame. Latest codecs such as the H.264 offer the possibility to use multiple reference frames but the general rule for P frames is that they are only derived from previous frames in chronological order. See figure 1.

Figure 1. Simple frame structure

Since MPEG-2 had appeared a new frame type was introduced - the bi-directionally predicted frame - B frame. As the name suggests the frame is derived from at least two other frames - one from the past and one from the future (Figure 2).


Figure 2. Sequence containing a B frame

Since the B1 frame is derived from I0 and P2 frames both of them must be available to the decoder prior to the start of the decoding of B1. This means the transmission/decoding order is not the same as the presentation order. That’s why there are two types of time stamps for video frames - PTS (presentation time stamp) and DTS (decoding time stamp).


Unluckily not all existing file containers are suitable for storing encoded video containing B frames because of the lack of capability to store two types of time stamps. AVI file is the typical example of a container that only allows one time stamp per access unit. MP4 or Matroska support B frames natively and that’s why they are more suitable for storing of latest high quality video.


Unluckily the DirectShow technology lacks the ability to store PTS/DTS information for media samples too and various techniques are being used to work around this limitation.

AVI files with DivX/XviD

Depending on the authoring tool and codec used to create the AVI file the content might be stored in several ways.

  1. Frames are stored in decoding order and time stamps represent DTS.
  2. Frames are stored in a "packed bitstream" mode where multiple encoded frames are stored in one AVI sample. This mode also implies the usage of empty "delay frames". See figure 3.


Figure 3. AVI samples containing B frames

To find more about AVI and VFW way of dealing with B frames you can read this post at the doom9 board. All of these techniques require that both encoded and decoder know what they are doing. If you are developing a filter that would be capable of reading and processing encoded stream containing B frames you should be aware of this.

Guess the times by yourself

If you are certain that the timestamps passed along with the IMediaSample object are PTS (and not DTS) you can implement a simple algorithm to determine the DTS value. You will need one variable to remember the maximum timestamp value seen so far that would help with the timestamp generation. This simple algorithm has one disadvantage - it assumes you CAN discard the first received frame (should not be a problem e.g. for network transmissions that run for a long time and need to be restarted only on special occasions :) ).

PTS_IN - PTS of the input frame
PTS_OUT - PTS of the output frame
DTS_OUT - DTS of the output frame
TEMP_TS - Temporary timestamp value

The algorithm is as follows:

  1. Receive frame and get its PTS_IN value
  2. If it was the first frame, assign TEMP_TS = PTS_IN and discard the frame. Jump to step 1.
  3. If (PTS_IN < TEMP_TS) then { DTS_OUT = PTS_IN } else { DTS_OUT = TEMP_TS; TEMP_TS = PTS_IN; }
  4. Assign PTS_OUT = PTS_IN
  5. Deliver frame
  6. Jump to step 1.

Consider the following sequence : I0, P2, B1, P4, B3, P6, B5

Input Frame
(at the start)
Output Frame
(at the end)
I (0) - - 0
P (2) 0 P (2,0) 2
B (1) 2 B (1,1) 2
P (4) 2 P (4,2) 4
B (3) 4 B (3,3) 4
P (6) 4 P (6,4) 6
B (5) 6 B (5,5) 6

Simple, huh ? It will also work nicely if the first received frame is not an I-Frame and also if there are some frames dropped/lost or missing in the sequence. The only two problems are that you must be receiving PTS with the input frame and that you’re losing one frame at the start.

The "my-beloved" solution

Another solution to the timestamp issue could be to introduce PTS/DTS timestamps into the MediaSample object. Since media samples are instances of a CUnknown(IUnknown)-derived classes they also can expose interfaces. That means we could define an IMediaSample3 interface that would contain the Get/SetTimePTS/DTS(REFERENCE_TIME *, REFERENCE_TIME *)  methods so both times could be attached to the media sample. The problem is that this would require that all filters supporting this new transport would have to provide custom allocators as well and it would be difficult to assure compatibility among other 3rd party components that might not have a clue of this mechanism. However, if I were to develp a solution that would contain filters made by one party for the Encoding/Mux/Demux/Decode parts of the graph I would go for this one for sure.

  1. 8 Responses to “B-Frames in DirectShow”

  2. By Sina™ on Jun 9, 2008

    Hi Igor…
    How you doing?
    Any news about LAME filter?

  3. By Zion on Jul 15, 2008


    I have a question..
    that why you need such an algorithm to obtain the DTS order from the PTS stamp of the input frames, isn’t the DTS stamp just the sequence of frames?

  4. By Igor Janos on Jul 15, 2008

    Yes, DTS is “just a sequence of frames” but for every frame PTS must be greater or equal to DTS. If you just started to count frames as you receive them the condition might not be fulfilled.

  5. By Jamie Fenton on Nov 30, 2008

    Two ideas popped into my head when I read your blog entry:

    1) DirectShow uses 100ns clock resolution - an application can thus hide a code inside the jitter, particularly in video signals where jitter is a feature and not a bug :-). I.E. truncate the timestamps to 0.1 ms and use the remaining 3 digits of precision to hide flags. A form of temporary timestamping like you describe.

    2) Use an empirical trick like trying the decode both ways and take whichever route is less ghastly. This requires the codec to be bulletproof (which it has to be anyway these days), and enough preroll space/time. So we have a variant on “discard first frame” theme here.

  6. By Igor Janos on Dec 1, 2008

    Nice ideas indeed. Although I like the second one much more than the first one. Using dirty tricks with timestamp precision can lead to incompatibility issues when using with 3rd party filters.

    In my latest filter, the x264 encoder, I’ve tried to implement a derived allocator class the creates IMediaSampleEx samples that provide methods for setting additional PTS, DTS and Clock values. This makes the filter compatible with filters that can only read one timestamp value as well as filters that can take advantage of the extended sample interface. IMHO this might prove useful.

  7. By Jamie Fenton on Dec 6, 2008

    The derived allocator approach is probably the best way to go - Microsoft recommends something like it for a back channel for managing dynamic format changes.

    Even better would be for Microsoft to add a property list to their media sample API, so you can tag your properties, and I can tag mine, to the same sample and know that your upstream will get it, as would mine. Then DirectShow could get away from the “lots of independent allocators” architecture, and closer to a “shared-pool with routing” architecture, that only copied something, changed formats, etc., when it really needed to.

    One way to get that benefit started would be to release the design/code for such a facility as open-source, and code what you can to it. (Which I know you have done with your x264 encoder). Still, having it be separate, minimal, and free to use by anybody might get it officially approved.

  8. By Igor Janos on Dec 6, 2008

    The mechanism of shared allocators is implemented in Trans-In-Place filters. However it seems that this usage scenario is not very common. In a typical situation - playback of an AVI file you usually have one async source and muxer - they use their own specific way of data delivery. Then you have Mux->Decoder connection which uses classic memory allocator. And the Decoder->Video Renderer connection that works best with allocator provided by the renderer filter so the framebuffer is decoded and copied only once directly into the video memory.

    As for the opensource design :) - yupp, that’s what I’m trying to do. I also have a set of muxers nearly finished so let’s hope the mechanism will become popular.

  9. By Enemo on Nov 21, 2009

    I just got an iPhone recently, and I would like to transfer my contacts from my previous phone to my iPhone. However, there is no obvious button that I can press. How do you do this?

    [url=]unlock iphone[/url]

Post a Comment