辅导ffmpeg and multimedia processing

Project: ffmpeg and multimedia processing
Ao Shen
June 11, 2022
Contents
1 Functional Requirements 2
1.1 Micro-benchmarking: virtual function and template dispatch 2
1.2 Video Contact Sheet . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Frame Extraction . . . . . . . . . . . . . . . . . . . . . . 3
1.2.2 Resize and Compose a sheet . . . . . . . . . . . . . . . . 4
1.2.3 Add text . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Non-functional requirements to your Code 5
2.1 Code style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3 Submission and Report 5
3.1 Submission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 Project Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
A Background Information 7
A.1 Anatomy of a video file . . . . . . . . . . . . . . . . . . . . . . . 7
A.2 ffmpeg API Overview . . . . . . . . . . . . . . . . . . . . . . . . 7
A.3 In one pixel: color and gamma . . . . . . . . . . . . . . . . . . . 8
A.4 Pixels in one frame: planar and packed format . . . . . . . . . 10
In this project, you will develop a series of small tools processing media
files, mostly video files. Multimedia encoding algorithms are among
the most complex ones, therefore, we will use a state-of-the-art library, ffmpeg,
to do the actual encoding/decoding. However, we still believe it can a
worthwhile learning experience, as modern computing systems have been
so complex that you have to integrate many libraries to program them for
work, research or even for your own personal use.
1
1 Functional Requirements
1.1 Micro-benchmarking: virtual function and template
dispatch
(15%) As you will see in Section A.4, pixel data can be stored in decoded
video files by different arrangements. When using C++, it might be possible
to encapsule the difference as a virtual function, and redirect all accesses of
any pixel through that.
class FrameViewBase {
public: virtual uint8_t data_from_pixel(int w, int h);
virtual double calculate() { /* calls data_from_pixel() */ }
private: uint8_t *data;
};
class YUV444PFrame : public FrameViewBase {
public: virtual uint8_t data_from_pixel(int w, int h) {
// get required data at given point
}
};
double calculate_frame(AVPixelFormat fmt, uint8_t * buffer) {
FrameViewBase* frame;
// factory pattern
if (fmt == AV_PIX_FMTYUV444P) frame = new YUV444PFrame(buffer, /*...*/ );
return frame->calculate(); // virtual function
}
With factory pattern, you only need to write common processing functions
like calculate once in the base class.
On the other hand, if you would like to play with template a bit, same
functionality can be achieved as:
struct YUV444PFrame {
uint8_t * data; /* ... */
};
template
uint8_t data_from_pixel(T* frame, int w, int h) = delete;
template<>
uint8_t data_from_pixel(YUV444PFrame* frame, int w, int h) {
// get required data at given point
}
template double calculate(T* frame) { /* ... */ }
double calculate_frame(AVPixelFormat fmt, uint8_t * buffer) {
YUV444PFrame frame444{buffer, /* ... */ };
YUV420PFrame frame420{buffer, /* ... */ };
if (fmt == AV_PIX_FMTYUV444P)
calculate(frame444); // template function
}
2
Figure 1: A video contact sheet, generated by https://github.com/amietn/
vcsi
Read through code given in the microbench directory, and run the benchmark.
Does performance differ? If it does, why so?
Requirement. Run the benchmark with CMake Release profile. Provide
your result of benchmark, specification of the computer where benchmark
ran, and your explanation of difference (or lack thereof).
Remark. The code given utilizes ffmpeg library to read input data. Pay
attention to ffmpeg_decode_sample.c and read description of ffmpeg API in
Section A.2. This should be helpful for your following task.
Please refer to README.md in your code repository for how to run the
benchmark code.
1.2 Video Contact Sheet
Video contact sheet is a picture with several snapshots at different time
points of a given video (see Figure. 1). It is often used as a preview for a
video before downloading. In this project, you will build such a tool with
C/C++ step by step.
1.2.1 Frame Extraction
(20%) First, you will need to extract frames at different time points from a
video.
Requirement. The filename of input video file is given as a command
parameter (argv[1] in parameters of main function). Extract 6 different
frames at about beginning, 1/5, 2/5, 3/5, 4/5 and near end from that video.
Save them as frame_%d.png at the same directory of the video file. %d is
3
index of frames you saved. The exact time points you take is not important
as the project is graded by hand.
Remark. You may need av_seek_frame or avformat_seek_file. To write
a PNG image, utilize stb_image_writer.h provided in external directory.
In Windows and macOS, drag and drop a file onto a executable will invoke
the executable with path of dropped file as command parameter.
Most of image formats expect pixels encoded as RGB values, but videos
often encode pixels differently. In this project, you only need to consider
following pixel layout, as described in format member of AVFrame struct:
AV_PIX_FMT_YUV420P, AV_PIX_FMT_YUV444P, AV_PIX_FMT_RGBA
Refer to Section A.3 about how to convert between them. Your program
should give a warning and exit cleanly when encountered a unknown format.
1.2.2 Resize and Compose a sheet
Then, combine the extracted 6 picture and put them as a 2× 3 grid onto one
sheet (5%). You should scale down each image so that the output does not
exceed 2160 × 2160 pixels (5%). Moreover, if the image grid with no scaling
is too small, do not scale them up, reduce size of output sheet instead (5%).
Requirement. Your input is the same as before. Save the output picture
as combined.png at the same directory as the video file.
Remark. You may want to look at stb_image_resize.h provided, or use
more powerful libraries such as https://github.com/dtschump/CImg. You
can also write resizing part by yourself. No matter which way you use,
please modify CMake project structure accordingly and describe it in the
report.
Note that if you decide to write resizing by yourself, be sure not average
RGB values directly. If you don’t understand why, read Section A.3 and its
references again, and think again.
We suggest that, for this task, don’t try to read back output from the previous
task. Instead, you should reuse code from your previous task directly.
Try to encapsule common operations (i.e. decoding 6 frames) into your own
library. Refer to util folder for an example of how.
1.2.3 Add text
(15%) Then, put time stamp of each extracted frame on the output sheet.
Requirement. Your input is still the same. Save the output picture as
contactsheet.png at the same directory as the video file.
Remark. This should be easy if you have done the previous task. You
don’t need to do fancy transparent effect like Figure 1. All you need to do is
hardcode eleven pixel patterns (for each digit and “:” symbol) and copy them
below border of each frame.
4
2 Non-functional requirements to your Code
You can choose C or C++ to finish this project. Exact version of standard
does not matter as long as you don’t use compiler-specific extensions or notyet-standardized
C++23 features.
Note that because ffmpeg is a C library, when including its header in
C++ you should use extern "C" declaration. Refer to avframe_wrapper.cpp
for an example.
extern "C" {
#include
}
In addition to functionality outlined above, your code will be examined
and graded by the following standards.
2.1 Code style
(5%) Your code must be readable by TA, which manifests as following.
• You should use a code formatter to keep a consistent style throughout
all your files. A .clang-format configuration is provided, and most
IDE you use should already support it.
• There are lots of repetitive work in the task. Do not copy and paste
your code all over the place. Instead, try to encapsule it into functions
or classes.
2.2 Memory Management
(10%) There should be no memory leak in your program, even when invalid
input is provided.
You can use Address Sanitizers to detect memory leak. Also, you can try
to encapsule alloc/free functions into C++ class.
2.3 Performance
(Should TA fail to recreate your result in a reasonable time [10x slower than
reference implementation], your point in respective task would be reduced.)
If you find your code running too slow, probably you have too much unnecessary
memory copying or floating point calculation. Try optimize them!
3 Submission and Report
3.1 Submission
All your code should be pushed onto assigned GitHub repository. The deadline
is determined by your git push time onto GitHub.
5
3.2 Project Report
(20%) In addition to code, you should also submit a report to learn.tsinghua.
edu.cn. It should contain following information:
• Screenshots of your programs’ output.
• How TA should compile your project and recreate your result.
• CPU and RAM specification of the computer you finished the microbenchmark
test. (e.g. “The benchmark is carried on a Desktop computer
with 4.5GHz AMD 5900X processor, and 64 GB DDR4 memory at 3200
MT/s”)
• Your microbenchmark result, and answer to the question given in that
section.
• Any interesting bugs you have encountered, and how did you solve
them.
Please do not submit code with your report! Only code committed and
pushed to GitHub will be graded.
6
Figure 2: Workflow of ffmpeg library, Source: ffmpeg and libav tutorial
A Background Information
A.1 Anatomy of a video file
While most strings are not compressible, as Computing Theory class told us,
videos are definitely among the most compressible strings. Consider a video
with resolution of 1920×1080 at 60 fps. Length of one minute uncompressed
video can be determined as
3 × 1920 × 1080 × 60 × 60 ≈ 2.2 × 1010(Bytes)
But a video file of that length usually takes less than 108 bytes. The video
encoding algorithms use many tricks to achieve this result. However in this
project, we don’t need to dig into them.
However, such encoding only provides a stream of data. In a video file,
we may want many streams of data — video (with multiple chapters), audio
(with different languages), subtitles. So video files are containers of these
data streams. For a list of these video formats, you can refer to Wikipedia:
https://en.wikipedia.org/wiki/Comparison_of_video_container_formats.
A.2 ffmpeg API Overview
The API of ffmpeg is designed around the aforementioned “container-stream”
model, as shown in Figure 2.
This section will only provide a high-level overview of what function you
may want to looking for. The exact usage of ffmpeg library is left on you to
read the documentation in comments of associated header file.
Ffmpeg is divided into several different parts, for this project the relevant
ones are libavformat libavutil libavcodec. The interface to these library
described in C header
• libavformat/avformat.h
7
• libavcodec/avcodec.h
• libavutil/avutil.h
Note that these header are C headers. If you are writing C++, include
them with extern "C", otherwise linking error may occur.
The handle to video file is AVFormatContext, which must be allocated
with avformat_alloc_context and freed with avformat_free_context. Other
structs mentioned often have similar alloc and free functions.
Once allocated, a video file can be opened for decoding with avformat_open_input.
Then stream can be examined with pFormat->streams[i], the length of array
is given in ->nb_streams.
Each steam gives appropriate codec information in its codecpar member.
To lookup a codec in libavcodec library, use avcodec_find_decoder(codecpar->codec_id)
function. A codec must work within a context AVCodecContext, which links
to a codec by avcodec_open2 function.
When all contexts are set up, the decoding pipeline can be operated manually.
Receive a packet from a stream by av_read_frame into a AVPacket, and
forward it into codec by avcodec_send_packet. Then decoded frame can be
received into a AVFrame with avcodec_receive_frame. Note the referencecounting
mechanism to avoid memory leak.
A small example program has been provided to give you an idea of how it
look likes. The header files are commented with how to use them, and can
often recognized by your IDE.
A.3 In one pixel: color and gamma
In order to deal with frame data, you have to know how the color picture is
encoded.
As we all known, human have three different kind of cone cells sensitive
to short-, middle- and long-wavelength light. Activity strength of them can
be denoted as a vector (a, b, c). And it follows intuitively that length of this
vector corresponds to strength of incoming light. So the “color” part can be
represented by a two-dimensional plane.
Further research showed that it is indeed the case. Figure 3 shows a
diagram of “all colors”. On the outer curve the number denotes wavelength
of pure light, and in closure of the curve is color of mixture of light with
different wavelength.
To actually encode a color, we have to choose three colors in the diagram
as our basis (called primaries). The sRGB standard, which is mostly used in
pictures, and BT. 709 standard, which is mostly used in HD videos, choose
the same three colors, as shown in Figure 3. There are some more “wide
gamut” color spaces that choose different set of basis to allow more color to
be encoded, like Display-P3 and AdobeRGB, but in this project we will never
encounter them.
8
Figure 3: CIE 1931 chromaticity digram with sRGB color gamut, Source:
Wikipedia
With basis chosen, all represent-able color can be denoted as (r, g, b) ∈
[0, 1]3
. However, in most image formats, the value ranging from 0 ∼ 255 is
not the vector multiplied by 255 and converted to int because
• Human eyes are more sensitive to difference in dark colors as shown in
Figure 4. So in order to give a consistent perceived lightness difference,
we need more values representing darker colors. Hence, a non-linear
transform x
γ
, γ < 1 is needed.
• CRT displays emit light with power-law relationship to input voltage
level. So to speak, emitted light L ≈ V
γ
, γ > 1, where V is input
voltage.
Figure 4: Perceived Lightness vs. Physical Lightness, Source: Learn
OpenGL
Therefore, a step called “gamma-correction” is often needed, whose name
is due to the symbol γ often used to denote the parameter in this transformation.
When an image is saved, you want a non-linear transform so more
precision is given to the darker side where that human is more sensitive.
9
And when an image is displayed, a non-linear transform before output is
needed so to get the correct voltage.1
Moreover, in video signal, it is common to split the RGB signal into luma
(Y0
) and chroma (CbCr) signals, resulting in YUVxxx formats as you will often
see.
The BT. 709 standard uses the following encoding transformation:
E
0
e =

4.500e 0 ≤ e ≤ 0.018
1.099e
0.45 − 0.099 0.018 < e ≤ 1
(e is one of r, g or b ∈ [0, 1])
E
0
Y 0 = 0.2126E
0
R + 0.7152E
0
G + 0.7222E
0
B DY 0 = [219E
0
Y 0 + 16]
E
0
Cb = (E
0
B − E
0
Y
)/1.8556 DCb = [224E
0
Cb + 128]
E
0
Cr = (E
0
R − E
0
Y
)/1.5748 DCr = [224E
0
Cr + 128]
where (DY 0 , DCb, DCr) is what saved in the video frame, [·] denotes rounding
to nearest integer.
The sRGB standard is used for most images today. And it saves RGB
information with following transformation:
E
0
e =

12.92e 0 ≤ e ≤ 0.00304
1.055e
1/2.4 − 0.055 0.00304 < e ≤ 1
De = [255Ee]
where (Dr, Dg, Db) is the familiar RGB value between 0 and 255.
Note that E0
BT.709(e) ≈ e
1/1.92 and E0
sRGB(e) ≈ e
1/2.2
, which can be used to
simplify the calculation a bit. Also, you can pre-calculate a lookup table.
This is why you cannot average pixels directly — average of RGB code
has nothing to do with average of actual light you will see when an image is
scaled down.
Incorrect understanding of RGB code in images has caused lots of confusion.
In fact, there is even a CVPR paper criticizing incorrect interpretation
of images. See R. M. H. Nguyen and M. S. Brown, “Why You Should Forget
Luminance Conversion and Do Something Better,” CVPR 2017.
A.4 Pixels in one frame: planar and packed format
Another important point about image format is pixels can be “packed” or
“planar”. As illustrated in the following pseudo-code with its name in ffmpeg:
typedef struct rgba_pixel {
uint8_t r; uint8_t g; uint8_t b; uint8_t a;
1While modern LCD displays are digital and no longer have this voltage-to-light relationship,
all those existing video output equipment makes LCD displays have a conversion circuit
to emulate(!) this behaviour.
10
Figure 5: memory layout of a frame in YUV420P pixel format and illustration
of chroma subsampling, YUV420 means 4:2:0, YUV444 means 4:4:4 and so
on. “Byte stream” means layout in a linear array. Source: Wikipedia
} rgba_pixel_t;
rgba_pixel_t packed_image[linesize*HEIGHT]; // AV_PIX_FMT_RGBA
struct {
uint8_t y_plane[linesize*HEIGHT];
uint8_t u_plane[(linesize/2)*(HEIGHT/2)]; // Cb is also denoted as U
uint8_t v_plane[(linesize/2)*(HEIGHT/2)]; // Cr is also denoted as V
} planar_image; // AV_PIX_FMT_YUV420P
You may have noticed that in the YUV420P example, the U and V plane
is smaller than the image. This is because of a compression trick called
chroma subsampling.
Because human eyes are more sensitive to change in brightness (luma)
than color (chroma), you can use less resolution to encode chroma plane and
let several pixels to share the same chroma information.
For example, layout of YUV420P is shown in the code and Figure 5. While
each pixel has a separate luma (Y0
) component, four pixels in a square share
one chroma (CbCr) components, as if they are the same for all of them.
11