Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip
VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip
VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip
Ebook304 pages2 hours

VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip

Rating: 0 out of 5 stars

()

Read preview

About this ebook

High definition video requires substantial compression in order to be transmitted or stored economically. Advances in video coding standards from MPEG-1, MPEG-2, MPEG-4 to H.264/AVC have provided ever increasing coding efficiency, at the expense of great computational complexity which can only be delivered through massively parallel processing.


This book will present VLSI architectural design and chip implementation for high definition H.264/AVC video encoding, using a state-of-the-art video application, with complete VLSI prototype, via FPGA/ASIC. It will serve as an invaluable reference for anyone interested in VLSI design and high-level (EDA) synthesis for video.

LanguageEnglish
PublisherSpringer
Release dateDec 29, 2009
ISBN9781441909596
VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip

Related to VLSI Design for Video Coding

Related ebooks

Electrical Engineering & Electronics For You

View More

Related articles

Reviews for VLSI Design for Video Coding

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    VLSI Design for Video Coding - Youn-Long Steve Lin

    Youn-Long Steve Lin, Chao-Yang Kao, Hung-Chih Kuo and Jian-Wen ChenVLSI Design for Video Coding1stH.264/AVC Encoding from Standard Specification to Chip10.1007/978-1-4419-0959-6_1© Springer Science+Business Media, LLC 2010

    1. Introduction to Video Coding and H.264/AVC

    Youn-Long Steve Lin¹ , Chao-Yang Kao¹, Huang-Chih Kuo¹ and Jian-Wen Chen¹

    (1)

    Dept. Computer Science, National Tsing Hua University, 101 Kuang Fu Road Section 2, HsinChu, 300, Taiwan R.O.C.

    Abstract

    A video signal is represented as a sequence of frames of pixels. There exists a vast amount of redundant information that can be eliminated with video compression technology so that transmission and storage becomes more efficient. To facilitate interoperability between compression at the video producing source and decompression at the consumption end, several generations of video coding standards have been defined and adapted. For low-end applications, software solutions are adequate. For high-end applications, dedicated hardware solutions are needed. This chapter gives an overview of the principles behind video coding in general and the advanced features of H.264/AVC standard in particular. It serves as an introduction to the remaining chapters; each covers an important coding tool and its VLSI architectural design of an H.264/AVC encoder.

    1.1 Introduction

    A video encoder takes as its input a video sequence, performs compression, and then produces as its output a bit-stream data which can be decoded back to a video sequence by a standard-compliant video decoder.

    A video signal is a sequence of frames. It has a frame rate defined as the number of frames per second (fps). For typical consumer applications, 30 fps is adequate. However, it could be as high as 60 or 72 for very high-end applications or as low as 10 or 15 for video conferencing over a low-bandwidth communication link.

    A frame consists of a two-dimensional array of color pixels. Its size is called frame resolution. A standard definition (SD) frame has 720 ×480 pixels per frame whereas a full high definition (FullHD) one has 1,920 ×1,088. There are large number of frame size variations developed by various applications such as computer monitors.

    A color pixel is composed of three elementary components: R, G, and B. Each component is digitized to an 8-bit data for consumer applications or a 12-bit one for high-end applications.

    The data rate for a raw video signal is huge. For example, a 30-fps FullHD one will have a data rate of 30 ×1,920 ×1,088 ×3 ×8=1.5Gbps, which is impractical for today’s communication or storage infrastructure.

    Fortunately, by taking advantage of the characteristics of human visual system and the redundancy in the video signal, we can compress the data by two orders of magnitude without scarifying the quality of the decompressed video.

    1.1.1 Basic Coding Unit

    In order for a video encoding or decoding system to handle video of different frame rates and simplify the implementation, a basic size of 16 ×16 has been popularly adopted. Every main stream coding standards from MPEG-1, MPEG-2,… to H.264 has chosen a macroblock of 16 ×16 pixels as their basic unit of processing. Hence, for video of different resolutions, we just have to process different number of macroblocks. For every 720 ×480 SD frame, we process 45 ×30 macroblocks while for every FullHD frame, we process 120 ×68 macroblocks.

    1.1.2 Video Encoding Flow

    Algorithm 1.1 depicts a typical flow of video encoding. frame(t) is the current frame to be encoded. frame ′ (t−1) is the reconstructed frame for referencing or called reference frame. frame ′ (t) is the reconstructed current frame. We encode F(t) one macroblock (MB) at a time starting from the leftmost MB of the topmost row. We called the MB being encoded as Curr_MB. It can be encoded in one of the three modes: I for intra prediction, P or unidirectional interprediction, and B for bidirectional interprediction. The resultant MB from prediction is called Pred_MB and the difference between Curr_MB and Pred_MB is called Res_MB for residuals. Res_MB goes through space-to-frequency transformation and then quantization processes to become Res_Coef or residual coefficients. Entropy coding then compresses Res_Coef to get final bit-stream. In order to prepare reconstructed current frame for future reference, we perform inverse quantization and inverse transform on Res_Coef to get reconstructed residuals called Reconst_res. Adding together Reconst_res and Pred_MB, we have Reconstruct MB for insertion into frame ′ (t).

    1.1.3 Color Space Conversion

    Naturally, each pixel is composed of R, G, and B 8-bit components. Applying the following conversion operation, it can be represented as one luminance (luma) component Y and two chrominance (chroma) components Cr and Cb. Since the human visual system is more sensitive to luminance component than chrominance ones, we can subsample Cr and Cb to reduce the data amount without sacrificing the video quality. Usually one out of two or one out of four subsampling is applied. The former is called 4:2:2 format and the later 4:2:0 format. In this book, we assume that 4:2:0 format is chosen. Of course, the inverse conversion will give us R, G, B components from a set of Y , Cr, Cb components.

    $$\begin{array}{rl} Y & = 0.299R + 0.587G + 0.114B, \\ \mathrm{Cb}& = 0.564(B - Y ), \\ \mathrm{Cr}& = 0.713(R - Y ).\end{array}$$

    (1.1)

    A978-1-4419-0959-6_1_Figa_HTML.gif

    1.1.4 Prediction of a Macroblock

    A macroblock M has 16 ×16=256 pixels. It takes 256 ×3=768 bytes to represent it in RGB format and 256 ×(1+1∕4+1∕4)=384 bytes in 4:2:0 format. If we can find during decoding a macroblock M ′ which is similar to M, then we only have to get from the encoding end the difference between M and M ′ . If M and M ′ are very similar, the difference becomes very small so does the amount of data needed to be transmitted/stored. Another way to interpret similarity is redundancy. There exist two types of redundancy: spatial and temporal. Spatial redundancy results from similarity between a pixel (region) and its surrounding pixels (regions) in a frame. Temporal redundancy results from slow change of video contents from one frame to the next. Redundancy information can be identified and removed with prediction tools.

    1.1.5 Intraframe Prediction

    In an image region with smooth change, a macroblock is likely to be similar to its neighboring macroblocks in color or texture. For example, if all its neighbors are red, we can predict that a macroblock is also red. Generally, we can define several prediction functions; each takes pixel values from neighboring macroblocks as its input and produces a predicted macroblock as its output. To carry out intraframe prediction, every function is evaluated and the one resulting in the smallest error is chosen. Only the function type and the error need to be encoded and stored/transmitted. This tool is also called intra prediction and a prediction function is also called a prediction mode.

    1.1.6 Interframe Prediction

    Interframe prediction, also called interprediction, identifies temporal redundancy between neighboring frames. We call the frame currently being processed the current frame and the neighboring one the reference frame. We try to find from the reference frame a reference macroblock that is very similar to the current macroblock of the current frame. The process is called motion estimation. A motion estimator compares the current macroblock with candidate macroblocks within a search window in the reference frame. After finding the best-matched candidate macroblock, only the displacement and the error need to be encoded and stored/transmitted. The displacement from the location of the current macroblock to that of the best candidate block is called motion vector (MV). In other words, motion estimation determines the MV that results in the smallest interprediction error. A bigger search window will give better prediction at the expense of longer estimation time.

    1.1.7 Motion Vector

    A MV obtained from motion estimation is adequate for retrieving a block from the reference frame. Yet, we do not have to encode/transmit the whole of it because there exists similarity (or redundancy) among MVs of neighboring blocks. Instead, we can have a motion vector prediction (MVP) as a function of neighboring blocks’ MVs and just process the difference, called motion vector difference (MVD), between the MV and its MVP. In most cases, the MVD is much smaller than its associated MV.

    1.1.8 Prediction Error

    We call the difference between the current macroblock and the predicted one as prediction error. It is also called residual error or just residual.

    1.1.9 Space-Domain to Frequency-Domain Transformation of Residual Error

    Residual error is in the space domain and can be represented in the frequency domain by applying discrete cosine transformation (DCT). DCT can be viewed as representing an image block with a weighted sum of elementary patterns. The weights are termed as coefficients. For computational feasibility, a macroblock of residual errors is usually divided into smaller 4 ×4 or 8 ×8 blocks before applying DCT one by one.

    1.1.10 Coefficient Quantization

    Coefficients generated by DCT carry image components of various frequencies. Since human visual system is more sensitive to low frequency components and less sensitive to high frequency ones, we can treat them with different resolution by means of quantization. Quantization effectively discards certain least significant bits (LSBs) of a coefficient. By giving smaller quantization steps to low frequency components and larger quantization steps to high frequency ones, we can reduce the amount of data without scarifying the visual quality.

    1.1.11 Reconstruction

    Both encoding and decoding ends have to reconstruct video frame. In the encoding end, the reconstructed frame instead of the original one should be used as reference because no original frame is available in the decoding end. To reconstruct, we perform inverse quantization and inverse DCT to obtain reconstructed residual. Note that the reconstructed residual is not identical to the original residual as quantization is irreversible. Therefore, distortion is introduced here. We then add prediction data to the reconstructed residual to obtain reconstructed image. For an intrapredicted macroblock, we perform predict function on its neighboring reconstructed macroblocks while for an interpredicted one we perform motion compensation. Both methods give a reconstructed version of the current macroblock.

    1.1.12 Motion Compensation

    Given a MV, the motion compensator retrieves from the reference frame a reconstructed macroblock pointed to by the integer part of the MV. If the MV has fractional part, it performs interpolation over the retrieved image to obtain the final reconstructed image. Usually, interpolation is done twice, one for half-pixel accuracy and the other for quarter-pixel accuracy.

    1.1.13 Deblocking Filtering

    After every macroblock of a frame is reconstructed one by one, we obtain a reconstructed frame. Since the encoding/decoding process is done macroblock-wise, there exists blocking artifacts between boundaries of adjacent macroblocks or subblocks. Deblocking filter is used to eliminate this kind of artificial edges.

    1.2 Book Organization

    This book describe a VLSI implementation of a hardware H.264/AVC encoder as depicted in Fig. 1.1.

    A978-1-4419-0959-6_1_Fig1_HTML.gif

    Fig. 1.1

    Top-level block diagram of the proposed design

    In Chap.2, we present intra prediction. Intra prediction is the first process of H.264/AVC intra encoding. It predicts a macroblock by referring to its neighboring macroblocks to eliminate spatial redundancy. There are 17 prediction modes for a macroblock: nine modes for each of the 16 luma 4 ×4 blocks, four modes for a luma 16 ×16 block, and four modes for each of the two chroma 8 ×8 blocks. Because there exists great similarity among equations of generating prediction pixels across prediction modes, effective hardware resource sharing is the main design consideration. Moreover, there exists a long data-dependency loop among luma 4 ×4 blocks during encoding. Increasing parallelism and skipping some modes are two of the popular methods to design a high-performance architecture for high-end applications. However, to increase throughput will require more hardware area and to skip some modes will degrade video quality. We will present a novel VLSI implementation for intra prediction in this chapter.

    In Chap.3, we present integer motion estimation. Interframe prediction in H.264/AVC is carried out in three phases: integer motion estimation (IME), fractional motion estimation (FME), and motion compensation (MC). We will discuss these functions in Chaps. 3, 4, and 5, respectively. Because motion estimation in H.264/AVC supports variable block sizes and multiple reference frames, high computational complexity and huge data traffic become main difficulties in VLSI implementation. Moreover, high-resolution video applications, such as HDTV, make these problems more critical. Therefore, current VLSI designs usually adopt parallel architecture to increase the total throughput and solve high computational complexity. On the other hand, many data-reuse schemes try to increase data-reuse ratio and, hence, reduce required data traffic. We will introduce several key points of VLSI implementation for IME.

    In Chap.4, we present fractional motion estimation. Motion estimation in H.264/AVC supports quarter-pixel precision and is usually carried out in two phases: IME and FME. We have talked about IME in Chap.3. After IME finds an integer motion vector (IMV) for each of the 41 subblocks, FME performs motion search around the refinement center pointed to by IMV and further refines 41 IMVs into fractional MVs (FMVs) of quarter-pixel precision. FME interpolates half-pixels using a six-tap filter and then quarter-pixels a two-tap one. Nine positions are searched in both half refinement (one integer-pixel search center pointed to by IMV and eight half-pixel positions) and then quarter refinement (one half-pixel position and eight quarter-pixel positions). The position with minimum residual error is chosen as the best match. FME can

    Enjoying the preview?
    Page 1 of 1