VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip
()
About this ebook
High definition video requires substantial compression in order to be transmitted or stored economically. Advances in video coding standards from MPEG-1, MPEG-2, MPEG-4 to H.264/AVC have provided ever increasing coding efficiency, at the expense of great computational complexity which can only be delivered through massively parallel processing.
This book will present VLSI architectural design and chip implementation for high definition H.264/AVC video encoding, using a state-of-the-art video application, with complete VLSI prototype, via FPGA/ASIC. It will serve as an invaluable reference for anyone interested in VLSI design and high-level (EDA) synthesis for video.
Related to VLSI Design for Video Coding
Related ebooks
High Efficiency Video Coding (HEVC): Algorithms and Architectures Rating: 0 out of 5 stars0 ratingsDigital Video Processing for Engineers: A Foundation for Embedded Systems Design Rating: 0 out of 5 stars0 ratingsThe Handbook of MPEG Applications: Standards in Practice Rating: 0 out of 5 stars0 ratingsPC-based Instrumentation and Control Rating: 3 out of 5 stars3/5Analog Dialogue, Volume 47, Number 4 Rating: 0 out of 5 stars0 ratingsMicroprocessor Based Systems for the Higher Technician Rating: 0 out of 5 stars0 ratingsDebugging Systems-on-Chip: Communication-centric and Abstraction-based Techniques Rating: 0 out of 5 stars0 ratingsResource Efficient LDPC Decoders: From Algorithms to Hardware Architectures Rating: 0 out of 5 stars0 ratingsColour Banding: Exploring the Depths of Computer Vision: Unraveling the Mystery of Colour Banding Rating: 0 out of 5 stars0 ratingsEmbedded Software Design and Programming of Multiprocessor System-on-Chip: Simulink and System C Case Studies Rating: 0 out of 5 stars0 ratingsDSP Integrated Circuits Rating: 0 out of 5 stars0 ratingsSystemVerilog for Hardware Description: RTL Design and Verification Rating: 0 out of 5 stars0 ratingsReal-Time Systems Development Rating: 0 out of 5 stars0 ratingsDigital Video Distribution in Broadband, Television, Mobile and Converged Networks: Trends, Challenges and Solutions Rating: 0 out of 5 stars0 ratingsLearning SD-WAN with Cisco: Transform Your Existing WAN Into a Cost-effective Network Rating: 0 out of 5 stars0 ratingsNokia Firewall, VPN, and IPSO Configuration Guide Rating: 0 out of 5 stars0 ratingsInterfacing PIC Microcontrollers: Embedded Design by Interactive Simulation Rating: 0 out of 5 stars0 ratingsIntermediate C Programming for the PIC Microcontroller: Simplifying Embedded Programming Rating: 0 out of 5 stars0 ratingsMPEG-V: Bridging the Virtual and Real World Rating: 0 out of 5 stars0 ratingsCMOS IC Layout: Concepts, Methodologies, and Tools Rating: 4 out of 5 stars4/5Software Development: BCS Level 4 Certificate in IT study guide Rating: 4 out of 5 stars4/5Computer Organization and Design: The Hardware / Software Interface Rating: 4 out of 5 stars4/5Advanced Video Coding: Principles and Techniques: The Content-based Approach Rating: 0 out of 5 stars0 ratingsDemystifying Embedded Systems Middleware Rating: 4 out of 5 stars4/5Embedded Deep Learning: Algorithms, Architectures and Circuits for Always-on Neural Network Processing Rating: 0 out of 5 stars0 ratingsComputer Architecture and Security: Fundamentals of Designing Secure Computer Systems Rating: 0 out of 5 stars0 ratingsGetting Started with Oracle Tuxedo Rating: 0 out of 5 stars0 ratingsInduSoft Application Design and SCADA Deployment Recommendations for Industrial Control System Security Rating: 0 out of 5 stars0 ratingsProgramming Microcontrollers with Python: Experience the Power of Embedded Python Rating: 0 out of 5 stars0 ratingsMillimeter-Wave Digitally Intensive Frequency Generation in CMOS Rating: 0 out of 5 stars0 ratings
Electrical Engineering & Electronics For You
Electrical Engineering 101: Everything You Should Have Learned in School...but Probably Didn't Rating: 5 out of 5 stars5/5Schaum's Outline of Basic Electricity, Second Edition Rating: 5 out of 5 stars5/5Electrical Engineering Rating: 4 out of 5 stars4/5DIY Lithium Battery Rating: 3 out of 5 stars3/5Practical Electrical Wiring: Residential, Farm, Commercial, and Industrial Rating: 4 out of 5 stars4/5The Homeowner's DIY Guide to Electrical Wiring Rating: 5 out of 5 stars5/5How to Diagnose and Fix Everything Electronic, Second Edition Rating: 4 out of 5 stars4/5Electricity for Beginners Rating: 5 out of 5 stars5/5Understanding Automotive Electronics: An Engineering Perspective Rating: 4 out of 5 stars4/5Electrician's Pocket Manual Rating: 0 out of 5 stars0 ratingsBeginner's Guide to Reading Schematics, Fourth Edition Rating: 4 out of 5 stars4/5Programming the Raspberry Pi, Third Edition: Getting Started with Python Rating: 5 out of 5 stars5/5Beginner's Guide to Reading Schematics, Third Edition Rating: 0 out of 5 stars0 ratingsSolar & 12 Volt Power For Beginners Rating: 4 out of 5 stars4/5Upcycled Technology: Clever Projects You Can Do With Your Discarded Tech (Tech gift) Rating: 5 out of 5 stars5/5Understanding Electricity Rating: 4 out of 5 stars4/5Electronics Explained: Fundamentals for Engineers, Technicians, and Makers Rating: 5 out of 5 stars5/5Electronics Engineering Rating: 0 out of 5 stars0 ratingsBasic Electricity Rating: 4 out of 5 stars4/5Starting Electronics Rating: 4 out of 5 stars4/5Electrical Engineering: Know It All Rating: 4 out of 5 stars4/5Raspberry Pi Electronics Projects for the Evil Genius Rating: 3 out of 5 stars3/5Off-Grid Projects: Step-by-Step Guide to Building Your Own Off-Grid System Rating: 0 out of 5 stars0 ratingsElectric Circuits Essentials Rating: 5 out of 5 stars5/5Raspberry Pi Projects for the Evil Genius Rating: 0 out of 5 stars0 ratingsVery Truly Yours, Nikola Tesla Rating: 5 out of 5 stars5/5Programming Arduino: Getting Started with Sketches Rating: 4 out of 5 stars4/5The Fast Track to Your Technician Class Ham Radio License: For Exams July 1, 2022 - June 30, 2026 Rating: 5 out of 5 stars5/5C++ Programming Language: Simple, Short, and Straightforward Way of Learning C++ Programming Rating: 4 out of 5 stars4/5
Reviews for VLSI Design for Video Coding
0 ratings0 reviews
Book preview
VLSI Design for Video Coding - Youn-Long Steve Lin
Youn-Long Steve Lin, Chao-Yang Kao, Hung-Chih Kuo and Jian-Wen ChenVLSI Design for Video Coding1stH.264/AVC Encoding from Standard Specification to Chip10.1007/978-1-4419-0959-6_1© Springer Science+Business Media, LLC 2010
1. Introduction to Video Coding and H.264/AVC
Youn-Long Steve Lin¹ , Chao-Yang Kao¹, Huang-Chih Kuo¹ and Jian-Wen Chen¹
(1)
Dept. Computer Science, National Tsing Hua University, 101 Kuang Fu Road Section 2, HsinChu, 300, Taiwan R.O.C.
Abstract
A video signal is represented as a sequence of frames of pixels. There exists a vast amount of redundant information that can be eliminated with video compression technology so that transmission and storage becomes more efficient. To facilitate interoperability between compression at the video producing source and decompression at the consumption end, several generations of video coding standards have been defined and adapted. For low-end applications, software solutions are adequate. For high-end applications, dedicated hardware solutions are needed. This chapter gives an overview of the principles behind video coding in general and the advanced features of H.264/AVC standard in particular. It serves as an introduction to the remaining chapters; each covers an important coding tool and its VLSI architectural design of an H.264/AVC encoder.
1.1 Introduction
A video encoder takes as its input a video sequence, performs compression, and then produces as its output a bit-stream data which can be decoded back to a video sequence by a standard-compliant video decoder.
A video signal is a sequence of frames. It has a frame rate defined as the number of frames per second (fps). For typical consumer applications, 30 fps is adequate. However, it could be as high as 60 or 72 for very high-end applications or as low as 10 or 15 for video conferencing over a low-bandwidth communication link.
A frame consists of a two-dimensional array of color pixels. Its size is called frame resolution. A standard definition (SD) frame has 720 ×480 pixels per frame whereas a full high definition (FullHD) one has 1,920 ×1,088. There are large number of frame size variations developed by various applications such as computer monitors.
A color pixel is composed of three elementary components: R, G, and B. Each component is digitized to an 8-bit data for consumer applications or a 12-bit one for high-end applications.
The data rate for a raw video signal is huge. For example, a 30-fps FullHD one will have a data rate of 30 ×1,920 ×1,088 ×3 ×8=1.5Gbps, which is impractical for today’s communication or storage infrastructure.
Fortunately, by taking advantage of the characteristics of human visual system and the redundancy in the video signal, we can compress the data by two orders of magnitude without scarifying the quality of the decompressed video.
1.1.1 Basic Coding Unit
In order for a video encoding or decoding system to handle video of different frame rates and simplify the implementation, a basic size of 16 ×16 has been popularly adopted. Every main stream coding standards from MPEG-1, MPEG-2,… to H.264 has chosen a macroblock of 16 ×16 pixels as their basic unit of processing. Hence, for video of different resolutions, we just have to process different number of macroblocks. For every 720 ×480 SD frame, we process 45 ×30 macroblocks while for every FullHD frame, we process 120 ×68 macroblocks.
1.1.2 Video Encoding Flow
Algorithm 1.1 depicts a typical flow of video encoding. frame(t) is the current frame to be encoded. frame ′ (t−1) is the reconstructed frame for referencing or called reference frame. frame ′ (t) is the reconstructed current frame. We encode F(t) one macroblock (MB) at a time starting from the leftmost MB of the topmost row. We called the MB being encoded as Curr_MB. It can be encoded in one of the three modes: I for intra prediction, P or unidirectional interprediction, and B for bidirectional interprediction. The resultant MB from prediction is called Pred_MB and the difference between Curr_MB and Pred_MB is called Res_MB for residuals. Res_MB goes through space-to-frequency transformation and then quantization processes to become Res_Coef or residual coefficients. Entropy coding then compresses Res_Coef to get final bit-stream. In order to prepare reconstructed current frame for future reference, we perform inverse quantization and inverse transform on Res_Coef to get reconstructed residuals called Reconst_res. Adding together Reconst_res and Pred_MB, we have Reconstruct MB for insertion into frame ′ (t).
1.1.3 Color Space Conversion
Naturally, each pixel is composed of R, G, and B 8-bit components. Applying the following conversion operation, it can be represented as one luminance (luma) component Y and two chrominance (chroma) components Cr and Cb. Since the human visual system is more sensitive to luminance component than chrominance ones, we can subsample Cr and Cb to reduce the data amount without sacrificing the video quality. Usually one out of two or one out of four subsampling is applied. The former is called 4:2:2 format and the later 4:2:0 format. In this book, we assume that 4:2:0 format is chosen. Of course, the inverse conversion will give us R, G, B components from a set of Y , Cr, Cb components.
$$\begin{array}{rl} Y & = 0.299R + 0.587G + 0.114B, \\ \mathrm{Cb}& = 0.564(B - Y ), \\ \mathrm{Cr}& = 0.713(R - Y ).\end{array}$$(1.1)
A978-1-4419-0959-6_1_Figa_HTML.gif1.1.4 Prediction of a Macroblock
A macroblock M has 16 ×16=256 pixels. It takes 256 ×3=768 bytes to represent it in RGB format and 256 ×(1+1∕4+1∕4)=384 bytes in 4:2:0 format. If we can find during decoding a macroblock M ′ which is similar to M, then we only have to get from the encoding end the difference between M and M ′ . If M and M ′ are very similar, the difference becomes very small so does the amount of data needed to be transmitted/stored. Another way to interpret similarity is redundancy. There exist two types of redundancy: spatial and temporal. Spatial redundancy results from similarity between a pixel (region) and its surrounding pixels (regions) in a frame. Temporal redundancy results from slow change of video contents from one frame to the next. Redundancy information can be identified and removed with prediction tools.
1.1.5 Intraframe Prediction
In an image region with smooth change, a macroblock is likely to be similar to its neighboring macroblocks in color or texture. For example, if all its neighbors are red, we can predict that a macroblock is also red. Generally, we can define several prediction functions; each takes pixel values from neighboring macroblocks as its input and produces a predicted macroblock as its output. To carry out intraframe prediction, every function is evaluated and the one resulting in the smallest error is chosen. Only the function type and the error need to be encoded and stored/transmitted. This tool is also called intra prediction and a prediction function is also called a prediction mode.
1.1.6 Interframe Prediction
Interframe prediction, also called interprediction, identifies temporal redundancy between neighboring frames. We call the frame currently being processed the current frame and the neighboring one the reference frame. We try to find from the reference frame a reference macroblock that is very similar to the current macroblock of the current frame. The process is called motion estimation. A motion estimator compares the current macroblock with candidate macroblocks within a search window in the reference frame. After finding the best-matched candidate macroblock, only the displacement and the error need to be encoded and stored/transmitted. The displacement from the location of the current macroblock to that of the best candidate block is called motion vector (MV). In other words, motion estimation determines the MV that results in the smallest interprediction error. A bigger search window will give better prediction at the expense of longer estimation time.
1.1.7 Motion Vector
A MV obtained from motion estimation is adequate for retrieving a block from the reference frame. Yet, we do not have to encode/transmit the whole of it because there exists similarity (or redundancy) among MVs of neighboring blocks. Instead, we can have a motion vector prediction (MVP) as a function of neighboring blocks’ MVs and just process the difference, called motion vector difference (MVD), between the MV and its MVP. In most cases, the MVD is much smaller than its associated MV.
1.1.8 Prediction Error
We call the difference between the current macroblock and the predicted one as prediction error. It is also called residual error or just residual.
1.1.9 Space-Domain to Frequency-Domain Transformation of Residual Error
Residual error is in the space domain and can be represented in the frequency domain by applying discrete cosine transformation (DCT). DCT can be viewed as representing an image block with a weighted sum of elementary patterns. The weights are termed as coefficients. For computational feasibility, a macroblock of residual errors is usually divided into smaller 4 ×4 or 8 ×8 blocks before applying DCT one by one.
1.1.10 Coefficient Quantization
Coefficients generated by DCT carry image components of various frequencies. Since human visual system is more sensitive to low frequency components and less sensitive to high frequency ones, we can treat them with different resolution by means of quantization. Quantization effectively discards certain least significant bits (LSBs) of a coefficient. By giving smaller quantization steps to low frequency components and larger quantization steps to high frequency ones, we can reduce the amount of data without scarifying the visual quality.
1.1.11 Reconstruction
Both encoding and decoding ends have to reconstruct video frame. In the encoding end, the reconstructed frame instead of the original one should be used as reference because no original frame is available in the decoding end. To reconstruct, we perform inverse quantization and inverse DCT to obtain reconstructed residual. Note that the reconstructed residual is not identical to the original residual as quantization is irreversible. Therefore, distortion is introduced here. We then add prediction data to the reconstructed residual to obtain reconstructed image. For an intrapredicted macroblock, we perform predict function on its neighboring reconstructed macroblocks while for an interpredicted one we perform motion compensation. Both methods give a reconstructed version of the current macroblock.
1.1.12 Motion Compensation
Given a MV, the motion compensator retrieves from the reference frame a reconstructed macroblock pointed to by the integer part of the MV. If the MV has fractional part, it performs interpolation over the retrieved image to obtain the final reconstructed image. Usually, interpolation is done twice, one for half-pixel accuracy and the other for quarter-pixel accuracy.
1.1.13 Deblocking Filtering
After every macroblock of a frame is reconstructed one by one, we obtain a reconstructed frame. Since the encoding/decoding process is done macroblock-wise, there exists blocking artifacts between boundaries of adjacent macroblocks or subblocks. Deblocking filter is used to eliminate this kind of artificial edges.
1.2 Book Organization
This book describe a VLSI implementation of a hardware H.264/AVC encoder as depicted in Fig. 1.1.
A978-1-4419-0959-6_1_Fig1_HTML.gifFig. 1.1
Top-level block diagram of the proposed design
In Chap.2, we present intra prediction. Intra prediction is the first process of H.264/AVC intra encoding. It predicts a macroblock by referring to its neighboring macroblocks to eliminate spatial redundancy. There are 17 prediction modes for a macroblock: nine modes for each of the 16 luma 4 ×4 blocks, four modes for a luma 16 ×16 block, and four modes for each of the two chroma 8 ×8 blocks. Because there exists great similarity among equations of generating prediction pixels across prediction modes, effective hardware resource sharing is the main design consideration. Moreover, there exists a long data-dependency loop among luma 4 ×4 blocks during encoding. Increasing parallelism and skipping some modes are two of the popular methods to design a high-performance architecture for high-end applications. However, to increase throughput will require more hardware area and to skip some modes will degrade video quality. We will present a novel VLSI implementation for intra prediction in this chapter.
In Chap.3, we present integer motion estimation. Interframe prediction in H.264/AVC is carried out in three phases: integer motion estimation (IME), fractional motion estimation (FME), and motion compensation (MC). We will discuss these functions in Chaps. 3, 4, and 5, respectively. Because motion estimation in H.264/AVC supports variable block sizes and multiple reference frames, high computational complexity and huge data traffic become main difficulties in VLSI implementation. Moreover, high-resolution video applications, such as HDTV, make these problems more critical. Therefore, current VLSI designs usually adopt parallel architecture to increase the total throughput and solve high computational complexity. On the other hand, many data-reuse schemes try to increase data-reuse ratio and, hence, reduce required data traffic. We will introduce several key points of VLSI implementation for IME.
In Chap.4, we present fractional motion estimation. Motion estimation in H.264/AVC supports quarter-pixel precision and is usually carried out in two phases: IME and FME. We have talked about IME in Chap.3. After IME finds an integer motion vector (IMV) for each of the 41 subblocks, FME performs motion search around the refinement center pointed to by IMV and further refines 41 IMVs into fractional MVs (FMVs) of quarter-pixel precision. FME interpolates half-pixels using a six-tap filter and then quarter-pixels a two-tap one. Nine positions are searched in both half refinement (one integer-pixel search center pointed to by IMV and eight half-pixel positions) and then quarter refinement (one half-pixel position and eight quarter-pixel positions). The position with minimum residual error is chosen as the best match. FME can