VIDEO ENCODING 101

Im­prove qual­ity and keep your band­width down

Maximum PC - - FRONT PAGE - BY NEIL MOHR

A NEW GEN­ER­A­TION of video en­coders is upon us! Some sound fa­mil­iar, such as H.265, some sound ex­otic and new, like VP9 or AV1, but no mat­ter what, you’re likely be trip­ping over one or the other soon, and curs­ing that your CPU use has shot through the roof.

Why do we need a raft of new en­coders? Why has your hard­ware ac­cel­er­a­tion stopped work­ing? Is that 4K file go­ing to play on your 4K UHDTV? What about HDR on the PC? How can you en­code your own video, and can you even en­joy some GPGPU ac­cel­er­a­tion, too? Can we make it all sim­ple for your per­plexed brain?

We’re go­ing to take a deep dive into the in­ner work­ings of video en­coders to see how gi­ga­bytes of data can be com­pressed 100 times and the hu­man eye can’t tell the dif­fer­ence. It’s go­ing to be like the math de­gree you never got around to tak­ing at com­mu­nity col­lege, but prob­a­bly a lot more use­ful!

If you want to con­trol your en­codes bet­ter, it re­ally does help to have an un­der­stand­ing of what’s go­ing on in­side the en­coder. Iron­i­cally, none of these stan­dards de­fines the en­coder, just what’s re­quired for the de­coder. As we’ll see, this does en­able the en­coder writer to use all man­ner of tech­niques and user set­tings to en­hance en­codes, usu­ally at the ex­pense of longer en­code times and more pro­ces­sor us­age, for ei­ther a smaller file or en­hanced qual­ity.

ONCE UPON A TIME, there were an­i­mated GIFs, and boy those things were ter­ri­ble. You can see the prob­lem: Peo­ple want to store mov­ing images on their com­put­ers, and a fixed 256-color pal­ette stan­dard that stored full frames was not the fu­ture; this was back in 1987.

What we’ve had since then is 30 years of steady encoding en­hance­ments from a well-or­ga­nized in­ter­na­tional stan­dards com­mit­tee. Well, up un­til the last decade, but we’ll come to that.

We’re not go­ing to dwell on the his­tory, but it’s worth not­ing a cou­ple of things that have fixed nam­ing con­ven­tions and sys­tems in place. Be­fore mov­ing images, there were still images. The first work­ing JPEG code was re­leased at the end of 1991, and the stan­dard in 1992, out of the work by the Joint Photographic Ex­perts Group, which started work­ing on JPEG in 1986.

Sim­i­larly, the In­ter­na­tional Telecom­mu­ni­ca­tions Union (ITU) formed the Video Cod­ing Ex­pert Group (VCEG) way back in 1984 to develop an in­ter­na­tional dig­i­tal video stan­dard for tele­con­fer­enc­ing. That stan­dard was the first H.261, fi­nal­ized in 1990, and de­signed for send­ing mov­ing images at a heady 176x144 res­o­lu­tion over 64kb/s ISDN links. From this, H.262, H.263, H.264, and H.265 were all born, and af­ter H.261, the next video stan­dard worked on was known as MPEG-1.

So, you can see where these stan­dards came from. We’ve men­tioned JPEG for two rea­sons: First, MPEG (Mov­ing Pic­ture Ex­perts Group) was formed af­ter the suc­cess of JPEG. It also used some of the same tech­niques to re­duce en­coded file size, which was a huge shift from ex­ist­ing loss­less dig­i­tal stor­age to us­ing lossy tech­niques.

One of the most im­por­tant things to un­der­stand about mod­ern lossy com­pres­sion is how the im­age is han­dled and com­pressed. We’re go­ing to start with the ba­sics, and how JPEG com­pres­sion [ Im­age A] works—it’s at the heart of the whole in­dus­try—and ex­pand that knowl­edge into mo­tion pic­tures and MPEG. Once we’ve got that un­der our belt, we’ll see how these were im­proved and ex­panded to cre­ate the H.264 and H.265 we have to­day.

SPACE: THE COL­OR­FUL FRON­TIER The first thing to get your head around is the change in color space. You’re prob­a­bly used to think­ing of ev­ery­thing on PCs stored in 16-, 24-, or 32-bit color, spread over the red, green, and blue chan­nels. This puts equal em­pha­sis on stor­ing all color, hue, and bright­ness in­for­ma­tion. The fact is, the hu­man eye is far more sen­si­tive to bright­ness, aka lu­mi­nance, than any­thing else, then hue, and fi­nally color sat­u­ra­tion.

The color space is changed from RGB to YCbCr, aka lu­mi­nance and two chromi­nance chan­nels (blues and reds here)—if you’re won­der­ing, this is re­lated to YUV. The rea­son is to en­able chroma sub­sam­pling of the im­age; that is, we’re go­ing to re­duce the res­o­lu­tion of the Cb and Cr chan­nels to save space with­out any per­ceiv­able drop in qual­ity.

This is ex­pressed as a ra­tio of the lu­mi­nance to the chroma chan­nels. The high­est qual­ity—full RGB—would be 4:4:4, with ev­ery pixel hav­ing Y, Cb, and Cr data. It’s only used in high-end video equip­ment. The widely used 4:2:2 has Y for ev­ery pixel, but halves the hor­i­zon­tal res­o­lu­tion for chroma data; this is used on most video hard­ware. 4:2:0 keeps hor­i­zon­tal res­o­lu­tion halved, but now skips ev­ery other ver­ti­cal line [ Im­age B]— this is the encoding used in MPEG, ev­ery H.26x stan­dard, DVD/Blu-ray, and more. The rea­son? It halves the band­width re­quired.

THE SCI­ENCE BIT Now the heavy math kicks in, so pre­pare your gray mat­ter…. For each chan­nel Y, Cb, and Cr, the im­age is split into 8x8 pixel blocks. With video, these are called mac­roblocks. Each block is pro­cessed with a for­ward dis­crete co­sine trans­form (DCT)—

of the type two, in fact. “What the heck is that?” we hear you ask. We un­leashed our tamed Max­i­mumPC math­e­ma­ti­cian from his box, who said it was pretty sim­ple. Well, he would! Then he scur­ried off, and in­stalled Arch Linux on all our PCs. Gah!

The DCT process trans­forms the dis­crete spa­tial do­main pix­els into a fre­quency do­main rep­re­sen­ta­tion. But of what? Take a look at the 8x8 co­sine grid—the black and white in­ter­fer­ence-like pat­terns [ Im­age C]. The DCT spits out an 8x8 ma­trix of num­bers; each num­ber rep­re­sents how much like the orig­i­nal 8x8 im­age block is to each cor­re­spond­ing ba­sis func­tion el­e­ment. Com­bine them with the weight­ing spec­i­fied, and you get a close enough match to the orig­i­nal that the hu­man eye can’t tell the dif­fer­ence.

As you’ll spot, the top-left ba­sis func­tion el­e­ment is plain (de­scribed as low fre­quency), and will de­scribe the over­all in­ten­sity of the 8x8 grid. The pro­gres­sively higher fre­quency el­e­ments play less and less of a role de­scrib­ing the over­all pat­tern. The idea be­ing we’re able to ul­ti­mately drop much of the high-fre­quency data (the de­tails parts) with­out loss of per­ceived qual­ity.

This DCT ma­trix is run through a 8x8 quan­tizer ma­trix; stan­dard ma­tri­ces are pro­vided ( based on re­searched amounts), but soft­ware en­coders can cre­ate their own for, say, 50 per­cent qual­ity, as it goes along. The quan­tizer is de­vised to try to re­tain low-fre­quency de­tail—which the eye sees more eas­ily—over high­fre­quency de­tail. Each DCT el­e­ment is di­vided by each cor­re­spond­ing quan­tizer el­e­ment, and the re­sult rounded to the near­est in­te­ger. The end re­sult is a ma­trix with lowfre­quency el­e­ments re­main­ing in the top-left of the ma­trix, and more of the high-fre­quency el­e­ments re­duced to zero to­ward the bot­tom-right of the ma­trix, which is where the loss­less el­e­ment of JPEG comes in [ Im­age D].

Due to this top-left weight­ing, the ma­trix is re­ar­ranged into a list us­ing a “zig-zag” pat­tern, pro­cess­ing the ma­trix in di­ag­o­nals, start­ing topleft, which pushes most ze­ros to the end [ Im­age E]. Huff­man Encoding com­presses this fi­nal data. Qual­ity of the JPEG im­age is con­trolled by how ag­gres­sive the quan­tizer ma­trix is. The more strongly it at­tempts to round down el­e­ments to zero, the more ar­ti­facts. Con­grat­u­la­tions— you’ve com­pressed a still im­age!

If JPEG is so good at com­press­ing, why not stick a load to­gether to make a mov­ing im­age for­mat? Well, you can—it’s called Mo­tion JPEG—but it’s hugely in­ef­fi­cient com­pared to the mov­ing pic­ture encoding schemes.

MO­TION COM­PRES­SION For MPEG (and other com­pres­sion schemes), each frame is split into

mac­roblocks of 16x16 pix­els. Why not the 8x8 that is used by JPEG and used within the MPEG stan­dard? Well, MPEG uses 4:2:0 color space, which skips ev­ery hor­i­zon­tal line, so we need to dou­ble up to re­tain an even amount of data hor­i­zon­tally and ver­ti­cally.

The MPEG stan­dard uses three types of frame to cre­ate a mov­ing im­age. I (in­tra) frames are ef­fec­tively a full-frame JPEG im­age. P (pre­dicted) frames store only the dif­fer­ence be­tween an I-frame and the fu­ture P-frame. In be­tween these is a se­ries of B (bidi­rec­tional or in­ter) frames, which store mo­tion vec­tor de­tails for each mac­roblock— B-frames are able to ref­er­ence frames be­fore and up­com­ing.

So, what’s hap­pen­ing with mo­tion vec­tors and the mac­roblocks? The en­coder com­pares the cur­rent frame’s mac­roblock with ad­ja­cent parts of the video from the an­chor frame, which can be an I or P frame. This is up to a pre­de­fined ra­dius of the mac­roblock. If a match is found, the mo­tion vec­tor—di­rec­tion and dis­tance data—is stored in the B-frame. The de­cod­ing of this is called mo­tion com­pen­sa­tion, aka MC.

How­ever, the pre­dic­tion won’t match ex­actly with the ac­tual frame, so pre­dic­tion er­ror data is also stored. The larger the er­ror, the more data has to be stored, so an ef­fi­cient en­coder must per­form good mo­tion es­ti­ma­tion, aka ME.

Be­cause most mac­roblocks in the same frame have the same or sim­i­lar mo­tion vec­tors, these can be com­pressed well. In MPEG-1, P-frames store one mo­tion vec­tor per mac­roblock; B-frames are able to have two—one from the pre­vi­ous frame, and one from the fu­ture frame. The en­coder packs groups of I, P, and B frames to­gether out of or­der, and the de­coder re­builds the video in or­der from these out of the MPEG bit­stream. As you can imag­ine, there’s scope for the en­coder to im­ple­ment a wide range of qual­ity and ef­fi­ciency vari­ables, from the ra­tio of I, P, and B frames to its mo­tion com­pen­sa­tion ef­fi­ciency.

At this point, we’ve cov­ered the ba­sics of a com­pressed MPEG-1 video. Au­dio is in­te­grated into the bit­stream, and MPEG-1 sup­ports stereo MP3 au­dio. We’re not go­ing to go into the au­dio com­pres­sion. MPEG-2 (aka H.262) [ Im­age F] uses largely the same sys­tem, but was en­hanced to sup­port a wider range of pro­files, in­clud­ing DVD, in­ter­laced video, higher-qual­ity color spa­ces up to 4:4:4, and multi-chan­nel au­dio. MPEG-3 was rolled into MPEG-2 with the in­clu­sion of 1080p HD pro­files.

MPEG-4: ZOOM AND EN­HANCE MPEG-4—aka H.264—started off with the aim of en­hanc­ing the stan­dard for dig­i­tal stream­ing. Get­ting bet­ter im­age qual­ity from half the bi­trate. But with the ad­vent of Blu-ray and HD DVD (re­mem­ber that?), the ex­ten­sion

Here’s how the color chan­nels are sub-sam­pled for dif­fer­ent color spa­ces.

You can see the JPEG com­pres­sion strug­gling to re­tain de­tail at low- qual­ity set­tings.

Ev­ery im­age can be roughly made up from this. Math! Take the val­ues of a sin­glechan­nel 8x8 pixel grid into a ma­trix.

MeGUI, it’s all about me, me, me, me, me! DCT pro­duces a fre­quency do­main ma­trix; we quan­tize this to get the fi­nal lossy com­pressed ma­trix.

Newspapers in English

Newspapers from Australia

© PressReader. All rights reserved.