Google Summer of Code 2019: Hardware-Accelerated Deshake Filter for FFmpeg

August 26, 2019

Back in March of this year I carved time out of my busy school and work schedule to dive into the FFmpeg codebase and explore working with OpenCL to write hardware-accelerated video filters. It was interesting work (you can read about my experience as well as some information on how to write a filter of your own here). I squeezed a proposal to write a deshake filter using OpenCL in just before the deadline in early April; my proposal was subsequently accepted a month later, leading to a fantastic summer of research, learning, and writing exciting code with very visual results.

In this post I'll talk about what I did, provide an overview of the current state of the filter, and discuss some future steps that could be taken to make it even better.

The code

If you came here looking for the code, look no further. It resides in these three commits to the FFmpeg repository:

What is Video Stabilization?

When capturing videos handheld, or without appropriate gear, videos can have an excessive amount of shaking (whether that be high-frequency jitter or low-frequency erratic motion) that distracts the viewer from the actual subject of the clip. Video stabilization refers to processes that have been developed to remove such motion and produce a more pleasing result for the videographer.

There are many different approaches and methods to perform such stabilization. Modern smartphones have both Optical Image Stabilization (OIS), a hardware feature that moves the camera sensor around to compensate for the motion of the user, and Electronic Image Stabilization (EIS), a software feature that utilizes data from sensors (such as the gyroscope) as well as data drawn from analysis of the video itself in order to warp individual frames and remove camera shake. Google has an excellent blog post on the topic that can be found here.

Many other cameras, however, don't have such technology built-in. Thus, it is important to have software available to stabilize videos after-the-fact, as a purely post-process step in the pipeline. There are various different methodologies within this sub-space of video stabilization to consider, including three-dimensional reconstruction of the camera path within a shot. All approaches, however, function the same way at a basic level: analyze the video to determine its motion, figure out which motion should not be present, and then remove said motion.

My Approach

The filter that I wrote this summer functions as follows at a high level:

For each video frame:

Find corners (good points to track)
Create "descriptors" for those points
Match points to their locations in the previous frame using the descriptors
Determine which matches are correct and represent the motion of the camera
Use the correct matches to estimate an affine transform between each frame
Decompose the transform into translation, rotation, and scale

The filter then buffers the motion data behind and ahead of the current frame being processed so as to be able to use a Gaussian filter on it (a filter that preserves low-frequency data).

Using the buffered data, the filter:

Convolves the motion with a Gaussian filter centered at the current frame's motion
Transforms the frame appropriately to eliminate the difference between where the frame is and where it should be according to the result of the convolution

Some of these points deserve more detailed explanation.

Finding Corners

In order to figure out the camera's motion we have to be able to determine how far each frame moved from the one prior. We accomplish this by finding certain points in each frame that have characteristics that make it easy to reliably find them over and over again and determine their correspondences with each other.

It turns out that corners make good candidates for such points, as they have high contrast and are also difficult to confuse amongst themselves. While lines are very difficult to reliably choose the same point on (every point on a line is extremely similar), corners provide no such room for confusion.

In order to actually find corners in an image, I take the derivative of the image's grayscale representation (resulting in the rate-of-change of the luminance of the image at each point, essentially) and figure out which points have a high rate-of-change in both the x and y directions (representing a corner).

This computation outputs what's called the Harris response (a measure of "cornerness") at each point, which I then threshold and do some non-maximum suppression (gridding the image and choosing the maximum value for each block in the grid) on to pick only the best corners.

Point Descriptors

Figuring out how to match points to each other was particularly interesting to me. I had absolutely no clue how best to do it when I first set out to get it working.

As it would turn out, one excellent (and semi-recent) method for doing this is called a binary descriptor. I chose to use BRIEF (Binary Robust Independent Elementary Features) descriptors, but all of the various flavors function in essentially the same way. A descriptor is a bit vector where the value of each bit is determined by luminance comparisons between two pixels within a window around the point you're building a descriptor for. In order to determine which pixels to compare and ensure that the same pixels are compared every time, you first initialize a sampling pattern (in my case I simply use a random number generator with a fixed seed to choose pixels to compare).

Figuring out how similar two descriptors are to each other is then as simple as calculating the Hamming distance between them (accomplished by XORing one descriptor with the other and then counting the number of ones, which represent differing bits, in the result). A threshold on the number of ones allowed determines whether or not two points are considered the same.

This simple approach produces quite effective results, although it's not perfect (when a video cut occurs mis-matches happen more frequently than I would like). You can read more about it here.

Determining Correct Matches

Despite the usage of binary descriptors, some of the found point correspondences are still going to be incorrect. Additionally, some of the points may be on moving objects in the shot; such objects do not represent the camera's motion and need to be discarded.

This is accomplished through usage of RANSAC (RANdom SAmple Consensus). RANSAC works by randomly choosing a subset of the given data over and over, calculating models from the subsets, and then figuring out how well each model represents the group as a whole. In the case of point matches, I use RANSAC to choose subsets of three point correspondences, calculate affine transformations between them, and then determine how much error each transform creates if applied to all of the other points. This error is used both to filter out outliers (bad matches, moving objects) and also minimized as a whole to obtain the most accurate model possible.

Decomposing the Transform

In order to get simpler transformation data to work with I decompose the affine transformation into its individual parts (translation, rotation, scaling) using the QR method as described here.

This information then gets buffered and convolved with a Gaussian filter to smooth the camera's motion.

Current State of the Filter

Here are some before / after comparisons showing the results that my filter produces on a couple of videos:

The smooth strength was set to a fixed value of 0.75 for this one:

The command used to create them:

./ffmpeg -i input.mp4 -init_hw_device opencl=gpu -filter_hw_device gpu -filter_complex "split[a][b]; [a]pad=iw*2:ih[src]; [b]format=yuv420p, hwupload, deshake_opencl, hwdownload, format=yuv420p[filt]; [src][filt]overlay=w" output.mp4

Here are the results that the CPU deshake filter produces for contrast:

The filter is reasonably performant (faster than the CPU deshake filter, rough numbers below) and, as observed above, generally produces better results. It provides a number of configurable settings, including the strength of the smoothing applied, and also supports debug output to view the point matches as well as statistics about transformations and kernel execution times.

Performance Numbers

The bridge video used in the first example was used to capture performance numbers.

Commands used:

time ./ffmpeg -i input.mp4 -init_hw_device opencl=gpu -filter_hw_device gpu -filter_complex "[0:v]format=yuv420p, hwupload, deshake_opencl, hwdownload, format=yuv420p" output.mp4

time ./ffmpeg -i input.mp4 -vf deshake output.mp4

On my dev machine with a 980 Ti and a 6700k on Arch Linux:

deshake_opencl filter:

real    0m14.717s
user    1m37.036s
sys     0m1.707s

deshake filter:

real    0m35.266s
user    1m51.744s
sys     0m0.307s

On a 2019 13-inch base model Macbook Pro with Iris Plus Graphics 645:

deshake_opencl filter:

real    0m29.969s
user    2m42.721s
sys     0m3.233s

deshake filter:

real    0m42.789s
user    2m13.579s
sys     0m1.213s

Future Work

The two largest improvements that could be made, in my opinion, are:

Implement a non-linear least squares solver (the Levenberg-Marquardt algorithm would be one option) in order to more accurately determine a transform from the output of RANSAC (points that are considered inliers). This would reduce the jitter in the output and produce even smoother videos. I had not studied linear algebra yet and therefore was not at a level mathematically this summer where I was able to take this on.
Improve the performance of the harris_response kernel. While the performance of the filter is already very good (especially on my NVIDIA GPU), it could be even better with some more optimization. The harris_response kernel in particular takes much longer than I believe it should on weaker hardware, likely due to poor memory access patterns (although confirming this is difficult due to the lack of supported profiling tools for OpenCL).

From there, there are millions of tiny improvements that could be made and an endless supply of potential techniques that could be implemented to improve results for specific subsets of videos. Video stabilization is a very open research topic and will likely stay that way for quite some time.

Final Thoughts

The Google Summer of Code program has proved to be a fantastic experience and gave me the opportunity to learn and grow in ways that I never would have been able to otherwise over the last few months. Thanks to the generous pay, I was able to work much less at my part-time job in the service industry and focus on programming instead, one of the main goals of the program. Even better, I got to work on an exciting project with very visual results!

I want to extend my graditude to the FFmpeg project, and in particular my mentor, Mark Thompson, for accepting my project proposal and guiding me throughout the summer as I worked my way toward the finish line.

If you are reading this as a student considering applying for a future year of GSoC, I would highly recommend that you do; the ability to work remotely with flexible hours as a student over the summer is priceless, and contributing to open-source software is a great way to ensure that many people are able to benefit from your work.

So long, GSoC 2019!

Cldfire's Blog