By: Kylee Wall, Workflow Supervisor
Simply put, a checksum is a method for detecting errors when copying a file from one place to another. When a file is being copied or moved between two different storage mediums, an error can occur that is undetectable from a quick glance but can cause big issues later on; like two bits being swapped inside the file that cause it to not play back anymore.
A checksum can be a difficult concept to understand both in function and importance, only in part because it’s pretty boring on the surface. In my experience, many people in post production beyond on set media managers don’t really know what a checksum is — and it’s hard to blame them, because it’s not simple to investigate in the context of video production. If a checksum is done at all, it’s a button that spits out data that says “verified” or not, a not-that-interesting chore done in the process of getting media from one place to another in order to get to the good part. Who would ever want to know more?
But if your media is compromised, that’s the end of the road. No assembling, no cutting, no finishing. That’s a lot of trust to put in a boring button-press that may as well be run by magic for all of you now. What is actually happening in there? Does verified actually mean verified all the time?
In video post production, a checksum is used to prevent as many of these copy errors from slipping by as possible. There are different checksum types that accomplish this error detection in their own ways, and with varying degrees of accuracy depending on, well, variables.
From there, things quickly get much trickier.
The concept of a checksum is complicated by its origins in cryptography. Its technology is used to verify data authenticity (“has this been maliciously tampered with?”) as well as data integrity (“did this copy to my hard drive as I expected?”). In post, we have other needs and methods for keeping media safe from potential harm, so complicated cryptography-related checksum algorithms offer us features we don’t use. An MD5 checksum has become industry standard, and studios that require checksums with media offload on set typically require it.
And further research leads you down two rabbit holes: a whole lot of forum posts by internet mathematicians arguing about equations, or a brief summary which blindly accepts checksum methods as “totally safe” with no actual explanation of how or why it works.
Besides the industry standard MD5, there are many different kinds of checksums. A newer checksum called “xxHash” has been recently(ish) introduced as a faster alternative to MD5 that does not check data authenticity, and data wranglers are pushing for its adoption while studios push back. Is this disruption on the checksum scene worth looking at, or is it an attempt to cut corners on media management?
“Maintaining the integrity of media” being the most important job description that any of us have at Sim Digital, I wasn’t happy with choosing between not-human-readable descriptions of algorithms by internet strangers and oversimplified blog posts that come down to “Hey, whatever, it works!” I want to know more about the processes that I’m trusting to check the bits I can’t see because I am not a robot, unfortunately.
It is important to note that none of these checksums are perfect. Each checksum brings with it variables that determine both safety and speed. The reality of post production today is MORE media, LESS time. How likely is a lapse in data integrity, and how can you balance risk with time? What are the most important considerations for video professionals who manage data?
What is a checksum? How does it work?
A checksum is a method for detecting errors in copying files from one place to another — from a camera card to a hard drive, or maybe from a portable drive to local shared storage. A checksum consists of a hash function, which “takes an item of a given type and generates an integer hash value within a given range.”
Again, in English: the hash function looks at a video clip bit by bit, and from that it generates a string of random letters and numbers to represent that clip, kind of like a fingerprint for the file. When a checksum happens, it’s comparing the string of letters and numbers — the hash value — for the clip on both the source and destination locations. Even one bit out of place will mean the hash value (or fingerprint) for the file will be completely different. If the hash values match, the checksum believes the data has been copied correctly.
A good hash function will generate unique hash values from different inputs. The way the algorithm is programmed to work, your computer hardware, and the length of the hash value (meaning the number of characters in the random string) will determine the speed of the checksum, as well as the likelihood of a “hash collision.”
What’s a hash collision?
In every hash function we’ll use, there is some possibility of unique inputs generating the same exact hash value. This is called a “collision.” If a collision occurs, that means two or more of the clips you’ve verified with a checksum have the same hash value, so one of them has not actually been verified. This file could have been damaged in the copy process, but there’s no way you’ll know because your checksum saw that the hash values matched on both sides, so it believes everything is fine.
(Hashing algorithms are incredibly complex, and while I want to understand how they work so I can make informed decisions about how best to utilize them to protect media, I know there are limits to my expertise. I believe that it’s enough for me to know they exist and that there are different algorithms that work different ways. In terms of collisions, I think it’s more important to talk about hash length, especially given that the checksum options available in most media offloading software have comparably “safe” algorithms.)
You already know a hash value contains a string of random letters and numbers generated after the function looks at all the bits inside a video clip. What I didn’t mention before is that these random strings are different lengths depending on the checksum you choose. This “bit length” of a hash value can vary wildly, from 32 bits to 256 bits, or more or less. They also vary by checksum — a MD5 checksum could be 32 bits, or 128 bits too.
What’s the bit length of the MD5 checksum your media offloading software is using? Why does it matter? Setting aside the algorithm, there is a 1 in 1000 chance of a hash collision in a 160 bit hash if you have 5.41×1022 clips. With a 32 bit hash, you reach that probability with just 2932 clips. For reference, you also have a 1 in 1000 chance of getting a four of a kind in a poker game. Even if you’re the worst card player, you’ve played a hand at least once where someone won with a four of a kind. In the same way, you should remember that collisions aren’t merely a possibility — they’re inevitable.
A Nerdy Side Note on Probability Theory
It’s important to consider a couple aspects of probability theory when it comes to thinking about hash collisions. This is going to feel a little high school math, but love yourself and stick with it this time.
You remember the probability of a random event, right? If you flip a coin 50 times and it comes up heads every time, what is the probability it will come up heads again? It’s 50-50, and it always will be. In the same way, each hashing function is a completely independent event. It doesn’t know what the value generated on the previous clip was, and doesn’t care what the value of the next clip will be.
You may have forgotten about the “birthday paradox.” This is the probability that in a set of randomly chosen people, some pair of them will have the same birthday. You might believe that this probability hits 100% when you have 367 people, given that there are 366 possible birthdays in a year, including February 29th. However, you actually hit 99% probability with just 57 people. This is an important way to illustrate how probability works in terms of hash collisions — the number of clips needed to generate a possible collision is much lower than you might anticipate at first glance because hash values aren’t equally distributed, just as birthdays are not equally distributed across every single day.
Why did I share math lessons? Context. It’s all numbers and math, and there’s a reason why this works and doesn’t work. If you have a firm grasp on the very basics of this process and the simple limitations of it, you are more informed and appropriately trusting of your checksum. We want to trust what computers tell us because they’re sitting all around us all the time, buzzing us when we need to do a thing or sending a notification when we’ve forgotten to do another thing. We want to trust them, but they’re all numbers and math. We have to decide how to interpret them based on the inputs we’ve provided.
Back to Collisions and Hash Length
Again, a shorter hash — 32 bits instead of 128 for example — will increase your chances of a collision significantly. So why not just run a 512 bit checksum on every clip? Well, it takes longer. Like, it can take a LOT longer, especially if you’re using a complex checksum like MD5. More complexity means more data to crunch. So yes, you could run the biggest hash you’ve got. But in the world of on set media management or dailies, the clock is running constantly and it’s not realistic to allow hours and hours for the verification of data when smaller hashes are probably capable of doing the job.
But it’s important to understand the risk you’re accepting. You have a 1 in 100 million chance of a collision basically right away when you use a 32 bit hash value. But with the 160 bit hash value, you need hundreds of thousands of clips before you reach those odds. (That’s a simplified way to think about how collisions can come up, but keep in mind that it can happen on clip #1 or clip #95,909 since like a coin flip, collisions are independent events.)
You also have a 1 in 100 million chance of dying in a shark attack. Do you still swim in the ocean? Some people don’t, but most of us probably do while we remain vigilant…just in case. Jaws was popular for a reason.
Checksums and Speed
The speed at which checksums run is limited by the hardware you’re using. Some people try to say that xxHash is considerably faster than MD5 in an attempt to sway people from one to the other, citing 5.4 GB/s compared to MD5’s 0.28 GB/s. But these speeds are benchmarking tests, not real world uses. Among other things, your checksum process is limited by your CPU’s computing power, your RAM, the speed of your hard drives, and the way you’re connected to devices. That xxHash checksum is fast, but it’s only faster than MD5 if the transfer speed of the source and destination is more than 350MB/s. You can only accurately compare checksum speeds using the same hardware and software setup. And if you give a complex checksum process too much power, you won’t leave enough for running other tasks on your system. Like everything else in video production, the details and peripherals matter greatly.
In Shotput Pro, three of the checksum types you can choose from are SHA-1 256, MD5-32, and xxHash-16. Given that these are all considered to have similarly safe algorithms and are commonly used checksums in video production, and given what we know about hash lengths, I decided to run a test to see how quickly media could be verified with each.
I used a late 2013 Mac Pro with a 2.7 GHz 12-Core Intel Xeon E5 and 64 GB 1866 MHz DDR3 ECC.
The source is a Lacie Rugged Drive connected via USB3 directly to the Mac Pro. The destination is the Mac’s desktop. The read/write speed from this drive, testing with Blackmagic’s speed test, is roughly 85 MB/s. These drives are very commonly used for media transfer and transport.
Using Shotput Pro 5.3.4, I verified a “small” 2.5 gig ProRes file and a “large” 10 gig ProRes file — both files generated by an ARRI Alexa XT.
Hash Length: 16 bit
Time Elapsed: 47.19 seconds
Bit Length: 32 bit
Time Elapsed 85.77 seconds
Bit Length: 256 bit
Time Elapsed: 112.92 seconds
On a common video production hardware setup, the MD5 checksum took twice as long as xxHash, and the SHA-1 checksum took 3 times as long.
A terabyte might take 2.5 hours to verify with SHA-1. With xxHash, it might take only an hour. MD5 falls somewhere in between, a little under two hours. I didn’t do any testing to see how speeds compare between doing a terabyte all at once or several hundred gigabytes at a time, which is far more likely in an on-set media management situation. (But I don’t have any reason to believe there are significant speed variances between big chunks and small chunks.)
The difference between an hour and 2.5 hours on set or in dailies turnarounds is significant. The difference in “safety” may or may not be as significant depending on your outlook. Setting aside the algorithm each checksum is using, you can see how they might be different.
So, What Do I Do?
I know the odds of being struck by lightning are fairly low, about 1 in 500,000-ish. But when I’m outside during a thunderstorm, I don’t hang out under a tree. And I’m guessing you don’t either. But I’m also guessing you don’t sit inside your home, refusing to go outside for even a moment until the storm is long gone. Similarly, I think there’s a middle ground to be found in checksums depending on the level of risk you’re willing to accept.
In video production, we accept all kinds of risks on a day to day basis. We trust that every person in the chain is doing their job. We allow people to drive cars with hard drives inside. We assume the set isn’t going to be struck by a meteor. Acceptable risk is a part of doing business. But we need to know exactly what kinds of things we’re choosing to trust and how they work. As the industry continues to move in the direction of faster speeds, higher capacities, and increased footage, xxHash will likely be more widely adopted. What about after that? Shorter bit lengths? Less complex but less robust algorithms? Will our amount of acceptable risk shift as expectations shift?
We need to remain vigilant that our need for speed isn’t tossing aside essential verification processes that can make or break the completion of a project. Particularly as an advocate for digital negative, the safety of bits is near and dear to me. What you do on your own productions depends on your own needs, but knowing what you’re getting yourself into — bit for bit — puts you ahead of the curve. My own data verification choices will continue to evolve depending on the needs of clients and studios, combined with the fail-safes available to ensure the safekeeping of the digital negative.