Verifying and Uploading Large Archives of Photos with ZipVerifier and S3BigUpload
29-April-2015; Updated 22-July-2016, 26-Aug-2016, 30-Aug-2017, 9-Feb-2019
ZIP Archives for Long-Term Storage
It's best to have several layers of backup for important files, ranging from an online copy, to a nearby offline copy, to offsite copies (perhaps on DVDs), to copies in the cloud. The idea is to protect against increasingly rare but geographically-wide disasters with increasingly delayed, but reliable, retrieval. For example, it may take hours to retrieve the offsite backups, or days to download the cloud backups, but a disaster (such as a building explosion, as occurred in March, 2015, in New York) that would necessitate going to them is extremely rare, so the time isn't a problem.
You could just copy the files themselves to the storage medium, but it's important to know that the files copied correctly, as there are many things that could go wrong, such as disk or memory errors, software bugs, or human error. There are various methods that work, such as keeping checksums, but if you have 75,000 images, it's difficult to check the checksums and upload the failed files until they're all correct.
It's much easier to first put the files into an archive, verify that the archive is correct, and then there is only one file to fret over. Even if you have several archives, the number of archives is much less than the number of files. (I have about 10 archives.)
So the question becomes: What kind of archive should you use? You should be able to do these operations on the archive:
Verify that the archive is correct by processing its internal structure (internal verification).
Verify by comparing each internal file with its external original (external verification).
Recover from truncation or damage to the archive. Specifically, extraction should be able to recover from data corruption that affects only part of the file.
Extract files in the distant future, even if current computers and operating systems are no longer available.
The ubiquitous ZIP archive format mostly meets these requirements, but no ZIP utilities I've come across can do external verification, although there's no theoretical reason why not. I've accomplished the same thing by extracting the entire archive into and comparing the extracted files with the originals using a file comparison tool, but that's a lot of extra work. ZipVerifier does it automatically.
Corrupted ZIP files can recovered in principle (minus the damaged internal files), using the 4-byte signature that marks the internal header for each file, and there are a few utilities that can do this.
Another common archive format is tar, which doesn't have a per-file internal signature, so recovery is much more difficult, and in some cases impossible. The GNU tar utility does external verification, although not the one that comes with current versions of Mac OS X, but only for files in the archive. If the archiving skipped a file, the verification won't discover that.
So, ZIP is the best format for long-term storage.
Creating Large ZIP Archives
The traditional ZIP format is limited to 4GB archives, not nearly large enough for collections of images. The newer ZIP64 format goes as big as you need. Unfortunately, many ZIP utilities won't handle ZIP64 properly. The OS X built-in compression utility, Archive Utility, doesn't deal with ZIP64, and just produces useless archives when the size is >4GB. (The Finder context-menu Compress command uses Archive Utility and has the same problem.) The unzip command that comes with OS X won't handle ZIP64 either, as it's an old version. (It's possible, but troublesome, to install the newer version of unzip.)
The zip terminal command does deal correctly with ZIP64, as does the commercial BetterZip app. (But it seems that the zip command's test feature invokes unzip and therefore doesn't handle ZIP64.) So, those are the two to use. There might be other choices, but these are the two I tested.
On Windows, I've tried both the built-in facility (on the Windows Explorer "Send to" menu) and the free utility 7-Zip, and they both work OK.
Using BetterZip (Mac)
(See the note a few paragraphs down about versions earlier than 3.)
Two rules to observe when you're using BetterZip to create archives that ZipVerifier can handle. First, for external verification (not internal), you can only archive a single folder, like this:
You can't do this:
Second, the compression has to be Normal or Fast, and you have to exclude Mac-specific stuff, or ZipVerifier will get the file counts wrong.
That's for BetterZip version 3 or later; for earlier versions, you have to set the Compression to "No compression".
If you like, you can also test the archive with BetterZip. It's essentially the same as ZipVerifier's internal verification, but it doesn't compute the MD5/ETag checksums.
Calculating the Checksum and Verifying the Archive (Mac)
Once you've created the archive, you want to verify it before you store it away. You also want a checksum that you can compare to the one calculated by Amazon S3 (what I use; see below) so you can ensure that it got uploaded correctly. It's important to calculate the checksum before you verify the archive; otherwise, there's a small chance that it could go bad between the time you verify it and the time you calculate the checksum. If you do the checksum first, the file may go bad after it's been checksummed, but the verification will detect that.
The OS X md5 command works on large files, and that's what S3 uses as its ETag if you upload the archive all at once. However, large uploads need to be done as so-called multipart uploads, since they can take days to complete and you want to be able to restart them if anything stops the upload. Calculating the correct ETag for a multipart upload is more complex, so it's built-into the ZipVerifier app, and, if you request it, it's calculated prior to verification.
The overall MD5 checksum and the Etag are written to a text file named XYZ.zip.md5, where XYZ.zip is the archive to be verified. For example,
ZipVerifier has no preferences or menu commands other than Help, and only one button to start the verification:
You can download the free ZipVerifier from here.
You're prompted to choose whether the verification should be external or internal (external also performs the internal checks), and whether the ETag should be calculated. If you choose external, you're prompted for the comparison folder, which should be the folder that you archived, or an exact copy of it. Then you're prompted for the archive to be verified. You can stop processing with the Stop button.
A few restrictions:
External verification works only for archives built from a single folder, because anything more complicated would make it burdensome to choose the source folders in the ZipVerifier app.
Only the traditional ZIP compression method, called Deflate, is supported, along with no compression at all. Newer methods, such as Deflate64, are not supported.
A single file must be <4GB. If you have a file bigger than that, just upload it as it is, without ZIPping it. You can still use ZipVerifier to calculate the ETag prior to uploading. It will complain that it's not a ZIP file, but only after correctly generating the ETag.
Note that for large archives ZipVerifier takes a long time, sometimes a half-hour or longer.
Here's what ZipVerifier verification checks (and doesn't check):
That the various internal directory structures are intact, have the correct signatures, and the various sizes and CRC checksums match one another. (They appear twice: once in the central directory, at the end of the ZIP file, and once again adjacent to each file's data.)
That compressed files can be successfully expanded.
That the stored CRC checksums match a checksum calculated for the expanded data.
For external verification, that the expanded data exactly matches the data in the external file.
For external verification, that the number of archived files is the same as the number in the comparison (external) folder.
Dates and times aren't verified, as I don't consider these important for long-term archiving. If everything else checks out, these very likely to be OK, too, and, for images, all that matters is the EXIF data within the image, not what the file system thinks the dates and times are. In the event of a disaster, which is what long-term archiving is designed to recover from, you'll still get your images.
Calculating the Checksum and Verifying the Archive (Windows)
On Windows, you can build the ZIP file with 7-ZIP, which is free, or using the built-in Windows feature (right-click, choose "Compressed (zipped) folder" on the "Send to" menu), or using any other utility that can correctly ZIP very large archives.
Then, to test the archive, unzip it into a fresh folder and use the free WinMerge utility to compare the two folders. If it's OK and you have the ETag, go ahead and upload it to S3 (see next section).
ZipVerifier is now available for Windows; see the link at the top of this page.
Uploading the Archive to S3 (Mac, Windows, or Linux)
Now that you've got the archive and its MD5/ETag checksums, and it's verified, you can save it to a hard drive or upload it to cloud storage. If you copy it to a hard drive, you can use ZipVerifier to verify the copy, perhaps with internal verification only, to save time. If you upload it to the cloud, you can use the checksums to verify that the upload went OK. S3 does report the checksums; I don't know about other cloud services.
A simple upload utility or the Amazon website used to copy a large archive to S3 is unlikely to complete, as it may take several days, and there are too many ways it can be interrupted. You're much better off with what Amazon calls a multipart upload, which can be restarted where it left off if the process stops or if you just want to pause it.
The commercial app BucketExplorer does multipart uploads, and much more.
I've also written my own Chrome App, S3BigUpload, which is the one I use, and it's proven to be extremely reliable, easily handling uploads that take almost a week to complete, with numerous stops and starts. You can get S3BigUpload from the Chrome Web Store. To use S3BigUpload, you first login with your AWS credentials, then you choose the bucket, then you choose the file to be uploaded. You can pause at any time and resume. Any application or computer crash is also considered to be a pause, and when you restart S3BigUpload it will pick up where it left off.
Note that you can install Chrome Apps in the Chrome Web Store only from within Chrome, and you have to have Chrome installed to run S3BigUpload, although you don't have to use Chrome as your web browser if you don't want to.
UPDATE: My app S3BigUpload is no longer available, because Google no longer supports Chrome Apps on platforms other than ChromeOS. But, for Windows, I've recently been using the free version of S3 Browser, with great success.
Once the upload is finished, you can access the file's properties from the Amazon S3 Management Console and compare its ETag to the one that ZipVerifier calculated.
VERY IMPORTANT: On Windows, if you've generated an ETag with ZipVerifier, make sure your part size for multipart uploads is set to 10MB, something you have to do with the options in S3 Browser. I don't know about the other multipart uploaders. On the Mac, ZipVerifier still uses 5MB parts.
Storing Archives on Amazon Glacier
Glacier is much cheaper than S3 for storage, but a lot more expensive for retrieval. Archives for long-term storage will be accessed only if all other backups have been destroyed, which is to say probably never, so Glacier is ideal. I haven't found any good multipart uploaders for Glacier, but you don't need one. Just set up your S3 bucket to automatically transfer files to Glacier, and then you just need to upload to that S3 bucket.
To do that, you first need a Lifecycle Rule:
Then you set that rule as a Lifecycle Rule for the bucket:
Now anything you upload to that S3 bucket will automatically go to Glacier, and you'll be charged at the Glacier rates. In my case, I have 436GB in Glacier, and that costs me about $1.75 a month.
Entire site ©2006-16 Marc Rochkind unless otherwise noted. All rights reserved.