Marc Rochkind's
Apps and Books
 

The most powerful and easy-to-use image verifier.

MacOS-logo MacOS 10.4-10.10       windows_logo XP/7/Vista/8/10

Button-Download Button-Buy Button-Doc

Note: In Version 1.3 and later, all raws are treated alike, and there is no longer a checkbox for DNGs. The "Verify Non-DNG Raws" checkbox has been relabeled "Verify Raws".

ImageVerifier (IV for short) traverses a hierarchy of folders looking for image files to verify. It can verify TIFFs, JPEGs. PSDs, DNGs, and non-DNG raws (e.g., NEF, CR2).

IV is designed to process large numbers of images. Folder hierarchies with 100,000 images or more should be no problem. In one test run, IV ran for 14 hours.

There are two kinds of verification that IV performs: Structure checking and hash checking. They are described in detail below.

To download IV, click the Download link at the top of this page.

Basic Operation

All structure verification other than for non-DNG raws is built-in; for DNGs IV uses Adobe's DNG SDK directly. JPEGs and TIFFs are verified using built-in libraries as well. PSDs are validated using a method designed specifically for ImageVerifier that follows Adobe documentation for the PSD format. Non-DNG raws (e.g., NEFs) are verified by running them through Adobe DNG Converter.

For all image files, structure checking is performed by reading the actual image data, decompressing as necessary. This can find many errors, but not all, as some errors are indistinguishable for image data. See below for more information.

The list of extensions for raw files is the same as ImageIngester's. (If your raw files end in TIF you can set an option for that on the main window.)

The real work is done by subprocesses, so IV can take advantage of multiple CPU (or multiple core) computers. If you have 4 CPUs, it should be capable of fully loading all 4 at once.

For each verification run, called a job, you can choose the folders, whether to process subfolders or just the top level, what kinds of images to process (TIFF, JPEG, PSD, DNG, and/or non-DNG raw), the maximum number of errors to report, and whether to store the results in a built-in database.

You can save the settings in a named job, which acts something like ImageIngester's presets, except saving is automatic.

There's a built-in scheduler on OS X (not on Windows) that allows you to schedule jobs to be run once at a specified time; daily at a specified time on specified days (e.g., Tuesdays and Saturdays at 2am); and monthly on a specified day and time (e.g., the 3rd of every month at 5am).

The scheduler uses the "cron" facility built into OS X. IV doesn't have to be running for a scheduled job to run, nor does it keep its own daemon process running.

More About Detection of Invalid Files With Structure Checking

On one test of 100 defective images produced by "punching" holes of 40,000 zero bytes at random points, ImageVerifier reported 88% of them as invalid, which is exactly the same result produced by DNG Converter running standalone.

For DNGs, IV uses Adobe's DNG SDK (software development kit). It produced exactly the same results on a test of 100 defective DNGs as did DNG Converter.

Detection of invalid JPEGs was even better (98% or something like that). The very high compression of JPEGs makes even small errors detectable.

Detection of invalid TIFFs with an 800,000 byte hole punched in them was poor: Only 13%. However, 6MB TIFFs with 100,000, 200,000, 400,000, and 800,000 bytes lost from the end (i.e., truncated files) were all reported as invalid. Perhaps truncated files and files with their headers clobbered are the most commonly found bad files, and those are all detected.

(For some additional insight into the problem of detecting invalid files, imagine a file format that consists of a 4-byte width, a 4-byte length, and then width-times-height 3-byte RGB pixels. In this arrangement, as long as the width and length accurately describe the file, since any RGB combination is valid, no holes punched into the file would be detectable unless they affect the first 8 bytes. That's the worst case, and it's close to the TIFF case if no compression is used.)

Hash Checking

Structure checking is verifying the image file by reading through its various structures and decompressing any compressed image data, looking for errors. This can be effective in finding damage if the damage is large and/or the image is compressed. For highly compressed images like JPEGs, damage detection is very good. It's not so good for uncompressed raws, such as the DNGs that come straight from a Leica M8. It's better for compressed DNGs, but not as good as it is for JPEGs.

Another approach entirely is hash checking, which is maintaining for each image known to be good a fixed-length hash computed from all the bytes in the file so that it's unlikely that two different files will produce the same hash. (Not impossible, since the hash is of fixed length and the number of possible image files is infinite.) If the two files are the good one and a copy (or even the original) that's been damaged, then comparing hashes of the two files will show that the files are not the same.

Comparing the actual files is even better, but in the case of a single file that's been damaged you don't have two files. All you have is the damaged file and the hash from when it used to be good. Also, reading one file to compute its hash takes half as long as reading two files.

The nice thing about structure checking is that no bookkeeping is involved—each file stands on its own. Hash checking, however, does create complications because you need to put the hash somewhere, and you need a way of associating the image with its hash. This is easy for a DAM system that controls all the assets, but much harder with a passive utility like ImageVerifier. Putting the hash inside the file is one approach, but this has two problems: It's safe only for certain formats for which it's allowed, such as DNG, and it requires IV to write into the file, which I don't want to do because it raises the possibility of damage to the file during verification and because many photographers don't want to use any utilities that write into their files.

So, here's the scheme that IV uses: For each file, a key is generated that's rich enough so that two different images won't have the same key. The key is the concatenation of the filename (not the path, just the last component), the size, the modification date/time of the file, the EXIF DateTimeDigitized, the EXIF SubSecTimeDigitized, and the EXIF DateTimeOriginal (also called plain DateTime). It's still possible for two different images to have identical keys, but the worse that will happen in that case is that IV will erroneously say that they are different, and then later you can determine that they are not.

After the key is generated, a 512-bit hash is computed from the entire file, and both are stored in the IV database. Note that the location of the file--what folder it's in--is stored for reference purposes, but plays no role at all in associating keys (and therefore files) and their hashes.

Here's an example key:

DSC_0003.JPG - 2591469 - 2007-02-05 22:56:43 - 2004:06:23 18:59:27.70 - 2007:02:03 14:00:20

The corresponding example hash is a 128-character string of hex digits.

If the file is copied to a backup, or moved to another folder, IV will still find its hash, as long as none of the components of the key have changed. (Copying normally doesn't change the modification date, and you need to be careful not to rename your files when you back them up. Once a file is named, ideally during ingestion, it ought to keep that name forever.)

A key and hash take up around 150 bytes in the database, depending on the length of the key. (Recall that the hash represents all the bytes in the file, which for most of us is around 10MB or more per image.) In addition, there's space needed for the parent folder of each image, but these are shared by all the images in that folder, so if you have, say, 50 images in a folder, storing the parent path adds only around 1 or 2 bytes per image. Doing the math, the space to store the keys and hashes for 200,000 images is only about 30MB, which is about the space of 3 images. A lot of overhead space is also need for indexing, so the 30MB maybe should be tripled. Still, the space cost is reasonable.

When you run IV, you have the option of doing a hash check, a structure check, or both. To run a hash check, you must have previously generated the hashes for the files you want to check. If during a hash check no stored hash can be found for an image (that is, no match for the key), you get a failure message that's different from the message you get when the hashes fail to match.

The hash check is pretty fast: Less than a half-second per image, compared to 2 - 4 seconds for a structure check.

There's a Keep Existing Hashes option that restricts verification and storing of hashes to those files that don't already have a hash stored. This is a quick way of processing just files that were added (or which failed verification) since the last time hashes were stored on that folder hierarchy.

A Store Hashes for Invalid TIFFs option causes a hash to be stored even if a TIFF file failed the validity check, since many TIFFs deviate from the standard and therefore can't be checked by ImageVerifier. Don't use this option unless you know the images to be good. (Hashes are also stored for valid TIFFs and for any other types you're also processing.)

In practice, the workflow is something like this: You take a folder of images known to be good and run IV on them with the Store Hashes checkbox checked. This automatically checks the Check Structure checkbox, because a hash is stored in the database only if the file passed the structure check. Then, after the hashes are stored, you can run a hash check on the same files now, or a year from now, or 5 years from now.

Here's a simple test I ran: I stored hashes from a folder of JPEGs, and then took one of them, copied it to another folder, and wrote a single "x" byte into it, 123 bytes from the end, being careful to reset the modification time back to its original value. The resulting damaged JPEG passed the structure check, but it failed the hash check.

In practice, IV seems to magically just remember all your images and what their bytes are supposed to be, and then dips into its memory whenever it sees a repeat image. All this works no matter what folder structure you use, how many backups you have, or how often you rearrange things. (Just don't change the file names.)

A Manage Hashes menu choice opens a panel where you can see the folders whose images were hashed, along with a list of the keys for those images. You can purge a single folder, although you don't have to, since storing new hashes with the same keys will overwrite the old hashes.  

 

©2006-2016 by Marc Rochkind. All rights reserved. Downloaded trials and demos may not be redistributed without written permission of the copyright holder.

ImageIngester, ImageIngesterPro, ImageVerifier, ImageReporter, Ingestamatic, ProofSheet, ExifChanger, ExifExtreme, SpanBurner, LRViewer, LRVmaker, PhotoSelectLink, PhotoApp, PhotoMag, and PhotoAppMaker are trademarks of Marc Rochkind.

WARRANTY: THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Privacy Policy: The email list is never shared with any other organization and is never used for any purpose other than informing you of developments related to products available on this site. The apps here and this website never upload data to any server without your explicit permission, other than requests for content that are a necessary part of web interaction and a record of your entering an unlock code.