Logo
  

 

 

Home

Technical Support

Latest News (II Blog)

Download

Documentation

License Terms and Purchasing

Technical Notes

Troubleshooting

ImageIngester

ImageVerifier

LRViewer/LRVmaker

Sign Up for PhotoSelectLink™

SpanBurner

GPSTrackViewer

Offsite Links

Adobe Digital Negative (DNG)

The DAM Book

Marc's Main Blog

Marc's Gallery

Send Email

Page updated
25-July-2008

 

 


  

 ImageVerifier                  

      

Note: ImageVerifier is still called a beta version, but it's quite stable and ready for use.

ImageVerifier (IV for short) traverses a hierarchy of folders looking for image files to verify. It can verify TIFFs, JPEGs. PSDs, DNGs, and non-DNG raws (e.g., NEF, CR2).

IV is designed to process large numbers of images. Folder hierarchies with 100,000 images or more should be no problem. In one test run, IV ran for 14 hours.

There are two kinds of verification that IV performs: Structure checking and hash checking. They are described in detail below.

To download IV, click the Download link at left.

Basic Operation

All structure verification other than for non-DNG raws is built-in; for DNGs IV uses Adobe's DNG SDK directly. JPEGs and TIFFs are verified using built-in libraries as well. PSDs are validated using a method designed specifically for ImageVerifier that follows Adobe documentation for the PSD format. Non-DNG raws (e.g., NEFs) are verified by running them through Adobe DNG Converter.

For all image files, structure checking is performed by reading the actual image data, decompressing as necessary. This can find many errors, but not all, as some errors are indistinguishable for image data. See below for more information.

The list of extensions for raw files is the same as ImageIngester's. (If your raw files end in TIF you can set an option for that on the main window.)

The real work is done by subprocesses, so IV can take advantage of multiple CPU (or multiple core) computers. If you have 4 CPUs, it should be capable of fully loading all 4 at once.

For each verification run, called a job, you can choose the folders, whether to process subfolders or just the top level, what kinds of images to process (TIFF, JPEG, PSD, DNG, and/or non-DNG raw), the maximum number of errors to report, and whether to store the results in a built-in database.

You can save the settings in a named job, which acts something like ImageIngester's presets, except saving is automatic.

There's a built-in scheduler on OS X (not on Windows) that allows you to schedule jobs to be run once at a specified time; daily at a specified time on specified days (e.g., Tuesdays and Saturdays at 2am); and monthly on a specified day and time (e.g., the 3rd of every month at 5am).

The scheduler uses the "cron" facility built into OS X. IV doesn't have to be running for a scheduled job to run, nor does it keep its own daemon process running.

More About Detection of Invalid Files With Structure Checking

On one test of 100 defective images produced by "punching" holes of 40,000 zero bytes at random points, ImageVerifier reported 88% of them as invalid, which is exactly the same result produced by DNG Converter running standalone.

For DNGs, IV uses Adobe's DNG SDK (software development kit). It produced exactly the same results on a test of 100 defective DNGs as did DNG Converter.

Detection of invalid JPEGs was even better (98% or something like that). The very high compression of JPEGs makes even small errors detectable.

Detection of invalid TIFFs with an 800,000 byte hole punched in them was poor: Only 13%. However, 6MB TIFFs with 100,000, 200,000, 400,000, and 800,000 bytes lost from the end (i.e., truncated files) were all reported as invalid. Perhaps truncated files and files with their headers clobbered are the most commonly found bad files, and those are all detected.

(For some additional insight into the problem of detecting invalid files, imagine a file format that consists of a 4-byte width, a 4-byte length, and then width-times-height 3-byte RGB pixels. In this arrangement, as long as the width and length accurately describe the file, since any RGB combination is valid, no holes punched into the file would be detectable unless they affect the first 8 bytes. That's the worst case, and it's close to the TIFF case if no compression is used.)

Hash Checking

Structure checking is verifying the image file by reading through its various structures and decompressing any compressed image data, looking for errors. This can be effective in finding damage if the damage is large and/or the image is compressed. For highly compressed images like JPEGs, damage detection is very good. It's not so good for uncompressed raws, such as the DNGs that come straight from a Leica M8. It's better for compressed DNGs, but not as good as it is for JPEGs.

Another approach entirely is hash checking, which is maintaining for each image known to be good a fixed-length hash computed from all the bytes in the file so that it's unlikely that two different files will produce the same hash. (Not impossible, since the hash is of fixed length and the number of possible image files is infinite.) If the two files are the good one and a copy (or even the original) that's been damaged, then comparing hashes of the two files will show that the files are not the same.

Comparing the actual files is even better, but in the case of a single file that's been damaged you don't have two files. All you have is the damaged file and the hash from when it used to be good. Also, reading one file to compute its hash takes half as long as reading two files.

The nice thing about structure checking is that no bookkeeping is involved—each file stands on its own. Hash checking, however, does create complications because you need to put the hash somewhere, and you need a way of associating the image with its hash. This is easy for a DAM system that controls all the assets, but much harder with a passive utility like ImageVerifier. Putting the hash inside the file is one approach, but this has two problems: It's safe only for certain formats for which it's allowed, such as DNG, and it requires IV to write into the file, which I don't want to do because it raises the possibility of damage to the file during verification and because many photographers don't want to use any utilities that write into their files.

So, here's the scheme that IV uses: For each file, a key is generated that's rich enough so that two different images won't have the same key. The key is the concatenation of the filename (not the path, just the last component), the size, the modification date/time of the file, the EXIF DateTimeDigitized, the EXIF SubSecTimeDigitized, and the EXIF DateTimeOriginal (also called plain DateTime). It's still possible for two different images to have identical keys, but the worse that will happen in that case is that IV will erroneously say that they are different, and then later you can determine that they are not.

After the key is generated, a 512-bit hash is computed from the entire file, and both are stored in the IV database. Note that the location of the file--what folder it's in--is stored for reference purposes, but plays no role at all in associating keys (and therefore files) and their hashes.

Here's an example key:

DSC_0003.JPG - 2591469 - 2007-02-05 22:56:43 - 2004:06:23 18:59:27.70 - 2007:02:03 14:00:20

The corresponding example hash is a 128-character string of hex digits.

If the file is copied to a backup, or moved to another folder, IV will still find its hash, as long as none of the components of the key have changed. (Copying normally doesn't change the modification date, and you need to be careful not to rename your files when you back them up. Once a file is named, ideally during ingestion, it ought to keep that name forever.)

A key and hash take up around 150 bytes in the database, depending on the length of the key. (Recall that the hash represents all the bytes in the file, which for most of us is around 10MB or more per image.) In addition, there's space needed for the parent folder of each image, but these are shared by all the images in that folder, so if you have, say, 50 images in a folder, storing the parent path adds only around 1 or 2 bytes per image. Doing the math, the space to store the keys and hashes for 200,000 images is only about 30MB, which is about the space of 3 images. A lot of overhead space is also need for indexing, so the 30MB maybe should be tripled. Still, the space cost is reasonable.

When you run IV, you have the option of doing a hash check, a structure check, or both. To run a hash check, you must have previously generated the hashes for the files you want to check. If during a hash check no stored hash can be found for an image (that is, no match for the key), you get a failure message that's different from the message you get when the hashes fail to match.

The hash check is pretty fast: Less than a half-second per image, compared to 2 - 4 seconds for a structure check.

There's a Keep Existing Hashes option that restricts verification and storing of hashes to those files that don't already have a hash stored. This is a quick way of processing just files that were added (or which failed verification) since the last time hashes were stored on that folder hierarchy.

A Store Hashes for Invalid TIFFs option causes a hash to be stored even if a TIFF file failed the validity check, since many TIFFs deviate from the standard and therefore can't be checked by ImageVerifier. Don't use this option unless you know the images to be good. (Hashes are also stored for valid TIFFs and for any other types you're also processing.)

In practice, the workflow is something like this: You take a folder of images known to be good and run IV on them with the Store Hashes checkbox checked. This automatically checks the Check Structure checkbox, because a hash is stored in the database only if the file passed the structure check. Then, after the hashes are stored, you can run a hash check on the same files now, or a year from now, or 5 years from now.

Here's a simple test I ran: I stored hashes from a folder of JPEGs, and then took one of them, copied it to another folder, and wrote a single "x" byte into it, 123 bytes from the end, being careful to reset the modification time back to its original value. The resulting damaged JPEG passed the structure check, but it failed the hash check.

In practice, IV seems to magically just remember all your images and what their bytes are supposed to be, and then dips into its memory whenever it sees a repeat image. All this works no matter what folder structure you use, how many backups you have, or how often you rearrange things. (Just don't change the file names.)

A Manage Hashes menu choice opens a panel where you can see the folders whose images were hashed, along with a list of the keys for those images. You can purge a single folder, although you don't have to, since storing new hashes with the same keys will overwrite the old hashes.

ImageVerifier Version History

Note that at any given time the latest Mac and Windows version numbers may be different, since minor updates may only affect one platform.

Version          New Features          Bug Fixes
1.2.01 and 1.2.02 for OS X and Windows See the II Blog entries here and here.
1.1.02B4 for Windows
  • Now runs as a non-admin user.
  • Update check may now be turned off.
  • Small icon fixed.
 
1.1.02B2 for Windows
  • Expiration date removed. (1.1.02B1 did not work after 31-Dec-2007.)
  • MOS files added to list of raws.
 
1.1.02B3 for Mac
  • Improvements to PSD validation.
  • Several minor bugs fixed.
1.1.02B1 for Windows
  • Same feature set as 1.1.02B1 for Mac. Scheduler still unimplemented.
  • New algorithm for computing hashes (Windows only). Existing hashes will need to be recomputed.
  • Improvements to PSD validation.
  • Several minor bugs fixed.
1.1.02B1 for Mac
  • Requires registration using same Username and Registration Code as ImageIngesterPro (both applications for a single $40 license fee). Runs in trial mode, allowing up to 50 files per run, for free. No free version.
  • Verifies PSDs. Verification is very strict and may report errors on valid files. Please email support@ImageIngester.com to report your experiences.
  • New Keep Existing Hashes option limits verification and storing of hashes to those files that do not already have a hash stored. Usually very fast, as only new files and files that failed verification are processed.
  • Enhanced Examine panel. Examines PSD files.
  • Redesigned Hashed Images panel with hierarchical view of folders. Renamed Manage Hashes.
  • Longer list of raw types (SR2 added).
  • Several minor bugs fixed.
1.1.01B1 for Windows
  • First Windows beta. Structure and hash checking are implemented, but should be used with caution, as this is the first Windows version. Results stored in the database may be viewed (Tools-Results). Other menu commands aren't implemented.
 
1.1.01B7 for Mac  
  • More testing, minor bug fixes, and performance improvements. (Successfully tested about 240,000 files during a 31-hour run.)
1.1.01B6  
  • Fixed problem that caused checkboxes to remain disabled.
  • Fixed some internal memory problems and made some efficiency improvements.
1.1.01B5
  • Hash checking. See ImageVerifier page (link at left).
  • The scheduler failed to run a job correctly if its name contains spaces. Now fixed.
1.1.01B4
  • There's a new Preferences panel, for preferences that apply to all jobs. Currently there are only two: The path to Adobe DNG Converter and an Update Check checkbox.
  • The Update Check mechanism is implemented. The check is once a week or when the Help-Update Check menu item is chosen.
  • There are three new items on the Help menu: Web Site, Update Check, and View Internal Log. The Internal Log sometimes contains information useful in diagnosing problems with IV (but not problems with images). If you have a problem, I may ask you to email the internal log to me, in addition to the contents of the Run Log that appears on the main window. (The two logs are different.)
  • A count of skipped extensions and how many files of each type were skipped is shown in the Run Log.
  • The Run Log is now stored in the database.
  • There's a new checkbox on the main window to specify whether files with a TIF extension should be considered as raws rather than as TIFFs. Use it if your camera uses a TIF extension for raws.
 
1.1.01B3
  • Quality of the verification for non-DNG raws has been greatly improved by running them through DNG Converter, instead of an internal method based on the open-source program dcraw.
 
1.1.01B2  
  • TIFFs were erroneously reported as invalid on Intel Macs.
1.1.01B1 (First Mac beta)    
 

 


©2006-2008 by Marc Rochkind. All rights reserved. ImageReporter, and LRViewer may be freely copied and used provided they're not altered in any way. ImageIngester, ImageIngesterPro, ImageVerifier, ImageReporter, SpanBurner, LRViewer, LRVmaker, and PhotoSelectLink are trademarks of Marc Rochkind. Hosted by A2Hosting.com.
Sells Brands Software | Software Purchase Buy | Sellers Shop Software | Cheap Secure Download | Software Cheapest Download | Software Here Online