This is the blog archive site. For the latest blog articles, click here.

Hunting Down a Mac Hardware Problem

December 20, 2007

I'm developing ImageVerifier, an application that verifies the internal structure of image files to ensure that archives and backups are intact. It can also calculate a hash for each file, store it in a database, and then compare it to a re-hash of the file later, as an even better integrity check.

(A hash is a long sequence of characters, 128 in the case of ImageVerifier, that's computed so that every byte of the file participates. It's possible, but very rare, for two files to hash to the same sequence. Of course, the same file always produces the same hash, unless its contents have been changed.)

About a week ago I was finishing my latest version of ImageVerifier when it started to act erratically. My test files would fail the hash comparison, even though they hadn't changed. It wasn't always the same files, and they didn't always fail. But, consistently, out of a dozen files, one or two would fail.

A Bug in My Program?

My first thought was that I had a bug, since a lot of the code was pretty new. Bugs are sometimes hard to find, but when the problem is reproducible, as it was here, they never take me more than a few hours to find. Yet, after three days, I hadn't found the problem. I kept thinking I had it because the problem went away (all the hashes compared OK), but then it would come back.

Time to bring out the heavy artillery. I put the large ImageVerifier application aside and wrote a new program from scratch containing just the part that calculates the hashes and compares them to a re-hash of the files. I skipped the database (keeping the hashes in memory) and the entire user interface. It was just the barest of command-line programs. Even the paths to be tested were built-in.

It failed. Same problem. That didn't mean I didn't have a bug, since some of the code, such as the hashing calculation itself, was still present. But the rest of ImageVerifier was off the hook.

I was using SHA-2 hashing code written by Oliver Gay. After eliminating everything else in the smaller program that was taken from ImageVerifier, that was all that was left. So, I replaced the SHA-2 hash with a MD5 hash, this time using code from Ronald Rivest.

It failed with MD5 hashing.

If It's Not a Software Bug, What Is It?

Now it was time to start thinking the unthinkable, or at least the unusual. I made a list of what could be wrong:
  1. Compiler or library bug. (I was using Apple's Xcode 3.0, which uses the GNU compilers underneath.) Or, maybe some bug shared by both ImageVerifier and the smaller test program I wrote, even though they shared no code of mine.
  2. Bug in OS X Leopard (10.5), which I had installed a few weeks before. (This was uppermost in my mind.)
  3. Hardware problem. I ran the TechTool Deluxe hardware diagnostics supplied by Apple, but they didn't report any problems. I didn't think that meant anything since the problem was elusive. My Mac had been behaving perfectly, staying up for days, if not weeks, with no problems at all, other than with ImageVerifier and my other hashing tests.

The way to hunt down computer problems is to conduct a series of experiments each of which is designed to eliminate a possibility. As Sherlock Holmes said, "Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth."

Proving It's Not My Program

My first experiment was to write a shell script that used the well known openssl command, shipped with every Mac, to do the hashing, thus eliminating any coding on my part or my Xcode installation. (The shipped openssl was probably built with Xcode by somebody at Apple, but at least it wasn't me. Unlike my own test programs, openssl is very widely used.) My script simply ran the command "openssl md5" on each of 14 files ranging in size from 7 to 190 MB. Each file was processed 20 times and then the results were sorted, so that it was easy to tell if any of the 20 hashes were different.

That test failed, too. Here's some sample output:


MD5(/TestData/file1.psd)= 6f2014e423f3933cb1f2d81be9d1ce6b
MD5(/TestData/file1.psd)= 6f2014e423f3933cb1f2d81be9d1ce6b
MD5(/TestData/file1.psd)= 6f2014e423f3933cb1f2d81be9d1ce6b
MD5(/TestData/file1.psd)= bb2e4be6080e3821655daf38d8862afa
MD5(/TestData/file2.psd)= 9a6a0f5c2ff1153a210e013b666a2df1
MD5(/TestData/file2.psd)= 9a6a0f5c2ff1153a210e013b666a2df1
MD5(/TestData/file2.psd)= 9a6a0f5c2ff1153a210e013b666a2df1
MD5(/TestData/file2.psd)= 9a6a0f5c2ff1153a210e013b666a2df1

Clearly, the 4th hash of file1.psd is wrong.

That ruled out, or at least made very unlikely, a bug in my code or in Xcode. Even Apple's own utility openssl demonstrated the problem.

Is It the Drive?

Next I tried the script on an external FireWire drive. That failed too, which ruled out a drive problem. I don't know exactly how FireWire works, but I think that ruled out a disk controller problem, too.

The Mac I'd been using was a PowerPC G5 iMac. I also tried the script on another Intel Mac running Tiger (OS X 10.4), and it ran correctly every time.

Is it Leopard?

At this point I strongly suspected Leopard, still very new and known to have problems, as all new operating systems do, although nothing as bad as what I was seeing had been reported. Fortunately, I still had a mirror image of my Tiger system on an external drive, so it was very easy to reboot under Tiger and run the script.

It failed. Leopard was exonerated.

Must Be Hardware

It had to be hardware, but not the drive itself, since I had gotten it to fail on two different drives.

Aside from the drive, the only replaceable hardware parts that could be causing the problem were the memory modules. My Mac had two, 1GB each. It came with two 512MB modules that I had replaced a couple of years ago when I first got the Mac.

So, I opened up the case, pulled out one of the 1GB modules and replaced it with 512MB. This introduced another variable, a 1.5GB system instead of a 2GB one, but I didn't have another 1GB module on hand.

Proving It's Memory

I rebooted and started up the tests, both the shell script and ImageVerifier (which apparently had been doing its job all along). The tests now were going to be more time consuming if they ran successfully, because the task was to prove that the problem was gone. It's always harder to prove a negative, but I decided that 20 successful runs of the script and 10 runs of ImageVerifier would be enough to implicate the module I pulled out. (There was a 50/50 chanced it was the wrong module; if the test failed, I would replace the other one and start over.)

All the tests succeeded. I kept running the ImageVerifier test throughout the day, and it worked every time. It had never even worked one time with that bad memory module in the system.

Problem Solved... I Hope!

So I ordered a new 1GB module (about $90 from Newegg.com). There's still a slim chance it won't fix the problem, since the memory subsystem runs a bit differently with 1.5GB than it does with a full and balanced 2GB (two equal-sized modules). If that's the case, it will be worth writing about, and I'll do so in another blog article.

What does all this mean, aside from my system being fixed? I'll talk about that in the next article, which I've posted simultaneously with this one.

Blog Archives

Photography Articles

Raw Conversion: Better Never Than Late April 24, 2008

Scanning in India by Way of California With ScanCafe February 15, 2008

How To Back Up Your Personal Computer January 30, 2008

Every Camera I've Ever Owned January 25, 2008

Sharpening JPEGs for the Web January 4, 2008

Lessons Learned From My Memory Problem December 20, 2007

Hunting Down a Mac Hardware Problem December 20, 2007

Trimming GPS Tracks With GPSTrackViewer November 13, 2007

The World's Shortest Camera Buying Guide September 22, 2007

Transporting and Storing Portable Backup Drives August 26, 2007

"The Luminous Landscape" Teaches Me to Print August 4, 2007

Creating a Google Photo Map (Revised) June 26, 2007

Sony GPS-CS1: Not Good Enough for Geotagging Photos June 24, 2007

Epson P-3000/P-5000 Multimedia Storage Viewer March 10, 2007

Trying Out Infrared January 20, 2007

Stupid Designs Hold Digital Back April 1, 2006

 

Other, older articles


Galleries

image

A small collection of my best photos (click the image). You can order prints, too.


Software

image ImageIngester
image ImageVerifier
image LRViewer
image LRVmaker
image PhotoSelectLink™
image ImageReporter
image SpanBurner
image GPSTrackViewer

Books

The 2004 2nd Edition, a so-called "update" of the 1985 book, which turned out, not surprisingly, to be a re-write. Covers Solaris, Linux, FreeBSD, and Darwin (Mac OS X).


Entire contents of this web site Copyright 2006-2008 by Marc Rochkind. All rights reserved.