Today at the data center, I was scheduled to upgrade one of our systems. I was pretty well prepared and was just going through some final checks as I waited for the download of the new software to complete. I had already shut down the services and made a final backup, so I decided to try the restore feature to make sure it would work later after the upgrade was complete. It didn't work, and neither did the other 3 sets of backups I tried!
As mistakes go, having not tested this backup would have been a huge one! All of the data that was on the server would have been completely lost. This represents not just the time to recreate some of the documents and settings, but would have involved several different groups and would have left the server down for at least 3-4 days!
A call to the manufacturer's tech support line was helpful and it turned out that there was a hidden failure in the backups that can only be discovered on doing a test like I did. This wasn't in their documentation for the backup feature, nor was it in their documentation for the upgrade procedure. If I hadn't been diligent enough to test out the restore procedure, I would have had a major problem on my hands. The fact that my boss just left for vacation and the vendor was about to close for the weekend would have made the problem worse.
I trusted that the backup would work, but it didn't. The reason why is a bit too obscure for this blog, but it is something I never could have imagined. In fact, the engineers for the vendor seemed perplexed by it, too. With all of the pieces still intact, it is easy to figure out why the system failed. But if I had already installed the upgrade, there would be only guesses as to why the failure had happened. This is why you should always test your backups to make sure they will do what you think they will.
And you should check other systems that you use, too. In the computer security world, penetration testers help check out our security. Shows like "It Takes A Thief" do the same thing for peoples' home security. In some cases, even the most well planned and implemented systems can be broken. But most of the time there are holes in the design or execution that make the difference.
Effective systems design the verification into them. This is one of the strengths of the scientific process of peer review. It ensures that others can replicate results that one researcher finds. Because fraud, mistakes, and interpretation can distort the facts, the scientific community needs to make sure that these things are minimized. This ensures that our body of knowledge surges closer and closer to the truth of the world.
This reminds me of a scene from "Road Trip" where the kids are talking about jumping over a broken bridge in a car. They do a lot of calculating and thinking about it and decide that the distance really isn't that far and they can make it. And then they just happen to test it out by putting just a little bit of weight on the other side of the bridge. The bridge collapses. That was the easy way to find out their plan would have fallen apart with the bridge. Of course in the movie they went for it anyway, but at least they KNEW it wouldn't work.
update: It turns out that the backups were failing because they were being uploaded to a FTP server in the wrong mode. The client didn't properly change to "binary" mode and instead was sending files as "ASCII". Many servers automatically determine that the file is binary and transmit in the correct mode despite it being the client's responsibility. However, ours doesn't.
No comments:
Post a Comment