Using a server vs. individual desktops
Folks,
I hear the following argument a lot from small organizations:
"I like the concept of having a server, but it becomes a single point of failure. If it dies, everyone is dead in the water until it's fixed."
In response to this, it's instructive to work the math for disk drive failure rates and how they relate to your expected probability of failure. The statistic we need to use is the Mean Time Between Failures (MTBF), which Seagate defines on its website as device power-on hours between failures. I.e., for a disk drive with an MTBF of 1,000,000 the expected number of failures in 1,000,000 hours of running is one. This number of hours could be reached by running a single disk drive for 1,000,000 hours, but that's 114 years -- an unrealistic length of time that exceeds the expected service life of any disk drive (which is around 5 years). A more realistic way of accumulating 1,000,000 hours of running time is to run 100 disk drives for 10,000 hours, or slightly more than one year.
Modern disk drives have MTBF specs in the range of 1,000,000 hours, and there are 8760 hours in one year. Thus, assuming a flat probability distribution we get:
The likelihood of a single disk drive failing in a year = 8,760/1,000,000 = 0.00876
The likelihood of a single disk drive *not* failing in a year = 1 - 0.00876 = 0.99124
Raise 0.99124 to the number of disk drives power to get the probability of no failures in a set of disk drives in one year.
Likelihood of no failures in one year for:
1 disk drive = 0.99124^1 = 0.99124
10 disk drives = 0.99124^10 = 0.915773749
20 disk drives = 0.99124^20 = 0.83864156
50 disk drives = 0.99124^50 = 0.644081687
100 disk drives = 0.99124^100 = 0.414841219
200 disk drives = 0.99124^200 = 0.172093237
500 disk drives = 0.99124^500 = 0.012285972
Subtracting that from 1 yields the probability of at least one disk drive failure in a year for a given number of drives:
1 drive 0.00876
10 drives 0.084226251
20 drives 0.16135844
50 drives 0.355918313
100 drives 0.585158781
200 drives 0.827906763
500 drives 0.987714028
As you can see, the likelihood of at least one failure starts to grow pretty quickly. By the time you hit 20 disk drives you have a 1 in 7 chance of at least one disk drive failing in the course of one year. By the time you hit 500 disk drives you're pretty much guaranteed a failure.
20 disk drives isn't all that many computers. Factor in data loss failures from other causes (disk drive corruption, power supply failures, bad motherboards, etc.) and I'd estimate that the probability of a data loss failure in one year is roughly double the probability from a hardware failure of a disk drive, or 32% or 1 in 3 for an organization with 20 computers.
OK, enough math, what does this mean? For an organization with 20 desktop and laptop computers without a backup system, there is a 1 in 3 likelihood in one year that someone is going to lose all of his or her data. I think that most managers would consider that an unacceptable level of risk. Remember that the cost of the failure is not just the cost replacing the disk drive; it's also the cost of a tech's time to come in and troubleshoot and repair the system, the cost of having an employee out of work for one or more days, and the cost of replicating the possibly irreplaceable data.
What happens with a server and network home directories? Take an XServe and mirror two of the drives leaving the third as a hot spare (or use a hardware RAID5 card). Set up network home directories for each user and have all users store their files on the server. Any single disk drive on a client or on the server can fail without serious data loss. Only if you had two simultaneous disk failures on the server would you lose anything. An alternative path to data loss would be drive corruption, bad motherboard, or other hardware failure -- but on the clients this would not matter, only on the server. Thus, the probability of a data loss failure is reduced from 0.32271688, or 32% per year, to 0.00876^2 + 0.00876 = 0.00883674, or less than 1% per year.
What about recovery time? You now have two considerations. First, since all of the user data are now stored on the server you have no need to worry about salvaging anything from any disk on a desktop or laptop computer. You can just replace the drive and slap your standard image onto it, followed up with any additional bits that might need doing. The user then logs in and keeps on working. In fact, the user doesn't even have to wait that long -- he or she can just log in to another computer (a spare, or maybe one that is normally used by someone who is out sick or on vacation) and get to work right away while you fix things. Second, since it's easier to maintain complete backups, getting the server back up and running is less of a problem as well. Yes, in the latter case everyone in the office is stalled, but your recovery is made much easier and quicker.
Some food for thought.
Here's an Excel spreadsheet with the formulas if you want to play with the numbers on your own.
--Paul