TideLog Archive for November, 2013

Many electronics manufacturers, including HDD manufacturers like Seagate, have been using the industry standard “Mean Time Between Failures” (MTBF) to quantify disk drive average failure rates. MTBF has proven useful in the past, but it is flawed.

To address issues of reliability, Seagate is changing to another standard: “Annualized Failure Rate” (AFR).

MTBF is a statistical term relating to reliability as expressed in power on hours (p.o.h.) and is often a specification associated with hard drive mechanisms.
It was originally developed for the military and can be calculated several different ways, each yielding substantially different results. It is common to see MTBF ratings between 300,000 to 1,200,000 hours for hard disk drive mechanisms, which might lead one to conclude that the specification promises between 30 and 120 years of continuous operation. This is not the case! The specification is based on a large (statistically significant) number of drives running continuously at a test site, with data extrapolated according to various known statistical models to yield the results.

Based on the observed error rate over a few weeks or months, the MTBF is estimated and not representative of how long your individual drive, or any individual product, is likely to last. Nor is the MTBF a warranty – it is representative of the relative reliability of a family of products. A higher MTBF merely suggests a generally more reliable and robust family of mechanisms (depending upon the consistency of the statistical models used). Historically, the field MTBF, which includes all returns regardless of cause, is typically 50-60% of projected MTBF.

Seagate’s new standard is AFR. AFR is similar to MTBF and differs only in units. While MTBF is the probable average number of service hours between failures, AFR is the probable percent of failures per year, based on the manufacturer’s total number of installed units of similar type. AFR is an estimate of the percentage of products that will fail in the field due to a supplier cause in one year. Seagate has transitioned from average measures to percentage measures.

MTBF quantifies the probability of failure for a product, however, when a product is first introduced: this rate is often a predicted number, and only after a substantial amount of testing or extensive use in the field can a manufacturer provide demonstrated or actual MTBF measurements. AFR will better allow service plans and spare unit strategies to be set.

Hard drive reliability is closely related to temperature. By operational design, the ambient temperature is 86°F. Temperatures above 122°F or below 41°F, decrease reliability. Directed airflow up to 150 linear feet/min. is recommended for high speed drives.

The failure rate does not include drive returns with “no trouble found”, excessive shock failure, or handling damage.

Here is an example excerpt from a Product Manual, in this case for the Barracuda ES.2 Near-Line Serial ATA drive, which we installed in a backup server at Kana’s datacentre:

The product shall achieve an Annualized Failure Rate – AFR – of 0.73% (Mean Time Between Failures – MTBF – of 1.2 Million hrs) when operated in an environment that ensures the HDA case temperatures do not exceed 40°C. Operation at case temperatures outside the specifications in Section 2.9 may increase the product Annualized Failure Rate (decrease MTBF). AFR and MTBF are population statistics that are not relevant to individual units.
AFR and MTBF specifications are based on the following assumptions for business critical storage system environments:

  • 8,760 power-on-hours per year.
  • 250 average motor start/stop cycles per year.
  • Operations at nominal voltages.
  • Systems will provide adequate cooling to ensure the case temperatures do not exceed 40°C. Temperatures outside the specifications in Section 2.9 will increase the product AFR and decrease MTBF.

1.2 million hours MTBF? I’d have expected that kind of lifetime from an older hard drive, from when they were made to LAST, from the days of manufacturers like Connor and ExelStor, but you certainly won’t get THAT kind of running hours from a modern drive, certainly not 1.2 million hours CONSTANT running!

Comments No Comments »

Bad sectors are little clusters of data on your hard disk that cannot be read. More than that, though, they have the potential to cause real damage to your hard drive (catastrophic failure) if they build up over time, stressing your hard drive’s arm, which contains the read/write head, there are two for each platter, one for each side. Bad sectors are fairly common with normal computer use and the imperfections of the world we live in. Like chip fabrication and LCD panel manufacturing, HDD manufacture is a very critical, precise process, and like a TFT with bad pixels from the factory, you do get bad sectors with a HDD due to imperfections when it’s made. The manufacturers make legal allowances for a certain limit to these imperfections before warranty claims can be made, like the legal limit of 5 dead pixels on a TFT. However, there are several simple steps you can take to prevent HDD bad sectors and to repair any that you do have. Having bad sectors will slow down computer performance as well, as your drive takes time attempting to read them. Here is a step-by-step guide. The most common questions I get as a computer engineer are “What is a sector?”, and “How are HDD bad sectors created?”

A sector is simply a unit of information stored on your hard disk. Rather than being a mass of fluid information, your hard disk stores things neatly into “sectors”, a bit like us humans putting things into boxes, and the box only holds so much, and all boxes are the same size. The standard sector size is 512 bytes.

There are various problems that can cause HDD bad sectors:

  • Improper shutdown of Windows, especially power loss while the HDD is writing data;
  • Defects of the hard disk, including general surface wear, pollution of the air inside the unit due to a dirty or clogged air filter, or the head touching the surface of the disk;
  • Other poor quality or aging hardware, including dodgy data cables, an overheated hard drive, and even a power supply problem, if your drive’s power is erratic;
  • Malware.

Hard and soft bad sectors

There are two types of bad sectors – hard and soft.

Hard bad sectors are the ones that are physically damaged (that can happen because of a head crash if your drive is dropped while running and writing data), or in a fixed magnetic state. If your computer is bumped while the hard disk is writing data, is exposed to extreme heat, or simply has a faulty mechanical part that is allowing the head to contact the disk surface, a “hard bad sector” might be created. Hard bad sectors cannot be repaired, but they can be prevented. The heads of a hard drive float on the air cushion generated by the platters spinning, they fly less than the width of a human hair away from the platters, even a small speck of dust is like a mountain, so knocks are definitely to be avoided.

Soft bad sectors occur when an error correction code (ECC) found in the sector does not match the content of the sector. Whenever a file is written to a sector, the drive calculates a “checksum”, which is used to verify the data, if it doesn’t match upon read, the drive knows the sector is weak. A soft bad sector is sometimes explained as the “hard drive formatting wearing out”, in other words the magnetic field is weakening, like an old video cassette – they are logical errors, not physical damage ones. These are repairable by overwriting everything on the disk with zeros. Like tapes and CD’s, the magnetic surface on a hard disk is not infinite, it is affected by other magnetic fields around it, so data recovery guys like me recommend regularly imaging a drive directly to another, frequently, to keep the data fresh and readable.

Preventing bad sectors

You can help prevent bad sectors (always better than trying to repair them, as they say prevention is better than cure!) by paying attention to both the hardware and the software on your computer.

Preventing bad sectors caused by hardware:

  • Make sure your computer is kept cool and dust free;
  • Make sure you buy good quality hardware from respected brands. Cheap RAM and power supplies are my biggest culprits from experience;
  • Always move your computer carefully, and make sure it is TURNED OFF, not in Sleep mode, it can wake up while being moved, especially a laptop;
  • Keep your data cables as short as possible;
  • Always shut down your computer correctly – use an uninterrupted power supply if your house is prone to blackouts.

Preventing bad sectors using software

  • Use a quality disk defragmenter program with automated scheduling to help prevent head crashes (head crashes can create hard bad sectors). Disk defragmentation reduces hard drive wear and tear, thus prolonging its lifetime and preventing bad sectors;
  • Run a quality anti-virus and anti-malware software and keep the programs updated.

Monitoring bad sectors

If you use a tool like HD Sentinel, or CrystalDiskInfo, and you notice bad sectors on your drive, keep an eye on it. A few sectors bad is not normally a problem, as I mentioned at the start of the article, up to 5 bad pixels on a new TFT is allowed before it becomes a warranty claim, hard drives are allowed a few bad sectors due to the imperfections of their manufacturing process. They are manufactured with what are known as “reserved sectors”, a spare area of the disk only accessible by the controller board. If a sector is weak, the controller will attempt to move the data to the reserved area, if this is successful it then attempts a quick read/write test on the old sector (takes less than a few milliseconds), if it fails it marks it as bad in the sector map, also stored in the drive reserved area, along with drive firmware, so that it doesn’t attempt to use it again.

If the number of bad sectors starts increasing, or you start to experience other symptoms, such as the drive dropping out completely as if you unplugged it, or any clicking, and data taking longer to read or copy, this could indicate a fault with the read/write heads, or the control circuitry. Stop using it immediately and back up any important data to another drive. If the failing drive is under warranty, print a log off from HD Sentinel and take it along with you to return the drive, as evidence.

S.M.A.R.T Values to look for

When looking at S.M.A.R.T (Smart Monitoring And Reporting Tool) analysis, the two main areas to look out for are:

Reallocated Sector Count

This shows how many of the drive’s Reserved sectors have been used. If too many of these are used it generally indicates a problem with the disk surface.

Current Pending Sector

This shows how many bad sectors are currently pending a rewrite. A hard drive will always try to rewrite the sector, if it fails, the sector is reallocated into the reserved, the drive adds the sector on to the Reallocated Sector Count, and the original sector is then marked as unusable. If the rewrite is successful, the Pending Sector count will drop.

 

Comments No Comments »