How to Tackle RAID Failures

By CIOReview | Thursday, August 18, 2016


The age old technology consisting of computer components and recording media to retain digital data has currently evolved into the ultra-modern electronic digital storage devices and Solid State Drives. Though several technologies were introduced into the growing data storage arenas, most of them were unsuccessful in saving the data from disk failures, where RAID made its benchmark.  


Referred to a redundant array of independent disks, RAID storage uses multiple disks in order to provide fault tolerance to improve overall performance, and to increase storage capacity in a system. In contrast with older storage devices that used only a single disk drive to store data, RAID allows data storage redundantly (in multiple places) in a balanced way to improve overall performance. One of the main reasons for the widespread use of RAID against other technologies is the improved reliability it offers. With mirroring (such as RAID 1 or RAID 10), or striping with parity (such as RAID 5), the system can recover from a single hard disk failure with almost no loss of data. RAID is implemented by hardware or software -based method that is differentiated by the presence or absence of a RAID controller. Basically, a number of independent hard disks are connected to form a single and often larger virtual volume in a simple RAID system.

RAID Failure

Though RAID’s fault tolerance functions and auto rebuild functions are claimed to be highly efficient and effective, RAID can also fail under certain circumstances. There are three general scenarios of a RAID failure- failure of one of the RAID member disks, failure of several RAID member disks, and even other RAID failures not associated with the member disks. There may be an increase in the simultaneous reading and writing of drives along with the fault tolerance features depending on the RAID configuration. In a typical RAID 5 configuration, without even a scheduled power off, the RAID controller could rebuild the data volume from a hot standby drive or a replacement drive through hot swap. The only time it will fail is when two disks fail simultaneously, but such probability is one in a million!

Types of RAID Malfunctions

While RAID arrays are generally stable and reliable, like all mechanical devices, they can and do break down from time to time resulting in significant loss of valuable files and information. Depending on the RAID configuration, there may be an increase in simultaneous reading and writing of drives along with the fault tolerance feature.

1) Controller Failure: Most RAID servers rely on a single controller. The failure of this controller under any prospect can result in booting problems and other difficulties.

2) HDD Failure: The data storage devices manufacturing company, According to the reports from renowned HDD manufacturer Seagate, about 140, 000 hard drives fail every week. When a hard disk fails without a hot standby, the RAID array will run on the degraded mode increasing the likelihood of next drive failure. It is reasonable to assume that all the drives in the array are from the same batch and are subject to equal amount of working stress. So if one disk fails, the other might also be near to imminent failure and it often does.

3) Power Surge: Due to frequent power surge RAID configuration setting of NVRAM in the controller card could get corrupted. Power surge can also lead to failure of controller or disk sectors resulting in data loss.

It is also notable that RAID configuration with fault tolerance only intends to protect the physical failure, but not logical corruption such as system corruption, virus infection, or inadvertent deletion. Even an improper procedure of disk replacement in a RAID system could cause a complete failure of the system.

The Way Out

Taking the exact preliminary measures is the most important part to protect data when there is a RAID array failure. The initial step under such a situation is to turn off the array immediately. The more you run an array in degraded mode, the more likely you are to do further damage. It's highly recommended that the user should get expert guidance while replacing the failed system. Repairing a failed disk requires a clean room environment, as the dust in normal atmospheric air can do irreparable damage to the affected disk, and possibly erase the data permanently.

As though “prevention is better than cure,” check out this list of few basic strategies that can be adopted to prevent RAID failure to a large extent.

1) Avoid setting up RAID volumes using sequentially numbered drives that come off the same manufacturing line, one after another. This method can help avoid the chances of drive failures if the failures are the result of manufacturing defects that affect every drive on the line during a particular production cycle. So instead of creating a RAID volume from the same manufacturing line, the use of different serial numbered or company drives might reduce the chances of a catastrophic RAID volume failure.

2) The utilization of a self-monitoring analysis reporting technology (SMART) or other disk monitoring technologies to spot disk failures can be useful in collecting data about a potential disk failure prediction. The faster a disk failure is resolved, the quicker the rebuild process can be initiated and completed.

3) Forcing a fail marked or bad sector marked drive into a RAID system can increase the chances of complete failure. Even though companies suggest that the bad drives may not actually have errors, the best part is to avoid taking the chance once the software reports that the disk has a possibility to fail. Forcing a bad drive back into operation only exposes the RAID set to subsequent failures of the questionable drive and to secondary or tertiary drive failures before the situation can be fixed.

Data being the blood and soul of every business, even a minute loss of sensitive data can have both financial and reputational harm depending on the nature of information. In case of a disk failure, get in touch with a data recovery specialist who has the tools and facilities to properly repair the system in order to retrieve your files as soon as possible. The likelihood of recovering your data varies depending on the cause and severity of the problem, but in most cases, data can be recovered from RAID array failures with professional assistance. Reports of unusual noises, reduction in processing speed, and power surges that could have damaged the array can be valuable in accelerating the repair process.