Expert trick: How to troubleshoot server RAID 5 failure?

  
                  

Due to the continuous advancement of technology, different types of servers have different processing methods after RAID 5 failure.

Nowadays, the network topology of large-scale applications generally adopts C/S structure or B/S structure. At least one server with large database needs to be placed in the central computer room. Based on the security and reliability of the server, the disk of the server is usually backed up by the Redundant Array of Inexpensive Disk. The RAID 5 array level is a parity disk array without independent parity disks. It uses data blocking and independent access technology to process multiple access requests in parallel on the same disk while allowing any hard disk in the array to fail. .

In practical applications, some array failures may occur due to some unavoidable objective reasons. The most common situation is that the hard disk is offline, the online status is displayed as DDD (Defunct Disk Drive), and the hard disk has physical or logical failure. If it is a physical fault, only the hard disk replacement; if it is a logical fault, you can restore the online state of the hard disk through targeted technical repair, continue to maintain the striped state of the hard disk data in the original array, and continue the data storage system. consistency.

However, the recovery of data from some of HP's old servers (such as the HP LH6000) is different from the data recovery of new servers (such as the HP ProLian series of servers). So different servers handle RAID 5 failures differently. I have been exposed to RAID 5 array card data failure caused by accidental power failure of two servers, and solved the problem by adopting different strategies.

Troubleshooting

One is the HP LH6000 server purchased in 2000. Four 18GB hard disks are made into RAID 5 disk arrays. The array card is NetRaid; the other is 2006. The HP ProLian ML370 server purchased in the year, four 146GB hard disks are made into a RAID 5 disk array, and the array card is a Smart Array 642 with a hot spare hard disk (Hot Spare). Both operating systems are Window 2000 and the database is Server 2000.

The fault of the HP LH6000 is as follows: A hard disk flashes red and the machine is still running normally, but it doesn't take long for the system to operate normally. Only then does the red light of another hard disk flash.

The solution is as follows:

1. Start the server and press Ctrl+M to enter the NetRaid management program when self-testing to the array. Check the array information and find that the hard disk status is Failed. Use the modified configuration to force a hard disk to be OnLine. Restart the server, invalidate the hardware self-test before entering the system, and the startup fails.

2. Start the server and press Ctrl+M to enter the NetRaid management program when self-testing to the array. Select the disk array, manually fail the hard disk that the original OnLine hangs, and then manually set another Failed hard disk to OnLine. Restart the server and enter the system.

3. After the system and database are running normally, enter the array configuration tool to manually set the Failed hard disk to Rebuild, and then restart the server after 100% reconstruction. All the arrays and systems are restored to their original state.

Another server running the ERP system (HP ProLiant ML370) in my organization is configured as a RAID 5 RAID array by four 146GB hot-swappable hard disks through a RAID card (Smart array array card). . One of the hard disks suddenly failed during operation. Server RAID 5 automatically enables hot spares (Hot Spare) to logically replace damaged drives. The data access task of the entire hard disk still runs completely in the original sequence of read and write processes, and the application and database have no effect.

Check the status of the hard disk through the ACU tool that comes with HP, and find that the hard disk with red light alarm is offline. If two hard disks in the Raid 5 of the HP ProLiant server appear red, it indicates that the system has crashed and the database cannot be accessed, but the system does not automatically shut down. When the second hard disk is lit red, the data cannot be recovered by conventional means. Only the third-party data recovery company that pays to find a professional recovers the data.

Therefore, for HP's old HP LH6000 series servers, the array design is much different from the array of HP ProLiant series servers. As far as the operation method is concerned, the array operation method of the HP LH6000 server has many options, including the possibility of re-deleting the array and rebuilding after the array fails, and the initialization is also manually selected. However, the initialization of the HP ProLiant series of server arrays is automatically performed in the background after the array is configured, so the ProLiant series servers cannot reconfigure the array after an array error.

The HP LH6000 server will cause the disk in the array to be dropped due to other unexpected reasons. The maintenance personnel can manually select Online or Offline, Rebuild, etc. to recover data. However, the current HP ProLiant series servers will not have the same disk as the old server in the array, so when the hard disk is red, the hard disk is basically damaged and needs to be replaced. Of course, you can choose to re-plug the hard drive to rebuild (Rebuild), and see if the hard drive can be used for a while.

Technical backup

From the above two examples, it can be seen that the same brand and different series of servers have different Raid 5 disk faults due to their different technologies. . But after rebuilding the data, the data is saved, and the following experience can be drawn from it:

We believe that any advanced technology is not foolproof. If you want to ensure data security, you must do a backup job, it is best to do an offsite backup of the database once a day. Spare at least one new hard drive. It should be noted that the hard disk added to the array must be greater than or equal to the capacity of the failed hard disk.

If conditions permit, the RAID 5+ hot spare disk creation scheme is recommended. This way we have two chances to replace the hard drive before the data is lost. For general applications, only RAID 5 can be used to provide data access performance, reliability, and maximum disk space.

Administrators must constantly observe the status of the array, including viewing the yellow warning lights of the disk array and the drive status in the management software. A failure occurred and was eliminated in time. Regardless of the level of the array, data backup should be done before troubleshooting.

Copyright © Windows knowledge All Rights Reserved