Bulletproof Storage
Disk systems will repair themselves or can be left unrepaired for years.
You can fly a two-engine plane with one engine, but how many passengers would want to be on it?
That’s the idea behind “bulletproof storage,” a concept that IBM has been developing for two years and plans to begin unveiling incrementally over the next one to three years.
IBM’s technology initiative deals with fault tolerance in every part of a storage system: disk, controller, network cards, power supplies and software. By building more-robust storage systems that can defer replacement of failed parts for up to three years because of redundant components, IBM believes it can also eliminate many human errors that happen when failing components are replaced.
According to Stanley Zaffos, an analyst at Gartner Inc. the bulletproof storage concept still has another five to 10 years before it’s broadly embraced by users. But once it is, storage systems will require less maintenance and, therefore, cost less to maintain.
“We know how to build very reliable code. We use appliances every day that have software built into them that work forever: your automobile, your calculator, the disk drive in your PC, your telephone,”Zaffos says.
But IBM is looking to attack far more complex systems than telephones or calculators.
Under its bulletproof initiative, IBM is addressing disk-sector failures that grow along with disk capacity. While disk capacities double every 12 to 18 months, uncorrectable read/write error rates haven’t improved, nor has the probability of an uncorrectable error occurring on a disk read decreased. There are more sectors on today’s disks and, therefore, a greater chance of an uncorrectable error.
The answer is to create self-healing capabilities for storage management software and more-robust RAID configurations.
IBM says that in about a year it will release storage systems that can support three simultaneous disk-drive failures in a single array by introducing additional parity disks into RAID configurations, offering many times the resiliency of a RAID configuration with two parity disks. Today, standard systems allow for only two disk failures.
But Zaffos argues that 80% of downtime today is caused by user error and software failures, not hardware failures. He says that the failures resulting from software are created by complexity and that there is an almost infinite number of failures that can occur in a complex system.
IBM is addressing those code failures with a software project called N-Version Programming, where two pieces of code in the same application save data and then compare the data to ensure that there are no errors.
In N-Version Programming, two copies of data are protected using different means. One copy might be protected by standard RAID-5 programming coded by Programmer A.
The second copy is protected by a different algorithm coded by Programmer B. That way, if the first copy gets corrupted due to a particular bug in the program written by Programmer A, then the second copy can be used.
The second copy may have its own bugs, but they will manifest in different ways at different times, and when they do, the first copy will be the one which is good and which you can then use. It’s kind of like having a second person check the work of a first person and keep fixing it whenever it finds mistakes.
One way IBM plans to detect and correct corrupted data is to create more-resilient storage software with repairable data structures. The code checks that certain conditions, which are described in rules, are met. For example, in a file system with multiple files, the sum of the space taken by the files plus the free space in the system must be equal to the total available space. The code will check this property automatically at various times and use a procedure to repair and fix problems if the property isn’t met.
In this case, the software isn’t checking the code to see that it’s functioning properly and isn’t checking data contents. If certain properties aren’t met, the software knows how to fix the data structures.
But don’t expect to see fruit from N-Version Programming or checkable data structures for another two to three years.
防弹存储
磁盘系统自行修理或者几年不用修理。
双引擎飞机能用一个引擎飞行,但有多少乘客愿意乘坐?
“防弹存储”背后的想法就是这样一个概念,IBM已经研究了两年,并计划在今后一至三年中不断公布进展。
IBM的此项技术首创是要在存储系统的方方面面:磁盘、控制器、网卡、电源和软件,实现容错。IBM相信,通过制造更健壮的、并由于有冗余部件从而能将故障部件的更换推迟两至三年的存储系统,能避免很多在更换故障部件时产生的人为错误。
Gartner公司的分析师Stanley Zaffos称,防弹存储概念能为用户广为接受还需要5至10年的时间。但一旦得到认可,存储系统将需要更少的维护,因而需要更低的维护成本。
Zaffos说:“我们知道如何编制非常可靠的程序。我们每天使用各种各样的装置:汽车、计算器、PC机中的磁盘机和电话,它们都内装了使其能永远工作的软件。”
但IBM着眼于攻克比电话或计算器更复杂的系统。
在此项技术首创中,IBM要解决随磁盘容量增加而增加的磁盘部分故障。磁盘容量每12至18个月就翻一番,但无法纠正的读/写错误率没有得到改进,而且发生在磁盘读时的无法纠正的错误概率也没有降低。今天的磁盘上有更多的扇区,因而出现无法纠正错误的机会就更多。
这个问题的答案是提供存储管理软件的自修复能力以及更健壮的RAID(冗余磁盘阵列)配置。
IBM称,约在一年的时间里,将公布通过在RAID配置中增加一个奇偶盘而能在单个阵列中支持三个磁盘同时发生故障的存储系统,这将比两个奇偶盘RAID配置的弹性高出了很多倍。今天,标准的系统只允许两个磁盘出现故障。
但Zaffos认为,今天80%的宕机是由于用户的错误和软件故障,而不是硬件故障引起的。他说,软件带来的故障是因复杂性造成的,而在复杂系统中可能发生的故障几乎是不计其数的。
IBM用一个叫N-Version Programming的软件项目来解决这些程序故障,其中同一应用软件中有两段程序保存数据,然后通过比较数据来确保没有错误。
在N-Version Programming中,使用不同的方式保护数据的两个备份。一个备份可以用由程序员A编写的标准RAID-5编程保护。
第二个备份由程序员B编写的不同算法进行保护。这样,如果第一个备份由于程序员A编写的程序中的特定错误而被破坏了,就可以使用第二个备份。
第二个备份也可能有其自己的错误,但这些错误将以不用的方式、在不同的时间表现出来,当出现这些错误时,第一个备份将是好的,你可以使用。这好像是有第二个人来检查第一个人的工作,一发现错误就纠正。
IBM计划用来检测和纠正被破坏数据的一个方法,就是用可修理的数据结构来生成更有弹性的存储软件。这种程序检查在规则中描述的某些条件是否得到满足。例如,在有多个文件的文件系统中,文件占用的空间与系统中未用的空间之和应该等于总的可用空间。上述程序在不同的时间自动检查此特性,并在此特性未能得到满足时启用程序进行修理并纠正此问题。
此时,软件不是检查此程序,看看它是否正常运行,也不是检查数据内容。如果某些特性未能满足,软件知道如何来修正数据结构。
但不要指望在今后两三年内就能见到N-Version Programming项目,即可检查数据结构的成果。