Single Point of Failures in Virtual Infrastructure and Means to Overcome These

By CIOReview | Thursday, July 7, 2016
615
1052
219

Virtual infrastructure plays a pivotal role in fulfilling the needs of small and medium sized businesses, which are unable to afford their own physical infrastructure, such as servers, applications, and other enterprise-grade technologies. It helps organizations in eliminating the hiccups of maintaining an actual data center infrastructure; large capital required to pay for the hardware; and software licenses. For instance, according to the report of reports TechTarget, an online financial firm was running out of space to house physical servers. The company spent valuable time on configuring systems and balancing power distribution for the additional physical systems required for the new workloads. At last, the company leveraged server virtualization technology to resolve the efficiency of the various workflows. Now, the firm is 75 percent virtualized and operates 200 VMs on just 10 physical servers, which in turn saves 33 percent of power use.

Though, known failures can be controlled by taking appropriate measures, there are other infrastructures that can be a single point of failures such as routers, data centers, proxies, power, and credit card used to sign up for a SAAS solution used in the platform. Single Point of Failures (SPOFs) in a system such as failures in virtual database servers, virtual web or application nodes, and physical machines that host the virtualized environments could even shut down the entire system.

SPOFs can be split up into several categories such as virtual hardware and software failures, database corruption, operator error, and failure in mass storage devices.

Hardware problems

According to the report of Kyas’ Network Troubleshooting, around 25 percent of the errors occurring in the network are due to the problems in hardware of a computer system. For example, hardware failures such as outages in physical servers—on which directory server or directory proxy server are running—load balancer failures, storage subsystem failures, and power supply failures can causes the actual failure of virtual infrastructure that work over the networks.

Server crashes, network failures, power failures, and disk drive crashes can be categorized into virtual hardware failures. Similarly, directory server or directory proxy server crashes can be labeled as virtual software failures.

Software Problems

Software applications connect large number of servers for enterprise networks that are distributed widely and geographically in enterprises. The networks such as WLAN, LAN, which connect intra and inter-systems of an organization, provide all connectivity between diverse platforms and clients. Software failures can be caused by subtle differences in protocol implementation and handling, and running faulty device drivers and operating system. It is difficult to forecast the service demands on the network, even with careful monitoring, planning, and assessment. Because, in case of using existing software, it may not work correctly due to incompatibility issues with new virtualized environment.

Network Problems

Both, Software and Hardware problems are directly related to the network issues that connect virtual infrastructure. For clear understanding of the natures of failures, it is important to study them in the setting of the OSI model. The below figure shows the probability of errors among the layers of OSI model in Local Area Networks (LAN) (according to the Kyas report) which are responsible for the whole cause of damage for the virtual infrastructure.

Causes of failures within the physical layer are often due to defective cables and connections, defective Network Interface Card (NIC) cards, failures in routers and switches, beacon failure (Token Ring networks), packet size errors, and checksum errors.

Though, Ethernet technologies have improved over time, decreasing the failure rates in the lower layers of the OSI model, the Application Layer malfunction as software complexity continues to explode.

For example, the failure of a NIC card will be less likely to the result in SPOF of the enterprise network. However, a failure in core router without appropriate switchovers and redundancy can handicap an entire network.

Apart from the errors and failures of machines and components, errors of human actions are also considered to be a point of failures that are called Operator Error—subdivided into unintentional and intentional mistakes. The mistakes made by human vary from company to company based on the degree of training and other factors such as culture of corporate and their procedures.

Conclusion

These types of errors are useful in analyzing the possibilities of outages in the virtual environment. However, it probably isn't realistic for most organizations to eliminate every potential single point of threat. A better strategy is to identify the single points of failure and then assess each one based on the risk that it poses. In understanding the contribution of localized failure to virtual infrastructure, it is important to consider the scale and size of malfunctions that are caused by individual network components for framing a comprehensive analysis methodology to mitigate them.

Consider the following redundancy option to suppress the single point of failures at hardware, software, and network level:

To mitigate the load balancer failure, include a redundant load balancer in your computer architecture. In the event of database corruption, have a database failover strategy to ensure availability. To overcome storage subsystem failures, use redundant server controllers, cabling between controllers and storage subsystems, and redundant arrays of independent disks. For power supply failures, apart from adding redundant power supplies, leverage additional power providers, surge protectors, and local battery backups.