Introduction To Fault Tolerance Study Of Current

Print   

02 Nov 2017

Disclaimer:
This essay has been written and submitted by students and is not an example of our work. Please click this link to view samples of our professional work witten by our professional essay writers. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of EssayCompany.

Chapter 2

2.1.1. Introduction to Fault Tolerance

As a result of the complex nature of heterogeneous networks, fault tolerance is a major concern for the network administrators, and there are various ways that detection of such occurrences can be accomplished.

When a fault occurs, it is important to: rapidly determine exactly where the fault is, Isolate the rest of the network from the failure so that it can continue to function without interference, Reconfigure or modify the network in such a way as to minimize the impact of operation without the failed component or components, and repair or replace the failed components to restore the network to its initial state.

2.1.2. Function of Fault Tolerance

The function of fault tolerance is "...to preserve the delivery of expected services despite the presence of fault-caused errors within the system itself. Errors are detected and corrected, and permanent faults are located and removed while the system continues to deliver acceptable service."

From a user's point of view, a distributed application should continue despite failures. The fault tolerance has become the main topic of research. Till now there is no single system that can be called as the complete system that will handle all the faults in grids. Grid is a dynamic system and the nodes can join and leave voluntarily. For making fault tolerance system a success, we must consider:

- How new nodes join the system,

- How computing resources are shared,

- How the resources are managed and distributed

While considering all these factors, we have to consider the desirable properties of failure detectors and correctors that are summarized as follows:

Completeness: Any failure (process or machine crash) is eventually detected by all normal processes.

Accuracy: The probability of false positives is low.

Consistency: All processes that are not declared as failed obtain consistent failure information including false positives.

Detection Latency: The latency between a failure and its detection is low.

Scalability: CPU and network loads are low for a large number of participating processes.

Flexibility: Various network settings are tolerated. In particular, firewall or NATs may restrict connectivity between some live node pairs. Finally, they must be accomplished without tedious and error-prone manual configuration

Adaptiveness: Systems should bring up with a small amount of manually- supplied information about network settings.

2.1.3 Fault Tree Analysis

Because of computational Grid heterogeneity, scale and complexity, faults become likely. Therefore, Grid infrastructure must have mechanisms to deal with faults while also providing efficient and reliable services to its end users.

The fault tree analysis classifies faults that may take place in Grid Computing. In the figure various kinds of faults that can occur have been shown. There are mainly six classes of faults as discussed below:

1) Hardware faults:

Hardware failures take place due to faulty hardware components such as CPU, memory, and storage devices.

CPU faults arise due to faulty processors, resulting in incorrect output.

Memory faults are errors that occur due to faulty memory in the RAM, ROM or cache.

Storage faults occur for instance in secondary storage devices with bad disk sectors.

Components that are used beyond specification.

2) Application and Operating System Faults:

Application and operating system failures occur due to application or operating system specific faults.

Memory leaks are an application specific problem, where the application consumes a large amount of memory and never releases it.

Operating system faults include deadlock or inefficient or improper resource management.

Resource unavailable: Sometimes an application fails to execute because of resource unavailability, that is, the need for a resource is use by other applications.

3) Network Faults:

In Grid computing, resources are connected over multiple and different types of distributed networks. If physical damage or operational faults in the network will occur then the network may exhibit significant packet loss or packet corruption, as a result individual nodes in the network or the whole network may go down.

Node failure: In node failure, individual nodes may go down due to operating system faults, network faults or physical damage.

Packet loss: Broken links or congested networks may result in significant packet loss.

Corrupted packet: Packets can be corrupted in transfer from one end to another.

System Fault

Hardware Faults

Application and Operating system faults

Network Faults

CPU

Memory

Storage

Operating system specific

Application Specific

Node Failure

Network Connection oriented

Corrupt packet

Packet loss

Memory Leaks

Os Fault

A

Figure 2.1 Fault tree Analysis

Un expected Input

Un Handle Exception

Timeout Faults

Response Faults

Value

Byzantine

A

Software Faults

Timeout Job Execution

Timeout Job Execution

Timeout Run Execution

Figure 2.1. A. Fault tree Analysis

4) Software Faults:

In a particular task several high resource intensive applications usually run on Grid and while running this software application several software failures take place. Such as:

Un-handled exception: Software faults occur because of un-handled exception like divide by zero, incorrect type casting etc.

Unexpected input: Operators may enter incorrect or unexpected values into a system that causes software faults, for example specifying an incorrect input file or invalid location of an input file

5) Response Faults:

Several kinds of response faults like the following can occur:

Value fault: If some lower level system or application level fault has been overlooked (due to unknown problem) an individual processor or application may emit incorrect output.

Byzantine error: Byzantine errors take place due to failed or corrupted processors that behave arbitrarily.

6) Timeout Faults:

Timeout faults are higher level faults that take place due to some lower level faults at the hardware, operating system and application, software or network.

2.1.4 Fault Tolerance Mechanism

In addition to ad-hoc mechanisms - based on users complaints and log files analysis - grid users have used automatic ways to deal with failures in their Grid Environment. To achieve the automatic ways to deal with failures, various fault tolerance mechanisms are there. Some of these fault tolerance mechanisms are:

Application Dependent

Monitoring Systems

Checkpointing-recovery

Fault Tolerant Scheduling

1) Application-dependent:

Grids are increasingly used for applications requiring high levels of performance and reliability, the ability to tolerate failures while effectively exploiting the resources in scalable and transparent manner must be integral part of grid computing resource management systems.

Support for the development of fault-tolerant applications has been identified as one of the major technical challenges to address for the successful deployment of computational grids. To date, there has been limited support for application-level fault tolerance in computational grids. Support has consisted mainly of failure detection services or fault-tolerance capabilities in specialized grid toolkits. Neither solution is satisfactory in the long run. The former places the burden of incorporating fault-tolerance techniques into the hands of application programmers, while the latter only works for specialized applications. Even in cases where fault-tolerance techniques have been integrated into programming tools, these solutions have generally been point solutions, i.e., tool developers have started from scratch in implementing their solution and have not shared, nor reused, any fault- tolerance code. A better way is to use the compositional approach in which

fault-tolerance experts write algorithms and encapsulate them into reusable code artifacts, or modules.

2) Monitoring Systems:

In this a fault monitoring unit is attached with the grid. The base technique which most of the monitoring units follow is heartbeating technique. The heartbeating technique is further classified into 3 types:

Centralized Heartbeating - Sending heartbeats to a central member creates a hot spot, an instance of high asymptotic complexity.

Ring Based Heartbeating - along a virtual ring suffers from unpredictable failure detection times when there are multiple failures, an instance of the perturbation effect.

All-to-all heartbeating - sending heartbeats to all members, causes the message load in the network to grow quadratically with group size, again an instance of high asymptotic complexity

3) Checkpointing-recovery:

Checkpointing and rollback recovery provides an effective technique for tolerating transient resource failures, and for avoiding total loss of results. Checkpointing involves saving enough state information of an executing program on a stable storage so that, if required, the program can be re-executed starting from the state recorded in the checkpoints. Checkpointing distributed applications is more complicated than Checkpointing the ones which are not distributed. When an application is distributed, the Checkpointing algorithm not only has to capture the state of all individual processes, but it also has to capture the state of all the communication channels effectively. Checkpointing is basically divided into 2 types:

Uncoordinated Checkpoint: In this approach, each of the processes that are part of the system determines their local checkpoints individually. During restart, these checkpoints have to be searched in order to construct a consistent global checkpoint.

Coordinated Checkpoint: In this approach, the Checkpointing is orchestrated such that the set of individual checkpoints always results in a consistent global checkpoint. This minimizes the storage overhead, since only a single global checkpoint needs to be maintained on stable storage. Algorithms used in this approach are blocking and non- blocking.

4) Fault Tolerant Scheduling:

With the momentum gaining for the grid computing systems, the issue of deploying support for integrated scheduling and fault-tolerant approaches becomes a paramount importance. For this most of the fault tolerant scheduling algorithms are using the coupling of scheduling policies with the job replication schemes such that jobs are efficiently and reliably executed. Scheduling policies are further classified on basis of time sharing and space sharing.

Application dependent mechanisms use the reuse concept to integrate fault tolerance within the application code. Most of the present middleware uses heartbeat based monitoring system to monitor the status of grid. In case of Checkpointing-recovery and fault-tolerant scheduling, they are only able to deal with crash failure semantics for both hardware and software components. Software faults with more malign failure semantics - such as timing or omission ones, which are even more difficult to deal with - are not covered by them.

2.1.4.1 What is Checkpointing?

A checkpoint is a local state of a process saved on stable storage. Local checkpoint is an event that records the state of a process at processor at a given instance. Checkpoint may be local or global depending on taking the checkpoints.

2.1.4.2. Need of Checkpointing

To recover from failures.

Checkpointing is also used in debugging distributed programs and migrating processes in multiprocessor system. In debugging distributed programs, state changes of a process during execution are monitored at various time instances. Checkpoints assist in such monitoring.

To balance the load of processors in the distributed system, processes are moved from heavily loaded processors to lightly loaded ones. Checkpointing a process periodically provides the information necessary to move it from one processor to another.

With checkpointing, an arbitrary temporal section of a program's runtime can be extracted for exhaustive analysis without the need to restart the program from beginning.

2.1.4.3. Checkpointing Definitions

Local checkpoints:

A process may take a local checkpoint any time during the execution. The local checkpoints of different processes are not coordinated to form a global consistent checkpoint.

Global checkpoints:

Global checkpoint is collection of local checkpoints, one from each processor.

2.1.4.4 Types of checkpointing

1. Uncoordinated Checkpoints

2. Coordinated Checkpoints

3. Communication Induced

4. Diskless Checkpoint

5. Double Checkpoint

1) Uncoordinated Checkpointing

In this checkpointing, each process independently saves its checkpoints for a consistent state from which execution can resume. However, it is susceptible to rollback propagation, the domino effect possibly cause the system to rollback to the beginning of the computation. Rollback propagations also make it necessary for each processor to store multiple checkpoints, potentially leading to a large storage overhead.

2) Coordinated Checkpointing

It requires processes to coordinate their checkpoints in consistent global state. This minimizes the storage overhead, since only a single global checkpoint needs to be maintained on stable storage. Algorithms used in this approach are blocking (used to take system level checkpoints) and non-blocking (uses application level checkpointing). It does not suffer from rollback propagations.

3) Communication-induced Checkpointing

In this checkpointing the processes works in a distributed environment where it takes independent checkpoint to prevent the domino effect by forcing the processors to take additional checkpoints based on protocol-related information piggybacked on the application messages from other processors.

4) Diskless Checkpointing

It is a technique for distributed system with memory and processor redundancy. It requires two extra processors for storing parity as well as standby. Process migration feature has ability to save a process image. The process can be resumed on the new node without having to kill the entire application and start it over again. It has memory or disk space. In order to restore the process image after a failure, a new processor has to be available to replace the crashed processor. This requires a pool of standby processors for multiple unexpected failures.

5) Double Checkpoint

Double checkpointing targets on relatively small memory footprint on very large number of processors when handles fault at a time, each checkpoint data would be stored to two different locations to ensure the availability of one checkpoint. In case the other is lost using two buddy processors have identical checkpoints. It can be stored either in the memory or local disk of two processors. These are double in-memory checkpointing and double in-disk checkpointing schemes. In this scheme, store checkpoints in a distributed fashion to avoid both the network bottleneck to the central server.

Table 2.1 Comparison of different Checkpoint Schemes

Check

pointing

Methods

Uncoordinated Check pointing

Coordinated Check pointing

Communication Induced

Diskless check pointing

Double Check pointing

Efficiency

High for small process

Low

Low

High

High

Performance

Low

Low

Low

Higher for

distributed

applications

Faster

Portability

High

High

High

Low

Low

Cost

High

Low, negligible for low memory usage application

High

High

Very High

Scalable

No

Minimal

Not scale for large number of processors

Difficult to scale to large number of processors

Highly

Flexibility

Low

All processes Save their states at the same time

Process can be moved from one node to another by writing the process image directly to a remote node

Replace stable storage with memory and processor redundancy

Handle fault at a time and Availability of one checkpoint in case the other is lost.

Overhead

Large Storage, Very high Log management and Work in small

Process

Minimum storage overhead And negligible overheads in failure-free Executions.

High latency and Memory and disk overhead

High memory overhead for storing checkpoints

Low memory overhead

Advantages

Most Convenient and Save their checkpoints Individually

Not suffer from rollback

propagations and Processes Save their state together

Preventing domino effect, piggybacking And information Of regular message exchanged by the processes

Improve performance in distributed /parallel

applications and Process

migration save process image

Uses in small memory footprint on large number of processors. ex- scientific applications

Recovery

Checkpoint Of the faulty process is restored

Processes stop regular message activity to take their checkpoints and coordinated way to analyze and restore the last set of Checkpoints

Needed large number of forced checkpoints nullify the benefit of autonomous local checkpoints Using new process to restore the process image after failure

Using

parity/backup and extra processors for storing parity as well as replace failed application processors.

Through

automatic restart and

synchronization by two identical checkpoint buddy processors to provide foolproof fault tolerance

Disadvanta ges

Unsuitable, Domino Effect, Wastage memory ,unbounded & complex

Garbage collection

Consistent checkpoint and Large latency for saving the checkpoints storage

Deteriorated parallel

performance & Requires standby processors

Communication bottleneck

Depend on a central reliable storage and required Additional Hardware

2.1.4.5 Phases of Checkpointing

Checkpointing has two phases:

Saving a checkpoint and

Checkpoint recovery following the failure.

To save a checkpoint, the memory and system, necessary to recover from a failure is sent to storage. Checkpoint recovery involves restoring the system state and memory from the checkpoint and restarting the computation from the checkpoint was stored. When the time lost in computation is called overhead time that has to save a checkpoint and restore a checkpoint after a failure and the re-computation is performed after checkpoint but before the failure. However, this loss contributes to the computer unavailability so is better and beneficial, instead of restarting job after occurrence of a failure. It applies both to system and application checkpoint. Some of the basic action is given below: -

Checkpoint saving steps: -

First of all, we decide checkpoint on the basis of the application code, scheduler and some policies.

After that all processors must reach a safe point, which includes appropriate handling of outstanding reads and writes at that time. Detected errors should be handled before initiating a checkpoint.

Then a copy of where memory for system checkpoints is stored.

Then, is committed checkpoint it means there where no errors during data copy, so that the checkpoint can be safely used for recovery and

At last we can resume computing of from the processor safe points.

Checkpoint recovery steps: -

First of all, we determine need for recovery from a failure and rerun a computation from a certain point where we select most recent checkpoint.

After that halt the processors and reset them to a know state.

Then, we update system configuration to determine resources to use for failure recovery, e.g., a spare processor to use in place of a failed processor.

Next, we copy all memory for system checkpoint from storage and commit checkpoint recovery where need to be certain where no errors during data copy.

At last, resume computing from the recovered system state.

Each action may involve multiple messages responses and context switches. The decision to commit a checkpoint or checkpoint recovery requires successful acknowledgment from all the system resources involved.

2.1.4.6 Fault Tolerance in different Grid middleware

Fault Tolerance in Globus

Globus provides a heartbeat service to monitor running processes to detect faults. The application is notified of the failure and expected to take appropriate recovery action.

Many tools are available in the market to handle faults in globus. The basic entities used by each system are: local monitor and data collector.

1. A local monitor is responsible for observing the state of both the computer on which it is located and any monitored processes on that computer. It generates periodic "i-am-alive" messages or heartbeats, summarizing this status information.

2. A data collector receives heartbeat messages generated by local monitors and actually identities failed components based on missing heartbeats.

Globus Heartbeat Monitor (HBM), which provides a fault detection service for applications developed with the Globus toolkit. HBM comprises three components:

1. A local monitor, responsible for monitoring the computer on which it runs, as well as selected processes on that computer.

2. A client registration API, which an application uses to specify the processes to be monitored by the local monitor, and to whom heartbeats are sent.

3. A data collector API, which enables an application to be noticed about relevant events concerning monitored processes.

Fault Tolerance in Legion

Legion provides mechanisms to support fault tolerance such as Checkpointing, but the policy is left to the application. A component-based reflexive architecture allows fault tolerance methods to be encapsulated in reusable components that user applications may choose from.

Two properties characterize reflective systems: introspection and causal connection. Introspection allows a computational process to have access to its own internal structures. Causal connection enables the process to modify its behavior directly by modifying its internal data structures—there is a cause-and-effect relationship between changing the values of the data structures and the behavior of the process.

The implementation of the reflexive architecture relies on common basic primitives such as:

1. Intercepting the message stream are piggybacking information on the message stream are acting upon the information contained in the message stream are saving and restoring state r detecting failure

2. Exchanging protocol information between participants of an algorithm.

Fault Tolerance in Sun N1 Grid Engine

N1GE6 has several fault tolerance features. The main among them is Checkpointing. Checkpointing in N1GE6 can be classified in 2 different classes:

Kernel-level

o Such tools are built into the kernel of the operating system. During a checkpoint, the entire process space (which tends to be huge) is written to physical storage.

o The user does not need to recompile/re-link their applications. Checkpointing and restarting of application is usually done through OS commands.

o Checkpointed application is usually unable to be restarted on a different host.

User-level

o These "tools" are built into the application which will periodically write their status information into physical storage.

o Checkpointing of such applications is usually done by sending a specific signal to the application.

o Restarting of such applications is usually done by calling the application with additional parameters pointing to the location of restart files.

N1GE6 has built-in support for the integration of 3rd party checkpointing tools. Certain checkpointing tools (mostly user-level) allow the restart of applications on different hosts. These tools, coupled with the migration support on the N1GE6 and with proper configuration of the queue threshold levels, allow the administrator to finely load balance the N1GE6 cluster. One such tool is Berkeley Lab Checkpoint/Restart (BLCR). BLCR is a kernel module that allows you to save a process to a file and restore the process from the file. This file is called a context file. A context file is similar to a core file, but a context file holds enough information to continue running the process. A context file can be created at any point in a process's execution. The process may be resumed from that point at a later time, or even on a different workstation.

Fault Tolerance in Alchemi

Alchemi is a .NET-based grid computing framework that provides the runtime machinery and programming environment required to construct desktop grids and develop grid applications. Till today not much work is done on Alchemi Fault tolerance. The basic technique used by Alchemi for Fault Tolerance is Heart beating. The Executors in the Alchemi sends the heartbeat signals to the manager at regular interval of time. Manager after receiving signals from executors guesses that the executor node is still working. So the procedure of fault tolerance is not so well planned. In later chapters we will see fault tolerance procedure of Alchemi in more detail and try to find challenges in its fault tolerance system and how they can remove.

2.1.4.7 Desirable Features of Checkpoint Algorithm

The time taken by checkpoint algorithm should be minimum during failure free run.

Recovery should be fast in the event of a failure. Availability of a consistent global state in stable storage expedites recovery.

Rollback propagation should be eliminated completely.

Selective rollback should be possible.

Resource requirements (memory and processor) for checkpoint should be minimum.

Problems and Weakness of Chandy and Lamport Algorithm.

In Chandy and Lamport algorithm, node saves its context in stable storage. Due to that system has context saving overhead. Basic algorithm does not reduce that overhead. In current system there are no. of checkpoint are taken during computation, due to that we required more space and bandwidth to store that checkpoints. Current system does not handle the lost messages and garbage collection in distributed system. The current algorithm is weak in terms of task completion time in both of fault-free and faulty situations.

2.3 Feature of new system

A new checkpointing algorithm has been proposed which has minimum checkpointing counts equivalent to the periodic checkpointing algorithm, and relatively short rollback distance at faulty situations. The proposed algorithm is better than Chandy and Lamport’s checkpointing algorithm in terms of task completion time in both of fault-free and faulty situations.



rev

Our Service Portfolio

jb

Want To Place An Order Quickly?

Then shoot us a message on Whatsapp, WeChat or Gmail. We are available 24/7 to assist you.

whatsapp

Do not panic, you are at the right place

jb

Visit Our essay writting help page to get all the details and guidence on availing our assiatance service.

Get 20% Discount, Now
£19 £14/ Per Page
14 days delivery time

Our writting assistance service is undoubtedly one of the most affordable writting assistance services and we have highly qualified professionls to help you with your work. So what are you waiting for, click below to order now.

Get An Instant Quote

ORDER TODAY!

Our experts are ready to assist you, call us to get a free quote or order now to get succeed in your academics writing.

Get a Free Quote Order Now