Failures in a Distributed System Essay
A distributed system is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, where all component work together to perform a single set of related tasks.
A distributed system can be much larger and more powerful given the combined capabilities of the distributed components, than combinations of stand-alone systems. But it’s not easy – for a distributed system to be useful, it must be reliable. This is a difficult goal to achieve because of the complexity of the interactions between simultaneously running components.A distributed system must have the following characteristics: * Fault-Tolerant: It can recover from component failures without performing incorrect actions. * Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed. * Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
* Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure.This underlies the ability of a distributed system to act like a non-distributed system. * Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a “non-scalable” system. Similarly, we might increase the number of users or servers, or overall load on the system.
In a scalable system, this should not have a significant effect.Predictable Performance: The ability to provide desired responsiveness in a timely manner. * Secure: The system authenticates access to data and services These are high standards, which are challenging to achieve. Probably the most difficult challenge is a distributed system must be able to continue operating correctly even when components fail.
Four types of failures that can occur in a distributed system are: * Halting failures: A component simply stops. There is no way to detect the failure except by timeout: it either stops sending “I’m alive” (heartbeat) messages or fails to respond to requests.Your computer freezing is a halting failure. * Omission failures: Failure to send/receive messages primarily due to lack of buffering space, which causes a message to be discarded with no notification to either the sender or receiver. This can happen when routers become overloaded.
* Network failures: A network link breaks. * Timing failures: A temporal property of the system is violated. For example, clocks on different computers which are used to coordinate processes are not synchronized when a message is delayed longer than a threshold period, etc. ww. code. google.
com Timing failures – there is suspicion of a possible failure related to the motherboard.This can be a result of a specific message strongly implicating the motherboard in some sort of erratic system behavior. It may also be the case that the motherboard probably isn’t the problem, but that we want to rule it out as a possible cause. Since the motherboard is where all the other components meet and connect, a bad motherboard can affect virtually any other part of the PC.For this reason the motherboard must often be checked to ensure it is working properly, even if it is unlikely to be the cause of whatever is happening. Outright motherboard failure is fairly rare in a new system and extremely rare in a system that is already up and running. Usually, the problem is that the motherboard has been misconfigured or there is a failure with one or more of the components that connect to it.
Getting a system in the mail that has a loose component or disconnected cable is very common. In fact, though, there are surprisingly large possible causes for what may appear to be a motherboard failure.Omission failure – Since an omission failure causes a message to be discarded, I would check the following causes to eliminate each one to find the one specific problem. * Router overload * Transmitter malfunction * Buffer overflow * Receiver out of range Troubleshooting each of these areas would eliminate each cause one by one until the problem is found.
Checking the router probably is the most important step in resolving this problem. The router should be unplugged, then plugged back in to reboot. Also, a different area within the room or building may benefit the router by providing better reception.The best way to design a distributed system is to design it for failure. This would assist in a careful design for which assumptions would not be made about the reliability of the components of the system. A good way to initiate the design of a distributed is to focus on a design which uses a client-server model with mostly standard protocols.
ReferencesIntroduction to Distributed Systems Design. Retrieved from: http://www. code.
google. com/ edu/parallel/dsd-tutorial. html Concurrent Reading. Retrieved from http://www. s.