In the distributed computing environment, checkpointing is a technique that helps tolerate failures that would otherwise force a long-running application to restart from the beginning.
The most basic way to implement checkpointing is to stop the application, copy all the required data from the memory to reliable storage (e.g., parallel file system), then continue with execution.
It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency.
This functionality became extremely important for usability in applications in which a particular task could not be completed in one sitting (such as playing a video game expected to take dozens of hours) or in which the work was being done over a long period of time (such as data entry into a document such as rows in a spreadsheet).
[6] FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance.
FTI provides application-level checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste.
If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.
The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR.
It was designed for batteryless embedded devices such as RFID tags and smart cards which rely on harvesting energy from ambient background sources.
Mementos frequently senses the available energy in the system and decides whether to checkpoint the program due to impending power loss versus continuing computation.
When the energy becomes sufficient for reboot, the data is retrieved from non-volatile memory and the program continues from the stored state.
It uses a dynamic programming approach to locate low overhead points in the state machine of the design.