Methods and approaches to improving the reliability of the parallel dataflow computing system

Zmejev, D.N.; Levchenko, N.N.; Okunev, A.S.

Home Authors Papers Year of conference Themes Organizations To MES conference

Methods and approaches to improving the reliability of the parallel dataflow computing system

Authors

Zmejev D.N.

Levchenko N.N.

Okunev A.S.

Date of publication

2020

DOI

10.31114/2078-7707-2020-2-87-94

Abstract

The approaches to the creation of hardware and software tools for the restoration of dataflow computing systems operation after a fault or failure have their own specifics due to the complex structure of parallel computing processes and difficulties in localizing the source of the error.
The article describes the methods and approaches using which the degree of reliability of the parallel dataflow computing system is increased. These methods are associated with overcoming the overflow of the content addressable memory of keys of the matching processor, local dynamic redistribution of computations between execution units, operation of recovery tools after an execution unit fault or failure, hardware support for creating local checkpoints, and also with the implementation of the global distributed associative computing environment.
The problem of overflow of the content addressable memory of keys of the matching processor critically affects the reliability of the computing system. Hardware and software methods for preventing overflow of such memory are associated with spooling/swapping tokens in dynamic mode and with dividing the task into stages.
Local dynamic redistribution of computations between execution units allows even in case of failure of individual communication channels to continue the functioning of the computational module. The options for recovering from a fault or failure of an execution unit are described.
For the operation of local recovery tools in the computational core and the computational module, hardware support for local checkpoints is proposed. The developed mechanism for the formation of local checkpoints solves the problem of system recovery after a fault in dynamic mode.
To increase the reliability of the system, it is also proposed to transit to the global distributed associative computing environment, the distribution of computations for which is carried out to a separate processor. In this case, there is a decrease in overhead costs for the restoration of its operation in the event of the failure of individual local associative elements.
The main methods and approaches proposed in the article to create tools of recovery after a fault/failure, increasing fault tolerance, architectural flexibility of the system, and improving the manufacturability of crystal manufacturing, will make it possible to create highly reliable specialized and universal supercomputers based on the parallel dataflow computing system.

Keywords

parallel dataflow computing system, computation reliability, dataflow computing model, local checkpoint.

Library reference

Zmejev D.N., Levchenko N.N., Okunev A.S. Methods and approaches to improving the reliability of the parallel dataflow computing system // Problems of Perspective Micro- and Nanoelectronic Systems Development - 2020. Issue 2. P. 87-94. doi:10.31114/2078-7707-2020-2-87-94

URL of paper

http://www.mes-conference.ru/data/year2020/pdf/D039.pdf