Home         Authors   Papers   Year of conference   Themes   Organizations        To MES conference

Impact of features of the computing model and architecture on the reliability of the parallel dataflow computing system  

Authors
 Zmejev D.N.
 Levchenko N.N.
 Okunev A.S.
 Stempkovsky A.L.
Date of publication
 2020
DOI
 10.31114/2078-7707-2020-1-64-69

Abstract
 When creating high-performance computing systems, especially with non-traditional architecture, the problem of increasing reliability is associated with the presence of a large amount of parallel computing. With the increase in the number of computational cores, this problem only worsens. When distributing computations to hundreds of thousands and millions of computational cores, the dataflow computing model provides significant advantages over the traditional von Neumann approach in terms of increasing real performance and increasing the efficiency of parallelizing tasks with a large volume of hard-structured data. Features of the non-traditional computing model and the architecture that implements it have a significant impact on the new methods being developed to ensure system reliability.
The architecture of the parallel dataflow computing system, which is being worked on at the IPPM RAS, is based on the dataflow computing model with a dynamically formed context with activation of program nodes by data readiness. In such systems, due to the complex structure of parallel computing processes, it is difficult to localize the source of the error. It is also necessary to solve the problem of reliable restoration of normal functioning after a fault and reconfiguration of the system after a failure.
The article describes the features of the dataflow computing model and architecture, which are considered from the point of view of their impact on the reliability of computing. The reliability of the computing system is enhanced by the developed hardware and software solutions for the matching processor: reducing the size of the content addressable memory of keys, using a separate memory of tokens, creating a command system with an extended set of AVOST flags, hardware control of forbidden states, building a memory hierarchy, etc.
These and other features of the parallel dataflow computing system require the development of original algorithms for collecting information for the formation of local checkpoints, methods for faults and failures registration, as well as the introduction of new hardware in the composition of the computational core for the automatic operation restoration after a fault. The above-mentioned tools implemented in the PDCS will make it possible to provide a high degree of reliability required for such computing systems.
Keywords
 dataflow computing model, parallel dataflow computing system, computation reliability
Library reference
 Zmejev D.N., Levchenko N.N., Okunev A.S., Stempkovsky A.L. Impact of features of the computing model and architecture on the reliability of the parallel dataflow computing system // Problems of Perspective Micro- and Nanoelectronic Systems Development - 2020. Issue 1. P. 64-69. doi:10.31114/2078-7707-2020-1-64-69
URL of paper
 http://www.mes-conference.ru/data/year2020/pdf/D038.pdf

Copyright © 2009-2024 IPPM RAS. All Rights Reserved.

Design of site: IPPM RAS