In my Master’s thesis I worked with data of the Marconi100 supercomputer from CINECA. I propose the use of several autoencoder models for anomaly detection on supercomputers, allowing this problem to be approached with semi-supervised learning techniques. For the development of the project, a sequence of real data from CINECA’s Marconi 100 supercomputer, obtained over several months, has been used. Two different model training approaches are compared. The first is a model trained with data from all the nodes of the supercomputer. In the second approach, observing significant differences between nodes, one model is trained for each node. The results were analysed by evaluating the positive and negative aspects of each approach.Anomaly detection autoencoder architecture used in the project
On the other hand, a replica of the Marconi 100 supercomputer was developed in a virtual reality environment that allows the data from each node to be visualised at different points in time. The real-time visualisation of information such as the power consumed, the temperatures of the cores (GPU and CPU) or the prediction of anomalies of the autoencoder model can be of great help in the early detection of errors and preventive maintenance tasks of this type of high-performance systems. The Python language and the VTK library have been used for the development. This had made it possible to create a full-scale replica in a virtual environment, accessible through virtual reality devices.