Hardware specification and provision, installation, configuration, commissioning, and maintenance of a compute cluster
Andra carries out numerous numerical simulations with the aim of assessing the performance and safety of radioactive waste repositories by quantifying various physical phenomena with or without coupling (thermal, chemical, hydraulic, mechanical, etc.). Andra develops, maintains and operates a wide range of computational codes, some of which use parallel resources, validated in different system environments. The Agency mainly uses a Linux-based computing cluster, which has been in operation since 2018, and wants to renew a large part of the hardware, to be operational by the end of the first half of 2024. Four compute nodes acquired in 2020 and storage acquired in 2022 (disk array and data servers) are retained. The cluster occupies two racks within a cube located in an air-cooled computer room. The equipment to be renewed are:- 2 management servers;- 41 computing nodes totalling 1312 cores (*);- 2 login nodes;- 2 visualization nodes (**);- 6 infiniband switches (***);- 1 backup robot with 4 drives (****);- 4 power distribution strips (PDU). (*): Renewal with the same number of cores (x86 64-bit processors). Given the technological evolution, this should correspond to at most about twenty nodes. (**): renewal with a single node dedicated to visualization and a node dedicated to computing on GPU(***): the number of new infiniband switches is to be determined according to their number of ports and the number of devices to be connected in a fat-tree (in particular the number of compute nodes). (****): Renewal at an overall throughput greater than or equal to that of the 4 LTO 7 readers used simultaneously. Notes: - The current backup robot and management servers, although to be renewed, are also retained. Both servers will be reconfigured with access to the backup bot as the only role. - Liquid cooling is excluded;- The current racks are powered by 16A three-phase, with the possibility of doubling the amperage. The service includes a lump sum:- the supply and delivery of the equipment;- the guarantee of parts, labour and travel on the equipment for a period of seven years from the signing of the final acceptance report; - Integration of the hardware into the racks currently occupied by the operating cluster and related documentation. The integration will be carried out in two stages with the possible temporary use of an additional rack in order to keep part of the computing power of the current cluster operational and to minimize the period of downtime;- the provision of hardware documentation (manufacturer documentation), installation, configuration and operation documentation;- documentation of the installed software: the sources of the software, configuration, installation and operation documentation;- carrying out and documenting acceptance tests, which will include application benchmarks;- knowledge transfer to Andra or its third parties designated by it;- support and corrective maintenance on hardware and software configuration and documentation throughout the duration of the contract.