Creating my own Beowulf cluster

# Creating my own Beowulf cluster ###### tags: `實驗與紀錄` :::danger :crying_cat_face: :crying_cat_face: :crying_cat_face: This project has temporarily terminated due to the funding problem. :crying_cat_face: :crying_cat_face: :crying_cat_face: ::: ### Contents 1. Introduction 2. Hardware and environment 3. Theoretical estimating and calculation 4. Software setup 5. Benchmarking and test 6. Epilogue :::warning :warning: DISCLAIMER :warning: This Beowulf cluster was built for personal and experimental purposes only, and is not intended for practical or commercial use. Its performance may not be suitable for high-performance computing tasks. The actual performance of the system may vary depending on the specific workload and software optimization. Me as a builder of this system do not assume any liability for any damages or losses resulting from any imitation or adaptation of this project. ::: ---------------------------------------------------------- ### Introduction Welcome to my cluster project! This project is the result of my exploration and analysis of a cluster of related topics. In this project, I aim to provide an in-depth understanding of these topics and their interconnections. Throughout this project, I will provide a comprehensive overview of each aspect of my cluster, explore their interconnections, and offer insights and recommendations for future developments. By the end of this project, you will have a deeper understanding of these topics and their implications, and be able to make informed decisions and engage in critical discussions about them. Also, as a poor student, I’m unable to afford those high-end CPUs for my hobby and my research, thus, I create this cluster as an alternative solution towarding my other project such as simulation a mathematic computing. Introducing our Beowulf cluster, a high-performance computing solution designed to handle complex workloads and demanding applications. Built with six nodes, each featuring an Intel Core i5-760 processor, 4GB of DDR3 RAM, and networked by gigabit Ethernet, our cluster provides a powerful computing platform for a wide range of tasks. Whether you need to process large datasets, run simulations, or perform other compute-intensive tasks, this Beowulf cluster offers the performance and scalability we need to get the job done. With a TDP wattage of approximately 440W, this cluster is providing a “cost-effective” solution for those looking to boost their computing capabilities like me. The original design also supports the further upgrade, it would be able to add more node on demand, which provides more scalability. And in the aspect of hard drive, I decided to install individual disk on every node rather than attach them to NAS in order to prevent extra loads on ethernet and avoid possible bottleneck on performance. -------------------------------------------------------- ### Hardware and Enviroment Introducing the hardware specifications of our Beowulf cluster, a high-performance computing solution designed to handle complex workloads and demanding applications. Our cluster is comprised of six slave nodes, each featuring an Intel Core i5-760 processor, which includes four cores running at a clock speed of 2.8GHz. Each node is equipped with 8GB of DDR3 RAM, providing a total of 24GB of RAM across the cluster. The nodes are connected by gigabit Ethernet, providing fast and reliable network connectivity. In addition, our cluster is built with a shared storage solution, with each node connected to a network-attached storage (NAS) device, providing a total of 2TB of shared storage capacity. With this hardware configuration, our Beowulf cluster provides a cost-effective and scalable high-performance computing solution, perfect for me to accelerate my workloads. #### CPU in every computing node :::info ### Intel Core i5-760 Number of Cores:==4== Clock Speed (Turbo Boosted): ==2.8 (3.3) GHz== Hyper-Threading: ==Yes== L3 Cache: ==8MB== Manufacturing Process: ==45nm== Socket Compatibility: ==LGA 1156== Memory Support: ==DDR3 up to 1333 MHz== Thermal Design Power (TDP): ==95 watts== (Please note that CPU in every computing node is the same) ::: #### Details for every single node :::success ### Node 1 Power Supply: Motherboard: Hard drive: RAM module: ::: :::success ### Node 2 Power Supply: Motherboard: Hard drive: RAM module: ::: :::success ### Node 3 Power Supply: Motherboard: Hard drive: RAM module: ::: :::success ### Node 4 Power Supply: Motherboard: Hard drive: RAM module: ::: :::success ### Node 5 Power Supply: Motherboard: Hard drive: RAM module: ::: :::success ### Node 6 Power Supply: Motherboard: Hard drive: RAM module: ::: #### Network Utilities (here is some possible choices) There's actually many ways to connect our nodes together,here are three choices I might able to build. A. gigabit ethernet B. fibre based 4G network C. 10\100M fast(is it?) ethernet the details will be explained in chapter of estimating and the data below will show like this, ::: danger red for fibre network hardware ::: :::warning yellow for gigabit ethernet hardware ::: :::success green for 10/100 ethernet hardware ::: :::warning ### switch Mercusys MS108G 8 port 10/100/1000Mbps Gigabit hub with Auto Negotiation & Auto MDI/MDIX IEEE 802.3, IEEE 802.3u, IEEE 802.3x CSMA/CD supported 10/100/1000Mbps half duplex 20/200/2000Mbps full duplex Impulse 64KB Jumbo frame 9KB ### wires and cables just some CAT6 copper cables with RJ-45 connectors, nothing more to explain :no_good: ### Network interface cards depends on every motherboard on board equips' might add external one lol ::: :::success ### fibre host bus adapter (HBA) Qlogic QLE2460 Single-Port, 4Gbps Fibre Channelto-PCI Express Host Bus Adapter. FC or LC connector ##### Data rate • 4/2/1Gbps auto-negotiation (4.2480/ 2.1240/ 1.0625Gbps) Performance • 150,000 IOPS ##### Topology • Point-to-point (N_Port), arbitrated loop (NL_Port), switched fabric (N_Port) ##### Logins • Support for F_Port and FL_Port login: 2,048 concurrent logins and 2,048 active exchanges ##### Class of service • Class 2 and 3 ##### Protocols • FCP (SCSI-FCP), FC-TAPE (FCP-2) cable travel about 70m in 4G mode ::: #### software ------------------------------------------------------------ ### Theoretical Estimating #### About the computing perfromance Performance: The peak theoretical performance of this six-node cluster is 537.6 GFLOPS, based on the i5-760 processor's performance. :::info :bulb: The Intel Core i5-760 processor has four processing cores with a base clock speed of 2.8 GHz. Assume it can perform up to 8 floating point operations per clock cycle (4 additions and 4 multiplications). we assume that processors include AVX instructions, the peak theoretical performance per core will be: 4 cores x 2.8 GHz x 8 FLOPs per clock cycle = 89.6 GFLOPS Multiplying the peak theoretical performance per processor by the number of processors in cluster (6) gives the total peak theoretical performance of the cluster: Total peak theoretical performance = 6 processors x 89.6 GFLOPS per processor = 537.6 GFLOPS ::: Based on the total peak theoretical performance of your Beowulf cluster (537.6 GFLOPS), we can make a comparison to modern CPUs in terms of their single-precision floating-point performance. Here are some examples of modern CPUs and their single-precision floating-point performance: |processor|GFLOPS| |---------|------| This Beowulf Cluster (6x Intel Core i5-760)|537.6 GFLOPS Intel Xeon E5-2699 v3|540 AMD Opteron 6378|541 Intel Core i7-5960X|469.4 AMD Ryzen 9 5950X|988 AMD EPYC 7763|7056 :exploding_head: EPYC 7763 is just here for fun.... As you can see, the performance of this cluster is not very good, as the previous statement said, this is only built for learning. The actual performance may be lower due to factors such as network latency and communication overhead.Also I'm gonig to run MATLAB on this cluster.MATLAB is a parallelizable application and can take advantage of the multiple cores and nodes in cluster to accelerate computations. The degree of acceleration will depend on the specific computations being performed and how well they can be parallelized.The size of the dataset being processed may also impact the performance of the cluster. If the data is stored on the local disks of each node, then I/O bandwidth may become a bottleneck, especially if the data needs to be transferred between nodes frequently. #### About the fibre network solution There is actually three choices that I can choose on networking between nodes, they are fibre based network, gigabit ethernet and 10/100 Mbps fast ethernet. Here are some estimation.I first think of using fiber network which provides a ultra fast data exchange, however, the motherboard im using does not have pci x4 slot for the QLE2460 fiber HBA. So that enforce me to use the adapter from x1 to x16 in order to equip it. And this might cause the bottleneck on transfer speed: The maximum transfer speed of the QLogic QLE2460 Fiber Channel Host Bus Adapter (HBA) can reach up to 4 Gbps (or 4000 Mbps). The QLE2460 can support up to 2.125 Gbps data rates for full duplex or 4 Gbps data rates for half duplex operations. The actual transfer speed of the HBA may vary depending on the configuration of the server, the storage device, and the network. And the following table displays the transfer speed of HBA installed on different slot. Transfer Speed| PCIe 3.0 x1| PCIe 3.0 x4 --------------|------------|---------------- Maximum Speed| 4 Gbps|16 Gbps Maximum Bandwidth| 1 GB/s|4 GB/s Note: The actual transfer speeds of the QLogic QLE2460 HBA may vary depending on various factors such as the configuration of the server, the storage device, and the network. The speeds listed here are the theoretical maximums for each interface. Please also note that the motherboard Im using has only PCIe 3.0 which means there's no way to get it faster :cry: Also fibre on node can cost a lot and because of this, I hadn't even decide which fibre switch to buy yet.They are just expensive as fuck.Maybe I'll try somethings from China rather than from Cisco or Netgear or whatever. :::spoiler some notes part still not arrived : 1. motherbords 2. cases 3. CPUs 4. RAM module still miss 24 5. 2.5"disks 6. another shelf to place those nodes find another way to estimate the performance :::