Copying, backup and archiving data from HPC clusters

# Copying, backup and archiving data from HPC clusters Quite often, the storage provided at HPC or Cloud services is limited and despite it is reliable stored, data could be accidentally deleted or overwritten. Make sure, your data is properly stored on storage (provided by the University, your lab or third party) specifically designed for this. Follow the routines to store your data there. Here we will discus what you can do at personal level and advise only on **simple** free Linux based solutions. ## Storage media types - external HDD/SSD: the easiest way to keep your files relatively safe. - NAS ([Network Attached Storage](https://en.wikipedia.org/wiki/Network-attached_storage)) - more robust if running some reasonable [RAID](https://en.wikipedia.org/wiki/RAID) configuration, bulkier, needs maintenance (firmware and software updates)... - Cloud storage - usually reasonable sizes are paid. Common problem is that the limitations are not easily seen (bandwidth, price for download, etc...) - DVD - avoid this - it causes only troubles (my humble opinion) ## Copy / backup The easiest is to copy you files to your disk. If your files are on a compute cluster, physically attaching the disk is essentially impossible. Even if you can, I advise you to not use the regular "drag and drop", `cp`. For remote location you can use `scp` or `sftp` but, if you can, take one step further and use `rsync`. ### `rsync` The most common and rather robust tool available for Linux and Mac. It is also ported to Windows, but try to avoid such ports - there are better ways to use Linux tools under Windows [link](https://hackmd.io/@pmitev/Linux4WinUsers). It is installed and available on every HPC cluster... - pros: - simple and robust - transferred data is check-summed during the entire process which make it rather reliable - easy to restart interrupted transfers - just run the command again. - allows for efficient incremental updates or mirroring of the data. - with some tricks one can run versioned backup/storage of the data on ext4 file system (any system that supports hard links) - effortless access to the copied/backed data and restoration in case of lost data - cons: - be careful with the `/` for the source and target The command line syntax is very similar to scp. Make sure you know how `/` works at the end of the source and the destination. > Simple example: ``` bash rsync -av --delete username@cluster.uu.se:/project/folder /external_disk/backups/ ``` The same command can be used again and again to bring the modified files under `/project/folder` to the copy/backup location. The `--delete` option ensures that files deleted (between syncs) on the source will be deleted on the target as well. This simple approach will keep your data in sync with the data on the disk. Be careful, it is not a bi-directional synchronization tool, so do not edit data on the disk - it will be overwritten next time you sync your data. Do not change the direction of the sync (i.e. from the disk to your data on the cluster) - it requires extra caution! Use the same command if you want to transfer different folders - just change the source and destination. > For Windows users, [WinSCP](https://winscp.ne) is very good GUI alternative for transferring data (FTP, FTPS, SCP, SFTP, WebDAV or S3 file transfer protocols), but consecutive data updating might lack some neat `rsync` features. Last words. A copy of your data could be almost considered as backup. You should be able to recover easily you data from this copy/backup with minimum efforts. ## Backup Having a **proper backup** requires a bit more (it is rather elaborate discussion - we will not go trough it here). A copy of your data is the first step. This gives you an option to recover files before sync. When you sync, all changes will propagate on to the copy as well i.e. you have just the last copy/backup of your files (still better than nothing). A better backup approach is to keep **snapshots** of the backup as it was done. An obvious solution to this is to make a new copy of the data every time... and it is, perhaps, just fine for small data. Having multiple (almost identical) copies of large data is not a good idea (also rather obvious). Without going in to too many details, here are some points that can help you improving your copy/backup solution. - Keep the copy/backup on a **different disk device** i.e. do not make backup on the same disk as the original data. - If you use external disk, **do not keep the disk constantly connected** to a computer and power. There are enough reasons why not to do so. - **Do not** use disk that are **fat32** formatted - common case for disk that needs to be used with Mac and Windows. - Consider archiving finished projects. - Try to have **2 copies/backups/archives** at two different physical locations - this always sound too much until you get to this unfortunate situations... - Small, non sensitive data, could be kept easily online on cloud storage, GitHub/Lab etc. Ask your colleagues how they handle their data - there is no universal or perfect solution - adapt something that is reasonable for your situation - time and experience will tell you how much (time, efforts, etc..) you want to invest in this yourself. ## Contacts: - [Pavlin Mitev](https://katalog.uu.se/profile/?id=N3-1425) - [UPPMAX](https://www.uppmax.uu.se/) - [SNIC AE@UPPMAX - related documentation](/8sqXISVRRquPDSw9o1DizQ) ![](https://snic.se/digitalAssets/603/c_603880-l_1-k_image.png =122x38) ![](https://live.webb.uu.se/digitalAssets/207/c_207717-l_3-k_bg-city.png) ###### tags: `UPPMAX`, `SNIC`, `backup`, `archive`

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.