owned this note
owned this note
Published
Linked with GitHub
# Copying, backup and archiving data from HPC clusters
Quite often, the storage provided at HPC or Cloud services is limited and despite it is reliable stored, data could be accidentally deleted or overwritten. Make sure, your data is properly stored on storage (provided by the University, your lab or third party) specifically designed for this. Follow the routines to store your data there.
Here we will discus what you can do at personal level and advise only on **simple** free Linux based solutions.
## Storage media types
- external HDD/SSD: the easiest way to keep your files relatively safe.
- NAS ([Network Attached Storage](https://en.wikipedia.org/wiki/Network-attached_storage)) - more robust if running some reasonable [RAID](https://en.wikipedia.org/wiki/RAID) configuration, bulkier, needs maintenance (firmware and software updates)...
- Cloud storage - usually reasonable sizes are paid. Common problem is that the limitations are not easily seen (bandwidth, price for download, etc...)
- DVD - avoid this - it causes only troubles (my humble opinion)
## Copy / backup
The easiest is to copy you files to your disk. If your files are on a compute cluster, physically attaching the disk is essentially impossible. Even if you can, I advise you to not use the regular "drag and drop", `cp`. For remote location you can use `scp` or `sftp` but, if you can, take one step further and use `rsync`.
### `rsync`
The most common and rather robust tool available for Linux and Mac. It is also ported to Windows, but try to avoid such ports - there are better ways to use Linux tools under Windows [link](https://hackmd.io/@pmitev/Linux4WinUsers). It is installed and available on every HPC cluster...
- pros:
- simple and robust - transferred data is check-summed during the entire process which make it rather reliable
- easy to restart interrupted transfers - just run the command again.
- allows for efficient incremental updates or mirroring of the data.
- with some tricks one can run versioned backup/storage of the data on ext4 file system (any system that supports hard links)
- effortless access to the copied/backed data and restoration in case of lost data
- cons:
- be careful with the `/` for the source and target
The command line syntax is very similar to scp. Make sure you know how `/` works at the end of the source and the destination.
> Simple example:
``` bash
rsync -av --delete username@cluster.uu.se:/project/folder /external_disk/backups/
```
The same command can be used again and again to bring the modified files under `/project/folder` to the copy/backup location. The `--delete` option ensures that files deleted (between syncs) on the source will be deleted on the target as well.
This simple approach will keep your data in sync with the data on the disk. Be careful, it is not a bi-directional synchronization tool, so do not edit data on the disk - it will be overwritten next time you sync your data. Do not change the direction of the sync (i.e. from the disk to your data on the cluster) - it requires extra caution!
Use the same command if you want to transfer different folders - just change the source and destination.
> For Windows users, [WinSCP](https://winscp.ne) is very good GUI alternative for transferring data (FTP, FTPS, SCP, SFTP, WebDAV or S3 file transfer protocols), but consecutive data updating might lack some neat `rsync` features.
Last words. A copy of your data could be almost considered as backup. You should be able to recover easily you data from this copy/backup with minimum efforts.
## Backup
Having a **proper backup** requires a bit more (it is rather elaborate discussion - we will not go trough it here). A copy of your data is the first step. This gives you an option to recover files before sync. When you sync, all changes will propagate on to the copy as well i.e. you have just the last copy/backup of your files (still better than nothing).
A better backup approach is to keep **snapshots** of the backup as it was done. An obvious solution to this is to make a new copy of the data every time... and it is, perhaps, just fine for small data.
Having multiple (almost identical) copies of large data is not a good idea (also rather obvious).
Without going in to too many details, here are some points that can help you improving your copy/backup solution.
- Keep the copy/backup on a **different disk device** i.e. do not make backup on the same disk as the original data.
- If you use external disk, **do not keep the disk constantly connected** to a computer and power. There are enough reasons why not to do so.
- **Do not** use disk that are **fat32** formatted - common case for disk that needs to be used with Mac and Windows.
- Consider archiving finished projects.
- Try to have **2 copies/backups/archives** at two different physical locations - this always sound too much until you get to this unfortunate situations...
- Small, non sensitive data, could be kept easily online on cloud storage, GitHub/Lab etc.
Ask your colleagues how they handle their data - there is no universal or perfect solution - adapt something that is reasonable for your situation - time and experience will tell you how much (time, efforts, etc..) you want to invest in this yourself.
## Contacts:
- [Pavlin Mitev](https://katalog.uu.se/profile/?id=N3-1425)
- [UPPMAX](https://www.uppmax.uu.se/)
- [SNIC AE@UPPMAX - related documentation](/8sqXISVRRquPDSw9o1DizQ)


###### tags: `UPPMAX`, `SNIC`, `backup`, `archive`