puhti
mahti
allas
This is the common "notebook" for the "Using CSC HPC environment efficiently" course organised in April 2021 at CSC -IT center for Science. Course page in e-Lena platform
This is the place to ask questions about the workshop content! We use the Zoom chat only for posting links, reporting Zoom problems and such.
`
) for thecode blocks
. Otherwise, it's like a Google doc: it allows simultaneous editing.
We strive to follow the Code of Conduct developed by The Carpentries organisation to foster a welcoming environment for everyone. In short:
You can type your questions here. We will answer them, and this document will store the answers for you for later use!
Q1: I have difficulty pasting my questions into HackMD (here). Do you have some instructions on how to write here?
Q2: I cannot access most of the slides from the course page. These are marked with an earth logo.
Q3: Slides available after the course?
Q3: Has the Zoom meeting already started? I'm just gettin "The meeting has not started" -note.
Q4: Do we get ECTS credits from this course for our university transcripts?
Q4: How long “project_2004306” will be available for use after the course?
Q1: I have a CSC login password as well as Haka password. I guess CSC login password should work for Puhti?
Q2: I can see the course's project in list of projects in mycsc profile but I can't access. i.e it doesn't show up under csc-workspaces.
Q2: Is memory available per core is total memory/number of cores available in that node?
Q3: In general, how many computing nodes can be applied for a general Molecular dynamic simulation task? As I read it might takes several days to complete the job. Can we use many many computing at the same time? (e.g. molecular weight of 69 kDa protein)
Q4: Tips on optimizing the application for memory usage? I also need a good tool to monitor the usage of memory of my application
Q6: Is there any possiblity to run jobs with GPU memory more than 3 days in Puhti for submitting sbatch jobs? or is that simply the max duration?
Q7: Do you also recommend using conda environment for running our python scripts in sbatch in Puhti? (I need Deep Learning libraries
such as Pytorch and Tensorflow) so it makes it easier to load libraries in conda
Q8: How do I know how many cores (CPUs) I need to optimize my sbatch jobs? Is it a trial and error task or there exists a logic I could calculate in advance?
Q9: Is NoMachine something like ssh user@machine -XY
in linux?
Q10: How to see on what node I am in? Login or Calculation one?
[username@puhti-login1 ~]$
-> you're on login node 1.hostname
command on the terminal. If it is login node you can see command output as : puhti-login1.bullx or puhti-login2.bullx
, for other (compute or interactive) nodes, you can just see node name which looks like r07c49.bullx
Q1: Is automatic deletion of scratch active in Mahti or Puhti?
/scratch
is filling up fast, so we will have to turn it on soon.Q2: Should I store my data at Allas and them move them to Puhti for calculation, and them move them back from Puhti to Allas, correct?
Q3: Should I use /tmpdir/
(temporal directory) for testing, and then /stratch/
disk for calculation themselves?
Q4: Do u recommend to save my dataset (~900GB < 1.1 TB) in /scratch/proj_num in Puhti or it is better to save it in Allas regardless the size of the dataset?
Q5: What is a 'small' file in this context?
Q6: Please, can you remind what is connection between Lustre and Puhti? What is Lustre useful for?
Q7: How to access temp folder in Puhti on Windows? echo $TMDIR prints nothing
Q8: Should we have access to Mahti for the Disk-areas exercises? I seem to have a pretty red X over Mahti on my Project_2004306 page. :)
Q1: I can see that on Puhti Python 3.7.3 is installed by default. Would you recommend using tools like pyenv to manage other python versions?
Q2: Yes, I was wondering what the best way to install and manage different python versions in Puhti is.
Q3: How can I launch Jupyter notebook when using Bioconda package?
A: Bioconda does not have jupyter included by default. But python-data, keras and pytorch and geoconda have.
Jupyter Notebook instructions: https://docs.csc.fi/computing/running/interactive-usage/#example-running-a-jupyter-notebook-server-via-sinteractive
For Jupyter Notebook you need to set up SSH keys first: https://docs.csc.fi/computing/connecting/#setting-up-ssh-keys
If you want to use bioconda packages inside Jupyter notebook, it is bit involved. One has to install a conda environment and install ipkernal to be ble to view your packages in Jupyter. Here is some generic example with gromacs software
Q1: Still, to accordingly select a partition, number of hours, cores, ect. requires likely error and trial approach?
Q2: How to comment in batch files? I tried using :: but this doesn't work
Q3: How many projects can I have in CSC? Can I remove my Test projects? Can't find a way to do this.
Q4: Can I run an open source computing software (FDS for CFD simulation) in HPC environment?
Q4b: The executable files are available in developer's website. I believe users from Aalto Unv may have installed it before. How can I check this? Should I contact servicedesk? If need install, how? FDS = Fire Dynamic Simulator
Q5: The examples did not allocate a certain amount of memory. How much memory is reserved if it is not defined in the bash script?
Q6a: How to run or output batch jobs under /scratch/<project> or /projappl/<project> ?
A: Simplest way to run your batch job under /scratch/project_xxxx
is by changing from current directory to scratch directory before submitting the job. As below:
cd /scratch/project_xxx/$USER
(This assumes you have your own directory, $USER
, under your project). Prepare all data and scripts needed to run your job in that directory. Finally, you can submit your job using sbatch
command. Unless, you redirect your output to elsewhere in script, your output files from batch job are written to the scratch directory.
Please note that /projappl/<project>
directories are meant for installing(or sharing) software tools for your project. You DON'T run any batch jobs in that directory.
Q6b: How to access /scratch/<project> and /projappl/<project> via a graphical interface (eg. FileZilla or VsCode SSH - Remote) ?
Q7a: How to direct all input (output?) of a batch job to a job-specific folder?
As the file must start with #!/bin/bash -l
, do not know how to create directories or change working directory. I am looking for something like #SBATCH mkdir output_%j
or #SBATCH cd output_%j
which should precede #SBATCH --output=x_output_%j.txt
and the like so that everything releated to that job would end up in the same folder.
#SBATCH
. The rest is just a normal bash script. By default the working directory of the job is the directory where you submit the job. You can create directories and move to them during the job:Q7b: How to name the folders and files with the job-id? Tried this but it did not work:
A: Can you paste here what works and what didn't? I would not use tee
in the command. Simple redirect should be enough (srun ... < input > output
). Otherwise the output (that would appear on screen) goes to #SBATCH -o <filename>
or if not defined, to slurm-${SLURM_JOB_ID}.out
Follow up Q7B:
I figured it out. The problem was that the rscript was not not in the newly created directory so had to use absolute path.
| tee output_${SLURM_JOB_ID}.R
and use #SBATCH --output=x_output_%j.R
but that output would have to be moved manually to the new folder. That's why I originally asked about changing working directory before or as part of the #SBATCH
commands in Q7A. The idea was to have the my_batch_job.sh
file in a directory, and in the very beginning of that file, create a subdirectory with the job-specific name.watch -n 60 sacct
(to check the job status every minute) is heavy on the login node?
Q1: Where should I store for example 80 GB geopackage (sqlite extension) so that it has as fast as possible public http access (queryable with gdal/ogr sql commands)?
Q2: Can I access one single object from the bucket?
a-get
command as here: a-get bucketname/objectname
. There are many interfaces to ALLAS. Please check here: https://docs.csc.fi/data/Allas/clients/Q3 Is there a way to sync data using rsync cmd without deleting the existing data on the destination folder?
rsync
command with -P flag will check the destination folder (with checksum) and then transfers only additional data on source folder to destination folder.Q4: How to close the connection to a project in Allas?
Q5: What would be the best way to take automatic backups of all/some files and directories in /scratch/project_XXX/user to Allas? Preferably as part of a batch job