changed 5 years ago
Published Linked with GitHub

Stat Analysis documentation

Version: 2020.01.24

Python environment setup

The instructions are for setting up an user environment on Rackham but could be easily adapter for personal computers as well. Memory requirement is minimized, but access to more memory will speed up the execution of the analysis.

This needs to be done only once for each user.

$ module load python/3.6.8
$ python3 -m pip install --user --upgrade "dask[complete]" crick

Setting up the calculation

The pattern expression to read multiple files needs to be specified in the python script.

...

# read only the 5th (4th index) column in the file to save memory.
# reduce to float32 to save memory as well
# 
#data1= dd.read_csv("BEDdata/chr10.3*.bed",

data1= dd.read_csv("BEDdata/*.bed",
    delim_whitespace= True,
    header= None,
    usecols=[4],
    dtype={4:"float32"},
    verbose=False).persist()

...

Once the file mask is setup, the program is ready to run.

Submitting job to SLURM

  • Testing the speed with respect to dedicated CPUs show no significant improvement above 6 CPUs taking about 35 min. and about 30 min. on 20 CPUs.
  • Using 6 CPUs (on Rackham) appears to be the most efficient setup to run the analysis.

Here is a SLURM script to submit a job to the queue.

#!/bin/bash -l
#SBATCH -A snic2017-11-16
#SBATCH -J stats
#SBATCH -p core -n 6 
#SBATCH -t 8:00:00

module load python/3.6.8

date
env > env.txt

./stat_dask_v03.py

date

Please, edit the corresponding parameters for the project (-A), CPUs (-n) and time (-t)

The program will calculate the properties in question, then save a plot of the histogram in histogram.png and the raw data used for the plot in histogram.dat

stat_dask_v03.py will show a progress bar if it is run interactively, for example, while testing on smaller subsets.

pattern matching files...
[########################################] | 100% Completed | 18min 38.7s
[########################################] | 100% Completed |  8min 29.0s
                  4
count  2.999793e+09
mean   1.232942e-01
std    1.300316e+00
min   -2.000000e+01
25%   -3.141763e-01
50%    1.269944e-01
75%    4.864117e-01
max    9.448000e+00
[########################################] | 100% Completed |  8min  9.9s
                  4
count  2.999793e+09
mean   1.232942e-01
std    1.300316e+00
min   -2.000000e+01
10%   -1.015454e+00
25%   -3.141763e-01
50%    1.269944e-01
75%    4.864117e-01
90%    1.062279e+00
max    9.448000e+00
[########################################] | 100% Completed | 49.6s
Thu Jan 23 12:13:11 CET 2020

This is handy, but when the output from the progress bar is redirected it generates unnecessary large amount of printout into slurm-123456.out. Despite this, one could still monitor the progress by filtering the "noise" and collect only the relevant data.

$ grep "Completed\|50%" -B 10 -A 10  slurm-123456.out

To remove the progress bar, comment these lines in the script.

...
# If running interactivelly, uncommenting these 2 lines will show a progress bar
pbar= ProgressBar()
pbar.register()
...

Note that some of the results are calculated twice which adds additional 8 min. The double calculation is to check for inconsistencies due to the approximations in the method. One can skip the first function call and directly call the second function call i.e. comment the first function call.

Stat Analysis documentation - performance tests
Dask vs. discrete histogram


About

tags: SNIC UPPMAX python dask
Select a repo