Biowulf

From 太極
Jump to navigation Jump to search

Swarm fig 1.png

helix and data transfer

  • Data transfer and intensive file management tasks should not be performed on the Biowulf login node, biowulf.nih.gov. Instead such tasks should be performed on the dedicated interactive data transfer node (DTN), helix.nih.gov
  • Examples of such tasks include:
    • cp, mv, rm commands on large numbers of files or directories
    • file compression/uncompression (tar, zip, etc.)
    • file transfer (sftp, scp, rsync, etc.)
  • https://helix.nih.gov/ Helix is an interactive system for short jobs. Moving large data transfers to Helix, which is now designated for interactive data transfers. For instance, use Helix when transferring hundreds of gigabytes of data or more using any of these commands: cp, scp, rsync, sftp, ascp, etc.
  • https://hpc.nih.gov/docs/rhel7.html#helix Helix transitioned to becoming the dedicated interactive data transfer and file management node [1] and its hardware was later upgraded to support this role [2]. Running processes such as scp, rsync, tar, and gzip on the Biowulf login node has been discouraged ever since.

Linux distribution

$ ls /etc/*release  # login mode
$ cat /etc/redhat-release  
Red Hat Enterprise Linux Server release 6.8 (Santiago)

$ sinteractive      # switch to biowulf2 computing nodes
$ cat /etc/redhat-release 
CentOS release 6.6 (Final)
$ cat /etc/centos-release 
CentOS release 6.6 (Final)

Training notes

Storage

https://hpc.nih.gov/storage/

/scratch and /lscratch

The /scratch area on Biowulf is a large, low-performance shared area meant for the storage of temporary files.

  • Each user can store up to a maximum of 10 TB in /scratch. However, 10 TB of space is not guaranteed to be available at any particular time.
  • If the /scratch area is more than 80% full, the HPC staff will delete files as needed, even if they are less than 10 days old.
  • Files in /scratch are automatically deleted 10 days after last access.
  • Touching files to update their access times is inappropriate and the HPC staff will monitor for any such activity.
  • Use /lscratch (not /scratch) when data is to be accessed from large numbers of compute nodes or large swarms.
  • The central /scratch area should NEVER be used as a temporary directory for applications -- use /lscratch instead.
  • fasterq-dump command in SRA-Toolkit where a temporary directory is needed. See HowTo page.
  • Running RStudio interactively. It is generally recommended to allocate at least a small amount of lscratch for temporary storage for R.
  • See the slides of Using the NIH HPC Storage Systems Effectively from NIH HPC classes

An example of using /lscratch space

fasterq-dump -t /lscratch/$SLURM_JOB_ID \
  --split-files \
  -O /lscratch/$SLURM_JOBID SRR13526458; \
  pigz -p6 /lscratch/$SLURM_JOBID/SRR13526458*.fastq; \
  cp /lscratch/$SLURM_JOBID/SRR13526458*.fastq.gz ~/PDACPDX

Transfer files

Alternatives:

User Dashboard

https://hpc.nih.gov/dashboard/

Create a shared folder

See Uppercase S in permissions of a folder and setGID. chmod -R 2770 ShareFolder.

Local disk and temporary files

See https://hpc.nih.gov/docs/b2-userguide.html#local and https://hpc.nih.gov/storage/

Dashboard

User dashboard Unlock account, disk usage, job info

Quota

checkquota

Environment modules

# What modules are available
module avail
module -d avail # default
module avail STAR
module spider bed # search by a case-insensitive keyword

# Load a module
module list # loaded modules
module load STAR
module load STAR/2.4.1a
module load plinkseq macs bowtie # load multiple modules
# if we try to load a module in a bash script, we can use the following
module load STAR || exit 1

# Unload a module
module unload STAR/2.4.1a

# Switch to a different version of an application
# If you load a module, then load another version of the same module, the first one will be unloaded.

# Examine a modulefile
$ module display STAR
-----------------------------------------------------------------------------
   /usr/local/lmod/modulefiles/STAR/2.5.1b.lua:
-----------------------------------------------------------------------------
help([[This module sets up the environment for using STAR.
Index files can be found in /fdb/STAR
]])
whatis("STAR: ultrafast universal RNA-seq aligner")
whatis("Version: 2.5.1b")
prepend_path("PATH","/usr/local/apps/STAR/2.5.1b/bin")

# Set up personal modulefiles

# Using modules in scripts

# Shared Modules

Single file - sbatch

Don't use the swarm command on a single script file since swarm will treat each line of the script file as an independent command.

sbatch --cpus-per-task=2 --mem=4g --time 24:00:00 MYSCRIPT
# Use --time 24:00:00 to increase the wall time from the default 2 hours to 24 hours

An example of the script file (Slurm environment variable $SLURM_CPUS_PER_TASK within your script was used to specify the number of threads to the program)

#!/bin/bash

module load novocraft
novoalign -c $SLURM_CPUS_PER_TASK  -f s_1_sequence.txt -d celegans -o SAM > out.sam

rslurm

rslurm package. Functions that simplify submitting R scripts to a Slurm workload manager, in part by automating the division of embarrassingly parallel calculations across cluster nodes.

Multiple files - swarm

swarm -f run.sh --time 24:00:00

swarm -t 3 -g 20 -f run_seqtools_vc.sh --module samtools,picard,bwa --verbose 1 --devel
# 3 commands run in 3 subjobs, each command requiring 20 gb and 3 threads, allocating 6 cores and 12 cpus
swarm -t 3 -g 20 -f run_seqtools_vc.sh --module samtools,picard,bwa --verbose 1

# To change the default walltime, use --time 24:00:00

swarm -t 8 -g 24 --module tophat,samtools,htseq -f run_master.sh
cat sw3n17156.o

Environment variables $SLURM

Swarm on Biowulf. Some examples: fmriprep, qsiprep.

  • $SLURM_MEM_PER_NODE: 32768
  • $SLURM_JOB_ID
  • $SLURM_NNODES: 1
  • $SLURM_CPUS_ON_NODE: 2
  • $SLURM_JOB_CPUS_PER_NODE: 2
  • $SLURM_NODELIST: cn2448

Why a job is pending

Partition and freen

Biowulf nodes are grouped into partitions. A partition can be specified when submitting a job. The default partition is 'norm'. The freen command can be used to see free nodes and CPUs, and available types of nodes on each partition.

We may need to run swarm commands on non-default partitions. For example, not many free CPUs are available in 'norm' partition. Or Total time for bundled commands is greater than partition walltime limit. Or because the default partition norm has nodes with a maximum of 120GB memory.

We can run the swarm command on different partition (the default is 'norm'). For example, to run on b1 parition (the hardware in b1 looks inferior to norm)

swarm -f outputs/run_seqtools_dge_align -g 20 -t 16 \
         --module tophat,samtools,htseq \
         --time 6:00:00 --partition b1 --verbose 1

If we want to restrict the output of freen to the norm nodes, we can use freen | grep -E 'Partition|----|norm'  ; see the User Guide.

Partition      FreeNds      FreeCPUs      FreeGPUs  Cores  CPUs  GPUs    Mem   Disk Features
-------------------------------------------------------------------------------------------------------
norm*           8 / 216    3702 / 14748                36    72         373g  3200g cpu72,core36,g384,ssd3200,x6140,ibhdr100
norm*         101 / 522   20030 / 29232                28    56         247g   800g cpu56,core28,g256,ssd800,x2680,ibfdr
norm*         497 / 529   16696 / 16928                16    32         121g   800g cpu32,core16,g128,ssd800,x2650,10g
norm*         504 / 539   29192 / 30184                28    56         247g   400g cpu56,core28,g256,ssd400,x2695,ibfdr
norm*           6 / 7       342 / 392                  28    56         247g  2400g cpu56,core28,g256,ssd2400,x2680,ibfdr

jobhist and track the resource usage

  • jobhist XXXXXXXX will show the resource usage. The Jobid Runtime, MemUsed columns can be used to identify the job that was using most resource
  • To identify the command run by a job, check out the log file <swarm_XXXXXX_Y.o>

Running R scripts

https://hpc.nih.gov/apps/R.html

Running a swarm of R batch jobs on Biowulf

$ cat Rjobs
R --vanilla < /data/username/R/R1  > /data/username/R/R1.out
R --vanilla < /data/username/R/R2  > /data/username/R/R2.out
# swarm -g 32 -t 10 -f /home/username/Rjobs --module R/3.6.0 --time 6:00:00

Pay attention to the default wall time (eg 2 hours) and various swarm options. See Swarm.

Parallelizing with parallel.

  • The following is modified from biowulf R's webpage. I change it so it works on non-biowulf system (like local Linux, Mac, or Windows).
  • The number of allocated CPUs and available memory are related.
    • Here I assume the Windows and Mac only has a modest RAM (eg 16GB) and the local Linux has enough RAM.
    • The RAM size on the local system can be obtained through the commented lines.
detectBatchCPUs <- function() {
  ncores <- as.integer(Sys.getenv("SLURM_CPUS_PER_TASK"))
  if (is.na(ncores)) {
    ncores <- as.integer(Sys.getenv("SLURM_JOB_CPUS_PER_NODE"))
  }
  if (is.na(ncores)) {
    # the system is not biowulf
    if (grepl("linux", R.version$os)) {
      # if it is linux, we assume there are enough ram
      # mem <- system('grep MemTotal /proc/meminfo', intern = TRUE)
      # mem <- strsplit(mem, " ")[[1]]
      # mem <- as.integer(mem[length(mem) -1])
      ncores <- future::availableCores()
      # } else if (grepl("darwin", R.version$os)) {
      #   ncores <- 2
    } else ncores <- 2
  }
  return(ncores)
}
ncpus <- detectBatchCPUs() 

options(mc.cores = ncpus) 
mclapply(..., mc.cores = ncpus) 
makeCluster(ncpus)

Some experiences

jobhist show significant less memory use than jobload. It appears that novoalign uses shared memory, which we account for in jobload but not in jobhist. In this case, you will want to use the memory value reported by jobload to avoid running out of allocated memory and thus getting killed by Slurm.

What is shared memory? Each process in Linux gets its own private application memory space. Linux makes available a portion of memory where multiple processes can also share the same memory space so these multiple processes can work on the same datasets.

Is there a way I can tell "shared memory" is in use when I was running a job? At the moment, you would have to be on a node running one of your jobs and run the 'ipcs -m' command. This lists shared memory segments for any processes using shared memory. This can be cumbersome, especially for a large multimode job or swarm.

freen command shows the maximum threads is 56 and the memory is 246GB.

When I run an R script (foreach is employed to loop over simulation runs), I find

  • Assign 56 threads can guarantee 56 simulations run at the same time (check by the jobload command).
  • We need to worry about the RAM size. The larger the threads, the more memory we need. If we don't assign enough memory, weird error message will be spit out.
  • Even assigning 56 threads can help to run 56 simulations at the same time, the actual execution time is longer than when I run fewer simulations.
    allocated threads allocated memory number of runs memory used time (min)
    56 64 10 30 10
    56 64 20 36 13
    56 64 56 58 27

Monitor jobs/Delete/Kill jobs

https://hpc.nih.gov/docs/userguide.html#monitor

sjobs
watch -n 125 jobload
watch -n 125 "jobload | tail"
scancel -u XXXXX
scancel NNNNN
scancel --name=JobName
scancel --state=PENDING 
scancel --state=RUNNING  
squeue -u XXXX

jobhist 17500  # report the CPU and memory usage of completed jobs.

The other two commands are very useful too jobhist and swarmhist (temporary).

$ cat /usr/local/bin/swarmhist
#!/bin/bash
usage="usage: $0 jobid"
jobid=$1
[[ -n $jobid ]] || { echo $usage; exit 1; }
ret=$(grep "jobid=$jobid" /usr/local/logs/swarm_on_slurm.log)
[[ -n $ret ]] || { echo "no swarm found for jobid = $jobid"; exit; }
echo $ret | tr ';' '\n'

$ jobhist 22038972
$ swarmhist 22038972
date=(SKIP)
 host=(SKIP)
 jobid=22038972
 user=(SKIP)
 pwd=(SKIP)
 ncmds=1
 soptions=--array=0-0 --job-name=swarm --output(SKIP)
 njobs=1
 job-name=swarm
 command=/usr/local/bin/swarm -t 16 -g 20 -f outputs/run_seqtools_vc --module samtools,picard --verbose 1

Show how busy is one node: jobload -n cnXXXX

This will show many jobs running in a node. It'll show USER, JOBID, NODECPUS, TOTALCPUS and TOTALNODES (1).

Show properties of a node: freen -n

Use freen -n.

This is helpful if we want to know the node that is allocated from the output (Nodelist column) of the sjobs command.

$ freen -n
                                           ........Per-Node Resources........  
Partition  FreeNds       FreeCPUs     Cores CPUs   Mem    Disk    Features                                            Nodelist
----------------------------------------------------------------------------------------------------------------------------------
norm*      160/454      17562/25424       28    56    248g   400g   cpu56,core28,g256,ssd400,x2695,ibfdr            cn[1721-2203,2900-2955]
norm*      0/476        5900/26656        28    56    250g   800g   cpu56,core28,g256,ssd800,x2680,ibfdr            cn[3092-3631]       
norm*      278/309      8928/9888         16    32    123g   800g   cpu32,core16,g128,ssd800,x2650,10g              cn[0001-0310]       
norm*      281/281      4496/4496          8    16     21g   200g   cpu16,core8,g24,sata200,x5550,1g                cn[2589-2782,2799-2899]
norm*      10/10        160/160            8    16     68g   200g   cpu16,core8,g72,sata200,x5550,1g                cn[2783-2798]       
...

$ sjobs
User   JobId     JobNa   Part         St  Reason  Runtime     Walltime     Nodes  CPUs   Memory Dependey  Nodelist
=============================================================================================================================
XXX 51944300  sinteracti interactive  R           1:13:32      8:00:00      1     16   32 GB              cn0862
XXX 51946396  myjob      norm         R              2:57     12:00:00      1     10   32 GB              cn0925
=============================================================================================================================

Exit code

https://hpc.nih.gov/docs/b2-userguide.html#exitcodes

Walltime limits

$ batchlim
Max jobs per user: 4000
Max array size:    1001

Partition        MaxCPUsPerUser     DefWalltime     MaxWalltime                
---------------------------------------------------------------
norm                     7360         02:00:00     10-00:00:00 
multinode                7560         08:00:00     10-00:00:00 
	turbo qos        15064                         08:00:00
interactive                64         08:00:00      1-12:00:00 (3 simultaneous jobs)
quick                    6144         02:00:00        04:00:00 
largemem                  512         02:00:00     10-00:00:00 
gpu                       728         02:00:00     10-00:00:00 (56 GPUs per user)
unlimited                 128        UNLIMITED       UNLIMITED 
student                    32         02:00:00        08:00:00 (2 GPUs per user)
ccr                      3072         04:00:00     10-00:00:00 
ccrgpu                    448         04:00:00     10-00:00:00 (32 GPUs per user)
forgo                    5760       1-00:00:00      3-00:00:00 

Interactive debugging

Default is 2 CPUs, 4G memory (too small) and 8 hours walltime.

Increase them to 60 GB and more cores if we run something like STAR for rna-seq reads alignment.

sinteractive --mem=32g -c 16 --gres=lscratch:100 --time=24:00:00

The '--gres' option will allocate a local disk, 100GB in this case. The local disk directory will be /lscratch/$SLURM_JOBID.

For RStudio, the example shows to allocate 5GB of temporary space.

For R, it also recommends to allocate a minimal amount of lscratch of 1GB plus whatever lscratch storage is required by your code.

ssh cnXXX

If we have an interactive job on a certain node, we can directly ssh into that node to check for example the /lscratch/$SLURM_JOBID usage. The JobID can be obtained from the jobload command.

We can also use freen -n to see some properties of the node.

Parallel jobs

Parallel (MPI) jobs that run on more than 1 node: Use the environment variable $SLURM_NTASKS within the script to specify the number of MPI processes.

Reproduciblity/Pipeline

Singularity

Snakemake (and Singularity)

Apps

Scientific Applications on NIH HPC Systems

R program

https://hpc.nih.gov/apps/R.html

Find available R versions:

module -r avail '^R$'

where -r means to use regular expression match. This will match "R/3.5.2" or "R/3.5" but not "Rstudio/1.1.447".

(Self-installed) R package directory

On our systems, the default path to the library is ~/R/<ver>/library where where ver is the two digit version of R (e.g. 3.5). However, R won't automatically create that directory and in its absence will try to install to the central packge library which will fail. To install packages in your home directory manually create ~/R/<ver>/library first.

The directory ~/R/x86_64-pc-linux-gnu-library/ was not used anymore in Biowulf.

RStudio

  • Connecting to Biowulf with NoMachine (NX)
    • It works though the resolution is not great on a Mac screen. This assume we are in the NIH network.
    • One thing is about adjusting the window size. Click Ctrl + Alt + 0 to bring up the setting if we accept all the default options when we connect to a remote machine. Click 'Resize remote display'. Now we can use mouse to drag and adjust the display size. This will keep the resolution as it showed up originally. It is easier when I choose the 'key-based authorization method with a key you provide'.
    • Desktop environment is XFCE ($DESKTOP_SESSION)
  • RStudio on Biowulf.
    • We need to open a terminal and follow the instruction there.
    • The desktop IDE program is installed in /usr/local/apps/rstudio/rstudio-1.4.1103/.
    • Problems: The copy/paste keyboard shortcuts do not work even I checked the option "Grab the keyboard input". Current resolution: Use two ssh connections (one to run R, and another one to edit file). Open noMachine and use the desktop to view graph files (gv XXX.pdf OR change the default program to 'GNU GV Postscript/PDF Viewer' from the default 'LibreOffice Draw').
sinteractive --mem=6g --gres=lscratch:5
module load Rstudio R
rstudio &

Visual Studio Code

  • VS Code on Biowulf. Install the "Remote Development" VScode extension.
  • Run Jupyter notebook on a compute node in VS Code.
  • Remote Development using SSH. Basically once we connect to a remote host, a new VScode instance will be created and we can open any files from the remote host on VScode.

Exit an SSH session

Today after I issued a "cd /data/USERNAME" command, it just hung there even the Ctrl+c won't exit.

One solution is to close the current terminal. If we like to keep the current terminal, a solution is to open another terminal, give an SSH connection and run pkill -u USERNAME. This will also kill all the SSH connections including the current one.

SSH tunnel

https://hpc.nih.gov/docs/tunneling/

The use of interactive application servers (such as Jupyter notebooks) on Biowulf compute nodes requires establishing SSH tunnels to make the service accessible to your local workstation.

Jupyter

Jupyter on Biowulf

Terminal customization

  • ssh add authorized_keys
  • ~/.bashrc:
    • change PS1
    • add an alias for nano
  • ~/.bash_profile: no change
  • ~/.nanorc, ~/r.nanorc and ~/bin/nano/bin/nano (4.2)
  • ~/.emacs: global-display-line-numbers-mode
  • ~/.vimrc: set number
  • .Rprofile: options(editor="emacs")
  • .bash_logout: no change

tmux for keeping SSH sessions

tmux