Revision as of 14:24, 18 March 2016

Shell Programming

Some Resources

Redirect

Redirecting output. File descriptor number 1 (2) means standard output (error).

./myProgram > stdout.txt        # redirect std out to <stdout.txt>
./myProgram 2> stderr.txt       # redirect std err to <stderr.txt> by using the 2> operator
./myProgram > stdout.txt 2> stderr.txt # combination of above two
./myProgram > stdout.txt 2>&1   # redirect std err to std out <stdout.txt>
./myProgram >& /dev/null        # prevent writing std out and std err to the screen
ps >> outptu.txt                # append

Redirecting input

./myProgram < input.txt

>&

&> file is not part of the official POSIX shell spec, but has been added to many Bourne shells as a convenience extension (it originally comes from csh). In a portable shell script (and if you don't need portability, why are you writing a shell script?), use > file 2>&1 only.

Redirect Output and Errors To /dev/null

http://www.cyberciti.biz/faq/how-to-redirect-output-and-errors-to-devnull/

command > /dev/null 2>&1
# OR
command &>/dev/null

tee -redirect to both a file and the screen same time

To redirect to both a file and the screen the same time, use tee command. See

command1 |& tee log.txt
## or ##
command1 -arg |& tee log.txt
## or ##
command1 2>&1 | tee log.txt

Pipe

The operator is |.

ps > psout.txt
sort psout.txt > pssort.out

can be simplified to

ps | sort > pssort.out

For example,

$ head /etc/passwd
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync

$cat /etc/passwd | cut -d: -f7 | sort | uniq -c | sort -nr
     18 /bin/sh
     13 /bin/false
      2 /bin/bash
      1 /bin/sync

where cut command will extract the 7th field separated by the : character and write to the output stream. sort command will sort alphabetically sorts the line it reads from its input and returns the new sort to its output. The uniq command will remove and count duplicated lines. The final sort command will sort its input numerically in reverse order.

Process substitution

https://en.wikipedia.org/wiki/Process_substitution

Powerfulness of pipes

Consider the following commands

samtools mpileup -go temp.bcf -uf genome.fa  dedup.bam
bcftools call -vmO v -o sample1_raw.vcf temp.bcf

The disadvantage of this approach is it will create a temporary file (temp.bcf in this case). If the size of the temporary file is enormous large (several hundred of GB), it will waste/eat up the hard disk space no to say the time used to create the temporary file. If we use pipes, we can save the time and disk space of the temporary file.

samtools mpileup -uf genome.fa  dedup.bam | bcftools call -vmO v -o sample1_raw.vcf

Pipe vs redirect

Pipe is used to pass output to another program or utility.
Redirect is used to pass output to either a file or stream.

In other words, thing1 | thing2 does the same thing as thing1 > temp_file && thing2 < temp_file.

Shebang (#!)

A shebang is the character sequence consisting of the characters number sign and exclamation mark (that is, "#!") at the beginning of a script. See the Wikipedia page.

The syntax looks like

#! interpreter [optional-arg]

For example,

#!/bin/sh — Execute the file using sh, the Bourne shell, or a compatible shell
#!/bin/csh -f — Execute the file using csh, the C shell, or a compatible shell, and suppress the execution of the user’s .cshrc file on startup
#!/usr/bin/perl -T — Execute using Perl with the option for taint checks

Comments

For a single line, we can use the '#' sign.

For a block of code, we use

#!/bin/bash
echo before comment
: <<'END'
bla bla
blurfl
END
echo after comment

Variables

food=Banana
echo $food
food="Apple"
echo $food

Concatenate string variables

http://stackoverflow.com/questions/4181703/how-can-i-concatenate-string-variables-in-bash

a='hello'
b='world'
c=$a$b
echo $c

# Bash also supports a += operator 
$ A="X Y"
$ A+="Z"
$ echo "$A"

Often we need to use "double quotes" around the string variables if the string variables represent some directories.

mkdir "tmp 1"
touch "tmp 1/tmpfile"

tmpvar="tmp 1"
echo tmpvar
# tmp 1

ls $tmpvar
ls: cannot access tmp: No such file or directory
ls: cannot access 1: No such file or directory
ls "$tmpvar"
# tmpfile

However, for integers

echo $a
24
((a+=12))
echo $a
36

Environment variables

$HOME
$PATH
$0 -- name of the shell script
$# -- number of parameters passed (so it does include the program itself)
$$ process ID of the shell script, often used inside a script for generating unique temp filenames
$? -- the exit value of the last run command

Example 1 (check if a command run successfully):

some_command
if [ $? -eq 0 ]; then
    echo OK
else
    echo FAIL
fi
# OR
if some_command; then
    printf 'some_command succeeded\n'
else
    printf 'some_command failed\n'
fi

$ tabix -f -p vcf ~/SeqTestdata/usefulvcf/hg19/CosmicCodingMuts.vcf.gz
brb@brb-P45T-A:/tmp$ echo $?
0
$ tabix -f -p vcf ~/Downloads/CosmicCodingMuts.vcf.gz
Not a BGZF file: /home/brb/Downloads/CosmicCodingMuts.vcf.gz
tbx_index_build failed: /home/brb/Downloads/CosmicCodingMuts.vcf.gz
$ echo $?
1

Example 2 (check if users have supply a correct number of parameters):

#!/bin/bash
if [ $# -ne 2 ]; then
  echo "Usage: $0 ProgramName filename"
  exit 1
fi

match_text=$1
filename=$2

Parameter variables

$1, $2, .... -- parameters given to the script
$* -- list of all the parameters, in a single variable
$@ -- subtle variation on $*. 
$! -- the process id of the last command run in the background.

For example,

$ touch /tmp/tmpfile_$$

$ set foo bar bam
$ echo $#
3
$ echo $@
foo bar bam
$ set foo bar bam &
[1] 28212
$ echo $!
28212
[1]+  Done                    set foo bar bam

We can also use parentheses around the variable name.

QT_ARCH=x86_64
QT_SDK_BINARY=QtSDK-4.8.0-${QT_ARCH}.tar.gz
QT_SD_URL=https://xxx.com/$QT_SDK_BINARY

Conditions

We can use the test command to check if a file exists. The command is test -f <filename>.

[] is just the same as writing test, and would always leave a space after the test word.

if test -f fred.c; then ...; fi

if [ -f fred.c ]
then
...
fi

if [ -f fred.c ]; then
...
fi

Arithmetic comparison

expr1 -eq expr2  ==> check equal
expr1 -ne expr2  ==> check not equal
expr1 -gt expr2  ==> expr1 > expr2
expr1 -ge expr2  ==> expr1 >= expr2
expr1 -lt expr2  ==> expr1 < expr2
expr1 -le expr2  ==> expr1 <= expr2
! expr  ==> opposite of expr

File conditionals

-d file  ==> True if the file is a directory
-e file  ==> True if the file exists
-f file  ==> True if the file is a regular file
-r file  ==> True if the file is readable
-s file  ==> True if the file has non-zero size
-w file  ==> True if the file is writable
-x file  ==> True if the file is executable

Example: Suppose we want to know if the first argument (if given) match a specific string. We can use (note the space before and after '==')

#!/bin/bash
if [ $1 == "console" ]; then
  echo 'Console'
else
  echo 'Non-console'
fi

Control Structures

if

if condition
then
  statements
elif [ condition ]; then
  statements
else 
  statements
fi

For example, we can run a cp command if two files are different.

if ! cmp -s "$filesrc" "$filecur"
then
     cp $filesrc $filecur
fi

while

while condition do
  statements
done

until

until condition
do 
  statements
done

AND list

statement1 && statement2 && statement3 && ...

If command1 finishes successfully then run command2.

OR list

statement1 || statement2 || statement3 || ...

If command1 fails then run command2.

For example,

codename=$(lsb_release -s -c)
if [ $codename == "rafaela" ] || [ $codename == "rosa" ]; then
  codename="trusty"
fi

for

for variable in values
do 
  statements
done

Example 1

To convert pdfs to tifs using ImageMagick (for looping over files, check cyberciti.biz)

outdir="../plosone"
indir="../fig"

if [[ ! -d  $outdir ]];
then
   mkdir $outdir
fi

in=(file1.pdf file2.pdf file3.pdf)

for (( i=0; i<${#in[@]} ; i++ ))
do
  convert -strip -units PixelsPerInch -density 300 -resample 300 \
          -alpha off -colorspace RGB -depth 8 -trim -bordercolor white \
          -border 1% -resize '2049x2758>' -resize '980x980<' +repage \
          -compress lzw $indir/${in[$i]} $outdir/Figure$[$i+1].tiff
done

Example 2

A second example is to download all the (Ontario gasoline price) data with wget and parsing and concatenating the data with other *nix tools like 'sed':

# Download data
for i in $(seq 1990 2014)
        do wget http://www.energy.gov.on.ca/fuelupload/ONTREG$i.csv
done

# Retain the header
head -n 2 ONTREG1990.csv | sed 1d > ONTREG_merged.csv

# Loop over the files and use sed to extract the relevant lines
for i in $(seq 1990 2014)
        do
        tail -n 15 ONTREG$i.csv | sed 13,15d | sed 's/./-01-'$i',/4' >> ONTREG_merged.csv
        done

Example 3

Download all 20 sra files (60GB in total) from SRP032789.

for x in $(seq 1027175 1027180) 
   do wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP032/SRP032789/SRR$x/SRR$x.sra
done

Example 4

Convert all files from DOS to Unix format

for f in *.txt; do   tr -d '\r' < $f > tmp.txt;   mv tmp.txt $f  ; done
# Or
for file in $*; do   tr -d '\r' < $f > tmp.txt;   mv tmp.txt $f  ; done

Example 5

Include all files in a directory

for f in /etc/*.conf
do
   echo "$f"
done

Functions

set -e, set -x and trap

Exit immediately if a command exits with a non-zero status. Type help set in command line. Very useful!

See also the trap command that is related to non-zero exit.

See

Commands

break  ==> escaping from an enclosing for, while or until loop
:      ==> null command
continue ==> make the enclosing for, while or until loo continue at the next iteration
.      ==> executes the command in the current shell
eval   ==> evaluate arguments
exec   ==> replacing the current shell with a different program
export ==> make the variable named as its parameter available in subshells
expr   ==> evaluate its arguments as an expression
printf ==> similar to echo
set    ==> sets the parameter variables for the shell. Useful for using fields in commands that output spaced-separated values
shift  ==> moves all the parameter variables down by one.
trap   ==> specify the actions to take on receipt of signals.
unset  ==> remove variables or functions from the environment.
mktemp ==> create a temporary file

trap

The syntax to use trap command is

trap command signal

For example,

$ cat traptest.sh
#!/bin/sh

trap 'rm -f /tmp/tmp_file_$$' INT
echo creating file /tmp/tmp_file_$$
date > /tmp/tmp_file_$$

echo 'press interrupt to interrupt ...'
while [ -f /tmp/tmp_file_$$ ]; do
  echo file exists
  sleep 1
done
echo the file no longer exists

trap - INT
echo creaing file /tmp/tmp_file_$$
date > /tmp/tmp_file_$$
echo 'press interrupt to interrupt ...'
while [ -f /tmp/tmp_file_$$ ]; do
  echo file exists
  sleep 1
done
echo we never get here
exit 0

will get an output like

$ ./traptest.sh
creating file /tmp/tmp_file_21389
press interrupt to interrupt ...
file exists
file exists
^Cthe file no longer exists
creaing file /tmp/tmp_file_21389
press interrupt to interrupt ...
file exists
file exists
^C

The first when we use trap, it will delete the file when we hit Ctrl+C. The second time when we use trap, we do not specify any command to be exected when an INT signal occurs. So the default behavior occurs. That is, the final echo and exit statements are never executed.

Note that the following two are different.

trap - INT
trap '' INT

The second command will IGNORE signals (Ctrl+C in this case) so if we apply this statement above, we will not be able to use Ctrl+C to kill the execution.

Command Execution

$(command)
`command`    # ` is a backquote/backtick, not a single quotation sign
# Example
sudo apt-get install linux-headers-$(uname -r)

Note all new scripts should use the $(...) form, which was introduced to avoid some rather complex rules.

Example

#!/bin/sh
echo The current directory is $PWD
echo The current users are $(who)
sudo chown `id -u` SomeDir  # change the ownership to the current user. Dangerous!
                            # Or sudo chown `whoami` SomeDirOrSomeFile
exit 0

Note that $(your expression) is a better way as it allows you to run nest expressions. For example,

cd $(dirname $(type -P touch))

will cd you into the directory containing the 'touch' command.

The concept of putting the result of a command into a script variable is very powerful, as it makes it easy to use existing commands in scripts and capture their output.

Arithmetic Expansion

$((...))

is a better alternative to the expr command. More examples:

for i in $(seq 1 3)
  do echo SRR$(( i + 1027170 ))'_1'.fastq 
done

Note that the single quote above is required. The above will output SRR1027171_1.fastq, SRR102172_1.fastq and SRR1027173_1.fastq.

Parameter Expansion

${parameter}

Bash shell find out if a command exists or not

http://www.cyberciti.biz/faq/unix-linux-shell-find-out-posixcommand-exists-or-not/

POSIX command

# command -v will return >0 when the command1 is not found
command -v command1 >/dev/null && echo "command1 Found In \$PATH" || echo "command1 Not Found in \$PATH"

$ help command
command: command [-pVv] command [arg ...]
    Execute a simple command or display information about commands.
    
    Runs COMMAND with ARGS suppressing  shell function lookup, or display
    information about the specified COMMANDs.  Can be used to invoke commands
    on disk when a function with the same name exists.
    
    Options:
      -p	use a default value for PATH that is guaranteed to find all of
    	the standard utilities
      -v	print a description of COMMAND similar to the `type' builtin
      -V	print a more verbose description of each COMMAND
    
    Exit Status:
    Returns exit status of COMMAND, or failure if COMMAND is not found.

type -P

type -P command1 &>/dev/null && echo "Found" || echo "Not Found"

$ help type
type: type [-afptP] name [name ...]
    Display information about command type.
    
    For each NAME, indicate how it would be interpreted if used as a
    command name.
    
    Options:
      -a	display all locations containing an executable named NAME;
    	includes aliases, builtins, and functions, if and only if
    	the `-p' option is not also used
      -f	suppress shell function lookup
      -P	force a PATH search for each NAME, even if it is an alias,
    	builtin, or function, and returns the name of the disk file
    	that would be executed
      -p	returns either the name of the disk file that would be executed,
    	or nothing if `type -t NAME' would not return `file'.
      -t	output a single word which is one of `alias', `keyword',
    	`function', `builtin', `file' or `', if NAME is an alias, shell
    	reserved word, shell function, shell builtin, disk file, or not
    	found, respectively
    
    Arguments:
      NAME	Command name to be interpreted.
    
    Exit Status:
    Returns success if all of the NAMEs are found; fails if any are not found.
typeset: typeset [-aAfFgilrtux] [-p] name[=value] ...
    Set variable values and attributes.
    
    Obsolete.  See `help declare'.

pause by read -p command

http://www.cyberciti.biz/tips/linux-unix-pause-command.html

read -p "Press [Enter] key to start backup..."

If we want to ask users about a yes/no question, we can use this method

while true; do
    read -p "Do you wish to install this program? " yn
    case $yn in
        [Yy]* ) make install; break;;
        [Nn]* ) exit;;
        * ) echo "Please answer yes or no.";;
    esac
done

OR

echo "Do you wish to install this program?"
select yn in "Yes" "No"; do
    case $yn in
        Yes ) make install; break;;
        No ) exit;;
    esac
done

Keyboard input and Arithmetic

http://linuxcommand.org/wss0110.php

read

#!/bin/bash

echo -n "Enter some text > "
read text
echo "You entered: $text"

Arithmetic

#!/bin/bash

# An applications of the simple command
# echo $((2+2))
# That is, when you surround an arithmetic expression with the double parentheses, 
# the shell will perform arithmetic evaluation.
first_num=0
second_num=0

echo -n "Enter the first number --> "
read first_num
echo -n "Enter the second number -> "
read second_num

echo "first number + second number = $((first_num + second_num))"
echo "first number - second number = $((first_num - second_num))"
echo "first number * second number = $((first_num * second_num))"
echo "first number / second number = $((first_num / second_num))"
echo "first number % second number = $((first_num % second_num))"
echo "first number raised to the"
echo "power of the second number   = $((first_num ** second_num))"

and a program that formats an arbitrary number of seconds into hours and minutes:

#!/bin/bash

seconds=0

echo -n "Enter number of seconds > "
read seconds

# use the division operator to get the quotient
hours=$((seconds / 3600))
# use the modulo operator to get the remainder
seconds=$((seconds % 3600))
minutes=$((seconds / 60))
seconds=$((seconds % 60))

echo "$hours hour(s) $minutes minute(s) $seconds second(s)"

Here documents

Debugging Scripts

http://www.cyberciti.biz/tips/debugging-shell-script.html

Run a shell script with -x option. Then each lines of the script will be shown on the stdout. We can see which line takes long time or which lines broke the code (it still runs through the script).

$ bash -x script-name

Use of set builtin command
Use of intelligent DEBUG function

To run a bash script line by line:

Bash Debugger
Use Geany. See the next session.

Geany

(Ubuntu 12.04 only): By default, it does not have the terminal tab. Install virtual terminal emulator. Run

sudo apt-get install libvte-dev

Step 1: Keyboard shortcut. Select a region of code. Edit -> >Commands->Send selection to Terminal. You can also assign a keybinding for this. To do so: go to Edit->Preferences and pick the Keybindings tab. See a screenshot here. I assign F12 (no any quote) for the shortcut. This is a complete list of the keybindings.

Step 2: Newline character. Another issue is that the last line of sent code does not have a newline character. So I need to switch to the Terminal and press Enter. The solution is to modify the <geany.conf> (find its location using locate geany.conf. On my ubuntu 14 (geany 1.26), it is under ~/.config/geany/geany.conf) and set send_selection_unsafe=true. See here.
Step 3: PATH variable.

$ tmpname=$(basename $inputVCF)
Command 'basename' is available in '/usr/bin/basename'
The command could not be located because '/usr/bin' is not included in the PATH environment variable.

The solution is to run PATH=$PATH:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin in the Terminal window before running our script.

Step 4 (optional): Change background color.

Another handy change to geany is to change its background to black. To do that, go to Edit -> Preferences -> Editor. Once on the Editor options level, select the Display tab to the far right of the dialog, and you will notice a checkbox marked invert syntax highlighting colors.

See this post about changing the default terminal in the Terminal window. The default is xterm (see the output of echo $TERM).

Examples

<upgrade8.sh> file from BioLinux installation page
Install required R packages using a mixture of bash and R.

Text processing

Regular Expression

A summary table
https://regexper.com/ You can type for example '[a-z]*.[0-9]' to see what it is doing.
- ( ?[a-zA-Z]+ ?) match all words in a given text
- [0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3} match an IP address
Linux command line: grep PATTERN FILENAME or grep -E PATTERN FILENAME (extended regular expression)

echo -e "today is Monday\nHow are you" | grep Monday

grep -E "[a-z]+" filename
# or
egrep "[a-z]+" filename

grep -i PATTERN FILENAME # ignore case

grep -v PATTERN FILENAME # inverse match

grep -c PATTERN FILENAME # count the number of lines in which a matching string appears

grep -n PATTERN FILENAME # print the line number

grep -R PATTERN DIR      # recursively search many files
grep -r PATTERN DIR      # recursively search many files

grep -e "pattern1" -e "pattern2" FILENAME # multiple patterns
grep -f PATTERNFILE FILENAME # PATTERNFILE contains patterns line-by-line

grep -F PATTERN FILENAME # Interpret PATTERN as a  list  of  fixed  strings,  separated  by
                         # newlines,  any  of  which is to be matched.

grep -r --include *.{c,cpp} PATTERN DIR # including files in which to search
grep -r --exclude "README" PATTERN DIR  # excluding files in which to search

grep -o \<dt\>.*<\/dt\> FILENAME # print only the matched string (<dt> .... </dt>)

Extract columns or fields from text files: cut

http://www.thegeekstuff.com/2013/06/cut-command-examples/

To extract fixed columns (say columns 5-7 of a file):

cut -c5-7 somefile

If the field delimiter is different from TAB you need to specify it using -d:

cut -d' ' -f100-105 myfile > outfile
#
cut -d: -f6 somefile   # colon-delimited file
# 
grep "/bin/bash" /etc/passwd | cut -d':' -f1-4,6,7    # field 1 through 4, 6 and 7

cut -f3 --complement somefile # print all the columns except the third column

To specify the output delimiter, we shall use --output-delimiter. NOTE that to specify the Tab delimiter in cut, we shall use $'\t'. See http://www.computerhope.com/unix/ucut.htm. For example,

cut -f 1,3 -d ':' --output-delimiter=$'\t' somefile

If I am not sure about the number of the final field, I can leave the number off.

cut -f 1- -d ':' --output-delimiter=$'\t' somefile

Substitution of text: sed (stream editor)

By default, sed only prints the substituted text. To save the changes along the substitutions to the same file, use the -i option.

sed 's/text/replace/' file > newfile
mv newfile file
# OR better
sed -i 's/text/replace/' file

The sed command will replace the first occurrence of the pattern in each line. If we want to replace every occurrence, we need to add the g parameter at the end, as follows:

sed 's/pattern/replace/g' file

To remove blank lines

sed '/^$/d' filename

To replace all three-digit numbers with another specified word in a file

sed -i 's/\b[0-9]\{3\}\b/NUMBER/g' filename

echo -e "I love 111 but not 1111." | sed 's/\b[0-9]\{3\}\b/NUMBER/g'

where {3} is used for matching the preceding character thrice. \ in \{3\} is used to give a special meaning for { and }. \b is the word boundary marker.

Variable string and quoting

text=hello
echo hello world | sed "s/$text/HELLO/"

Double quoting expand the expression by evaluating it.

Substitution of text: perl

Add or remove 'chr' from vcf file https://www.biostars.org/p/18530/

awk

awk is a tool designed to work with data streams. It can operate on columns and rows. If supports many built-in functionalities, such as arrays and functions, in the C programming language. Its biggest advantage is its flexibility.

https://en.wikipedia.org/wiki/AWK

Structure of an awk script

awk ' BEGIN{ print "start" } pattern { commands } END { print "end" } ' file

The three of components (BEGIN, END and a common statements block with the pattern match option) are optional and any of them can be absent in the script.

The default delimiter for fields is a space.

Some examples:

awk 'BEGIN { i=0 } { i++ } END { print i}' filename
echo -e "line1\nline2" | awk 'BEGIN { print "start" } { print } END { print  "End" }'

seq 5 | awk 'BEGIN { sum=0; print "Summation:" } { print $1"+"; sum+=$1 } END { print "=="; print sum }'

awk -F : '{print $6}' somefile   # colon-delimited file, print the 6th field (cut can do it)
#
awk --field-searator="\\t" '{print $6}' filename    # tab-delimited (cut can do it)
 
awk -F":" '{ print $1 " " $3 }' /etc/passwd  # (cut can do it)

How to wrap a long linux command

Use backslash character. However, make sure the backslash character is the last character at a line. For example the first example below does not work since there is an extra space character after \.

Example 1 (not work)

sudo apt-get install libcap-dev libbz2-dev libgcrypt11-dev libpci-dev libnss3-dev libxcursor-dev \
   libxcomposite-dev libxdamage-dev libxrandr-dev libdrm-dev libfontconfig1-dev libxtst-dev \ 
   libcups2-dev libpulse-dev libudev-dev

vs example 2 (work)

sudo apt-get install libcap-dev libbz2-dev libgcrypt11-dev libpci-dev libnss3-dev libxcursor-dev \
   libxcomposite-dev libxdamage-dev libxrandr-dev libdrm-dev libfontconfig1-dev libxtst-dev \
   libcups2-dev libpulse-dev libudev-dev

Command line path navigation

pushd and popd are used to switch between multiple directories without the copying nad posting of directory paths. Thy operate on a stack; a last in first out data structure (LIFO).

pushd /var/www
pushd /usr/src
dirs
pushd +2
popd

When we have only two locations, an alternative and easier way is cd -.

cd /usr/src
# Do something
cd /var/www
cd -     # /usr/src

Web

Reference: Linux Shell Scripting Cookbook

Copy a complete webiste

wget --mirror --convert-links URL
# OR
wget -r -N -k -l DEPTH URL

HTTP or FTP authentication

wget --user username --password pass URL

Download a web page as plain text (instead of HTML text)

lynx URL -dump > TextWebPage.txt

cURL

curl http://google.com -o index.html --progress
curl http://google.com --silent -o index.html

# Cookies
curl http://example.com --cookie "user=ABCD;pass=EFGH"
curl URL --cookie-jar cookie_file

# Setting a user agent string
# http://www.useragentstring.com/pages/useragentstring.php
curl URL --user-agent "Mozilla/5.0"

# Authenticating 
curl -u user:pass http://test_auth.com
curl -u user http://test_auth.com

# Printing response headers excluding the data
# For example, to check whether a page is reachable or not
# by checking the 'Content-length' parameter.
curl -I URL

Image crawler and downloader

#!/bin/bash
#Desc: Images downloader
#Filename: img_downloader.sh

if [ $# -ne 3 ];
then
  echo "Usage: $0 URL -d DIRECTORY"
  exit -1
fi

for i in {1..4}
do
  case $1 in
  -d) shift; directory=$1; shift ;;
   *) url=${url:-$1}; shift;;
  esac
done

mkdir -p $directory;
baseurl=$(echo $url | egrep -o "https?://[a-z.]+")

echo Downloading $url
curl -s $url | egrep -o "<img src=[^>]*>" | 
sed 's/<img src=\"\([^"]*\).*/\1/g' > /tmp/$$.list

sed -i "s|^/|$baseurl/|" /tmp/$$.list

cd $directory;

while read filename;
do
  echo Downloading $filename
  curl -s -O "$filename" --silent

done < /tmp/$$.list

Find broken links in a website by lynx -traversal

#!/bin/bash 
#Desc: Find broken links in a website

if [ $# -ne 1 ]; 
then 
  echo -e "$Usage: $0 URL\n" 
  exit 1; 
fi 

echo Broken links: 

mkdir /tmp/$$.lynx 
cd /tmp/$$.lynx 

lynx -traversal $1 > /dev/null 
count=0; 

sort -u reject.dat > links.txt 

while read link; 
do 
  output=`curl -I $link -s | grep "HTTP/.*OK"`; 
  if [[ -z $output ]]; 
  then 
    echo $link; 
    let count++ 
  fi 
done < links.txt 

[ $count -eq 0 ] && echo No broken links found.

Track changes to a website

#!/bin/bash
#Desc: Script to track changes to webpage

if [ $# -ne 1 ];
then 
  echo -e "$Usage: $0 URL\n"
  exit 1;
fi

first_time=0
# Not first time

if [ ! -e "last.html" ];
then
  first_time=1
  # Set it is first time run
fi

curl --silent $1 -o recent.html

if [ $first_time -ne 1 ];
then
  changes=$(diff -u last.html recent.html)
  if [ -n "$changes" ];
  then
    echo -e "Changes:\n"
    echo "$changes"
  else
    echo -e "\nWebsite has no changes"
  fi
else
  echo "[First run] Archiving.."

fi
  
cp recent.html last.html

POST/GET

Look at a web site source and look for the 'name' field in a <input> tag.

http://www.w3schools.com/html/html_forms.asp

# -d is used for posting in curl
curl URL -d "postvar1=var1&postvar2=var2"
# OR the 'get' command with the 'post-data' option
get URL --post-data "postvar1=var1&postvar2=var2" -O out.html

Working with Files

nl command

Add line numbers to a text file

$ cat demo_file
THIS LINE IS THE 1ST UPPER CASE LINE IN THIS FILE.
this line is the 1st lower case line in this file.
This Line Has All Its First Character Of The Word With Upper Case.

Two lines above this line is empty.
And this is the last line.
$ nl demo_file
     1	THIS LINE IS THE 1ST UPPER CASE LINE IN THIS FILE.
     2	this line is the 1st lower case line in this file.
     3	This Line Has All Its First Character Of The Word With Upper Case.
       
     4	Two lines above this line is empty.
     5	And this is the last line.

file command

$ file thumbs/g7.jpg 
thumbs/g7.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=10, orientation=upper-left, xresolution=134, yresolution=142, resolutionunit=2, software=Adobe Photoshop CS Windows, datetime=2004:03:31 22:28:58], baseline, precision 8, 100x75, frames 3

$ file index.html
index.html: HTML document, ASCII text

$ file 2742OS_5_01.sh 
2742OS_5_01.sh: Bourne-Again shell script, ASCII text executable

$ file R-3.2.3.tar.gz 
R-3.2.3.tar.gz: gzip compressed data, last modified: Thu Dec 10 03:12:50 2015, from Unix

tail -f command

When we use the '-f' (follow) option, we can monitor a growing file. For example, we can create a new file called tmp.txt and run 'tail -f tmp.txt'. Now we open another terminal and run 'for i in {0..100}; do sleep 2; echo $i >> ~/output.txt ; done'. We will see in the 1st terminal that the content of tmp.txt is changed.

A practical example is

Monitor system change

sudo tail -f /var/log/syslog

Monitor a process and terminate itself when a give process dies

PID=$(pidof Foo)
tail -f textfile --pid $PID

A process Foo (eg. gedit) is appending data to a file, the tail -f should be executed until the process Foo dies.

Low-level File Access

file descriptors: 0 means standard input, 1 means standard output, 2 means standard error.
size_t write(int fildes, const void *buf, size_t nbytes);

#include <unistd.h>
#include <stdlib.h>
int main()
{
  if ((write(1, "Here is some data\n", 18)) != 17)
    write(2, "A write error has occurred on file descriptor\n", 46);
  exit(0);
}

size_t read(int fildes, void *buf, size_t nbytes); returns the number of data bytes actually read. If a read call returns 0, it had nothing to read; it reached the end of the file. An error on the call will cause it to return -1.
To create a new file descriptor we use the open system call. int open(const char *path, int oflags, mode_t mode);

The next program will do file copy.

#include <unistd.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdlib.h>
int main()
{
  char c;
  int in, out;
  in = open("file.in", O_RDONLY);
  out = open("file.out", O_WRONLY|O_CREAT, S_IRUSER|S_IWUSR);
  while(read(in,&c,1) == 1)
    write(out,&c,1)
  exit(0);
}

The Standard I/O Library

fopen, fclose
fread, fwrite
fflush
fseek
fgetc, getc, getchar
fputc, putc, putchar
fgets, gets
printf, fprintf and sprintf
scanf, fscanf and sscanf

Formatted Input and Output

prinf, fprintf and sprintf
scanf, fscanf and sscanf

Stream Errors

How do You Run a Command in the Background with No Output Unless There is an Error?

Revision as of 14:24, 18 March 2016

Shell Programming

Some Resources

Redirect

>&

Redirect Output and Errors To /dev/null

tee -redirect to both a file and the screen same time

Pipe

Process substitution

Powerfulness of pipes

Pipe vs redirect

Shebang (#!)

Comments

Variables

Concatenate string variables

Environment variables

Parameter variables

Conditions

Control Structures

if

while

AND list

OR list

for

Example 1

Example 2

Example 3

Example 4

Example 5

Functions

set -e, set -x and trap

Commands

trap

Command Execution

Bash shell find out if a command exists or not

POSIX command

type -P

pause by read -p command

Keyboard input and Arithmetic

Here documents

Debugging Scripts

Geany

Examples

Text processing

Regular Expression

Extract columns or fields from text files: cut

Substitution of text: sed (stream editor)

Substitution of text: perl

awk

How to wrap a long linux command

Command line path navigation

Web

Copy a complete webiste

HTTP or FTP authentication

Download a web page as plain text (instead of HTML text)

cURL

Image crawler and downloader

Find broken links in a website by lynx -traversal

Track changes to a website

POST/GET

Working with Files

nl command

file command

tail -f command

Low-level File Access

The Standard I/O Library

Formatted Input and Output

Stream Errors

File and Directory Maintenance

Scanning Directories

UNIX environment

Logging

Resources and Limits

Terminals

Reading from and Writing to the Terminal

The termios Structure

Terminal Output

Detecting Keystokes

Curses

Data Management