Python
Basic
Think Python (Free Ebook)
http://www.greenteapress.com/thinkpython/
How to run a python code
python mypython.py
Install a new module
The Python Package Index (PyPI) is the definitive list of packages (or modules)
sudo apt-get install python-pip pip install SomePackage pip show --files SomePackage pip install --upgrade SomePackage pip uninstall SomePackage
If a package has been bundled by its creator using the standard approach to bundling modules (with Python’s distutils tool), all you need to do is download the package, uncompress it and type:
python setup.py install
How to list all installed modules
help('modules')
How to find the location of installed modules
There are different ways
- python -v
- import MODULENAME
- help('MODULENAME')
Using this way, I find the 'RPi' module is installed under /usr/lib/python2.7/dist-packages.
if __name__ == "__main__":
http://stackoverflow.com/questions/419163/what-does-if-name-main-do
Import a compiled C module
- An example based on SWIG compiler.
string and string operators
Reference: Python for Genomic Data Science from coursera.
- Use double quote instead of single quote to define a string
- Use triple double quotes """ to write a long string spanning multiple lines or comments in a python script
- if dna="gatagc", then
dna[0]='g'
dna[-1]='c' (start counting from the right)
dna[-2]='g'
dna[0:3]='gat' (the end always excluded)
dna[:3]='gat'
dna[2:]='tgc'
len(dna)=6
type(dna)
print(dna)
dna.count('c')
dna.upper()
dna.find('ag')=3 (only the first occurrence of 'ag' is reported)
dna.find('17', 2) (start looking from pos 17)
dna.rfind('ag') ( search backwards in string)
dna.islower() (True)
dna.isupper() (False)
dna.replace('a', 'A')
User's input
dna=raw_input("Enter a DNA sequence: ") # python 2
dna=input("Enter a DNA sequence: ") # python 3
To convert a user's input (a string) to others
int(x, [, base]) flaot(x) str(x) #converts x to a string str(65) # '65' chr(x) # converts an integer to a character chr(65) # 'A'
Fancy Output
print("THE DNA's GC content is ", gc, "%") # gives too many digits following the dot
print("THE DNA's GC content is %5.3f %%" % " % gc)
# the percent operator separating the formatting string and the value to
# replace the format placeholder
print("%d" % 10.6) # 10
print("%e" % 10.6) # 10.060000e+01
print("%s" % dna) # gatagc
List
A list is an ordered set of values
gene_expr=['gene', 5.16e-08, 0.001385, 7.33e-08] print(gene_expr[2] gene_expr[0]='Lif'
Slice a list (it will create a new list)
gene_expr[-3:] # [5.16e-08, 0.001385, 7.33e-08] gene_expr[1:3] = [6.09e-07]
Clear the list
gene_expr[]=[]
Size of the list
len(gene_expr)
Delete an element
del gene_expr[1]
Extend/append to a list
gene_expr).extend([5.16e-08, 0.00123])
Count the number of times an element appears in a list
print(gene_expr.count('Lif'), gene_expr.count('gene'))
Reverse all elements in a list
gene_expr.reverse() print(gene_expr) help(list)
Lists as Stacks
stack=['a', 'b', 'c', 'd']
stack.append('e')
Sorting lists
mylist=[3, 31, 123, 1, 5] sorted(mylist) mylist # not changed mylist.sort() mylist=['c', 'g', 'T', 'a', 'A'] mylist.sort()
Don't change an element in a string!
motif = 'nacggggtc' motif[0] = 'a' # ERROR
Tuples
A tuple consists of a number of values separated by commas, and is another standard sequence data type, like strings and lists.
t=1,2,3 t t=(1,2,3) # we may input tuples with or without surrounding parentheses
Sets
A set is an unordered collection with no duplicate elements.
brca1={'DNA repair', 'zine ion binding'}
brca2={protein binding', 'H4 histone'}
brca1 | brca2
brca1 & brca2
brca1 - brca2
Dictionaries
A dictionary is an unordered set of key and value pairs, with the requirement that the keys are unique (within on dictionary).
TF_motif = {'SP1' : 'gggcgg',
'C/EBP' : 'attgcgcaat',
'ATF' : 'tgacgtca',
'c-Myc' : 'cacgtg',
'Oct-1' : 'atgcaaat'}
# Access
print("The recognition sequence for the ATF transcription is %s." % TF_motif['ATF'])
# Update
TF_motif['AP-1'] = 'tgagtca'
# Delete
del TF_motif['SP1']
# Size of a list
len(TF_motif)
# Get a list of all the 'keys' in a dictionary
list(TF_motif.keys())
# Get a list of all the 'values'
list(TF_motif.values())
# sort
sorted(TF_motif.keys())
sorted(TF_motif.values())
In summary, strings, lists and dictionaries are most useful data types for bioinformatics.
if statement
dna=input('Enter DNA sequence: ')
if 'n' in dna :
nbases=dna.count('n')
print("dna sequence has %d undefined bases " % nbases)
if condtion 1:
do action 1
elif condition 2:
do action 2
else:
do action 3
Logical operators
Use and, or, not.
dna=input('Enter DNA sequence: ')
if 'n' in dna or 'N' in dna:
nbases=dna.count('n')+dna.count('N')
print("dna sequence has %d undefined bases " % nbases)
else:
print("dna sequence has no undefined bases)
Loops
while
dna=input('Enter DNA sequence:')
pos=dna.find('gt', 0)
while pos>-1 :
print("Donar splice site candidate at position %d" %pos)
pos=dna.find('gt', pos+1)
for
motifs=["attccgt", "aggggggttttttcg", "gtagc"]
for m in motifs:
print(m, len(m))
range
for i in range(4):
print(i)
for i in range(1,10,2):
print(i)
Problem: find all characters in a given protein sequence are valid amino acids.
protein='SDVIHRYKUUPAKSHGWYVCJRSRFTWMVWWRFRSCRA'
for i in range(len(protein)):
if protein[i] not in 'ABCDEFGHIKLMNPQRSTVWXYZ':
print("this is not a valid protein sequence!")
break
continue
protein='SDVIHRYKUUPAKSHGWYVCJRSRFTWMVWWRFRSCRA'
corrected_protein=''
for i in range(len(protein)):
if protein[i] not in 'ABCDEFGHIKLMNPQRSTVWXYZ':
continue
corrected_protein=corrected_protein+protein[i]
print("COrrected protein seq is %s" % corrected_protein)
else Statement used with loops
- If used with a for loop, the else statement is executed when the loop has exhausted iterating the list
- If used with a while loop, the else statement is executed when the condition becomes false
# Find all prime numbers smaller than a given integer
N=10
for y in range(2, N):
for x in range(2, y):
if y %x == 0:
print(y, 'equals', x, '*', y//x)
break
else:
// loop fell through without finding a factor
print(y, 'is a prime number')
The pass statement is a placeholder
if motif not in dna:
pass
else:
some_function_here()
Functions
def function_name(arguments) :
function_code_block
return output
For example,
def gc(dna) :
"this function computes the gc perc of a dna seq"
nbases=dna.count('n')+dna.count('n')
gcpercent=float(dna.count('c')+dna.count('C')+dna.count('g)
+dna.count('G'))*100.0/(len(dna)-nbases)
return gcpercent
gc('AAAAGTNNAGTCC')
help(gc)
Boolean functions
Problem: checks if a given dna seq contains an in-frame stop condon
dna=input("Enter a dna seq: ")
if (has_stop_codon(dna)) :
print("input seq has an in frame stop codon.")
else :
print("input seq has no in frame stop codon.")
def has_stop_codon(dna) :
"This function checks if given dna seq has in frame stop codons."
stop_codon_found=False
stop_codons=['tga', 'tag', 'taa']
for i in range(0, len(dna), 3) :
codon=dna[i:i+3].lower()
if codon in stop_codons :
stop_codon_found=True
break
return stop_codon_found
Function default parameter values
Suppose the has_stop_codon function also accepts a frame argument (equal to 0, 1, or 2) which specifies in what frame we want to look for stop codons.
def has_stop_codon(dna, frame=0) :
"This function checks if given dna seq has in frame stop codons."
stop_codon_found=False
stop_codons=['tga', 'tag', 'taa']
for i in range(frame, len(dna), 3) :
codon=dna[i:i+3].lower()
if codon in stop_codons :
stop_codon_found=True
break
return stop_codon_found
dna="atgagcggccggct"
has_stop_codon(dna) # False
has_stop_codon(dna, 0) # False
has_stop_codon(dna, 1) # True
has_stop_codon(frame=0, dna=dna)
More examples
Reverse complement of a dna sequence
def reversecomplement(seq):
"""Return the reverse complement of the dna string."""
seq = reverse_string(seq)
seq = complement(seq)
return seq
reversecomplement('CCGGAAGAGCTTACTTAG')
To reverse a string
def reverse_string(seq):
return seq[::-1]
reverse_string(dna)
Complement a DNA Sequence
def complement(dna):
"""Return the complementary sequence string."""
basecomplement = {'A':'T', 'C':'G', 'G':'C', 'T':'A',
'N':'N', 'a':t', 'c':'g', 'g':'c', 't':'a', 'n':'n'} # dictionary
letters = list(dna) # list comprehensions
letters = [basecomplement[base] for base in letters]
return ''.join(letters)
Split and Join functions
sentence="enzymes and other proteins come in many shapes"
sentence.split() # split on all whitespaces
sentence.split('and') # use 'and' as the separator
'-'.join(['enzymes', 'and', 'other', 'proteins', 'come', 'in', 'many', 'shapes'])
Variable number of function arguments
def newfunction(fi, se, th, *rest):
print("First: %s" % fi)
print("Second: %s" % se)
print("Third: %s" % th)
print("Rest... %s" % rest)
return
Projects based on python
- pithos Pandora on linux
- Many Raspberry Pi GPIO projects
- GeneScissors It also requires pip and scikit-learn packages.
- KeepNote It depends on Python 2.X, sqlite and PyGTK.
- Zim It depends on Python, Gtk and the python-gtk bindings.
- Cherrytree It depends on Python2, Python-gtk2, Python-gtksourceview2, p7zip-full, python-enchant and python-dbus.
Qt for GUI development
- http://zetcode.com/gui/pyqt4/
- http://wiki.wildsong.biz/index.php/PyQt Create GUI in Qt Designer and convert/use it in PyQt.