Python: Difference between revisions

From 太極
Jump to navigation Jump to search
Line 822: Line 822:
= Python 3 =
= Python 3 =
* [https://github.com/arogozhnikov/python3_with_pleasure Migrating to Python 3 with pleasure]
* [https://github.com/arogozhnikov/python3_with_pleasure Migrating to Python 3 with pleasure]
* [https://github.com/wesm/pydata-book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython] by Wes McKinney.


= R and Python =
= R and Python =
* [https://github.com/rstudio/reticulate R Interface to Python]: '''reticulate''' package
* [https://github.com/rstudio/reticulate R Interface to Python]: '''reticulate''' package
* [https://www.listendata.com/2018/03/run-python-from-r.html?m=1 Run Python from R]
* [https://www.listendata.com/2018/03/run-python-from-r.html?m=1 Run Python from R]

Revision as of 12:32, 16 April 2018

Basic

Think Python (Free Ebook)

http://www.greenteapress.com/thinkpython/

Cheat sheet

The Most Frequently Asked Questions About Python Programming

https://www.makeuseof.com/tag/python-programming-faq/

How to run a python code

python mypython.py

Install a new module

See an example of installing HTSeq.

pip

The Python Package Index (PyPI) is the definitive list of packages (or modules)

sudo apt-get install python-pip
pip install SomePackage
pip show --files SomePackage
pip install --upgrade SomePackage
pip uninstall SomePackage

Don't use sudo + pip

https://askubuntu.com/questions/802544/is-sudo-pip-install-still-a-broken-practice

"--user" option in pip

https://github.com/pypa/pip/issues/4186

$ pip install Pygments
...
OSError: [Errno 13] Permission denied: '/usr/local/lib/python2.7/dist-packages/Pygments-2.2.0.dist-info'
/usr/local/lib/python2.7/dist-packages/pip-9.0.1-py2.7.egg/pip/_vendor/requests/packages/urllib3/util/ssl_.py:122: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
$ pip install --user Pygments
Collecting Pygments
  Using cached Pygments-2.2.0-py2.py3-none-any.whl
Installing collected packages: Pygments
Successfully installed Pygments-2.2.0

python setup.py

If a package has been bundled by its creator using the standard approach to bundling modules (with Python’s distutils tool), all you need to do is download the package, uncompress it and type:

python setup.py build
sudo python setup.py install

For Python 2, the packages are installed under /usr/local/lib/python2.7/dist-packages/.

$ ls -l /usr/local/lib/python2.7/dist-packages/
total 12
-rw-r--r-- 1 root staff  273 Jan 12 13:45 easy-install.pth
drwxr-sr-x 4 root staff 4096 Jan 12 13:45 HTSeq-0.6.1p1-py2.7-linux-x86_64.egg
drwxr-sr-x 4 root staff 4096 Jan 12 13:42 pysam-0.9.1.4-py2.7-linux-x86_64.egg

Get a list of installed modules

http://stackoverflow.com/questions/739993/how-can-i-get-a-list-of-locally-installed-python-modules

pydoc modules

Check an installed package version

If you install packages through pip, use

$ pip list
...
pyOpenSSL (0.13.1)
pyparsing (2.0.1)
pysam (0.10.0)
python-dateutil (1.5)
pytz (2013.7)
rudix (2016.12.13)
scipy (0.13.0b1)
setuptools (1.1.6)
singledispatch (3.4.0.3)
six (1.4.1)
tornado (4.4.2)
vboxapi (1.0)
xattr (0.6.4)
zope.interface (4.1.1)

And more information about a package by using pip show PACKAGE.

$ pip show pysam
Name: pysam
Version: 0.10.0
Summary: pysam
Home-page: https://github.com/pysam-developers/pysam
Author: Andreas Heger
Author-email: [email protected]
License: MIT
Location: /Users/XXX/Library/Python/2.7/lib/python/site-packages
Requires: 

The following method works whether the package is installed by source or binary package

>>> import pysam
>>> print(pysam.__version__)
0.10.0
>>> print pysam.__version__
0.10.0

See http://hammelllab.labsites.cshl.edu/tetoolkit-faq/

Install a specific version of package through pip

https://stackoverflow.com/questions/5226311/installing-specific-package-versions-with-pip

For example, pysam package was actively released. But the new release (0.11.2.2) may introduce some bugs. So I have to install an older version (0.10.0 works for me on Mac El Capitan and Sierra).

$ sudo -H pip uninstall pysam
Uninstalling pysam-0.11.2.2:
......
$ sudo -H pip install pysam==0.10.0
Collecting pysam==0.10.0
  Downloading pysam-0.10.0.tar.gz (2.3MB)
    100% |████████████████████████████████| 2.3MB 418kB/s 
Installing collected packages: pysam
  Running setup.py install for pysam ... done
Successfully installed pysam-0.10.0

warning: Please check the permissions and owner of that directory

I got this message when I use root to run the 'sudo pip install PACKAGE' command.

See

python3-pip installed but pip3 command not found?

sudo apt-get remove python3-pip; sudo apt-get install python3-pip

How to list all installed modules

help('modules')

How to find the location of installed modules

There are different ways

  1. python -v
  2. import MODULENAME
  3. help('MODULENAME')

Using this way, I find the 'RPi' module is installed under /usr/lib/python2.7/dist-packages.

if __name__ == "__main__":

http://stackoverflow.com/questions/419163/what-does-if-name-main-do

Import a compiled C module

string and string operators

Reference:

  1. Python for Genomic Data Science from coursera.
  2. Python Hello World and String Manipulation
dna[0]='g' 
dna[-1]='c' (start counting from the right)
dna[-2]='g'
dna[0:3]='gat' (the end always excluded)
dna[:3]='gat'
dna[2:]='tgc'
len(dna)=6
type(dna)
print(dna)
dna.count('c')
dna.upper()
dna.find('ag')=3  (only the first occurrence of 'ag' is reported)
dna.find('17', 2) (start looking from pos 17)
dna.rfind('ag')   ( search backwards in string)
dna.islower()    (True)
dna.isupper()    (False)
dna.replace('a', 'A')

User's input

dna=raw_input("Enter a DNA sequence: ")  # python 2
dna=input("Enter a DNA sequence: ")      # python 3

To convert a user's input (a string) to others

int(x, [, base])
flaot(x)
str(x) #converts x to a string
str(65) # '65'

chr(x)  # converts an integer to a character
chr(65) # 'A'

Fancy Output

print("THE DNA's GC content is ", gc, "%") # gives too many digits following the dot
print("THE DNA's GC content is %5.3f %%" % " % gc) 
# the percent operator separating the formatting string and the value to
# replace the format placeholder
print("%d" % 10.6)  # 10
print("%e" % 10.6)  # 10.060000e+01
print("%s" % dna)   # gatagc

List

A list is an ordered set of values

gene_expr=['gene', 5.16e-08, 0.001385, 7.33e-08]
print(gene_expr[2]
gene_expr[0]='Lif'

Slice a list (it will create a new list)

gene_expr[-3:]  # [5.16e-08, 0.001385, 7.33e-08]
gene_expr[1:3] = [6.09e-07]

Clear the list

gene_expr[]=[]

Size of the list

len(gene_expr)

Delete an element

del gene_expr[1]

Extend/append to a list

gene_expr).extend([5.16e-08, 0.00123])

Count the number of times an element appears in a list

print(gene_expr.count('Lif'), gene_expr.count('gene'))

Reverse all elements in a list

gene_expr.reverse()
print(gene_expr)
help(list)

Lists as Stacks

stack=['a', 'b', 'c', 'd']
stack.append('e')

Sorting lists

mylist=[3, 31, 123, 1, 5]
sorted(mylist)
mylist  # not changed
mylist.sort()

mylist=['c', 'g', 'T', 'a', 'A']
mylist.sort()

Don't change an element in a string!


motif = 'nacggggtc'
motif[0] = 'a'    # ERROR

Tuples

A tuple consists of a number of values separated by commas, and is another standard sequence data type, like strings and lists.

t=1,2,3
t
t=(1,2,3)  # we may input tuples with or without surrounding parentheses

Sets

A set is an unordered collection with no duplicate elements.

brca1={'DNA repair', 'zine ion binding'}
brca2={protein binding', 'H4 histone'}
brca1 | brca2
brca1 & brca2
brca1 - brca2

Dictionaries

A dictionary is an unordered set of key and value pairs, with the requirement that the keys are unique (within on dictionary).

TF_motif = {'SP1' : 'gggcgg', 
            'C/EBP' : 'attgcgcaat',
            'ATF' : 'tgacgtca',
            'c-Myc' : 'cacgtg',
            'Oct-1' : 'atgcaaat'}
# Access
print("The recognition sequence for the ATF transcription is %s." % TF_motif['ATF']) 
# Update
TF_motif['AP-1'] = 'tgagtca'
# Delete
del TF_motif['SP1']
# Size of a list
len(TF_motif)
# Get a list of all the 'keys' in a dictionary
list(TF_motif.keys())
# Get a list of all the 'values'
list(TF_motif.values())
# sort
sorted(TF_motif.keys())
sorted(TF_motif.values())

We can retrieve data from dictionaries using the items() method.

for name,seq in seqs.item():
    print(name, seq)

In summary, strings, lists and dictionaries are most useful data types for bioinformatics.

if statement

dna=input('Enter DNA sequence: ')
if 'n' in dna :
  nbases=dna.count('n')
  print("dna sequence has %d undefined bases " % nbases)

if condtion 1:
  do action 1
elif condition 2:
  do action 2
else:
  do action 3

Logical operators

Use and, or, not.

dna=input('Enter DNA sequence: ')
if 'n' in dna or 'N' in dna:
    nbases=dna.count('n')+dna.count('N')
    print("dna sequence has %d undefined bases " % nbases)
else:
    print("dna sequence has no undefined bases)

Loops

while

dna=input('Enter DNA sequence:')
pos=dna.find('gt', 0)

while pos>-1 :
    print("Donar splice site candidate at position %d" %pos)
    pos=dna.find('gt', pos+1)

for

motifs=["attccgt", "aggggggttttttcg", "gtagc"]
for m in motifs:
    print(m, len(m))

range

for i in range(4):
    print(i)
for i in range(1,10,2):
    print(i)

Problem: find all characters in a given protein sequence are valid amino acids.

protein='SDVIHRYKUUPAKSHGWYVCJRSRFTWMVWWRFRSCRA'
for i in range(len(protein)):
    if protein[i] not in 'ABCDEFGHIKLMNPQRSTVWXYZ': 
        print("this is not a valid protein sequence!")
        break

continue

protein='SDVIHRYKUUPAKSHGWYVCJRSRFTWMVWWRFRSCRA'
corrected_protein=''
for i in range(len(protein)):
    if protein[i] not in 'ABCDEFGHIKLMNPQRSTVWXYZ': 
        continue
    corrected_protein=corrected_protein+protein[i]
print("COrrected protein seq is %s" % corrected_protein)

else Statement used with loops

  • If used with a for loop, the else statement is executed when the loop has exhausted iterating the list
  • If used with a while loop, the else statement is executed when the condition becomes false
# Find all prime numbers smaller than a given integer
N=10
for y in range(2, N):
    for x in range(2, y):
        if y %x == 0:
            print(y, 'equals', x, '*', y//x)
            break
        else:
            // loop fell through without finding a factor
            print(y, 'is a prime number')

The pass statement is a placeholder

if motif not in dna:
    pass
else:
    some_function_here()

Functions

def function_name(arguments) :
    function_code_block
    return output

For example,

def gc(dna) :
    "this function computes the gc perc of a dna seq"
    nbases=dna.count('n')+dna.count('n')
    gcpercent=float(dna.count('c')+dna.count('C')+dna.count('g)
+dna.count('G'))*100.0/(len(dna)-nbases)
    return gcpercent
gc('AAAAGTNNAGTCC')
help(gc)

Boolean functions

Problem: checks if a given dna seq contains an in-frame stop condon

dna=input("Enter a dna seq: ")
if (has_stop_codon(dna)) :
    print("input seq has an in frame stop codon.")
else :
    print("input seq has no in frame stop codon.")

def has_stop_codon(dna) :
    "This function checks if given dna seq has in frame stop codons."
    stop_codon_found=False
    stop_codons=['tga', 'tag', 'taa']
    for i in range(0, len(dna), 3) :
        codon=dna[i:i+3].lower()
        if codon in stop_codons : 
            stop_codon_found=True
            break
    return stop_codon_found

Function default parameter values

Suppose the has_stop_codon function also accepts a frame argument (equal to 0, 1, or 2) which specifies in what frame we want to look for stop codons.

def has_stop_codon(dna, frame=0) :
    "This function checks if given dna seq has in frame stop codons."
    stop_codon_found=False
    stop_codons=['tga', 'tag', 'taa']
    for i in range(frame, len(dna), 3) :
        codon=dna[i:i+3].lower()
        if codon in stop_codons : 
            stop_codon_found=True
            break
    return stop_codon_found

dna="atgagcggccggct"
has_stop_codon(dna)    # False
has_stop_codon(dna, 0) # False
has_stop_codon(dna, 1) # True
has_stop_codon(frame=0, dna=dna)

More examples

Reverse complement of a dna sequence

def reversecomplement(seq):
    """Return the reverse complement of the dna string."""
    seq = reverse_string(seq)
    seq = complement(seq)
    return seq

reversecomplement('CCGGAAGAGCTTACTTAG')

To reverse a string

def reverse_string(seq):
    return seq[::-1]

reverse_string(dna)

Complement a DNA Sequence

def complement(dna):
    """Return the complementary sequence string."""
    basecomplement = {'A':'T', 'C':'G', 'G':'C', 'T':'A', 
                      'N':'N', 'a':t', 'c':'g', 'g':'c', 't':'a', 'n':'n'} # dictionary
    letters = list(dna) # list comprehensions
    letters = [basecomplement[base] for base in letters]
    return ''.join(letters)

Split and Join functions

sentence="enzymes and other proteins come in many shapes"
sentence.split()  # split on all whitespaces
sentence.split('and') # use 'and' as the separator

'-'.join(['enzymes', 'and', 'other', 'proteins', 'come', 'in', 'many', 'shapes'])

Variable number of function arguments

def newfunction(fi, se, th, *rest):
  print("First: %s" % fi)
  print("Second: %s" % se)
  print("Third: %s" % th)
  print("Rest... %s" % rest)
  return

Modules and packages

<dnautil.py>

#!/usr/bin/python
"""
dnautil module contains a few useful functions for dna seq
"""
def gc(dna) :
    blah
    blah
    return gcpercent

When a module is imported, Python first searches for a built-in module with that name.

If built-in module is not found, Python then searches for a file obtained by adding the extension .py to the name of the module that it's imported:

  • in your current directory,
  • the directory where Python has been installed
  • in a path, i.e., a colon(':') separated list of file paths, stored in the environment variable PYTHONPATH.

You can use the sys.path variable from the sys built-in module to check the list of all directories where Python look for files

import sys
sys.path

If the sys.path variable does not contains the directory where you put your module you can extend it:

sys.path.append("/home/$USER/python")
sys.path

Using modules

import dnautil
dna="atgagggctaggt"
gc(dna)          # gc is not defined
dnautil.gc(dna)  # Good

Import Names from a Module

from dnautil import *
gc(dna)          # OK

from dnautil import gc, has_stop_codon

Packages

Packages group multiple modules under on name, by using "dotted module names". For example, the module name A.B designates a submodule named B in a package named A.

Each package in Python is a directory which MUST contain a special file __init__.py. This file can be empty and it indicates that the directory it contains is a Python package, so it can be imported the same way a module can be imported. https://docs.python.org/2/tutorial/modules.html

Example: suppose you have several modules dnautil.py, rnautil.py , and proteinutil.py. You want to group them in a package called "bioseq" which processes all types of biological sequences. The structure of the package:

bioseq/
  __init__.py
  dnautil.py
  rnautil.py
  proteinutil.py
  fasta/
    __init__.py
    fastautil.py
  fastq/
    __init__.py
    fastqutil.py

Loading from packages:

import bioseq.dnautil
bioseq.dnautil.gc(dna)

from bioseq import dnautil
dnautil.gc(dna)

from bioseq.fasta.fastautil import fastqseqread

Files - Communicate with the outside

f=open('myfile', 'r') # read
f=open('myfile')
f=open('myfile', 'w') # write
f=open('myfile', 'a') # append

Take care if a file does not exists

try:
    f = open('myfile')
except IOError:
    print("the file myfile does not exist!!")

Reading

for line in f:
    print(line)

Change positions within a file object

f.seek(0)  # go to the beginning of the file
f.read()

Read a single line

f.seek(0)
f.readline()

Write into a file

f=open("/home/$USER/myfile, 'a)
f.write("this is a new line")
f.close()

Command line arguments

Suppose we run 'python processfasta.py myfile.fa' in the command line, then

import sys
print(sys.argv)  #  ['processfasta.py', 'myfile.fa']

More completely

#!/usr/bin/python
"""
processfasta.py builds a dictionary with all sequences from a FASTA file.
"""

import sys
filename=sys.argv[1]

try:
  f = open(filename)
except IOError:
    print("File %s does not exist!" % filename)

Parsing command line arguments with getopt. Suppose we want to store in the dictionary the sequences bigger than a given length provided in the command line: 'processfasta.py -l 250 myfile.fa'

#!/usr/bin/python
import sys
import getopt

def usage():
    print """
processfasta.py: reads a FASTA file and builds a 
dictionary with all sequence bigger than a given length

processfasta.py [-h] [-l <length>] <filename>

 -h           print this message
 -l <length>  filter all sequences with a length
              smaller than <length>
              (default <length>=0)
 <filename>   the file has to be in FASTA format

o, a = getopt.getopt(sys.argv[1:], '1:h')
opts = {} # empty dictionary
seqlen=0;

for k,v in o:
    opts[k] = v
if 'h' in opts.keys():  # he means the user wants help
    usage(); sys.exit()
if len(a) < 1:
    usage(); sys.exit("input fasta file is missing")
if 'l' in opts.keys():
    if opts['l'] <0 :
        print("length of seq should be positive!"); sys.exit(0);
    seqlen=opts['l']

stdin and stdout

sys.stdin.read()

sys.stdout.write("Some useful ouput.\n")

sys.stderr.write("Warning: input file was not found\n")

Call external programs

import subprocess
subprocess.call('["ls", "-l"]) # return code indicates the success or failure of the execution

subprocess.call('["tophat", "genome_mouse_idx", "PE_reads_1.fq.gz", "PE_reads_2.fq.gz"])

Biopython & Pubmed

  • Parsers for various bioinformatics file formats (FASTA, Genbank)
  • Access to online services like NCBI Entrez or Pubmed databases
  • Interfaces to common bioinformatics programs such as BLAST, Clustalw and others.
import Bio
print(Bio.__version__)

Running BLAST over the internet

from Bio.Blast import NCBIWWW
fasta_string = open("myseq.fa").read()
result_handle = NCBIWWW.qblast("blastn":, "nt", fasta_string)
# blastn is the program to use
# nt is the database to search against
# default output is xml
help(NCBIWWW.qblast)

The BLAST record

from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)

Parse BLAST output

len(blast_record.alignments)

E_VALUE_THRESH = 0.01
for alignment in blas_record.alignments:
  for hsp in alignment.hsps:
    if hsp.expect < E_VALUE_THRESH:
      print('***Alignment***')
      print('sequence:', alignment.title)
      print('length:', alignment.length)
      print('e value:', hsp.expect)
      print(hsp.query)
      print(hsp.match)
      print(hsp.sbjct)

More help with Biopython

pubmed_parser

Parser for Pubmed Open-Access XML Subset and MEDLINE XML Dataset

Popular python libraries

20 Python libraries you can’t live without

numpy

scipy

Hypergeometric test

feedparser

Never miss a Magazine article — build your own RSS notification system

Projects based on python

  • pithos Pandora on linux
  • Many Raspberry Pi GPIO projects
  • GeneScissors It also requires pip and scikit-learn packages.
  • KeepNote It depends on Python 2.X, sqlite and PyGTK.
  • Zim It depends on Python, Gtk and the python-gtk bindings.
  • Cherrytree It depends on Python2, Python-gtk2, Python-gtksourceview2, p7zip-full, python-enchant and python-dbus.

Qt for GUI development

Python 3

R and Python