Resources

https://en.wikipedia.org/wiki/Regular_expression
Learning Regular Expressions
Regular Expressions Tutorial
- https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions-basics.php
- http://gnosis.cx/publish/programming/regular_expressions.html
- http://regextutorials.com/ Online interactive tutorials
Regular Expression testing
- http://rubular.com/
?grep (returns numeric values), ?grepl (returns a logical vector) and ?regexpr (returns numeric values) in R.
http://www.regular-expressions.info/rlanguage.html
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
http://www.johndcook.com/r_language_regex.html
http://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions
http://rpubs.com/Lionel/19068
http://ucfagls.wordpress.com/2012/08/15/processing-sample-labels-using-regular-expressions-in-r/
http://www.dummies.com/how-to/content/how-to-use-regular-expressions-in-r.html
http://www.r-bloggers.com/example-8-27-using-regular-expressions-to-read-data-with-variable-number-of-words-in-a-field/
http://www.r-bloggers.com/using-regular-expressions-in-r-case-study-in-cleaning-a-bibtex-database/
http://cbio.ensmp.fr/~thocking/papers/2011-08-16-directlabels-and-regular-expressions-for-useR-2011/2011-useR-named-capture-regexp.pdf
http://stackoverflow.com/questions/5214677/r-find-the-last-dot-in-a-string
http://stackoverflow.com/questions/10294284/remove-all-special-characters-from-a-string-in-r

Specific to R

Handling Strings with R by Gaston Sanchez
https://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions
https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
PCRE and newlines tells the differences of \r\n (newline for Windows), \r (newline for UNIX, hex 0D) and \n (newline for old Mac, hex 0A). The tab \t has hex 09.
http://www.autohotkey.com/docs/misc/RegEx-QuickRef.htm
http://opencompany.org/download/regex-cheatsheet.pdf
http://r-exercises.com/2016/10/30/regular-expressions-part-1/

Syntax

The following table is from endmemo.com.

Syntax	Description
\\d	Digit, 0,1,2 ... 9
\\D	Not Digit
\\s	Space
\\S	Not Space
\\w	Word
\\W	Not Word
\\t	Tab
\\n	New line
^	Beginning of the string
$	End of the string
\	Escape special characters, e.g. \\ is "\", \+ is "+"
	d)n/ matches "en" and "dn"
•	Any character, except \n or line terminator
[ab]	a or b
[^ab]	Any character except a and b
[0-9]	All Digit
[A-Z]	All uppercase A to Z letters
[a-z]	All lowercase a to z letters
[A-z]	All Uppercase and lowercase a to z letters
i+	i at least one time
i*	i zero or more times
i?	i zero or 1 time
i{n}	i occurs n times in sequence
i{n1,n2}	i occurs n1 - n2 times in sequence
i{n1,n2}?	non greedy match, see above example
i{n,}	i occures >= n times
[:alnum:]	Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:]	Alphabetic characters: [:lower:] and [:upper:]
[:blank:]	Blank characters: e.g. space, tab
[:cntrl:]	Control characters
[:digit:]	Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:]	Graphical characters: [:alnum:] and [:punct:]
[:lower:]	Lower-case letters in the current locale
[:print:]	Printable characters: [:alnum:], [:punct:] and space
[:punct:]	} ~
[:space:]	Space characters: tab, newline, vertical tab, form feed, carriage return, space
[:upper:]	Upper-case letters in the current locale
[:xdigit:]	Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

grep()

Use value = TRUE will return the matching elements instead of indices
Use invert = TRUE will return the indices or values for elements that do not match

sub() and gsub()

The sub function changes only the first occurrence of the regular expression, while the gsub function performs the substitution on all occurrences within the string.

regexpr() and gregexpr()

The output from these functions is a vector of starting positions of the regular expressions which were found; if no match occurred, a value of -1 is returned.

The regexpr function will only provide information about the first match in its input string(s), while the gregexpr function returns information about all matches found.

Note that in C++, the std::string::find() and Qt's QRegExp::indexIn() can do R's regexpr() does. I am not aware of any gregexpr()-equivalent function in C++.

The following example is coming from the book 'Data Manipulation with R' by Phil Spector, Chapter 7, Character Manipulation.

tst = c('one x7 two b1', 'three c5 four b9', 'five six seven', 'a8 eight nine')
wh = regexpr('[a-z][0-9]', tst)
wh
# [1] 5 7 -1 1
# attr(,"match.length")
# [1] 2 2 -1 2

wh1 = gregexpr('[a-z][0-9]',tst) # return a list just like strsplit()
wh1

# [[1]]
# [1]  5 12
# attr(,"match.length")
# [1] 2 2
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1]  7 15
# attr(,"match.length")
# [1] 2 2
# attr(,"useBytes")
# [1] TRUE
#
# [[3]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"useBytes")
# [1] TRUE
#
# [[4]]
# [1] 1
# attr(,"match.length")
# [1] 2
# attr(,"useBytes")
# [1] TRUE

gregexpr("'", "|3'-5'") # find the apostrophe character
# [[1]]
# [1] 3 6
# attr(,"match.length")
# [1] 1 1
# attr(,"useBytes")
# [1] TRUE

Examples

sub("^.*boundary=", "", string) will substitute a substring which starts with 0 or more characters and then 'boundary=' with an empty. Here ^ means beginning, dot means any character and star means the preceding item 0 or more times.
grep("\\.zip$", pkgs) or grep("\\.tar.gz$", pkgs) will search for the string ending with .zip or .tar.gz
biocLite(suppressUpdates=c("^org\.", "^BSgenome\.")) not update any package whose name starts with "org." or "BSgenome."
grep("9.11", string) will search for the string containing '9', any character (to split 9 & 11) and '11'.
pipe metacharacter; it is translated to 'or'. flood|fire will match strings containing floor or fire.
[^?.]$ will match anyone ([]) not (^) ending ($) with the question mark (?) or period (.).
^[Gg]ood|[Bb]ad will match strings starting with Good/good and anywhere containing Bad/bad.
^([Gg]ood|[Bb]ad) will look for strings beginning with Good/good/Bad/bad.
? character; it means optional. [Gg]eorge( [Ww]\.)? [Bb]ush will match strings like 'george bush', 'George W. Bush' or 'george bushes'. Note that we escape the metacharacter dot by '\.' so it becomes a literal period.
star and plus sign. star means any number including none and plus means at least one. For example, (.*) matches 'abc(222 )' and '()'.
[0-9]+ (.*) [0-9]+ will match one number and following by any number of characters (.*) and a number; e.g. 'afda1080 p' and '4 by 5 size'.
gsub("space:+", " ", " ab c ") will replace multiple spaces with 1 space.
{} refers to as interval quantifiers; specify the minimum and maximum number of match of an expression.
trimws() function to remove trailing/leading whitespace. The function is used in several places.

trimws <-
function(x, which = c("both", "left", "right"))
{
    which <- match.arg(which)
    mysub <- function(re, x) sub(re, "", x, perl = TRUE)
    if(which == "left")
        return(mysub("^[ \t\r\n]+", x))
    if(which == "right")
        return(mysub("[ \t\r\n]+$", x))
    mysub("[ \t\r\n]+$", mysub("^[ \t\r\n]+", x))
}

Another solution to trim leading/trailing space is

# returns string w/o leading whitespace
trim.leading <- function (x)  sub("^\\s+", "", x)

# returns string w/o trailing whitespace
trim.trailing <- function (x) sub("\\s+$", "", x)

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

Special case: match the dot character

See Chapter 11: Strings with stringr in 'R for Data Science' by Hadley Wickham.

The printed representation of a string shows the escapes. To see the raw contents of the string, use writeLines().

x <- c("\"", "\\") # escape ", \
x
# [1] "\"" "\\"
writeLines(x)
# "
# \

"." matches any character. To match the dot character literally we shall use "\\.".

# We want to match the dot character literally
writeLines("\.")
# Error: '\.' is an unrecognized escape in character string starting ""\."

# . should be represented as \. but \ itself should be escaped so
# to escape ., we should use \\.
writeLines("\\.")
# \.

Special case: match the backslash \

x <- "a\\b"
writeLines(x)
# a\b

str_view(x, "\\\\")

Regular expression

Contents

Resources

Syntax

grep()

sub() and gsub()

regexpr() and gregexpr()

Examples

Special case: match the dot character

Special case: match the backslash \

Navigation menu

Regular expression

Resources

Syntax

grep()

sub() and gsub()

regexpr() and gregexpr()

Examples

Special case: match the dot character

Special case: match the backslash \

Navigation menu

Search