Revision as of 09:03, 7 June 2023

Resources

https://en.wikipedia.org/wiki/Regular_expression
Learning Regular Expressions
Regular Expressions Tutorial
- https://ryanstutorials.net/regular-expressions-tutorial/regular-expressions-basics.php
- http://gnosis.cx/publish/programming/regular_expressions.html
- http://regextutorials.com/ Online interactive tutorials
Regular Expression testing
- http://rubular.com/
?grep (returns numeric values), ?grepl (returns a logical vector) and ?regexpr (returns numeric values) in R.
http://www.regular-expressions.info/rlanguage.html
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
http://www.johndcook.com/r_language_regex.html
http://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions
http://rpubs.com/Lionel/19068
http://ucfagls.wordpress.com/2012/08/15/processing-sample-labels-using-regular-expressions-in-r/
http://www.dummies.com/how-to/content/how-to-use-regular-expressions-in-r.html
http://www.r-bloggers.com/example-8-27-using-regular-expressions-to-read-data-with-variable-number-of-words-in-a-field/
http://www.r-bloggers.com/using-regular-expressions-in-r-case-study-in-cleaning-a-bibtex-database/
http://cbio.ensmp.fr/~thocking/papers/2011-08-16-directlabels-and-regular-expressions-for-useR-2011/2011-useR-named-capture-regexp.pdf
http://stackoverflow.com/questions/5214677/r-find-the-last-dot-in-a-string

Remove all special characters from a string in R?

gsub("[^[:alnum:]]", "", x)
gsub("[[:punct:]]", " ", x)

PCRE and newlines tells the differences of \r\n (newline for Windows), \r (newline for UNIX, hex 0D) and \n (newline for old Mac, hex 0A). The tab \t has hex 09.
How to Use Regular Expressions (regexes) on Linux

Specific to R

https://stringr.tidyverse.org/articles/regular-expressions.html, Cheat sheet from RStudio
- Basic matches
- Escaping, "\\."
- Special characters: "\e", "\f", "\n", "\r", "\t"
- Matching multiple characters, "\d" matches any digits, "\s" matches any whitespace, "\w" matches any word, "\p", "\b", ...
- Alternation/ Or: "|"
- Grouping: parentheses
- Anchors: "^" matches the start of string, "$" matches the end of the string.
- Repetition: "?", "+", "*", "[]", "{n}", "{n, }", "{n, m}", "??", "+?", "*?"
- Look arounds: "(?=...)", "(?!...)", "(?<=...)"
- Comments: "(?#...)"
Handling Strings with R ebook by Gaston Sanchez
https://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions
https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
http://r-exercises.com/2016/10/30/regular-expressions-part-1/
Demystifying Regular Expressions in R
A Beginners Guide to Match Any Pattern Using Regular Expressions in R

Specific to Terminal

Using Grep & Regular Expressions to Search for Text Patterns in Linux & Extended Regular Expressions.

Grep Multiple Strings from a File in Linux

Online tools

RegExplain & RStudio Addin https://github.com/gadenbuie/regexplain

Metacharacters

There are 12 metacharacters. It seems "]", "}" and "-" are not metacharacters.

.   \   |   (   )   [   {   $   ^   *   +   ?

If we want to match them, we need to precede them with a double backslash.

gsub("$", ".", "abc$def") # "abc$def"
gsub("\\$", ".", "abc$def") # "abc.def"

metachar <- scan(textConnection(".   \\   |   (   )   [   ]   {   }   $   -    ^   *   +   ?"), "")
# Read 15 items
metachar
# [1] "."  "\\" "|"  "("  ")"  "["  "]"  "{"  "}"  "$"  "-"  "^"  "*"  "+"  "?" 
nchar(metachar[2])
# [1] 1
grep("\\.", metachar, value = TRUE) # "."
grep("\\.", metachar) # 1
grep("\\\\", metachar) # 2
grep("]", metachar)  #  7
grep("}", metachar)  #  9
grep("{", metachar) 
# Error in grep("{", metachar) : 
#  invalid regular expression '{', reason 'Missing '}''
grep("\\$", metachar) # 10
grep("-", metachar)  # 11

strsplit("abc.def", "\\.")
strsplit("abc.def", ".", fixed = TRUE)

"." matches everything except for the empty sting "".
"+" the preceding item will be matched one or more times.
"*" the preceding item will be matched zero or more times.
"^" matches the empty string at the at the beginning of a line. When used in a character class means to match any character but the following ones.
"$" matches empty string at the end of a line.
"|" infix operator: OR
"(", ")" brackets for grouping.
"[", "]" character class brackets

+

gsub(pattern = "\\.\\.", replace = ".", "id..of...patient")
# [1] "id.of..patient"   NOT RIGHT, need to apply the command multiple times

gsub(pattern = "\\.+", replace = ".", "id..of...patient")
# [1] "id.of.patient"

Character classes

Replacing all values that do not contain letters or digits with NA value. The following example is from here.

s <- c("", "  ", "3 times a day after meal", "once a day", "       ","  one per day ", "\t", "\n  ")
# Method 1
s[s==""|s=="  "|s=="       "|s=="\t"|s=="\n"]  # BAD

# Method 2
allIndices = 1:length(s)
letOrDigIndices = grep("[a-zA-Z0-9]", s)
blankInd = setdiff(allIndices, letOrDigIndices)
s[blankInd]

# Method 3
gsub("^$|^( +)$|[\t\n\r\f\v]+", NA, s)

# Method 4. Get rid of extra blank spaces
s1 = gsub("^([ \t\n\r\f\v]+)|([ \t\n\r\f\v]+)$", "", s)
gsub("^$", NA, s1)

Syntax

The following table is from endmemo.com.

Syntax	Description
\\d	Digit, 0,1,2 ... 9
\\D	Not Digit
\\s	Space eg: sub('\\s', '\n', "abc ABC")
\\S	Not Space
\\w	Word
\\W	Not Word
\\t	Tab
\\n	New line
^	Beginning of the string
$	End of the string
^KEY1.*KEY2$	Beginning and end of a string
\	Escape special characters, e.g. \\ is "\", \+ is "+"
\|	Alternation match. e.g. /(e\|d)n/ matches "en" and "dn"
. OR .*	Any character, except \n or line terminator
[ab]	a or b
[^ab]	Any character except a and b
[0-9]	All Digit
[A-Z]	All uppercase A to Z letters
[a-z]	All lowercase a to z letters
[A-z]	All Uppercase and lowercase a to z letters
i+	i at least one time (Repetition)
i*	i zero or more times (Repetition)
i?	i zero or 1 time (Repetition)
i{n}	i occurs n times in sequence
i{n1,n2}	i occurs n1 - n2 times in sequence
i{n1,n2}?	non greedy match, see above example
i{n,}	i occures >= n times
[:alnum:]	Alphanumeric characters: [:alpha:] and [:digit:]
[:alpha:]	Alphabetic characters: [:lower:] and [:upper:]
[:blank:]	Blank characters: e.g. space, tab
[:cntrl:]	Control characters
[:digit:]	Digits: 0 1 2 3 4 5 6 7 8 9
[:graph:]	Graphical characters: [:alnum:] and [:punct:]
[:lower:]	Lower-case letters in the current locale
[:print:]	Printable characters: [:alnum:], [:punct:] and space
[:punct:]	} ~
[:space:]	Space characters: tab, newline, vertical tab, form feed, carriage return, space
[:upper:]	Upper-case letters in the current locale
[:xdigit:]	Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f

File globs

Asterisk: * matches any number of any characters

patttern	match
file.*	file.txt, file.jpg, file.tar.gz
file*.txt	file1.txt, file123.txt

Question mark: ? matches one of any character

patttern	match
file?.txt	file1.txt, filea.txt
file??.txt	file10.txt, fileab.txt
?.jpg	a.jpg, 2.jpg

Character sets: [] matches one character in the list

patttern	match
file[0-9].txt	file1.txt, file2.txt
file[a-z].txt	filea.txt, fileb.txt
file[abc123].jpg	a.jpg, b.jpg

Character sets: [-] matches a hyphen

patttern	match
file[-0-9].txt	file-.txt, file1.txt

Character sets: [! ] negates a match

patttern	match
file[!0-9].txt	filea.txt, fileb.txt

Character classes: [: :] matches on echaracter of a certain type

patttern	match
[:digit:]	numbers
[:upper:]	upper case characters
[:lower:]
[:alpha:]
[:alnum:]	upper and lower case plus numbers
[:space:]	spaces, tabs, and newlines
[:graph:]	printable characters, not including spaces
[:print:]	printable characters, including spaces
[:punct:]	punctuation
[:cntrl:]	nonprintable control characters
[:xdigit:]	hexadecimal characters

Using character classes

ls file[0-9].txt
ls file[[:digit:]].txt
ls file[[:digit:][:spaces:]].txt

Negating character

ls file[![:digit:]].txt
ls file[![:digit:][:spaces:]].txt

Brace expansion

ls {*.jpg,*.gif,*.png}

Extended globs

To turn it on, shopt -s extglob

?(match): 0 or 1 occurrence of pattern

pattern	match
file?(abc).txt	file.txt, fileabc.txt

+(match): 1 or more occurrence of pattern

pattern	match
file+(abc).txt	fileabc.txt, fileabcabc.txt

(match|match): match one or the other

pattern	match
Photo)*+(.jpg\|.gif)	photo.jpg, Photo.jpg, photo.gif, Photo.gif

(match): 0 or more occurrence of pattern

pattern	match
photo*(abc).jpg	photo.jpg, photoabc.jpg, photoabcabc.jpg

!(match): inverts the match

pattern	match
*.gif)	file.txt, fileabc.txt, fileabcabc.txt

grep()

Use value = TRUE will return the matching elements instead of indices
Use invert = TRUE will return the indices or values for elements that do not match
Use ignore.case = TRUE

Ref:

Regular Expressions In grep examples

grepl() and fix parameter

Test if characters are in a string

sub() and gsub()

The sub function changes only the first occurrence of the regular expression, while the gsub function performs the substitution on all occurrences within the string.

To extract the filename without extension,

sub('\\..*$', '', Filename)

regexpr() and gregexpr()

The output from these functions is a vector of starting positions of the regular expressions which were found; if no match occurred, a value of -1 is returned.

The regexpr function will only provide information about the first match in its input string(s), while the gregexpr function returns information about all matches found.

Note that in C++, the std::string::find() and Qt's QRegExp::indexIn() can do R's regexpr() does. I am not aware of any gregexpr()-equivalent function in C++.

The following example is coming from the book 'Data Manipulation with R' by Phil Spector, Chapter 7, Character Manipulation.

tst = c('one x7 two b1', 'three c5 four b9', 'five six seven', 'a8 eight nine')
wh = regexpr('[a-z][0-9]', tst)
wh
# [1] 5 7 -1 1
# attr(,"match.length")
# [1] 2 2 -1 2

wh1 = gregexpr('[a-z][0-9]',tst) # return a list just like strsplit()
wh1

# [[1]]
# [1]  5 12
# attr(,"match.length")
# [1] 2 2
# attr(,"useBytes")
# [1] TRUE
#
# [[2]]
# [1]  7 15
# attr(,"match.length")
# [1] 2 2
# attr(,"useBytes")
# [1] TRUE
#
# [[3]]
# [1] -1
# attr(,"match.length")
# [1] -1
# attr(,"useBytes")
# [1] TRUE
#
# [[4]]
# [1] 1
# attr(,"match.length")
# [1] 2
# attr(,"useBytes")
# [1] TRUE

gregexpr("'", "|3'-5'") # find the apostrophe character
# [[1]]
# [1] 3 6
# attr(,"match.length")
# [1] 1 1
# attr(,"useBytes")
# [1] TRUE

Examples

Substitute a substring which starts with 0 or more characters and then 'boundary=' with an empty. Here ^ means beginning, dot means any character and star means the preceding item 0 or more times.

sub("^.*boundary=", "", string)

Search for the string ending with .zip or .tar.gz

grep("\\.zip$", pkgs) # or 
grep("\\.tar.gz$", pkgs)

Not update any package whose name starts with "org." or "BSgenome."

biocLite(suppressUpdates=c("^org\.", "^BSgenome\."))

search for the string containing '9', any character (to split 9 & 11) and '11'.

grep("9.11", string)

pipe metacharacter; it is translated to 'or'. flood|fire will match strings containing floor or fire. lambda x: x * 2

[^?.]$ will match anyone ([]) not (^) ending ($) with the question mark (?) or period (.).
^[Gg]ood|[Bb]ad will match strings starting with Good/good and anywhere containing Bad/bad.
^([Gg]ood|[Bb]ad) will look for strings beginning with Good/good/Bad/bad.
? character; it means optional. [Gg]eorge( [Ww]\.)? [Bb]ush will match strings like 'george bush', 'George W. Bush' or 'george bushes'. Note that we escape the metacharacter dot by '\.' so it becomes a literal period.
star and plus sign. star means any number including none and plus means at least one. For example, (.*) matches 'abc(222 )' and '()'.

Regular Expression in Base R Regex to identify email address & How to extract expression matching an email address in a text file using R or Command Line?
Extract digits from a string.

gsub("\\D", "", c("i have 10 app", "call for 2 cups") ) # c("10", "2")

[0-9]+ (.*) [0-9]+ will match one number and following by any number of characters (.*) and a number; e.g. 'afda1080 p' and '4 by 5 size'.
replace multiple spaces with 1 space.

gsub("[[:space:]]+", " ", "  ab  c  ")

remove characters after period

gsub("\\..*", "", string)

remove all characters that are not digits (0-9)

gsub("[^0-9]", "", "Jan.-Feb. 1973") # Jan.-Feb. 1973 -> 1973

remove everything before the last forward slash in the URL. ".*/", matches any character (.) occurring zero or more times (*) followed by a forward slash (/)

gsub(".*/", "", url) # https://abc.def/file.xlsx -> file.xlsx

{} refers to as interval quantifiers; specify the minimum and maximum number of match of an expression.

trimws() function to remove trailing/leading whitespace. The function is used in several places.

trimws <-
function(x, which = c("both", "left", "right"))
{
    which <- match.arg(which)
    mysub <- function(re, x) sub(re, "", x, perl = TRUE)
    if(which == "left")
        return(mysub("^[ \t\r\n]+", x))
    if(which == "right")
        return(mysub("[ \t\r\n]+$", x))
    mysub("[ \t\r\n]+$", mysub("^[ \t\r\n]+", x))
}

Another solution to trim leading/trailing space is

# returns string w/o leading whitespace
trim.leading <- function (x)  sub("^\\s+", "", x)

# returns string w/o trailing whitespace
trim.trailing <- function (x) sub("\\s+$", "", x)

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

Extract/replace text between parentheses https://stackoverflow.com/a/13498914, stringr::str_replace()
```
gsub("\$.*\$", "", c("0.1385(+)", "0.33", "0.12(-)")
```

Replace "\n", "?" or ":" character. rsync will not be able to copy files if these characters appeared in the filename. Below is an R snippet to fix this problem.

dirns <- dir("~/Documents", full.names = TRUE)
for(dirn in dirns) {
  setwd(dirn)
  x <- list.files(".")
  y <- gsub("\n|\\?", "", x) # remove \n or ? character
  y <- gsub(":", ". ", y)    # replace : with .
  #   y <- gsub("\\s+", "_", y)  # replace space with _
  file.rename(x, y)
}

Special case: match the dot character

See Chapter 11: Strings with stringr in 'R for Data Science' by Hadley Wickham.

The printed representation of a string shows the escapes. To see the raw contents of the string, use writeLines().

x <- c("\"", "\\") # escape ", \
x
# [1] "\"" "\\"
writeLines(x)
# "
# \

"." matches any character. To match the dot character literally we shall use "\\.".

# We want to match the dot character literally
writeLines("\.")
# Error: '\.' is an unrecognized escape in character string starting ""\."

# . should be represented as \. but \ itself should be escaped so
# to escape ., we should use \\.
writeLines("\\.")
# \.

Suppose we have a string like "UCEC.transcriptome__unc_edu__Level_3__unc_lowess_normalization_gene_level__data.data.txt" and we want to get "UCEC" in output. We can use

mystring %>% gsub(x=., pattern = "\\..*", replacement = "")

Special case: match the backslash \

x <- "a\\b"
writeLines(x)
# a\b

str_view(x, "\\\\")

Replace single backslash in R. Note the fixed=TRUE option.

gsub("\\", "", str2, fixed=TRUE)

Approximate matching

TRE library

> names <- c("Konrad", "Conrad", "Konard", "Connard", "con rat", "Conga rat")
> grep("(Konrad){~2}", names, value = TRUE)
[1] "Konrad" "Conrad" "Konard"
> grep("(Konrad){~1}", names, value = TRUE)
[1] "Konrad" "Conrad"

stringr package

# Repetition
stringr::str_detect(c("OID01216", "OID01493"), "OID[0-9]{5}")

# the 2nd element has 1 more numerical character after OID; still matched
stringr::str_detect(c("OID01216", "OID012165"), "OID[0-9]{5}")
# [1] TRUE TRUE

# the 2nd element has 1 less character; not matched
stringr::str_detect(c("OID01216", "OID0121"), "OID[0-9]{5}")
# [1]  TRUE FALSE

RegExplain package

RegExplain is an RStudio addin slash utility belt for regular expressions. Interactively build your regexp.

Shell

How to sed remove last character from each line

$ echo "I love donuts two times a dayz" | sed 's/.$//'  # remove the last char
I love donuts two times a day

$ echo "I love donuts two times a dayz" | sed 's/[[:alpha:]]$//'     
I love donuts two times a day

$ echo "I love donuts two times a day1" | sed 's/[[:alpha:]]$//'   
I love donuts two times a day1

$ echo "I love donuts two times a dayz" | sed 's/.[[:alpha:]]$//'    
I love donuts two times a da

$ echo "I love donuts two times a day" | sed 's/[[:blank:]]//'  # only the 1st instance
Ilove donuts two times a day
$ echo "I love donuts two times a day" | sed 's/[[:blank:]]//g' # all instances
Ilovedonutstwotimesaday

@@ Line 492: / Line 492: @@
 = Examples =
-* sub("^.*boundary=", "", string) will substitute a substring which starts with 0 or more characters and then 'boundary=' with an empty. Here ^ means beginning, dot means any character and star means the preceding item 0 or more times.
+* Substitute a substring which starts with 0 or more characters and then 'boundary=' with an empty. Here ^ means beginning, dot means any character and star means the preceding item 0 or more times.
-* grep("\\.zip$", pkgs) or grep("\\.tar.gz$", pkgs) will search for the string ending with .zip or .tar.gz
+:<syntaxhighlight lang='rsplus'>
-* biocLite(suppressUpdates=c("^org\.", "^BSgenome\.")) not update any package whose name starts with "org." or "BSgenome."
+sub("^.*boundary=", "", string)
-* grep("9.11", string) will search for the string containing '9', any character (to split 9 & 11) and '11'.
+</syntaxhighlight>
-* pipe metacharacter; it is translated to 'or'. flood|fire will match strings containing floor or fire.
+* Search for the string ending with .zip or .tar.gz
+:<syntaxhighlight lang='rsplus'>
+grep("\\.zip$", pkgs) # or
+grep("\\.tar.gz$", pkgs)
+</syntaxhighlight>
+* Not update any package whose name starts with "org." or "BSgenome."
+:<syntaxhighlight lang='rsplus'>
+biocLite(suppressUpdates=c("^org\.", "^BSgenome\."))
+</syntaxhighlight>
+* search for the string containing '9', any character (to split 9 & 11) and '11'.
+:<syntaxhighlight lang='rsplus'>
+grep("9.11", string)
+</syntaxhighlight>
+* pipe metacharacter; it is translated to 'or'. flood|fire will match strings containing floor or fire. <syntaxhighlight lang="python" inline>lambda x: x * 2</syntaxhighlight>
 * [^?.]$ will match anyone ([]) not (^) ending ($) with the question mark (?) or period (.).
 * ^[Gg]ood|[Bb]ad will match strings starting with Good/good and anywhere containing Bad/bad.
@@ Line 504: / Line 522: @@
 * [https://stackoverflow.com/a/19342107 Regular Expression in Base R Regex to identify email address] & [https://stackoverflow.com/a/25077704 How to extract expression matching an email address in a text file using R or Command Line?]
-* Extract digits from a string. gsub("\\D", "", c("i have 10 app", "call for 2 cups") ) = c("10", "2")
+* Extract digits from a string.
+:<syntaxhighlight lang='rsplus'>
+gsub("\\D", "", c("i have 10 app", "call for 2 cups") ) # c("10", "2")
+</syntaxhighlight>
 * [0-9]+ (.*) [0-9]+ will match one number and following by any number of characters (.*) and a number; e.g. 'afda1080 p' and '4 by 5 size'.
-* <nowiki>gsub("[[:space:]]+", " ", "  ab  c  ") </nowiki> will replace multiple spaces with 1 space.
+* replace multiple spaces with 1 space.
-* <nowiki>gsub("\\..*", "", string) </nowiki> will remove characters after period
+:<syntaxhighlight lang='rsplus'>
+gsub("[[:space:]]+", " ", "  ab  c  ")
+</syntaxhighlight>
+* remove characters after period
+:<syntaxhighlight lang='rsplus'>
+gsub("\\..*", "", string)
+</syntaxhighlight>
+* remove all characters that are not digits (0-9)
+:<syntaxhighlight lang='rsplus'>
+gsub("[^0-9]", "", "Jan.-Feb. 1973") # Jan.-Feb. 1973 -> 1973
+</syntaxhighlight>
 * remove everything before the last forward slash in the URL. ".*/", matches any character (.) occurring zero or more times (*) followed by a forward slash (/)
 :<syntaxhighlight lang='rsplus'>