Regular expression: Difference between revisions
(→grep()) |
|||
(41 intermediate revisions by the same user not shown) | |||
Line 20: | Line 20: | ||
* http://cbio.ensmp.fr/~thocking/papers/2011-08-16-directlabels-and-regular-expressions-for-useR-2011/2011-useR-named-capture-regexp.pdf | * http://cbio.ensmp.fr/~thocking/papers/2011-08-16-directlabels-and-regular-expressions-for-useR-2011/2011-useR-named-capture-regexp.pdf | ||
* http://stackoverflow.com/questions/5214677/r-find-the-last-dot-in-a-string | * http://stackoverflow.com/questions/5214677/r-find-the-last-dot-in-a-string | ||
<ul> | |||
<li>[https://stackoverflow.com/a/10294818 Remove all special characters from a string in R?] | |||
Specific to R | <pre> | ||
* [http://www.gastonsanchez.com/r4strings/regex1.html Handling Strings with R] by Gaston Sanchez | gsub("[^[:alnum:]]", "", x) | ||
gsub("[[:punct:]]", " ", x) | |||
</pre></li></ul> | |||
* [https://nikic.github.io/2011/12/10/PCRE-and-newlines.html PCRE and newlines] tells the differences of \r\n (newline for Windows), \r (newline for UNIX, hex 0D) and \n (newline for old Mac, hex 0A). The tab \t has hex 09. | |||
* [https://www.howtogeek.com/661101/how-to-use-regular-expressions-regexes-on-linux/ How to Use Regular Expressions (regexes) on Linux] | |||
== Specific to R == | |||
* https://stringr.tidyverse.org/articles/regular-expressions.html, [https://github.com/rstudio/cheatsheets/blob/master/strings.pdf Cheat sheet from RStudio] | |||
** Basic matches | |||
** Escaping, "\\." | |||
** Special characters: "\e", "\f", "\n", "\r", "\t" | |||
** Matching multiple characters, "\d" matches any digits, "\s" matches any whitespace, "\w" matches any word, "\p", "\b", ... | |||
** Alternation/ Or: "|" | |||
** Grouping: parentheses | |||
** Anchors: "^" matches the start of string, "$" matches the end of the string. | |||
** Repetition: "?", "+", "*", "[]", "{n}", "{n, }", "{n, m}", "??", "+?", "*?" | |||
** Look arounds: "(?=...)", "(?!...)", "(?<=...)" | |||
** Comments: "(?#...)" | |||
* [http://www.gastonsanchez.com/r4strings/regex1.html Handling Strings with R] ebook by Gaston Sanchez | |||
* https://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions | * https://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions | ||
* https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf | * https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf | ||
* http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf | * http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf | ||
* http://r-exercises.com/2016/10/30/regular-expressions-part-1/ | * http://r-exercises.com/2016/10/30/regular-expressions-part-1/ | ||
* [https://blog.rsquaredacademy.com/regular-expression-in-r/ Demystifying Regular Expressions in R] | |||
* [https://regenerativetoday.com/a-beginners-guide-to-match-any-pattern-using-regular-expressions-in-r/ A Beginners Guide to Match Any Pattern Using Regular Expressions in R] | |||
== Specific to Terminal == | |||
[https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux#grouping Using Grep & Regular Expressions to Search for Text Patterns in Linux] & Extended Regular Expressions. | |||
[https://www.putorius.net/grep-multiple-strings-file.html Grep Multiple Strings from a File in Linux] | |||
== Online tools == | |||
* [https://regexr.com/ RegExplain] & RStudio Addin https://github.com/gadenbuie/regexplain | |||
= Metacharacters = | = Metacharacters = | ||
Line 62: | Line 86: | ||
strsplit("abc.def", "\\.") | strsplit("abc.def", "\\.") | ||
strsplit("abc.def", ".", fixed = TRUE) | strsplit("abc.def", ".", fixed = TRUE) | ||
strsplit("abc,def ghi -jk lm", "[,\\s-]+") # split with multiple delimiters | |||
# [1] "abc" "def ghi " "jk lm" | |||
strsplit("abc123def456ghi789jkl", "[0-9]+") # split with numbers as delimiters | |||
# [1] "abc" "def" "ghi" "jkl" | |||
</syntaxhighlight> | </syntaxhighlight> | ||
Line 72: | Line 100: | ||
* "(", ")" brackets for grouping. | * "(", ")" brackets for grouping. | ||
* "[", "]" character class brackets | * "[", "]" character class brackets | ||
* [,] matches a comma | |||
* \\s matches any whitespace character (spaces, tabs, line breaks). | |||
* [-] matches a hyphen | |||
== + == | == + == | ||
Line 103: | Line 134: | ||
</syntaxhighlight> | </syntaxhighlight> | ||
= | = List of regular expression = | ||
The following table is from [http://www.endmemo.com/program/R/grep.php endmemo.com]. | The following table is from [http://www.endmemo.com/program/R/grep.php endmemo.com]. | ||
See also the regular expression article in [https://stringr.tidyverse.org/articles/regular-expressions.html stringr] package. | |||
{| class="wikitable" | {| class="wikitable" | ||
! Syntax | ! Syntax | ||
Line 117: | Line 149: | ||
|- | |- | ||
| \\s | | \\s | ||
| Space | | Space eg: sub('\\s', '\n', "abc ABC") | ||
|- | |- | ||
| \\S | | \\S | ||
Line 149: | Line 181: | ||
| Alternation match. e.g. <nowiki>/(e|d)n/ </nowiki> matches "en" and "dn" | | Alternation match. e.g. <nowiki>/(e|d)n/ </nowiki> matches "en" and "dn" | ||
|- | |- | ||
| | | . OR .* | ||
| Any character, except \n or line terminator | | Any character, except \n or line terminator | ||
|- | |- | ||
Line 171: | Line 203: | ||
|- | |- | ||
| i+ | | i+ | ||
| i at least one time | | i at least one time (Repetition) | ||
|- | |- | ||
| i* | | i* | ||
| i zero or more times | | i zero or more times (Repetition) | ||
|- | |- | ||
| i? | | i? | ||
| i zero or 1 time | | i zero or 1 time (Repetition) | ||
|- | |- | ||
| i{n} | | i{n} | ||
Line 226: | Line 258: | ||
| [:xdigit:] | | [:xdigit:] | ||
| Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f | | Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f | ||
|} | |||
== File globs == | |||
Asterisk: * matches any number of any characters | |||
{| class="wikitable" | |||
! patttern | |||
! match | |||
|- | |||
| file.* | |||
| file.txt, file.jpg, file.tar.gz | |||
|- | |||
| file*.txt | |||
| file1.txt, file123.txt | |||
|} | |||
Question mark: ? matches one of any character | |||
{| class="wikitable" | |||
! patttern | |||
! match | |||
|- | |||
| file?.txt | |||
| file1.txt, filea.txt | |||
|- | |||
| file??.txt | |||
| file10.txt, fileab.txt | |||
|- | |||
| ?.jpg | |||
| a.jpg, 2.jpg | |||
|} | |||
Character sets: [] matches one character in the list | |||
{| class="wikitable" | |||
! patttern | |||
! match | |||
|- | |||
| file[0-9].txt | |||
| file1.txt, file2.txt | |||
|- | |||
| file[a-z].txt | |||
| filea.txt, fileb.txt | |||
|- | |||
| file[abc123].jpg | |||
| a.jpg, b.jpg | |||
|} | |||
Character sets: [-] matches a hyphen | |||
{| class="wikitable" | |||
! patttern | |||
! match | |||
|- | |||
| file[-0-9].txt | |||
| file-.txt, file1.txt | |||
|} | |||
Character sets: [! ] negates a match | |||
{| class="wikitable" | |||
! patttern | |||
! match | |||
|- | |||
| file[!0-9].txt | |||
| filea.txt, fileb.txt | |||
|} | |||
Character classes: [: :] matches on echaracter of a certain type | |||
{| class="wikitable" | |||
! patttern | |||
! match | |||
|- | |||
| [:digit:] | |||
| numbers | |||
|- | |||
| [:upper:] | |||
| upper case characters | |||
|- | |||
| [:lower:] | |||
| | |||
|- | |||
| [:alpha:] | |||
| | |||
|- | |||
| [:alnum:] | |||
| upper and lower case plus numbers | |||
|- | |||
| [:space:] | |||
| spaces, tabs, and newlines | |||
|- | |||
| [:graph:] | |||
| printable characters, not including spaces | |||
|- | |||
| [:print:] | |||
| printable characters, including spaces | |||
|- | |||
| [:punct:] | |||
| punctuation | |||
|- | |||
| [:cntrl:] | |||
| nonprintable control characters | |||
|- | |||
| [:xdigit:] | |||
| hexadecimal characters | |||
|} | |||
Using character classes | |||
<pre> | |||
ls file[0-9].txt | |||
ls file[[:digit:]].txt | |||
ls file[[:digit:][:spaces:]].txt | |||
</pre> | |||
Negating character | |||
<pre> | |||
ls file[![:digit:]].txt | |||
ls file[![:digit:][:spaces:]].txt | |||
</pre> | |||
Brace expansion | |||
<pre> | |||
ls {*.jpg,*.gif,*.png} | |||
</pre> | |||
== Extended globs == | |||
To turn it on, '''shopt -s extglob''' | |||
?(match): 0 or 1 occurrence of pattern | |||
{| class="wikitable" | |||
! pattern | |||
! match | |||
|- | |||
| file?(abc).txt | |||
| file.txt, fileabc.txt | |||
|} | |||
+(match): 1 or more occurrence of pattern | |||
{| class="wikitable" | |||
! pattern | |||
! match | |||
|- | |||
| file+(abc).txt | |||
| fileabc.txt, fileabcabc.txt | |||
|} | |||
:(match|match): match one or the other | |||
{| class="wikitable" | |||
! pattern | |||
! match | |||
|- | |||
| ls +(photo|Photo)*+(.jpg|.gif) | |||
| photo.jpg, Photo.jpg, photo.gif, Photo.gif | |||
|} | |||
*(match): 0 or more occurrence of pattern | |||
{| class="wikitable" | |||
! pattern | |||
! match | |||
|- | |||
| photo*(abc).jpg | |||
| photo.jpg, photoabc.jpg, photoabcabc.jpg | |||
|} | |||
!(match): inverts the match | |||
{| class="wikitable" | |||
! pattern | |||
! match | |||
|- | |||
| ls !(*.jpg|*.gif) | |||
| file.txt, fileabc.txt, fileabcabc.txt | |||
|} | |} | ||
Line 235: | Line 433: | ||
Ref: | Ref: | ||
* [https://www.cyberciti.biz/faq/grep-regular-expressions/ Regular Expressions In grep examples] | * [https://www.cyberciti.biz/faq/grep-regular-expressions/ Regular Expressions In grep examples] | ||
== grepl() and fix parameter == | |||
[https://stackoverflow.com/a/10128682 Test if characters are in a string] | |||
= [https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html sub() and gsub()] = | = [https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html sub() and gsub()] = | ||
The sub function changes only the first occurrence of the regular expression, while the gsub function performs the substitution on all occurrences within the string. | The sub function changes only the first occurrence of the regular expression, while the gsub function performs the substitution on all occurrences within the string. | ||
To [https://stackoverflow.com/a/29114007 extract the filename without extension], <pre>sub('\\..*$', '', Filename)</pre> | |||
= [https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html regexpr() and gregexpr()] = | = [https://stat.ethz.ch/R-manual/R-devel/library/base/html/grep.html regexpr() and gregexpr()] = | ||
The output from these functions is a vector of starting positions of the regular expressions which were found; if no match occurred, a value of -1 is returned. | <ul> | ||
<li>The output from these functions is a vector of '''starting positions''' and matched length '''match.length''' of the regular expressions which were found; if no match occurred, a value of -1 is returned. | |||
The '''regexpr''' function will only provide information about the first match in its input string(s), while the | <li>The '''regexpr''' function will only provide information about the first match in its input string(s), while the | ||
'''gregexpr''' function returns information about all matches found. | '''gregexpr''' function returns information about all matches found. | ||
Note that in C++, the '''std::string::find()''' and Qt's '''QRegExp::indexIn()''' can do R's '''regexpr()''' does. I am not aware of any gregexpr()-equivalent function in C++. | <li>Note that in C++, the '''std::string::find()''' and Qt's '''QRegExp::indexIn()''' can do R's '''regexpr()''' does. I am not aware of any gregexpr()-equivalent function in C++. | ||
The following example is coming from the book 'Data Manipulation with R' by [http://www.stat.berkeley.edu/~spector/ Phil Spector], Chapter 7, Character Manipulation. | <li>The following example is coming from the book 'Data Manipulation with R' by [http://www.stat.berkeley.edu/~spector/ Phil Spector], Chapter 7, Character Manipulation. | ||
<syntaxhighlight lang='rsplus'> | <syntaxhighlight lang='rsplus'> | ||
tst = c('one x7 two b1', 'three c5 four b9', 'five six seven', 'a8 eight nine') | tst = c('one x7 two b1', 'three c5 four b9', 'five six seven', 'a8 eight nine') | ||
Line 295: | Line 499: | ||
# [1] TRUE | # [1] TRUE | ||
</syntaxhighlight> | </syntaxhighlight> | ||
<li>[https://www.r-bloggers.com/2024/09/how-to-use-grep-and-return-only-substring-in-r-a-comprehensive-guide/ How to Use grep() and Return Only Substring in R: A Comprehensive Guide]. The example uses '''regexpr()''' to find the position of the match, and then '''substr()''' to extract the matched portion. | |||
<pre> | |||
text <- c("file1.txt", "file2.csv", "file3.doc") | |||
pattern <- "\\.[^.]+$" | |||
matches <- regexpr(pattern, text) | |||
result <- substr(text, matches, matches + attr(matches, "match.length") - 1) | |||
print(result) | |||
# [1] ".txt" ".csv" ".doc" | |||
</pre> | |||
</ul> | |||
= Examples = | = Examples = | ||
* | * Search '''".*" ''' for all example depending on '''".*"''' | ||
* | * Substitute a substring which starts with 0 or more characters and then 'boundary=' with an empty. Here ^ means beginning, dot means any character and star means the preceding item 0 or more times. | ||
:<syntaxhighlight lang='rsplus'> | |||
sub("^.*boundary=", "", string) | |||
* pipe metacharacter; it is translated to 'or'. flood|fire will match strings containing floor or fire. | </syntaxhighlight> | ||
* [^?.]$ will match anyone ([]) not (^) ending ($) with the question mark (?) or period (.). | |||
* ^[Gg]ood|[Bb]ad will match strings starting with Good/good and anywhere containing Bad/bad. | * Delete all characters up to the '''last''' appearance of the dollar sign | ||
* ^([Gg]ood|[Bb]ad) will look for strings beginning with Good/good/Bad/bad. | :<syntaxhighlight lang='rsplus'> | ||
* ? character; it means optional. [Gg]eorge( [Ww]\.)? [Bb]ush will match strings like 'george bush', 'George W. Bush' or 'george bushes'. Note that we escape the metacharacter dot by '\.' so it becomes a literal period. | gsub(".*\\$", "", "abc$de$fg") | ||
* star and plus sign. star means any number including none and plus means at least one. For example, (.*) matches 'abc(222 )' and '()'. | [1] "fg" | ||
* [0-9]+ (.*) [0-9]+ will match one number and following by any number of characters (.*) and a number; e.g. 'afda1080 p' and '4 by 5 size'. | </syntaxhighlight>Note that \\\\$ is used to match a literal dollar sign in a string. \\$ does not have any meaning. | ||
* gsub("[[:space:]]+", " ", " ab c ") | |||
* Delete all characters up to the '''first''' appearance of the dollar sign | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub("^[^$]*\\$", "", "abc$de$fg") | |||
[1] "de$fg" | |||
</syntaxhighlight>'''"[^$]"''' matches '''any''' (one) character except the dollar sign. '''"^[^$]*"''': Matches '''all''' (zero or more) characters from the start of the string (^) up to but not including the first dollar sign ($). The [^$]* part matches any sequence of characters that are not dollar signs. | |||
* Search for the string ending with .zip or .tar.gz | |||
:<syntaxhighlight lang='rsplus'> | |||
grep("\\.zip$", pkgs) # or | |||
grep("\\.tar.gz$", pkgs) | |||
</syntaxhighlight> | |||
* Not update any package whose name starts with "org." or "BSgenome." | |||
:<syntaxhighlight lang='rsplus'> | |||
biocLite(suppressUpdates=c("^org\.", "^BSgenome\.")) | |||
</syntaxhighlight> | |||
* search for the string containing '9', any character (to split 9 & 11) and '11'. | |||
:<syntaxhighlight lang='rsplus'> | |||
grep("9.11", string) | |||
</syntaxhighlight> | |||
* pipe metacharacter; it is translated to 'or'. <code style="display:inline-block;">flood|fire</code> will match strings containing floor or fire. | |||
* <code style="display:inline-block;">[^?.]$</code> will match anyone ([]) not (^) ending ($) with the question mark (?) or period (.). | |||
* <code style="display:inline-block;">^[Gg]ood|[Bb]ad</code> will match strings starting with Good/good and anywhere containing Bad/bad. | |||
* <code style="display:inline-block;">^([Gg]ood|[Bb]ad)</code> will look for strings beginning with Good/good/Bad/bad. | |||
* ? character; it means optional. <code style="display:inline-block;">[Gg]eorge( [Ww]\.)? [Bb]ush</code> will match strings like 'george bush', 'George W. Bush' or 'george bushes'. Note that we escape the metacharacter dot by '\.' so it becomes a literal period. | |||
* star and plus sign. star means any number including none and plus means at least one. For example, <code style="display:inline-block;">(.*)</code> matches 'abc(222 )' and '()'. | |||
* [https://stackoverflow.com/a/19342107 Regular Expression in Base R Regex to identify email address] & [https://stackoverflow.com/a/25077704 How to extract expression matching an email address in a text file using R or Command Line?] | |||
* Extract digits from a string. | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub("\\D", "", c("i have 10 app", "call for 2 cups") ) # c("10", "2") | |||
</syntaxhighlight> | |||
* <code style="display:inline-block;">[0-9]+ (.*) [0-9]+</code> will match one number and following by any number of characters (.*) and a number; e.g. 'afda1080 p' and '4 by 5 size'. | |||
* replace multiple spaces with 1 space. | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub("[[:space:]]+", " ", " ab c ") | |||
</syntaxhighlight> | |||
* remove characters after period | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub("\\..*", "", string) | |||
</syntaxhighlight> | |||
* remove all characters that are not digits (0-9) | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub("[^0-9]", "", "Jan.-Feb. 1973") # Jan.-Feb. 1973 -> 1973 | |||
</syntaxhighlight> | |||
* remove everything before the last forward slash in the URL. ".*/", matches any character (.) occurring zero or more times (*) followed by a forward slash (/) | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub(".*/", "", url) # https://abc.def/file.xlsx -> file.xlsx | |||
</syntaxhighlight> | |||
* {} refers to as interval quantifiers; specify the minimum and maximum number of match of an expression. | * {} refers to as interval quantifiers; specify the minimum and maximum number of match of an expression. | ||
* [https://github.com/wch/r-source/blob/trunk/src/library/base/R/strwrap.R#L201-L211 trimws()] function to [https://github.com/wch/r-source/blob/e36b7044ba5ca3e9caebdb0fc6302675a954ae47/doc/NEWS.Rd#L599-L600 remove trailing/leading whitespace]. The function is used in [https://github.com/wch/r-source/search?p=2&q=trimws&utf8=%E2%9C%93 several places]. | |||
< | * [https://github.com/wch/r-source/blob/trunk/src/library/base/R/strwrap.R#L201-L211 trimws()] function to [https://github.com/wch/r-source/blob/e36b7044ba5ca3e9caebdb0fc6302675a954ae47/doc/NEWS.Rd#L599-L600 remove trailing/leading whitespace]. The function is used in [https://github.com/wch/r-source/search?p=2&q=trimws&utf8=%E2%9C%93 several places]. <syntaxhighlight lang="rsplus"> | ||
trimws <- | trimws <- | ||
function(x, which = c("both", "left", "right")) | function(x, which = c("both", "left", "right")) | ||
Line 323: | Line 602: | ||
mysub("[ \t\r\n]+$", mysub("^[ \t\r\n]+", x)) | mysub("[ \t\r\n]+$", mysub("^[ \t\r\n]+", x)) | ||
} | } | ||
</ | </syntaxhighlight> | ||
* [http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-whitespace-in-r Another solution to trim leading/trailing space] is | * [http://stackoverflow.com/questions/2261079/how-to-trim-leading-and-trailing-whitespace-in-r Another solution to trim leading/trailing space] is | ||
< | :<syntaxhighlight lang="rsplus"> | ||
# returns string w/o leading whitespace | # returns string w/o leading whitespace | ||
trim.leading <- function (x) sub("^\\s+", "", x) | trim.leading <- function (x) sub("^\\s+", "", x) | ||
Line 334: | Line 614: | ||
# returns string w/o leading or trailing whitespace | # returns string w/o leading or trailing whitespace | ||
trim <- function (x) gsub("^\\s+|\\s+$", "", x) | trim <- function (x) gsub("^\\s+|\\s+$", "", x) | ||
</ | </syntaxhighlight> | ||
* Extract/replace text between parentheses https://stackoverflow.com/a/13498914, [https://stackoverflow.com/a/24174655 stringr::str_replace()] | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub("\\(.*\\)", "", c("0.1385(+)", "0.33", "0.12(-)") | |||
</syntaxhighlight> | |||
<ul> | |||
<li>Replace "\n", "?" or ":" character. '''rsync''' will not be able to copy files if these characters appeared in the filename. Below is an R snippet to fix this problem. | |||
<pre> | |||
dirns <- dir("~/Documents", full.names = TRUE) | |||
for(dirn in dirns) { | |||
setwd(dirn) | |||
x <- list.files(".") | |||
y <- gsub("\n|\\?", "", x) # remove \n or ? character | |||
y <- gsub(":", ". ", y) # replace : with . | |||
# y <- gsub("\\s+", "_", y) # replace space with _ | |||
file.rename(x, y) | |||
} | |||
</pre> | |||
</li> | |||
</ul> | |||
* Replace multiple spaces to one space. <code style="display:inline-block;">\\s+</code> regular expression matches one or more spaces, | |||
:<syntaxhighlight lang='rsplus'> | |||
gsub("\\s+", " ", "This is an example") # "This is an example" | |||
</syntaxhighlight> | |||
= Special case: match the dot character = | = Special case: match the dot character = | ||
See Chapter 11: Strings with stringr in 'R for Data Science' by Hadley Wickham. | See Chapter 11: Strings with stringr in '[https://r4ds.hadley.nz/ R for Data Science]' by Hadley Wickham. | ||
The printed representation of a string shows the escapes. To see the raw contents of the string, use '''writeLines()'''. | The printed representation of a string shows the escapes. To see the raw contents of the string, use '''writeLines()'''. | ||
Line 360: | Line 665: | ||
# \. | # \. | ||
</syntaxhighlight> | </syntaxhighlight> | ||
Suppose we have a string like "UCEC.transcriptome__unc_edu__Level_3__unc_lowess_normalization_gene_level__data.data.txt" and we want to get "UCEC" in output. We can use | |||
<pre> | |||
mystring %>% gsub(x=., pattern = "\\..*", replacement = "") | |||
</pre> | |||
= Special case: match the backslash \ = | = Special case: match the backslash \ = | ||
Line 368: | Line 678: | ||
str_view(x, "\\\\") | str_view(x, "\\\\") | ||
</syntaxhighlight> | |||
[https://stackoverflow.com/questions/25424382/replace-single-backslash-in-r Replace single backslash in R]. Note the fixed=TRUE option. | |||
<pre> | |||
gsub("\\", "", str2, fixed=TRUE) | |||
</pre> | |||
= Approximate matching = | |||
[https://laurikari.net/tre/documentation/regex-syntax/ TRE library] | |||
<syntaxhighlight lang='rsplus'> | |||
> names <- c("Konrad", "Conrad", "Konard", "Connard", "con rat", "Conga rat") | |||
> grep("(Konrad){~2}", names, value = TRUE) | |||
[1] "Konrad" "Conrad" "Konard" | |||
> grep("(Konrad){~1}", names, value = TRUE) | |||
[1] "Konrad" "Conrad" | |||
</syntaxhighlight> | |||
= stringr package = | |||
* https://stringr.tidyverse.org/articles/regular-expressions.html | |||
* [http://stat545.com/block022_regular-expression.html Regular Expression in R] from http://stat545.com | |||
<syntaxhighlight lang='rsplus'> | |||
# Repetition | |||
stringr::str_detect(c("OID01216", "OID01493"), "OID[0-9]{5}") | |||
# the 2nd element has 1 more numerical character after OID; still matched | |||
stringr::str_detect(c("OID01216", "OID012165"), "OID[0-9]{5}") | |||
# [1] TRUE TRUE | |||
# the 2nd element has 1 less character; not matched | |||
stringr::str_detect(c("OID01216", "OID0121"), "OID[0-9]{5}") | |||
# [1] TRUE FALSE | |||
</syntaxhighlight> | |||
= RegExplain package = | |||
[https://github.com/gadenbuie/regexplain RegExplain] is an RStudio addin slash utility belt for regular expressions. Interactively build your regexp. | |||
= Shell = | |||
*[https://www.cyberciti.biz/faq/sed-remove-last-character-from-each-line/ How to sed remove last character from each line] | |||
:<syntaxhighlight lang='bash'> | |||
$ echo "I love donuts two times a dayz" | sed 's/.$//' # remove the last char | |||
I love donuts two times a day | |||
$ echo "I love donuts two times a dayz" | sed 's/[[:alpha:]]$//' | |||
I love donuts two times a day | |||
$ echo "I love donuts two times a day1" | sed 's/[[:alpha:]]$//' | |||
I love donuts two times a day1 | |||
$ echo "I love donuts two times a dayz" | sed 's/.[[:alpha:]]$//' | |||
I love donuts two times a da | |||
$ echo "I love donuts two times a day" | sed 's/[[:blank:]]//' # only the 1st instance | |||
Ilove donuts two times a day | |||
$ echo "I love donuts two times a day" | sed 's/[[:blank:]]//g' # all instances | |||
Ilovedonutstwotimesaday | |||
</syntaxhighlight> | </syntaxhighlight> |
Latest revision as of 08:01, 10 September 2024
Resources
- https://en.wikipedia.org/wiki/Regular_expression
- Learning Regular Expressions
- Regular Expressions Tutorial
- Regular Expression testing
- ?grep (returns numeric values), ?grepl (returns a logical vector) and ?regexpr (returns numeric values) in R.
- http://www.regular-expressions.info/rlanguage.html
- http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
- http://www.johndcook.com/r_language_regex.html
- http://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions
- http://rpubs.com/Lionel/19068
- http://ucfagls.wordpress.com/2012/08/15/processing-sample-labels-using-regular-expressions-in-r/
- http://www.dummies.com/how-to/content/how-to-use-regular-expressions-in-r.html
- http://www.r-bloggers.com/example-8-27-using-regular-expressions-to-read-data-with-variable-number-of-words-in-a-field/
- http://www.r-bloggers.com/using-regular-expressions-in-r-case-study-in-cleaning-a-bibtex-database/
- http://cbio.ensmp.fr/~thocking/papers/2011-08-16-directlabels-and-regular-expressions-for-useR-2011/2011-useR-named-capture-regexp.pdf
- http://stackoverflow.com/questions/5214677/r-find-the-last-dot-in-a-string
- Remove all special characters from a string in R?
gsub("[^[:alnum:]]", "", x) gsub("[[:punct:]]", " ", x)
- PCRE and newlines tells the differences of \r\n (newline for Windows), \r (newline for UNIX, hex 0D) and \n (newline for old Mac, hex 0A). The tab \t has hex 09.
- How to Use Regular Expressions (regexes) on Linux
Specific to R
- https://stringr.tidyverse.org/articles/regular-expressions.html, Cheat sheet from RStudio
- Basic matches
- Escaping, "\\."
- Special characters: "\e", "\f", "\n", "\r", "\t"
- Matching multiple characters, "\d" matches any digits, "\s" matches any whitespace, "\w" matches any word, "\p", "\b", ...
- Alternation/ Or: "|"
- Grouping: parentheses
- Anchors: "^" matches the start of string, "$" matches the end of the string.
- Repetition: "?", "+", "*", "[]", "{n}", "{n, }", "{n, m}", "??", "+?", "*?"
- Look arounds: "(?=...)", "(?!...)", "(?<=...)"
- Comments: "(?#...)"
- Handling Strings with R ebook by Gaston Sanchez
- https://en.wikibooks.org/wiki/R_Programming/Text_Processing#Regular_Expressions
- https://journal.r-project.org/archive/2010-2/RJournal_2010-2_Wickham.pdf
- http://biostat.mc.vanderbilt.edu/wiki/pub/Main/SvetlanaEdenRFiles/regExprTalk.pdf
- http://r-exercises.com/2016/10/30/regular-expressions-part-1/
- Demystifying Regular Expressions in R
- A Beginners Guide to Match Any Pattern Using Regular Expressions in R
Specific to Terminal
Using Grep & Regular Expressions to Search for Text Patterns in Linux & Extended Regular Expressions.
Grep Multiple Strings from a File in Linux
Online tools
- RegExplain & RStudio Addin https://github.com/gadenbuie/regexplain
Metacharacters
There are 12 metacharacters. It seems "]", "}" and "-" are not metacharacters.
. \ | ( ) [ { $ ^ * + ?
If we want to match them, we need to precede them with a double backslash.
gsub("$", ".", "abc$def") # "abc$def" gsub("\\$", ".", "abc$def") # "abc.def" metachar <- scan(textConnection(". \\ | ( ) [ ] { } $ - ^ * + ?"), "") # Read 15 items metachar # [1] "." "\\" "|" "(" ")" "[" "]" "{" "}" "$" "-" "^" "*" "+" "?" nchar(metachar[2]) # [1] 1 grep("\\.", metachar, value = TRUE) # "." grep("\\.", metachar) # 1 grep("\\\\", metachar) # 2 grep("]", metachar) # 7 grep("}", metachar) # 9 grep("{", metachar) # Error in grep("{", metachar) : # invalid regular expression '{', reason 'Missing '}'' grep("\\$", metachar) # 10 grep("-", metachar) # 11 strsplit("abc.def", "\\.") strsplit("abc.def", ".", fixed = TRUE) strsplit("abc,def ghi -jk lm", "[,\\s-]+") # split with multiple delimiters # [1] "abc" "def ghi " "jk lm" strsplit("abc123def456ghi789jkl", "[0-9]+") # split with numbers as delimiters # [1] "abc" "def" "ghi" "jkl"
- "." matches everything except for the empty sting "".
- "+" the preceding item will be matched one or more times.
- "*" the preceding item will be matched zero or more times.
- "^" matches the empty string at the at the beginning of a line. When used in a character class means to match any character but the following ones.
- "$" matches empty string at the end of a line.
- "|" infix operator: OR
- "(", ")" brackets for grouping.
- "[", "]" character class brackets
- [,] matches a comma
- \\s matches any whitespace character (spaces, tabs, line breaks).
- [-] matches a hyphen
+
gsub(pattern = "\\.\\.", replace = ".", "id..of...patient") # [1] "id.of..patient" NOT RIGHT, need to apply the command multiple times gsub(pattern = "\\.+", replace = ".", "id..of...patient") # [1] "id.of.patient"
Character classes
Replacing all values that do not contain letters or digits with NA value. The following example is from here.
s <- c("", " ", "3 times a day after meal", "once a day", " "," one per day ", "\t", "\n ") # Method 1 s[s==""|s==" "|s==" "|s=="\t"|s=="\n"] # BAD # Method 2 allIndices = 1:length(s) letOrDigIndices = grep("[a-zA-Z0-9]", s) blankInd = setdiff(allIndices, letOrDigIndices) s[blankInd] # Method 3 gsub("^$|^( +)$|[\t\n\r\f\v]+", NA, s) # Method 4. Get rid of extra blank spaces s1 = gsub("^([ \t\n\r\f\v]+)|([ \t\n\r\f\v]+)$", "", s) gsub("^$", NA, s1)
List of regular expression
The following table is from endmemo.com.
See also the regular expression article in stringr package.
Syntax | Description |
---|---|
\\d | Digit, 0,1,2 ... 9 |
\\D | Not Digit |
\\s | Space eg: sub('\\s', '\n', "abc ABC") |
\\S | Not Space |
\\w | Word |
\\W | Not Word |
\\t | Tab |
\\n | New line |
^ | Beginning of the string |
$ | End of the string |
^KEY1.*KEY2$ | Beginning and end of a string |
\ | Escape special characters, e.g. \\ is "\", \+ is "+" |
| | Alternation match. e.g. /(e|d)n/ matches "en" and "dn" |
. OR .* | Any character, except \n or line terminator |
[ab] | a or b |
[^ab] | Any character except a and b |
[0-9] | All Digit |
[A-Z] | All uppercase A to Z letters |
[a-z] | All lowercase a to z letters |
[A-z] | All Uppercase and lowercase a to z letters |
i+ | i at least one time (Repetition) |
i* | i zero or more times (Repetition) |
i? | i zero or 1 time (Repetition) |
i{n} | i occurs n times in sequence |
i{n1,n2} | i occurs n1 - n2 times in sequence |
i{n1,n2}? | non greedy match, see above example |
i{n,} | i occures >= n times |
[:alnum:] | Alphanumeric characters: [:alpha:] and [:digit:] |
[:alpha:] | Alphabetic characters: [:lower:] and [:upper:] |
[:blank:] | Blank characters: e.g. space, tab |
[:cntrl:] | Control characters |
[:digit:] | Digits: 0 1 2 3 4 5 6 7 8 9 |
[:graph:] | Graphical characters: [:alnum:] and [:punct:] |
[:lower:] | Lower-case letters in the current locale |
[:print:] | Printable characters: [:alnum:], [:punct:] and space |
[:punct:] | } ~ |
[:space:] | Space characters: tab, newline, vertical tab, form feed, carriage return, space |
[:upper:] | Upper-case letters in the current locale |
[:xdigit:] | Hexadecimal digits: 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f |
File globs
Asterisk: * matches any number of any characters
patttern | match |
---|---|
file.* | file.txt, file.jpg, file.tar.gz |
file*.txt | file1.txt, file123.txt |
Question mark: ? matches one of any character
patttern | match |
---|---|
file?.txt | file1.txt, filea.txt |
file??.txt | file10.txt, fileab.txt |
?.jpg | a.jpg, 2.jpg |
Character sets: [] matches one character in the list
patttern | match |
---|---|
file[0-9].txt | file1.txt, file2.txt |
file[a-z].txt | filea.txt, fileb.txt |
file[abc123].jpg | a.jpg, b.jpg |
Character sets: [-] matches a hyphen
patttern | match |
---|---|
file[-0-9].txt | file-.txt, file1.txt |
Character sets: [! ] negates a match
patttern | match |
---|---|
file[!0-9].txt | filea.txt, fileb.txt |
Character classes: [: :] matches on echaracter of a certain type
patttern | match |
---|---|
[:digit:] | numbers |
[:upper:] | upper case characters |
[:lower:] | |
[:alpha:] | |
[:alnum:] | upper and lower case plus numbers |
[:space:] | spaces, tabs, and newlines |
[:graph:] | printable characters, not including spaces |
[:print:] | printable characters, including spaces |
[:punct:] | punctuation |
[:cntrl:] | nonprintable control characters |
[:xdigit:] | hexadecimal characters |
Using character classes
ls file[0-9].txt ls file[[:digit:]].txt ls file[[:digit:][:spaces:]].txt
Negating character
ls file[![:digit:]].txt ls file[![:digit:][:spaces:]].txt
Brace expansion
ls {*.jpg,*.gif,*.png}
Extended globs
To turn it on, shopt -s extglob
?(match): 0 or 1 occurrence of pattern
pattern | match |
---|---|
file?(abc).txt | file.txt, fileabc.txt |
+(match): 1 or more occurrence of pattern
pattern | match |
---|---|
file+(abc).txt | fileabc.txt, fileabcabc.txt |
- (match|match): match one or the other
pattern | match |
---|---|
Photo)*+(.jpg|.gif) | photo.jpg, Photo.jpg, photo.gif, Photo.gif |
- (match): 0 or more occurrence of pattern
pattern | match |
---|---|
photo*(abc).jpg | photo.jpg, photoabc.jpg, photoabcabc.jpg |
!(match): inverts the match
pattern | match |
---|---|
*.gif) | file.txt, fileabc.txt, fileabcabc.txt |
grep()
- Use value = TRUE will return the matching elements instead of indices
- Use invert = TRUE will return the indices or values for elements that do not match
- Use ignore.case = TRUE
Ref:
grepl() and fix parameter
Test if characters are in a string
sub() and gsub()
The sub function changes only the first occurrence of the regular expression, while the gsub function performs the substitution on all occurrences within the string.
To extract the filename without extension,
sub('\\..*$', '', Filename)
regexpr() and gregexpr()
- The output from these functions is a vector of starting positions and matched length match.length of the regular expressions which were found; if no match occurred, a value of -1 is returned.
- The regexpr function will only provide information about the first match in its input string(s), while the gregexpr function returns information about all matches found.
- Note that in C++, the std::string::find() and Qt's QRegExp::indexIn() can do R's regexpr() does. I am not aware of any gregexpr()-equivalent function in C++.
- The following example is coming from the book 'Data Manipulation with R' by Phil Spector, Chapter 7, Character Manipulation.
tst = c('one x7 two b1', 'three c5 four b9', 'five six seven', 'a8 eight nine') wh = regexpr('[a-z][0-9]', tst) wh # [1] 5 7 -1 1 # attr(,"match.length") # [1] 2 2 -1 2 wh1 = gregexpr('[a-z][0-9]',tst) # return a list just like strsplit() wh1 # [[1]] # [1] 5 12 # attr(,"match.length") # [1] 2 2 # attr(,"useBytes") # [1] TRUE # # [[2]] # [1] 7 15 # attr(,"match.length") # [1] 2 2 # attr(,"useBytes") # [1] TRUE # # [[3]] # [1] -1 # attr(,"match.length") # [1] -1 # attr(,"useBytes") # [1] TRUE # # [[4]] # [1] 1 # attr(,"match.length") # [1] 2 # attr(,"useBytes") # [1] TRUE gregexpr("'", "|3'-5'") # find the apostrophe character # [[1]] # [1] 3 6 # attr(,"match.length") # [1] 1 1 # attr(,"useBytes") # [1] TRUE
- How to Use grep() and Return Only Substring in R: A Comprehensive Guide. The example uses regexpr() to find the position of the match, and then substr() to extract the matched portion.
text <- c("file1.txt", "file2.csv", "file3.doc") pattern <- "\\.[^.]+$" matches <- regexpr(pattern, text) result <- substr(text, matches, matches + attr(matches, "match.length") - 1) print(result) # [1] ".txt" ".csv" ".doc"
Examples
- Search ".*" for all example depending on ".*"
- Substitute a substring which starts with 0 or more characters and then 'boundary=' with an empty. Here ^ means beginning, dot means any character and star means the preceding item 0 or more times.
sub("^.*boundary=", "", string)
- Delete all characters up to the last appearance of the dollar sign
gsub(".*\\$", "", "abc$de$fg") [1] "fg"
Note that \\\\$ is used to match a literal dollar sign in a string. \\$ does not have any meaning.
- Delete all characters up to the first appearance of the dollar sign
gsub("^[^$]*\\$", "", "abc$de$fg") [1] "de$fg"
"[^$]" matches any (one) character except the dollar sign. "^[^$]*": Matches all (zero or more) characters from the start of the string (^) up to but not including the first dollar sign ($). The [^$]* part matches any sequence of characters that are not dollar signs.
- Search for the string ending with .zip or .tar.gz
grep("\\.zip$", pkgs) # or grep("\\.tar.gz$", pkgs)
- Not update any package whose name starts with "org." or "BSgenome."
biocLite(suppressUpdates=c("^org\.", "^BSgenome\."))
- search for the string containing '9', any character (to split 9 & 11) and '11'.
grep("9.11", string)
- pipe metacharacter; it is translated to 'or'.
flood|fire
will match strings containing floor or fire.
[^?.]$
will match anyone ([]) not (^) ending ($) with the question mark (?) or period (.).
^[Gg]ood|[Bb]ad
will match strings starting with Good/good and anywhere containing Bad/bad.
^([Gg]ood|[Bb]ad)
will look for strings beginning with Good/good/Bad/bad.
- ? character; it means optional.
[Gg]eorge( [Ww]\.)? [Bb]ush
will match strings like 'george bush', 'George W. Bush' or 'george bushes'. Note that we escape the metacharacter dot by '\.' so it becomes a literal period.
- star and plus sign. star means any number including none and plus means at least one. For example,
(.*)
matches 'abc(222 )' and '()'.
- Regular Expression in Base R Regex to identify email address & How to extract expression matching an email address in a text file using R or Command Line?
- Extract digits from a string.
gsub("\\D", "", c("i have 10 app", "call for 2 cups") ) # c("10", "2")
[0-9]+ (.*) [0-9]+
will match one number and following by any number of characters (.*) and a number; e.g. 'afda1080 p' and '4 by 5 size'.
- replace multiple spaces with 1 space.
gsub("[[:space:]]+", " ", " ab c ")
- remove characters after period
gsub("\\..*", "", string)
- remove all characters that are not digits (0-9)
gsub("[^0-9]", "", "Jan.-Feb. 1973") # Jan.-Feb. 1973 -> 1973
- remove everything before the last forward slash in the URL. ".*/", matches any character (.) occurring zero or more times (*) followed by a forward slash (/)
gsub(".*/", "", url) # https://abc.def/file.xlsx -> file.xlsx
- {} refers to as interval quantifiers; specify the minimum and maximum number of match of an expression.
- trimws() function to remove trailing/leading whitespace. The function is used in several places.
trimws <- function(x, which = c("both", "left", "right")) { which <- match.arg(which) mysub <- function(re, x) sub(re, "", x, perl = TRUE) if(which == "left") return(mysub("^[ \t\r\n]+", x)) if(which == "right") return(mysub("[ \t\r\n]+$", x)) mysub("[ \t\r\n]+$", mysub("^[ \t\r\n]+", x)) }
# returns string w/o leading whitespace trim.leading <- function (x) sub("^\\s+", "", x) # returns string w/o trailing whitespace trim.trailing <- function (x) sub("\\s+$", "", x) # returns string w/o leading or trailing whitespace trim <- function (x) gsub("^\\s+|\\s+$", "", x)
- Extract/replace text between parentheses https://stackoverflow.com/a/13498914, stringr::str_replace()
gsub("\\(.*\\)", "", c("0.1385(+)", "0.33", "0.12(-)")
- Replace "\n", "?" or ":" character. rsync will not be able to copy files if these characters appeared in the filename. Below is an R snippet to fix this problem.
dirns <- dir("~/Documents", full.names = TRUE) for(dirn in dirns) { setwd(dirn) x <- list.files(".") y <- gsub("\n|\\?", "", x) # remove \n or ? character y <- gsub(":", ". ", y) # replace : with . # y <- gsub("\\s+", "_", y) # replace space with _ file.rename(x, y) }
- Replace multiple spaces to one space.
\\s+
regular expression matches one or more spaces,
gsub("\\s+", " ", "This is an example") # "This is an example"
Special case: match the dot character
See Chapter 11: Strings with stringr in 'R for Data Science' by Hadley Wickham.
The printed representation of a string shows the escapes. To see the raw contents of the string, use writeLines().
x <- c("\"", "\\") # escape ", \ x # [1] "\"" "\\" writeLines(x) # " # \
"." matches any character. To match the dot character literally we shall use "\\.".
# We want to match the dot character literally writeLines("\.") # Error: '\.' is an unrecognized escape in character string starting ""\." # . should be represented as \. but \ itself should be escaped so # to escape ., we should use \\. writeLines("\\.") # \.
Suppose we have a string like "UCEC.transcriptome__unc_edu__Level_3__unc_lowess_normalization_gene_level__data.data.txt" and we want to get "UCEC" in output. We can use
mystring %>% gsub(x=., pattern = "\\..*", replacement = "")
Special case: match the backslash \
x <- "a\\b" writeLines(x) # a\b str_view(x, "\\\\")
Replace single backslash in R. Note the fixed=TRUE option.
gsub("\\", "", str2, fixed=TRUE)
Approximate matching
> names <- c("Konrad", "Conrad", "Konard", "Connard", "con rat", "Conga rat") > grep("(Konrad){~2}", names, value = TRUE) [1] "Konrad" "Conrad" "Konard" > grep("(Konrad){~1}", names, value = TRUE) [1] "Konrad" "Conrad"
stringr package
- https://stringr.tidyverse.org/articles/regular-expressions.html
- Regular Expression in R from http://stat545.com
# Repetition stringr::str_detect(c("OID01216", "OID01493"), "OID[0-9]{5}") # the 2nd element has 1 more numerical character after OID; still matched stringr::str_detect(c("OID01216", "OID012165"), "OID[0-9]{5}") # [1] TRUE TRUE # the 2nd element has 1 less character; not matched stringr::str_detect(c("OID01216", "OID0121"), "OID[0-9]{5}") # [1] TRUE FALSE
RegExplain package
RegExplain is an RStudio addin slash utility belt for regular expressions. Interactively build your regexp.
Shell
$ echo "I love donuts two times a dayz" | sed 's/.$//' # remove the last char I love donuts two times a day $ echo "I love donuts two times a dayz" | sed 's/[[:alpha:]]$//' I love donuts two times a day $ echo "I love donuts two times a day1" | sed 's/[[:alpha:]]$//' I love donuts two times a day1 $ echo "I love donuts two times a dayz" | sed 's/.[[:alpha:]]$//' I love donuts two times a da $ echo "I love donuts two times a day" | sed 's/[[:blank:]]//' # only the 1st instance Ilove donuts two times a day $ echo "I love donuts two times a day" | sed 's/[[:blank:]]//g' # all instances Ilovedonutstwotimesaday