diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 0000000..e250a0e
--- /dev/null
+++ b/.nojekyll
@@ -0,0 +1 @@
+95f80280
\ No newline at end of file
diff --git a/Makefile b/Makefile
deleted file mode 100644
index 472cf45..0000000
--- a/Makefile
+++ /dev/null
@@ -1,13 +0,0 @@
-regex.md: regex.Rmd
- Rscript -e "rmarkdown::render('regex.Rmd', rmarkdown::md_document(preserve_yaml = TRUE, variant = 'gfm', pandoc_args = '--markdown-headings=atx'))"
-
-text-manipulation.md: text-manipulation.Rmd
- Rscript -e "rmarkdown::render('text-manipulation.Rmd', rmarkdown::md_document(preserve_yaml = TRUE, variant = 'gfm', pandoc_args = '--markdown-headings=atx'))" ## atx headers ensures headers are all like #, ##, etc. Shouldn't be necessary as of pandoc >= 2.11.2
-## 'gfm' ensures that the 'r' tag is put on chunks, so code coloring/highlighting will be done when html is produced.
-
-
-# text-manipulation.md: text-manipulation.qmd
-# quarto render text-manipulation.qmd --to html
-
-#regex.html: regex.qmd
-# quarto render regex.qmd --to html
diff --git a/README.md b/README.md
deleted file mode 100644
index ae90b84..0000000
--- a/README.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# tutorial-string-processing
-Tutorial on string processing, including regular expressions, in R and Python.
-
-Please see the overview page at the [GitHub pages site](https://berkeley-scf.github.io/tutorial-string-processing) to easily view the materials in a browser.
-
diff --git a/_config.yml b/_config.yml
deleted file mode 100644
index a4fd5b4..0000000
--- a/_config.yml
+++ /dev/null
@@ -1,7 +0,0 @@
-remote_theme: pages-themes/minimal@v0.2.0
-plugins:
-- jekyll-remote-theme # add this line to the plugins list if you already have one
-title: String processing tutorial
-description: Training materials for string processing in R and Python.
-show_downloads: false
-logo: assets/img/logo.svg
\ No newline at end of file
diff --git a/_includes/toc.html b/_includes/toc.html
deleted file mode 100644
index 4a40ecc..0000000
--- a/_includes/toc.html
+++ /dev/null
@@ -1,182 +0,0 @@
-{% capture tocWorkspace %}
- {% comment %}
- Copyright (c) 2017 Vladimir "allejo" Jimenez
-
- Permission is hereby granted, free of charge, to any person
- obtaining a copy of this software and associated documentation
- files (the "Software"), to deal in the Software without
- restriction, including without limitation the rights to use,
- copy, modify, merge, publish, distribute, sublicense, and/or sell
- copies of the Software, and to permit persons to whom the
- Software is furnished to do so, subject to the following
- conditions:
-
- The above copyright notice and this permission notice shall be
- included in all copies or substantial portions of the Software.
-
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES
- OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
- NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
- HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
- WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
- FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- OTHER DEALINGS IN THE SOFTWARE.
- {% endcomment %}
- {% comment %}
- Version 1.2.0
- https://github.com/allejo/jekyll-toc
-
- "...like all things liquid - where there's a will, and ~36 hours to spare, there's usually a/some way" ~jaybe
-
- Usage:
- {% include toc.html html=content sanitize=true class="inline_toc" id="my_toc" h_min=2 h_max=3 %}
-
- Parameters:
- * html (string) - the HTML of compiled markdown generated by kramdown in Jekyll
-
- Optional Parameters:
- * sanitize (bool) : false - when set to true, the headers will be stripped of any HTML in the TOC
- * class (string) : '' - a CSS class assigned to the TOC
- * id (string) : '' - an ID to assigned to the TOC
- * h_min (int) : 2 - the minimum TOC header level to use; any header lower than this value will be ignored
- * h_max (int) : 6 - the maximum TOC header level to use; any header greater than this value will be ignored
- * ordered (bool) : false - when set to true, an ordered list will be outputted instead of an unordered list
- * item_class (string) : '' - add custom class(es) for each list item; has support for '%level%' placeholder, which is the current heading level
- * submenu_class (string) : '' - add custom class(es) for each child group of headings; has support for '%level%' placeholder which is the current "submenu" heading level
- * base_url (string) : '' - add a base url to the TOC links for when your TOC is on another page than the actual content
- * anchor_class (string) : '' - add custom class(es) for each anchor element
- * skip_no_ids (bool) : false - skip headers that do not have an `id` attribute
-
- Output:
- An ordered or unordered list representing the table of contents of a markdown block. This snippet will only
- generate the table of contents and will NOT output the markdown given to it
- {% endcomment %}
-
- {% capture newline %}
- {% endcapture %}
- {% assign newline = newline | rstrip %}
-
- {% capture deprecation_warnings %}{% endcapture %}
-
- {% if include.baseurl %}
- {% capture deprecation_warnings %}{{ deprecation_warnings }}{{ newline }}{% endcapture %}
- {% endif %}
-
- {% if include.skipNoIDs %}
- {% capture deprecation_warnings %}{{ deprecation_warnings }}{{ newline }}{% endcapture %}
- {% endif %}
-
- {% capture jekyll_toc %}{% endcapture %}
- {% assign orderedList = include.ordered | default: false %}
- {% assign baseURL = include.base_url | default: include.baseurl | default: '' %}
- {% assign skipNoIDs = include.skip_no_ids | default: include.skipNoIDs | default: false %}
- {% assign minHeader = include.h_min | default: 2 %}
- {% assign maxHeader = include.h_max | default: 6 %}
- {% assign nodes = include.html | strip | split: ' maxHeader %}
- {% continue %}
- {% endif %}
-
- {% assign _workspace = node | split: '' | first }}>{% endcapture %}
- {% assign header = _workspace[0] | replace: _hAttrToStrip, '' %}
-
- {% if include.item_class and include.item_class != blank %}
- {% capture listItemClass %} class="{{ include.item_class | replace: '%level%', currLevel | split: '.' | join: ' ' }}"{% endcapture %}
- {% endif %}
-
- {% if include.submenu_class and include.submenu_class != blank %}
- {% assign subMenuLevel = currLevel | minus: 1 %}
- {% capture subMenuClass %} class="{{ include.submenu_class | replace: '%level%', subMenuLevel | split: '.' | join: ' ' }}"{% endcapture %}
- {% endif %}
-
- {% capture anchorBody %}{% if include.sanitize %}{{ header | strip_html }}{% else %}{{ header }}{% endif %}{% endcapture %}
-
- {% if htmlID %}
- {% capture anchorAttributes %} href="{% if baseURL %}{{ baseURL }}{% endif %}#{{ htmlID }}"{% endcapture %}
-
- {% if include.anchor_class %}
- {% capture anchorAttributes %}{{ anchorAttributes }} class="{{ include.anchor_class | split: '.' | join: ' ' }}"{% endcapture %}
- {% endif %}
-
- {% capture listItem %}{{ anchorBody }}{% endcapture %}
- {% elsif skipNoIDs == true %}
- {% continue %}
- {% else %}
- {% capture listItem %}{{ anchorBody }}{% endcapture %}
- {% endif %}
-
- {% if currLevel > lastLevel %}
- {% capture jekyll_toc %}{{ jekyll_toc }}<{{ listModifier }}{{ subMenuClass }}>{% endcapture %}
- {% elsif currLevel < lastLevel %}
- {% assign repeatCount = lastLevel | minus: currLevel %}
-
- {% for i in (1..repeatCount) %}
- {% capture jekyll_toc %}{{ jekyll_toc }}{{ listModifier }}>{% endcapture %}
- {% endfor %}
-
- {% capture jekyll_toc %}{{ jekyll_toc }}{% endcapture %}
- {% else %}
- {% capture jekyll_toc %}{{ jekyll_toc }}{% endcapture %}
- {% endif %}
-
- {% capture jekyll_toc %}{{ jekyll_toc }}
This tutorial covers tools for manipulating text data in R and Python, including the use of regular expressions. The tutorial is somewhat more focused on R than Python. Please click on the links on the left the various sections of this tutorial.
+
Please see the side menu bar for the various sections of this tutorial, of which this document is the introduction.
+
If you have a standard R or Python installation and can install the stringr package for R and the re package for Python, you should be able to reproduce the results in this document.
+
This tutorial assumes you have a working knowledge of R or Python.
+
Materials for this tutorial, including the Markdown file that was used to create this document are available on GitHub.
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/index.qmd b/index.qmd
deleted file mode 100644
index 3755a5a..0000000
--- a/index.qmd
+++ /dev/null
@@ -1,16 +0,0 @@
----
-title: String processing
-date: 2025-06-20
----
-
-This tutorial covers tools for manipulating text data in R and Python, including the use of regular expressions. The tutorial is somewhat more focused on R than Python. Please click on the links on the left the various sections of this tutorial.
-
-Please see the side menu bar for the various sections of this tutorial, of which this document is the introduction.
-
-If you have a standard R or Python installation and can install the `stringr` package for R and the `re` package for Python, you should be able to reproduce the results in this document.
-
-This tutorial assumes you have a working knowledge of R or Python.
-
-Materials for this tutorial, including the Markdown file that was used to create this document are [available on GitHub](https://github.com/berkeley-scf/tutorial-string-processing).
-
-
diff --git a/license.html b/license.html
new file mode 100644
index 0000000..994bfce
--- /dev/null
+++ b/license.html
@@ -0,0 +1,595 @@
+
+
+
+
+
+
+
+
+
+License
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Regular expressions are a domain-specific language for finding patterns and are one of the key functionalities in scripting languages such as Perl and Python, as well as the UNIX utilities grep, sed, and awk.
+
The basic idea of regular expressions is that they allow us to find matches of strings or patterns in strings, as well as do substitution. Regular expressions are good for tasks such as:
+
+
extracting pieces of text;
+
creating variables from information found in text;
+
cleaning and transforming text into a uniform format; and
+
mining text by treating documents as data.
+
+
+
+
2 Regular expression syntax
+
Please use one or more of the following resources to learn regular expression syntax.
One thing that can cause headaches is differences in version of regular expression syntax used. As discussed in man grep, extended regular expressions are standard, with basic regular expressions providing somewhat less functionality and Perl regular expressions additional functionality. In R, stringr provides ICU regular expressions (see help(regex)), which are based on Perl regular expressions. More details can be found in the regex Wikipedia page.
+
+
The bash shell tutorial provides a full documentation of the various extended regular expressions syntax, which we’ll focus on here. This should be sufficient for most usage and should be usable in R and Python, but if you notice something funny going on, it might be due to differences between the regular expressions versions.
+
+
+
4 General principles for working with regex
+
The syntax is very concise, so it’s helpful to break down individual regular expressions into the component parts to understand them. As Murrell notes, since regex are their own language, it’s a good idea to build up a regex in pieces as a way of avoiding errors just as we would with any computer code. str_detect in R’s stringr and re.findall in Python are particularly useful in seeing what was matched to help in understanding and learning regular expression syntax and debugging your regex. As with many kinds of coding, I find that debugging my regex is usually what takes most of my time.
+
+
+
5 Using regex in R
+
The grep, gregexpr and gsub functions and their stringr analogs are more powerful when used with regular expressions. In the following examples, we’ll illustrate usage of stringr functions, but with their base R analogs as comments.
+
+
5.1 Working with patterns
+
First let’s see the use of character sets and character classes.
+
+
library(stringr)
+text <-c("Here's my number: 919-543-3300.", "hi John, good to meet you",
+"They bought 731 bananas", "Please call 919.554.3800")
+str_detect(text, "[[:digit:]]") ## search for a digit
+
+
[1] TRUE FALSE TRUE TRUE
+
+
## Base R equivalent:
+## grep("[[:digit:]]", text, perl = TRUE)
+
+
+
str_detect(text, "[:,\t.]") # search for various punctuation symbols
+
+
[1] TRUE TRUE FALSE TRUE
+
+
## Base R equivalent:
+## grep("[:,\t.]", text)
+
+
+
str_locate_all(text, "[:,\t.]")
+
+
[[1]]
+ start end
+[1,] 17 17
+[2,] 31 31
+
+[[2]]
+ start end
+[1,] 8 8
+
+[[3]]
+ start end
+
+[[4]]
+ start end
+[1,] 16 16
+[2,] 20 20
+
+
## Base R equivalent:
+## gregexpr("[:,\t.]", text)
+
+
+
str_extract_all(text, "[[:digit:]]+") # extract one or more digits
## Base R equivalent:
+## matches <- gregexpr(pattern, text)
+## regmatches(text, matches)
+
+
+
+
+
+
+
+Challenge
+
+
+
+
How would I extract an email address from an arbitrary text string?
+
+
+
Next consider grouping.
+
For example, the phone number detection problem could have been done a bit more compactly (and more generally, in case the area code is omitted or a 1 is included) as:
## Base R equivalent:
+## matches <- gregexpr("(1[-.])?([[:digit:]]{3}[-.]){1,2}[[:digit:]]{4}", text)
+## regmatches(text, matches)
+
+
+
+
+
+
+
+Challenge
+
+
+
+
The above pattern would actually match something that is not a valid phone number. What can go wrong?
+
+
+
Here’s a basic example of using grouping via parentheses with the OR operator.
+
+
text <-c("at the site http://www.ibm.com", "other text", "ftp://ibm.com")
+str_locate(text, "(http|ftp):\\/\\/") # http or ftp followed by ://
+
+
start end
+[1,] 13 19
+[2,] NA NA
+[3,] 1 6
+
+
## Base R equivalent:
+## gregexpr("(http|ftp):\\/\\/", text)
+
+
Parentheses are also used for referencing back to a detected pattern when doing a replacement. For example, here we’ll find any numbers and add underscores before and after them:
+
+
text <-c("Here's my number: 919-543-3300.", "hi John, good to meet you",
+"They bought 731 bananas", "Please call 919.554.3800")
+str_replace_all(text, "([0-9]+)", "_\\1_") # place underscores around all numbers
+
+
[1] "Here's my number: _919_-_543_-_3300_."
+[2] "hi John, good to meet you"
+[3] "They bought _731_ bananas"
+[4] "Please call _919_._554_._3800_"
+
+
+
One uses the \\1 to refer back to the first group that was matched based on the parentheses. One can have multiple groups and refer to them with \\2, \\3, etc.
+
Here we’ll remove commas not used as field separators.
+
+
text <- ('"H4NY07011","ACKERMAN, GARY L.","H","$13,242",,,')
+clean_text <-str_replace_all(text, "([^\",]),", "\\1")
+clean_text
+
+
[1] "\"H4NY07011\",\"ACKERMAN GARY L.\",\"H\",\"$13242\",,,"
+
+
cat(clean_text)
+
+
"H4NY07011","ACKERMAN GARY L.","H","$13242",,,
+
+
## Base R equivalent:
+## gsub("([^\",]),", "\\1", text)
+
+
+
+
+
+
+
+Challenge
+
+
+
+
Suppose a text string has dates in the form “Aug-3”, “May-9”, etc. and I want them in the form “3 Aug”, “9 May”, etc. How would I do this search/replace?
+
+
+
Finally, let’s consider where a match ends when there is ambiguity.
+
As a simple example consider that if we try this search, we match as many digits as possible, rather than returning the first “9” as satisfying the request for “one or more” digits.
+
+
text <-"See the 998 balloons."
+str_extract(text, "[[:digit:]]+")
+
+
[1] "998"
+
+
+
That behavior is called greedy matching, and it’s the default. That example also shows why it is the default. What would happen if it were not the default?
+
However, sometimes greedy matching doesn’t get us what we want.
+
Consider this attempt to remove multiple html tags from a string.
+
+
text <-"Do an internship <b> in place </b> of <b> one </b> course."
+str_replace_all(text, "<.*>", "")
+
+
[1] "Do an internship course."
+
+
## Base R equivalent:
+## gsub("<.*>", "", text)
+
+
Notice what happens because of greedy matching.
+
One solution is to append a ? to the repetition syntax to cause the matching to be non-greedy. Here’s an example.
+
+
str_replace_all(text, "<.*?>", "")
+
+
[1] "Do an internship in place of one course."
+
+
## Base R equivalent:
+## gsub("<.*?>", "", text)
+
+
However, one can often avoid greedy matching by being more clever.
+
+
+
+
+
+
+Challenge
+
+
+
+
How could we change our regex to avoid the greedy matching without using the “?”?
+
+
+
+
+
5.2 ‘Escaping’ special characters
+
Using backslashes to ‘escape’ particular characters can be tricky. One rule of thumb is to just keep adding backslashes until you get what you want!
+
+
## last case here is literally a backslash and then 'n'
+strings <-c("Hello", "Hello.", "Hello\nthere", "Hello\\nthere")
+cat(strings, sep ="\n")
+
+
Hello
+Hello.
+Hello
+there
+Hello\nthere
+
+
+
+
str_detect(strings, ".") ## . means any character
+
+
[1] TRUE TRUE TRUE TRUE
+
+
## This would fail because \. looks for the special symbol \.
+## (which doesn't exist):
+## str_detect(strings, "\."))
+
+str_detect(strings, "\\.") ## \\ says treat \ literally, which then escapes the .
+
+
[1] FALSE TRUE FALSE FALSE
+
+
str_detect(strings, "\n") ## \n looks for the special symbol \n
+
+
[1] FALSE FALSE TRUE FALSE
+
+
## \\ says treat \ literally, but \ is not meaningful regex
+try(str_detect(strings, "\\"))
+
+
Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
+ Unrecognized backslash escape sequence in pattern. (U_REGEX_BAD_ESCAPE_SEQUENCE, context=`\`)
+
+
## R parser removes two \ to give \\; then in regex \\ treats second \ literally
+str_detect(strings, "\\\\")
+
+
[1] FALSE FALSE FALSE TRUE
+
+
+
+
+
5.3 Other comments
+
If we are working with newlines embedded in a string, we can include the newline character as a regular character that is matched by a “.” by first creating the regular expression with stringr::regex with the dotall argument set to TRUE:
+
+
myex <-regex("<p>.*</p>", dotall =TRUE)
+html_string <-"And <p>here is some\ninformation</p> for you."
+str_extract(html_string, myex)
+
+
[1] "<p>here is some\ninformation</p>"
+
+
+
+
str_extract(html_string, "<p>.*</p>") # doesn't work because \n is not matched
+
+
[1] NA
+
+
+
Regular expression can be used in a variety of places. E.g., to split by any number of white space characters
+
+
line <-"a dog\tjumped\nover \tthe moon."
+cat(line)
+
+
a dog jumped
+over the moon.
+
+
str_split(line, "[[:space:]]+")
+
+
[[1]]
+[1] "a" "dog" "jumped" "over" "the" "moon."
+
+
str_split(line, "[[:blank:]]+")
+
+
[[1]]
+[1] "a" "dog" "jumped\nover" "the" "moon."
+
+
+
+
+
+
6 Using regex in Python
+
+
6.1 Working with patterns
+
For working with regex in Python, we’ll need the re package. It provides Perl-style regular expressions, but it doesn’t seem to support named character classes such as [:digit:]. Instead use classes such as \d and [0-9].
+
Again, in the code chunks that follow, all the explicit print statements are needed for R Markdown to print out the values.
+
In Python, you apply a matching function and then query the result to get information about what was matched and where in the string.
+
+
import re
+text ="Here's my number: 919-543-3300."
+m = re.search("\d+", text)
+m
+
+
<re.Match object; span=(18, 21), match='919'>
+
+
m.group()
+
+
'919'
+
+
m.start()
+
+
18
+
+
m.end()
+
+
21
+
+
m.span()
+
+
(18, 21)
+
+
+
Notice that that showed us only the first match.
+
We can instead use findall to get all the matches.
+
+
re.findall("\d+", text)
+
+
['919', '543', '3300']
+
+
+
To ignore case, do the following:
+
+
import re
+str="That cat in the Hat"
+re.findall("hat", str, re.IGNORECASE)
+
+
['hat', 'Hat']
+
+
+
There are several other regex flags (also called compilation flags) that can control the behavior of the matching engine in interesting ways (check out re.VERBOSE and re.MULTILINE for instance).
+
We can of course use list comprehension to work with multiple strings. But we need to be careful to check whether a match was found.
+
+
import re
+text = ["Here's my number: 919-543-3300.", "hi John, good to meet you",
+"They bought 731 bananas", "Please call 919.554.3800"]
+
+def return_group(pattern, txt):
+ m = re.search(pattern, txt)
+if m:
+return(m.group())
+else:
+return(None)
+
+[return_group("\d+", str) forstrin text]
+
+
['919', None, '731', '919']
+
+
+
Next, let’s look at replacing patterns, using re.sub.
+
+
import re
+text = ["Here's my number: 919-543-3300.", "hi John, good to meet you",
+"They bought 731 bananas", "Please call 919.554.3800"]
+re.sub("\d", "Z", text[0])
+
+
"Here's my number: ZZZ-ZZZ-ZZZZ."
+
+
+
Next let’s consider grouping using ().
+
Here’s a basic example of using grouping via parentheses with the OR operator (|).
+
+
text ="At the site http://www.ibm.com. Some other text. ftp://ibm.com"
+re.search("(http|ftp):\\/\\/", text).group()
+
+
'http://'
+
+
+
However, if we want to find all the matches and try to use findall, we see that it returns only the captured groups when grouping operators are present, as discussed a bit in help(re.findall), so we’d need to add an additional grouping operator to capture the full pattern when using findall:
+
+
re.findall("(http|ftp):\\/\\/", text)
+
+
['http', 'ftp']
+
+
re.findall("((http|ftp):\\/\\/)", text)
+
+
[('http://', 'http'), ('ftp://', 'ftp')]
+
+
+
When you are searching for all occurrences of a pattern in a large text object, it may be beneficial to use finditer:
+
+
it = re.finditer("(http|ftp):\\/\\/", text) # http or ftp followed by ://
+
+for match in it:
+ match.span()
+
+
(12, 19)
+(49, 55)
+
+
+
This method behaves lazily and returns an iterator that gives us one match at a time, and only scans for the next match when we ask for it.
+
Groups are also used when we need to reference back to a detected pattern when doing a replacement. This is why they are sometimes referred to as “capturing groups”. For example, here we’ll find any numbers and add underscores before and after them:
+
+
text ="Here's my number: 919-543-3300. They bought 731 bananas. Please call 919.554.3800."
+re.sub("([0-9]+)", "_\\1_", text)
+
+
"Here's my number: _919_-_543_-_3300_. They bought _731_ bananas. Please call _919_._554_._3800_."
+
+
+
Here we’ll remove commas not used as field separators by replacing all commas except those occurring after another comma or after a quotation mark. This is an attempt to remove all commas not used as field delimiters.
+
+
text ='"H4NY07011","ACKERMAN, GARY L.","H","$13,242",,,'
+re.sub("([^\",]),", "\\1", text)
+
+
'"H4NY07011","ACKERMAN GARY L.","H","$13242",,,'
+
+
+
How does that work? Consider that “[^\",]” matches a character that is not a quote and not a comma. The regex is therefore such a non-quote, non-comma character followed by a comma, with the matched character saved in \\1 because of the grouping operator.
+
Groups can also be given names, instead of having to refer to them by their numbers, but we will not demonstrate this here.
+
Finally, let’s consider where a match ends when there is ambiguity.
+
As a simple example consider that if we try this search, we match as many digits as possible, rather than returning the first “9” as satisfying the request for “one or more” digits.
+
+
text ="See the 998 balloons."
+re.findall("\d+", text)
+
+
['998']
+
+
+
That behavior is called greedy matching, and it’s the default. That example also shows why it is the default. What would happen if it were not the default?
+
However, sometimes greedy matching doesn’t get us what we want.
+
Consider this attempt to remove multiple html tags from a string.
+
+
text ="Do an internship <b> in place </b> of <b> one </b> course."
+re.sub("<.*>", "", text)
+
+
'Do an internship course.'
+
+
+
Notice what happens because of greedy matching.
+
One way to avoid greedy matching is to use a ? after the repetition specifier.
+
+
re.sub("<.*?>", "", text)
+
+
'Do an internship in place of one course.'
+
+
+
However, that syntax is a bit frustrating because ? is also used to indicate 0 or 1 repetitions, making the regex a bit hard to read/understand.
+
+
+
+
+
+
+Challenge
+
+
+
+
Suppose I want to strip out HTML tags but without using the ? to avoid greedy matching. How can I be more careful in constructing my regex?
+
+
+
+
+
6.2 ‘Escaping’ special characters
+
For reasons explained in the re documentation, to match an actual backslash, such as "\section", you’d need "\\\\section". This can be avoided by using raw strings: r"\\section".
+
Here are some more examples of escaping characters.
re.search(".", strings[0]) ## . means any character
+
+
<re.Match object; span=(0, 1), match='H'>
+
+
re.search("\.", strings[0]) ## \. escapes the period and treats it literally
+re.search("\.", strings[1]) ## \. escapes the period and treats it literally
+
+
<re.Match object; span=(5, 6), match='.'>
+
+
re.search("\n", strings[2]) ## \n looks for the special symbol \n
+
+
<re.Match object; span=(5, 6), match='\n'>
+
+
re.search("\n", strings[3]) ## \n looks for the special symbol \n
+re.search("\\\\", strings[3]) ## string parser removes two \ to give \\;
+## then in regex \\ treats second \ literally
+
+
+
+
6.3 Other comments
+
You can also compile regex patterns for faster processing when working with a pattern multiple times.
+
+
import re
+text = ["Here's my number: 919-543-3300.", "hi John, good to meet you",
+"They bought 731 bananas", "Please call 919.554.3800"]
+p = re.compile('\d+')
+m = p.search(text[0])
+m.group()
+
+
'919'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/regex.qmd b/regex.qmd
deleted file mode 100644
index 3682476..0000000
--- a/regex.qmd
+++ /dev/null
@@ -1,525 +0,0 @@
----
-title: Using regular expression in R and Python
-format:
- html:
- theme: cosmo
- css: assets/styles.css
- toc: true
- code-copy: true
- code-block-bg: true
- code-block-border-left: "#31BAE9"
-ipynb-shell-interactivity: all
-code-overflow: wrap
-execute:
- freeze: auto
----
-
-
-## 1 Overview
-
-Regular expressions are a domain-specific language for
-finding patterns and are one of the key functionalities in scripting
-languages such as Perl and Python, as well as the UNIX utilities `grep`, `sed`, and
-`awk`.
-
-The basic idea of regular expressions is that they allow us to find
-matches of strings or patterns in strings, as well as do substitution.
-Regular expressions are good for tasks such as:
-
- - extracting pieces of text;
- - creating variables from information found in text;
- - cleaning and transforming text into a uniform format; and
- - mining text by treating documents as data.
-
-## 2 Regular expression syntax
-
-Please use one or more of the following resources to learn regular expression syntax.
-
-- [Our tutorial on using the bash shell](https://computing.stat.berkeley.edu/tutorial-using-bash/regex)
-- Duncan Temple Lang (UC Davis Statistics) has written a [nice tutorial](regexpr-Lang.pdf) that is part of the repository for this tutorial
-- Check out Sections 9.9 and 11 of [Paul Murrell's book](http://www.stat.auckland.ac.nz/~paul/ItDT)
-
-Also, see the back/second page of RStudio's stringr cheatsheet for a [cheatsheet on regular expressions](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf) for a regular expression cheatsheet. And here is a [website where you can interactively test regular expressions on example strings](https://regex101.com).
-
-## 3 Versions of regular expressions
-
-::: {.callout-danger title="Regex versions"}
-
-One thing that can cause headaches is differences in version of regular expression syntax used. As discussed in `man grep`, *extended regular expressions* are standard, with *basic regular expressions* providing somewhat less functionality and *Perl regular expressions* additional functionality.
-In R, `stringr` provides *ICU regular expressions* (see `help(regex)`), which are based on Perl regular expressions. More details can be found in the [regex Wikipedia page](https://en.wikipedia.org/wiki/Regular_expression).
-
-:::
-
-The [bash shell tutorial](https://computing.stat.berkeley.edu/tutorial-using-bash/regex) provides a full documentation of the various *extended regular expressions* syntax, which we'll focus on here. This should be sufficient for most usage and should be usable in R and Python, but if you notice something funny going on, it might be due to differences between the regular expressions versions.
-
-## 4 General principles for working with regex
-
-The syntax is very concise, so it's helpful to break down
-individual regular expressions into the component parts to understand
-them. As Murrell notes, since regex are their own language, it's
-a good idea to build up a regex in pieces as a way of avoiding errors
-just as we would with any computer code. `str_detect` in R's `stringr` and `re.findall` in Python are particularly
-useful in seeing *what* was matched to help in understanding
-and learning regular expression syntax and debugging your regex. As with
-many kinds of coding, I find that debugging my regex is usually what takes
-most of my time.
-
-## 5 Using regex in R
-
-
-The `grep`, `gregexpr` and `gsub` functions and
-their `stringr` analogs are more powerful when used with regular
-expressions. In the following examples, we'll illustrate usage of `stringr` functions, but
-with their base R analogs as comments.
-
-### 5.1 Working with patterns
-
-First let's see the use of character sets and character classes.
-
-
-```{r}
-library(stringr)
-text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
- "They bought 731 bananas", "Please call 919.554.3800")
-str_detect(text, "[[:digit:]]") ## search for a digit
-
-## Base R equivalent:
-## grep("[[:digit:]]", text, perl = TRUE)
-```
-
-
-```{r}
-str_detect(text, "[:,\t.]") # search for various punctuation symbols
-
-## Base R equivalent:
-## grep("[:,\t.]", text)
-```
-
-```{r}
-str_locate_all(text, "[:,\t.]")
-
-## Base R equivalent:
-## gregexpr("[:,\t.]", text)
-```
-
-```{r}
-str_extract_all(text, "[[:digit:]]+") # extract one or more digits
-
-## Base R equivalent:
-## matches <- gregexpr("[[:digit]]+", text)
-## regmatches(text, matches)
-
-```{r}
-str_replace_all(text, "[[:digit:]]", "X")
-
-## Base R equivalent:
-## gsub("[[:digit:]]", "X", text)
-```
-
-::: {.callout-tip title="Challenge"}
-
-How would we find a spam-like pattern with digits or non-letters inside a word? E.g., I want to find "V1agra" or "Fancy repl!c@ted watches".
-:::
-
-Next let's consider location-specific matches.
-
-```{r}
-str_detect(text, "^[[:upper:]]") # text starting with upper case letter
-
-## Base R equivalent:
-## grep("^[[:upper:]]", text)
-```
-
-```{r}
-str_detect(text, "[[:digit:]]$") # text ending with a digit
-
-## Base R equivalent:
-## grep("[[:digit:]]$", text)
-```
-
-
-Now let's make use of repetitions.
-
-Let's search for US/Canadian/Caribbean phone numbers in the example text we've been using:
-
-
-
-```{r}
-text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
- "They bought 731 bananas", "Please call 919.554.3800")
-pattern <- "[[:digit:]]{3}[-.][[:digit:]]{3}[-.][[:digit:]]{4}"
-str_extract_all(text, pattern)
-
-## Base R equivalent:
-## matches <- gregexpr(pattern, text)
-## regmatches(text, matches)
-```
-
-::: {.callout-tip title="Challenge"}
-
-How would I extract an email address from an arbitrary text string?
-:::
-
-Next consider grouping.
-
-For example, the phone number detection problem could have been done a bit more compactly (and more generally, in case the area code is omitted or a 1 is included) as:
-
-
-```{r}
-str_extract_all(text, "(1[-.])?([[:digit:]]{3}[-.]){1,2}[[:digit:]]{4}")
-
-## Base R equivalent:
-## matches <- gregexpr("(1[-.])?([[:digit:]]{3}[-.]){1,2}[[:digit:]]{4}", text)
-## regmatches(text, matches)
-```
-
-::: {.callout-tip title="Challenge"}
-The above pattern would actually match something that is not a valid phone number. What can go wrong?
-:::
-
-Here's a basic example of using grouping via parentheses with the OR operator.
-
-
-```{r}
-text <- c("at the site http://www.ibm.com", "other text", "ftp://ibm.com")
-str_locate(text, "(http|ftp):\\/\\/") # http or ftp followed by ://
-
-## Base R equivalent:
-## gregexpr("(http|ftp):\\/\\/", text)
-```
-
-Parentheses are also used for referencing back to a detected pattern when doing a replacement. For example, here we'll find any numbers and add underscores before and after them:
-
-
-```{r}
-text <- c("Here's my number: 919-543-3300.", "hi John, good to meet you",
- "They bought 731 bananas", "Please call 919.554.3800")
-str_replace_all(text, "([0-9]+)", "_\\1_") # place underscores around all numbers
-```
-
-One uses the `\\1` to refer back to the first group that was matched based on the parentheses. One can have multiple groups and refer to them with `\\2`, `\\3`, etc.
-
-Here we'll remove commas not used as field separators.
-
-
-```{r}
-text <- ('"H4NY07011","ACKERMAN, GARY L.","H","$13,242",,,')
-clean_text <- str_replace_all(text, "([^\",]),", "\\1")
-clean_text
-
-cat(clean_text)
-
-## Base R equivalent:
-## gsub("([^\",]),", "\\1", text)
-```
-
-::: {.callout-tip title="Challenge"}
-Suppose a text string has dates in the form “Aug-3”, “May-9”, etc. and I want them in the form “3 Aug”, “9 May”, etc. How would I do this search/replace?
-:::
-
-Finally, let's consider where a match ends when there is ambiguity.
-
-As a simple example consider that if we try this search, we match as many digits as possible, rather than returning the first "9" as satisfying the request for "one or more" digits.
-
-```{r}
-text <- "See the 998 balloons."
-str_extract(text, "[[:digit:]]+")
-```
-
-That behavior is called *greedy* matching, and it's the default. That example also shows why it
-is the default. What would happen if it were not the default?
-
-However, sometimes greedy matching doesn't get us what we want.
-
-Consider this attempt to remove multiple html tags from a string.
-
-
-```{r}
-text <- "Do an internship in place of one course."
-str_replace_all(text, "<.*>", "")
-
-## Base R equivalent:
-## gsub("<.*>", "", text)
-```
-
-Notice what happens because of greedy matching.
-
-One solution is to append a ? to the repetition syntax to cause the matching to be non-greedy. Here's an example.
-
-```{r}
-str_replace_all(text, "<.*?>", "")
-
-## Base R equivalent:
-## gsub("<.*?>", "", text)
-```
-
-However, one can often avoid greedy matching by being more clever.
-
-::: {.callout-tip title="Challenge"}
-How could we change our regex to avoid the greedy matching without using the “?”?
-:::
-
-### 5.2 'Escaping' special characters
-
-Using backslashes to 'escape' particular characters can be tricky. One rule of thumb is to just keep adding backslashes until you get what you want!
-
-
-```{r}
-## last case here is literally a backslash and then 'n'
-strings <- c("Hello", "Hello.", "Hello\nthere", "Hello\\nthere")
-cat(strings, sep = "\n")
-```
-
-
-```{r}
-str_detect(strings, ".") ## . means any character
-
-## This would fail because \. looks for the special symbol \.
-## (which doesn't exist):
-## str_detect(strings, "\."))
-
-str_detect(strings, "\\.") ## \\ says treat \ literally, which then escapes the .
-
-str_detect(strings, "\n") ## \n looks for the special symbol \n
-
-## \\ says treat \ literally, but \ is not meaningful regex
-try(str_detect(strings, "\\"))
-## R parser removes two \ to give \\; then in regex \\ treats second \ literally
-str_detect(strings, "\\\\")
-```
-
-
-### 5.3 Other comments
-
-If we are working with newlines embedded in a string, we can include the newline character as a regular character that is matched by a “.” by first creating the regular expression with `stringr::regex` with the `dotall` argument set to `TRUE`:
-
-
-```{r}
-myex <- regex("
.*
", dotall = TRUE)
-html_string <- "And
here is some\ninformation
for you."
-str_extract(html_string, myex)
-```
-
-```{r}
-str_extract(html_string, "
Text manipulations in R, Python, Perl, and bash have a number of things in common, as many of these evolved from UNIX. When I use the term string here, I’ll be referring to any sequence of characters that may include numbers, white space, and special characters. Note that in R a character vector is a vector of one or more such strings.
+
Some of the basic things we need to do are paste/concatenate strings together, split strings apart, take subsets of strings, and replace characters within strings. Often these operations are done based on patterns rather than a fixed string sequence. This involves the use of regular expressions.
+
+
+
2 R
+
In general, strings in R are stored in character vectors. R’s functions for string manipulation are fully vectorized and will work on all of the strings in a vector at once.
+
One can do string manipulation in base R or using the stringr package. In general, I’d suggest using stringr functionsin place of R’s base string functions.
+
+
2.1 String manipulation in base R
+
A few of the basic R functions for manipulating strings are paste, strsplit, and substring. paste and strsplit are basically inverses of each other:
+
+
paste concatenates together an arbitrary set of strings (or a vector, if using the collapse argument) with a user-specified separator character
+
strsplit splits apart based on a delimiter/separator
+
substring splits apart the elements of a character vector based on fixed widths
+
nchar returns the number of characters in a string.
+
+
Note that all of these operate in a vectorized fashion.
Some string processing functions (such as strsplit above) can return multiple values for each input string (each element of the character vector). As a result, the functions will return a list, which will be a list with one element when the function operates on a single string.
+
+
out <-c("Her name is Maya", "Hello everyone")
+strsplit(out, split =' ')
To identify particular subsequences in strings, there are several closely-related R functions. grep will look for a specified string within an R character vector and report back indices identifying the elements of the vector in which the string was found. Note that using the fixed=TRUE argument ensures that regular expressions are NOT used. gregexpr will indicate the position in each string that the specified string is found (use regexpr if you only want the first occurrence). gsub can be used to replace a specified string with a replacement string (use sub if you only want to replace only the first occurrence).
The stringr package wraps the various core string manipulation functions to provide a common interface. It also removes some of the clunkiness involved in some of the string operations with the base string functions, such as having to to call gregexpr and then regmatches to pull out the matched strings.
First let’s see stringr’s versions of some of the base R string functions mentioned in the previous section.
+
The basic interface to stringr functions is function(character_vector, pattern, [replacement]).
+
Table 1 provides an overview of the key functions related to working with patterns, which are basically wrappers for grep, gsub, gregexpr, etc.
+
+
+
+
+
+
+
+
Function
+
What it does
+
+
+
+
+
str_detect
+
detects pattern, returning TRUE/FALSE
+
+
+
str_count
+
counts matches
+
+
+
str_locate/str_locate_all
+
detects pattern, returning positions of matching characters
+
+
+
str_extract/str_extract_all
+
detects pattern, returning matches
+
+
+
str_replace/str_replace_all
+
detects pattern and replaces matches
+
+
+
+
The analog of regexpr vs. gregexpr and sub vs. gsub is that most of the functions have versions that return all the matches, not just the first match, e.g., str_locate_all, str_extract_all, etc. Note that the _all functions return lists while the non-_all functions return vectors.
+
To specify options, you can wrap these functions around the pattern argument: fixed(pattern, ignore_case) and regex(pattern, ignore_case). The default is regex, so you only need to specify that if you also want to specify additional arguments, such as ignore_case or others listed under help(regex) (invoke the help after loading stringr)
Let’s see basic concatenation, splitting, working with substrings, and searching/replacing substrings. Notice that Python’s string functionality is object-oriented (though len is not).
+
Here, we’ll just cover the basic methods for the str type. There’s lots of additional functionality for working with strings using regular expressions in the re package. Of course in many cases of working with strings, one would need the full power of regular expressions to do what one needs to do.
+
First let’s look at combining/concatenating strings. We can do this with the + operator or using the join method, which is (perhaps confusingly) called based on the separator of interest with the input strings as arguments.
+
+
out ="My"+"name"+"is"+"Chris"+"."
+out
+
+
'MynameisChris.'
+
+
+
+
out =' '.join(("My", "name", "is", "Chris", "."))
+out
+
+
'My name is Chris .'
+
+
+
len simply returns the number of characters in the string.
+
+
len(out)
+
+
18
+
+
+
+
out.split(' ')
+
+
['My', 'name', 'is', 'Chris', '.']
+
+
+
To see the various string methods, we can hit tab after typing str. or based on any specific string:
Unlike in R, you cannot use the string methods directly on a list or tuple of strings, but you of course can do things like list comprehension to easily process multiple strings.
+
Working with substrings relies on the fact that Python works with strings as if they are vectors of individual characters.
+
+
var ="13:47:00"
+var[3:5]
+
+
'47'
+
+
+
However strings are immutable - you cannot alter a subset of characters in the string. Another option is to work with strings as lists.
+
+
var[0:2] ="01"
+
+
TypeError: 'str' object does not support item assignment
+
+
+
Now let’s consider finding substrings. Here Python tells us that ‘2016’ starts in the 6th position in the first and third elements (with 0-based indexing).
+
+
var ="08-03-2016"
+var.find("2016")
+
+
6
+
+
+
We can count occurrences with .count():
+
+
var ="08-03-2016; 07-09-2016"
+var.count("2016")
+
+
2
+
+
+
And we can replace like this:
+
+
var ="13:47:00"
+var.replace("13", "01")
+
+
'01:47:00'
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/text-manipulation.qmd b/text-manipulation.qmd
deleted file mode 100644
index 06e89f2..0000000
--- a/text-manipulation.qmd
+++ /dev/null
@@ -1,261 +0,0 @@
----
-title: Basic text manipulation in R and Python
-format:
- html:
- theme: cosmo
- css: assets/styles.css
- toc: true
- code-copy: true
- code-block-bg: true
- code-block-border-left: "#31BAE9"
-ipynb-shell-interactivity: all
-code-overflow: wrap
-execute:
- freeze: auto
----
-
-## 1 Overview
-
-Text manipulations in R, Python, Perl, and bash have a number of things
-in common, as many of these evolved from UNIX. When I use the
-term *string* here, I'll be referring to any sequence of characters
-that may include numbers, white space, and special characters. Note that in R
-a character vector is a vector of one or more such strings.
-
-Some of the basic things we need to do are paste/concatenate strings together,
-split strings apart, take subsets of strings, and replace characters within strings.
-Often these operations are done based on patterns rather than a fixed string
-sequence. This involves the use of [regular expressions](regex.qmd).
-
-## 2 R
-
-In general, strings in R are stored in character vectors. R's functions for string manipulation are fully vectorized and will work on all of the strings in a vector at once.
-
-One can do string manipulation in base R or using the `stringr` package. In general, I'd suggest using `stringr` functionsin place of R's base string functions.
-
-
-
-### 2.1 String manipulation in base R
-
-A few of the basic R functions for manipulating strings are `paste`,
-`strsplit`, and `substring`. `paste` and `strsplit`
-are basically inverses of each other:
-
-- `paste` concatenates together an arbitrary set of strings (or a vector, if using the `collapse` argument) with a user-specified separator character
-- `strsplit` splits apart based on a delimiter/separator
-- `substring` splits apart the elements of a character vector based on fixed widths
-- `nchar` returns the number of characters in a string.
-
-Note that all of these operate in a vectorized fashion.
-
-
-```{r}
-out <- paste("My", "name", "is", "Chris", ".", sep = " ")
-paste(c("My", "name", "is", "Chris", "."), collapse = " ") # equivalent
-
-nchar(out)
-
-strsplit(out, split = ' ')
-```
-
-::: {.callout-warning}
-
-Some string processing functions (such as `strsplit` above) can return multiple values for each input string (each element of the character vector). As a result, the functions will return a list, which will be a list with one element when the function operates on a single string.
-
-```{r}
-out <- c("Her name is Maya", "Hello everyone")
-strsplit(out, split = ' ')
-```
-:::
-
-Here are some examples of using `substring`:
-
-```{r}
-times <- c("04:18:04", "12:12:53", "13:47:00")
-substring(times, 7, 8)
-```
-
-```{r}
-substring(times[3], 1, 2) <- '01' ## replacement
-times
-```
-
-
-To identify particular subsequences in strings, there are several
-closely-related R functions. `grep` will look for a specified string
-within an R character vector and report back indices identifying the
-elements of the vector in which the string was found. Note that using the
-`fixed=TRUE` argument ensures that regular expressions are NOT
-used. `gregexpr` will indicate the position in each string
-that the specified string is found (use `regexpr` if you only
-want the first occurrence). `gsub` can be used to replace a
-specified string with a replacement string (use `sub` if you
-only want to replace only the first occurrence).
-
-
-```{r}
-dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
-grep("2016", dates)
-```
-
-```{r}
-gregexpr("2016", dates)
-```
-
-```{r}
-gsub("2016", "16", dates)
-```
-
-### 2.2 String manipulation using `stringr`
-
-The `stringr` package wraps the various core string manipulation
-functions to provide a common interface. It also removes some of the
-clunkiness involved in some of the string operations with the base
-string functions, such as having to to call `gregexpr` and
-then `regmatches` to pull out the matched strings.
-
-Here's a [cheatsheet from RStudio](https://raw.githubusercontent.com/rstudio/cheatsheets/main/strings.pdf) on manipulating strings using the `stringr` package in R.
-
-First let's see `stringr`'s versions of some of the base R string functions mentioned in the previous section.
-
-The basic interface to `stringr` functions is `function(character_vector, pattern, [replacement])`.
-
-Table 1 provides an overview of the key functions related to working with patterns, which are basically
-wrappers for `grep`, `gsub`, `gregexpr`, etc.
-
-
-| Function | What it does
-|-----------------------------------|---------------------------------------------------------------------
-| str_detect | detects pattern, returning TRUE/FALSE
-| str_count | counts matches
-| str_locate/str_locate_all | detects pattern, returning positions of matching characters
-| str_extract/str_extract_all | detects pattern, returning matches
-| str_replace/str_replace_all | detects pattern and replaces matches
-
-The analog of `regexpr` vs. `gregexpr` and `sub`
-vs. `gsub` is that most of the functions have versions that
-return all the matches, not just the first match, e.g., `str_locate_all`,
-`str_extract_all`, etc. Note that the `_all` functions return
-lists while the non-`_all` functions return vectors.
-
-
-To specify options, you can wrap these functions around the pattern
-argument: `fixed(pattern, ignore_case)` and `regex(pattern, ignore_case)`.
-The default is `regex`, so you only need to specify that if you also want to
-specify additional arguments, such as `ignore_case` or others listed under `help(regex)` (invoke the help after loading `stringr`)
-
-Here's an example:
-
-```{r}
-library(stringr)
-str <- c("Apple Computer", "IBM", "Apple apps")
-
-str_detect(str, fixed("app", ignore_case = TRUE))
-str_count(str, fixed("app", ignore_case = TRUE))
-```
-
-```{r}
-str_locate(str, fixed("app", ignore_case = TRUE))
-str_locate_all(str, fixed("app", ignore_case = TRUE))
-```
-
-```{r}
-dates <- c("2016-08-03", "2007-09-05", "2016-01-02")
-str_locate(dates, "20[^0][0-9]") ## regular expression: years begin in 2010
-```
-
-```{r}
-str_extract_all(dates, "20[^0][0-9]")
-str_replace_all(dates, "20[^0][0-9]", "XXXX")
-```
-
-## 3 Python
-
-Let's see basic concatenation, splitting, working with substrings, and searching/replacing
-substrings. Notice that Python's string functionality is object-oriented (though `len` is not).
-
-Here, we'll just cover the basic methods for the `str` type. There's lots of additional functionality for [working with strings using regular expressions in the `re` package](regex.qmd#using-regex-in-python). Of course in many cases of working with strings, one would need the full power of regular expressions to do what one needs to do.
-
-First let's look at combining/concatenating strings. We can do this with the `+` operator
-or using the `join` method, which is (perhaps confusingly) called based on the separator
-of interest with the input strings as arguments.
-
-```{python}
-out = "My" + "name" + "is" + "Chris" + "."
-out
-```
-
-```{python}
-out = ' '.join(("My", "name", "is", "Chris", "."))
-out
-```
-
-`len` simply returns the number of characters in the string.
-
-```{python}
-len(out)
-```
-
-```{python}
-out.split(' ')
-```
-
-To see the various string methods, we can hit tab after typing `str.` or based on any specific string:
-
-```{python, eval=FALSE}
-out.
-```
-
-```
-out.capitalize() out.index( out.isspace() out.removesuffix( out.startswith(
-out.casefold() out.isalnum() out.istitle() out.replace( out.strip(
-out.center( out.isalpha() out.isupper() out.rfind( out.swapcase()
-out.count( out.isascii() out.join( out.rindex( out.title()
-out.encode( out.isdecimal() out.ljust( out.rjust( out.translate(
-out.endswith( out.isdigit() out.lower() out.rpartition( out.upper()
-out.expandtabs( out.isidentifier() out.lstrip( out.rsplit( out.zfill(
-out.find( out.islower() out.maketrans( out.rstrip(
-out.format( out.isnumeric() out.partition( out.split(
-out.format_map( out.isprintable() out.removeprefix( out.splitlines(
-```
-
-Unlike in R, you cannot use the string methods directly on a list or tuple of strings, but you of course can do things like list comprehension to easily process multiple strings.
-
-Working with substrings relies on the fact that Python works with strings as if they are vectors of individual characters.
-
-
-```{python}
-var = "13:47:00"
-var[3:5]
-```
-
-However strings are immutable - you cannot alter a subset of characters in the string. Another option is to work with strings as lists.
-
-```{python, error=TRUE}
-var[0:2] = "01"
-```
-
-
-Now let's consider finding substrings. Here Python tells us that '2016' starts in the 6th position in the first and third elements (with 0-based indexing).
-
-
-```{python}
-var = "08-03-2016"
-var.find("2016")
-```
-
-We can count occurrences with `.count()`:
-
-```{python}
-var = "08-03-2016; 07-09-2016"
-var.count("2016")
-```
-
-And we can replace like this:
-
-```{python}
-var = "13:47:00"
-var.replace("13", "01")
-```
-
-