Character vector basics
This package is for operating on three key kinds of things:
Strings
Sequences of characters, delimited by either double ("
)
or single quotes ('
).
"Hello"
#> [1] "Hello"
'Everyone'
#> [1] "Everyone"
Characters
Any letter or number like A
or 1
, but more
specifically, any individual unicode code point:
"A"
#> [1] "A"
# Whitespace and punctuation are characters too
" "
#> [1] " "
# Or even emojis...
"😮"
#> [1] "😮"
# There are other ways to represent a single character
"\u42"
#> [1] "B"
Of course, those were all strings.
# How many characters are in each of these strings?
str_length("Hello")
#> [1] 5
str_length('Everyone')
#> [1] 8
str_length("A")
#> [1] 1
str_length(" ")
#> [1] 1
str_length("😮")
#> [1] 1
str_length("\u42")
#> [1] 1
Character vectors
Sequences of strings.
c("Hello", "world")
#> [1] "Hello" "world"
# What is the object type?
typeof(c("Hello", "world"))
#> [1] "character"
# How many elements are in the vector?
length(c("Hello", "world"))
#> [1] 2
Our function names use a shorthand str
for strings and
chr
for character vectors, with the significance of the
distinctions detailed more in
vignette(suitestrings-conventions)
.
It’s useful to keep in mind that for many purposes the three are all the same; as R objects, they are all stored as character vectors i.e. a string is simply a character vector with a single element, which may contain no, one or many characters. Usually when we say something operates on a string, it will work on every individual string within a character vector.
suitestrings operations
There are a few families of operations for working with strings:
- Character-based transformation: Change strings based on individual characters within them. Extend strings by adding characters, shorten strings by removing them. Handle specific kinds of characters like whitespace.
- String combination: Concatenate strings together, with other strings or even with the results of R expressions converted to characters,
- Pattern matching operations: Manipulate strings based on a pattern of characters, which is usually defined by a regular expression.
- Character vector organisation: Sort a character vector alphabetically or remove duplicate elements of a character vector.
Character-based transformation
Shortern strings
There are some functions useful for quickly cleaning a string.
str_trim()
removes whitespace from the ends and
str_squish()
additionally reduces whitespace in the middle
to a single character:
str_trim(" Get rid of spaces at the ends ")
#> [1] "Get rid of spaces at the ends"
str_squish(" Get these spaces \u020 under control ")
#> [1] "Get these spaces under control"
If you want cut a string down to size, but also to have the string indicate that it has been shortened, you can truncate it down to a specified number of characters:
str_truncate("Sometimes we just need to make a string smaller", 20)
#> [1] "Sometimes we just..."
# To cut it off without an ellipsis
str_truncate("Sometimes we just need to make a string smaller", 20, ellipsis = "")
#> [1] "Sometimes we just ne"
Extend strings
The somewhat opposite functions str_pad()
and
str_indent()
fill a string to a minimum length, and add a
specific number of spaces, respectively.
str_pad("hello", 10)
#> [1] " hello"
# Though they can both extend with other characters
str_indent("hello", 3, indent = ".")
#> [1] "...hello"
Convert Case
Functions to change case can help with consistent formatting:
str_to_upper_case("hello")
#> [1] "HELLO"
str_to_snake_case(c("nO FUNny buSSIneSs", " TIDY this UP "))
#> [1] "no_funny_bussiness" "tidy_this_up"
String combination
Concatenate strings together with str_concat()
or
str_glue()
.
str_concat("abc", "def")
#> [1] "abcdef"
# Both can take a custom separator argument to place between strings
str_glue("abc", "def", separator = " ")
#> [1] "abc def"
They can also combine strings with R expressions (coerced to
as.character()
). str_glue()
is designed to
handle this more elegantly by treating text in curly braces
{}
as R code.
str_concat(
"hello",
10 * 10,
"worlds",
separator = " "
)
#> [1] "hello 100 worlds"
str_glue("hello {10*10} worlds")
#> [1] "hello 100 worlds"
Concatenate the elements of a character vector into a single string
with chr_collapse()
chr_collapse(c("abc", "def"))
#> [1] "abcdef"
chr_collapse(1:5, separator = ", ")
#> [1] "1, 2, 3, 4, 5"
Pattern matching operations
There are functions to detect,
locate, extract,
replace, remove and
split patterns of characters in strings. Except for the
detect family, suffixes _first()
, _nth()
and
last()
are used to specify pattern occurences within
strings, and _all()
to work with every occurrence.
A pattern by default is a regular expression, a type of string with a
special set of characters defined that allow it to represent many
different strings. If you wish to use a string literally as a position
you can supply argument fixed = TRUE
.
To get started we’ll just work with a simple pattern
strings <- c(
"in the middle of the day, I eat lunch",
"today's the day, forget yesterday",
"only weeks to be found here",
"days like this make me love Thursdays"
)
# Define a regular expression pattern for words containing "day"
pattern <- "\\w*day\\w*"
Detect
# Does the string contain the pattern?
str_detect(strings, pattern)
#> [1] TRUE TRUE FALSE TRUE
str_detect_starts_with(strings, pattern)
#> [1] FALSE TRUE FALSE TRUE
str_detect_ends_with(strings, pattern)
#> [1] FALSE TRUE FALSE TRUE
Extract
# Pull out the matching words from each string
str_extract_first(strings, pattern)
#> [1] "day" "today" NA "days"
str_extract_nth(strings, pattern, 2)
#> [1] NA "day" NA "Thursdays"
str_extract_last(strings, pattern)
#> [1] "day" "yesterday" NA "Thursdays"
str_extract_all(strings, pattern)
#> [[1]]
#> [1] "day"
#>
#> [[2]]
#> [1] "today" "day" "yesterday"
#>
#> [[3]]
#> character(0)
#>
#> [[4]]
#> [1] "days" "Thursdays"
Replace and Remove
str_remove_first(strings, pattern)
#> [1] "in the middle of the , I eat lunch" "'s the day, forget yesterday"
#> [3] "only weeks to be found here" " like this make me love Thursdays"
str_replace_all(strings, pattern, "night")
#> [1] "in the middle of the night, I eat lunch"
#> [2] "night's the night, forget night"
#> [3] "only weeks to be found here"
#> [4] "night like this make me love night"
Split
# Let's just split on spaces to get the words
str_split_all(strings, " ")
#> [[1]]
#> [1] "in" "the" "middle" "of" "the" "day," "I" "eat"
#> [9] "lunch"
#>
#> [[2]]
#> [1] "today's" "the" "day," "forget" "yesterday"
#>
#> [[3]]
#> [1] "only" "weeks" "to" "be" "found" "here"
#>
#> [[4]]
#> [1] "days" "like" "this" "make" "me" "love"
#> [7] "Thursdays"