suitestrings conventions
Source:vignettes/suitestrings-conventions.Rmd
suitestrings-conventions.Rmd
A consistent approach is taken to naming functions and arguments. This vignette sets out to explain the structure, compare this against base R and popular string packages, and perhaps point out any exceptions to the rules. Once you’re familiar with the structure it should be easier to understand what a function does just from it’s name.
The basic function name structure is as follows, with each function having at least the first two parts:
<operation-type>_<action>_<action-location>
Prefixes - <operation-type>
Prefixes are used for a general base of behaviour:
-
str_
for vectorised operations on strings. -
chr_
for vector-wide operations over strings.
str_
for operations on strings
In a character vector, the same operation is applied to each string. These functions are useful for single strings, but also for chains of inputs and outputs where you map from one vector to another.
An example:
strings <- c("Hello, world.", "Happy birthday!", "Nice to meet you.")
strings
#> [1] "Hello, world." "Happy birthday!" "Nice to meet you."
# Transform each string
snake_strings <- str_to_snake_case(strings)
snake_strings
#> [1] "hello_world" "happy_birthday" "nice_to_meet_you"
# Is the text "day" in each string?
day_pattern <- "day"
day_in_strings <- str_detect(strings, day_pattern)
day_in_strings
#> [1] FALSE TRUE FALSE
In most cases the output will have the same length()
as
the input strings
.
chr_
for vector-wide operations
These functions operate on the character vector as a whole. They may
give the a single piece of information about the vector, a re-organised
vector of the same length or perhaps an elements extracted from every
string. The result may be of any length()
, but will always
be an atomic vector i.e. won’t be a list.
For example, a single fact about the vector:
# Is the text "day" in any of the strings?
chr_detect_any(strings, day_pattern)
#> [1] TRUE
# Is the text "day" in all of the strings?
chr_detect_all(strings, day_pattern)
#> [1] FALSE
Or a re-organised vector:
# Rearrange the vector in alphabetical order
chr_sort(strings)
#> [1] "Happy birthday!" "Hello, world." "Nice to meet you."
str_
or chr_
In a few cases, the only difference between two function names is the
prefix. When this happens the str_
function will give you a
list (equal in length()
to the input vector) with a result
for each string . The chr_
function will give a single
atomic vector containing all of the same elements, as if you used
unlist()
on the result of the str_
version.
See extraction of every match for a pattern:
word_pattern <- "\\w+"
str_result <- str_extract_all(strings, word_pattern)
str_result
#> [[1]]
#> [1] "Hello" "world"
#>
#> [[2]]
#> [1] "Happy" "birthday"
#>
#> [[3]]
#> [1] "Nice" "to" "meet" "you"
chr_result <- chr_extract_all(strings, word_pattern)
chr_result
#> [1] "Hello" "world" "Happy" "birthday" "Nice" "to" "meet"
#> [8] "you"
length(str_result) # Length is preserved here
#> [1] 3
length(chr_result) # Not here
#> [1] 8
Suffixes - <action-location>
Pattern Occurrence - first, nth, last or all
The majority of functions that include matching of patterns
(typically regular expressions) include a suffix for which occurrence of
the pattern within a string you want to work with. The following apply
for the action verbs locate
, extract
,
replace
, remove
, split
:
-
_first
matches the first occurrence. -
_nth
matches a specific occurrence, as set in additional argumentn
. -
_last
matches the last occurrence. -
_all
matches every occurrence.
You can expect a single vector resulting from _first
,
_last
and _nth
, whereas
str_<action>_all()
will return a list. Again, you can
use the chr_<action>_all()
variant to get to a single
atomic vector.
str_extract_first(strings, word_pattern) # the first word
#> [1] "Hello" "Happy" "Nice"
str_extract_nth(strings, word_pattern, 2) # the second word
#> [1] "world" "birthday" "to"
str_extract_last(strings, word_pattern) # the last word
#> [1] "world" "birthday" "you"
str_extract_all(strings, word_pattern) # all words
#> [[1]]
#> [1] "Hello" "world"
#>
#> [[2]]
#> [1] "Happy" "birthday"
#>
#> [[3]]
#> [1] "Nice" "to" "meet" "you"
Function arguments
You may have noticed repeating in our example names:
function(strings, some_pattern)
. This was quite deliberate
to highlight the most common first and second arguments.
Chaining the first argument
-
strings
- Almost every function takes this as the first argument. It’s a character vector containing all the strings to manipulate. The main reason to have this as the first argument is to make it suitable for piping with|>
ormagrittr::%>%
.
# Easily perform a chain of operations on strings
# Like these unnecessary steps to make a happy hello
strings |>
chr_sort() |>
str_extract_first(word_pattern) |>
chr_collapse(" ") |>
str_remove_last(word_pattern) |>
str_concat(":)")
#> [1] "Happy Hello :)"
# You can probably find better uses
Pattern matching
-
pattern
- The second argument to any function that uses pattern matching. This is a single string to be matched, by default interpreted as a regular expression. -
fixed
- Accompanying everypattern
argument is a logical argument to optionally interpret thepattern
as a literal string to be matched exactly.
String combination
-
...
- A minority of functions take several character vectors as the first arguments instead ofstrings
;str_concat()
andstr_glue()
. When you pass multiple character vectors the resulting vector will have thelength()
of the longest input vector. -
separator
- A string to be placed between input elements in the output string.
Comparison to other packages
Compared to stringr
We’ve broken away from the stringr
convention of
starting every function with str_
. If you’re used to that,
you might not be able to immediately find equivalent functions. For
example, sort has moved from str_sort()
to
chr_sort()
. However, once you get used to chr_
it should be more informative, and useful for immediately identifying if
a function is applied vector-wide or not.
stringr::str_split_1()
for instance, was added as
something of an afterthought to the str_split family, for getting an
atomic vector containing all results, whereas the equivalent
suitestrings::chr_split_all()
sits as a natural part of the
naming scheme.
chr_split_all(strings, " ")
#> [1] "Hello," "world." "Happy" "birthday!" "Nice" "to"
#> [7] "meet" "you."