Skip to contents

Character vector basics

This package is for operating on three key kinds of things:

Strings

Sequences of characters, delimited by either double (") or single quotes (').

"Hello"
#> [1] "Hello"
'Everyone'
#> [1] "Everyone"

Characters

Any letter or number like A or 1, but more specifically, any individual unicode code point:

"A"
#> [1] "A"

# Whitespace and punctuation are characters too
" "
#> [1] " "

# Or even emojis...
"😮"
#> [1] "😮"

# There are other ways to represent a single character
"\u42"
#> [1] "B"

Of course, those were all strings.

# How many characters are in each of these strings?
str_length("Hello")
#> [1] 5
str_length('Everyone')
#> [1] 8
str_length("A")
#> [1] 1
str_length(" ")
#> [1] 1
str_length("😮")
#> [1] 1
str_length("\u42")
#> [1] 1

Character vectors

Sequences of strings.

c("Hello", "world")
#> [1] "Hello" "world"

# What is the object type?
typeof(c("Hello", "world"))
#> [1] "character"

# How many elements are in the vector?
length(c("Hello", "world"))
#> [1] 2

Our function names use a shorthand str for strings and chr for character vectors, with the significance of the distinctions detailed more in vignette(suitestrings-conventions).

It’s useful to keep in mind that for many purposes the three are all the same; as R objects, they are all stored as character vectors i.e. a string is simply a character vector with a single element, which may contain no, one or many characters. Usually when we say something operates on a string, it will work on every individual string within a character vector.

# Viewed for their vector properties,
# a character and a string look much the same:

# A character vector
typeof("A")
#> [1] "character"
typeof("Hello")
#> [1] "character"

# With one element
length("A")
#> [1] 1
length("Hello")
#> [1] 1

suitestrings operations

There are a few families of operations for working with strings:

  1. Character-based transformation: Change strings based on individual characters within them. Extend strings by adding characters, shorten strings by removing them. Handle specific kinds of characters like whitespace.
  2. String combination: Concatenate strings together, with other strings or even with the results of R expressions converted to characters,
  3. Pattern matching operations: Manipulate strings based on a pattern of characters, which is usually defined by a regular expression.
  4. Character vector organisation: Sort a character vector alphabetically or remove duplicate elements of a character vector.

Character-based transformation

Shortern strings

There are some functions useful for quickly cleaning a string. str_trim() removes whitespace from the ends and str_squish() additionally reduces whitespace in the middle to a single character:

str_trim("  Get rid of spaces at the ends          ")
#> [1] "Get rid of spaces at the ends"
str_squish("  Get    these    spaces \u020         under       control    ")
#> [1] "Get these spaces under control"

If you want cut a string down to size, but also to have the string indicate that it has been shortened, you can truncate it down to a specified number of characters:

str_truncate("Sometimes we just need to make a string smaller", 20)
#> [1] "Sometimes we just..."

# To cut it off without an ellipsis
str_truncate("Sometimes we just need to make a string smaller", 20, ellipsis = "")
#> [1] "Sometimes we just ne"

Extend strings

The somewhat opposite functions str_pad() and str_indent() fill a string to a minimum length, and add a specific number of spaces, respectively.

str_pad("hello", 10)
#> [1] "     hello"

# Though they can both extend with other characters
str_indent("hello", 3, indent = ".")
#> [1] "...hello"

Convert Case

Functions to change case can help with consistent formatting:

str_to_upper_case("hello")
#> [1] "HELLO"

str_to_snake_case(c("nO  FUNny  buSSIneSs", "    TIDY  this  UP  "))
#> [1] "no_funny_bussiness" "tidy_this_up"

String combination

Concatenate strings together with str_concat() or str_glue().

str_concat("abc", "def") 
#> [1] "abcdef"

# Both can take a custom separator argument to place between strings
str_glue("abc", "def", separator = " ")
#> [1] "abc def"

They can also combine strings with R expressions (coerced to as.character()). str_glue() is designed to handle this more elegantly by treating text in curly braces {} as R code.

str_concat(
  "hello", 
  10 * 10,
  "worlds",
  separator = " "
)
#> [1] "hello 100 worlds"

str_glue("hello {10*10} worlds")
#> [1] "hello 100 worlds"

Concatenate the elements of a character vector into a single string with chr_collapse()

chr_collapse(c("abc", "def"))
#> [1] "abcdef"
chr_collapse(1:5, separator = ", ")
#> [1] "1, 2, 3, 4, 5"

Pattern matching operations

There are functions to detect, locate, extract, replace, remove and split patterns of characters in strings. Except for the detect family, suffixes _first(), _nth() and last() are used to specify pattern occurences within strings, and _all() to work with every occurrence.

A pattern by default is a regular expression, a type of string with a special set of characters defined that allow it to represent many different strings. If you wish to use a string literally as a position you can supply argument fixed = TRUE.

To get started we’ll just work with a simple pattern

strings <- c(
  "in the middle of the day, I eat lunch",
  "today's the day, forget yesterday",
  "only weeks to be found here",
  "days like this make me love Thursdays"
)

# Define a regular expression pattern for words containing "day"
pattern <- "\\w*day\\w*"

Detect

# Does the string contain the pattern?
str_detect(strings, pattern)
#> [1]  TRUE  TRUE FALSE  TRUE
str_detect_starts_with(strings, pattern)
#> [1] FALSE  TRUE FALSE  TRUE
str_detect_ends_with(strings, pattern)
#> [1] FALSE  TRUE FALSE  TRUE

Extract

# Pull out the matching words from each string
str_extract_first(strings, pattern)
#> [1] "day"   "today" NA      "days"
str_extract_nth(strings, pattern, 2)
#> [1] NA          "day"       NA          "Thursdays"
str_extract_last(strings, pattern)
#> [1] "day"       "yesterday" NA          "Thursdays"
str_extract_all(strings, pattern)
#> [[1]]
#> [1] "day"
#> 
#> [[2]]
#> [1] "today"     "day"       "yesterday"
#> 
#> [[3]]
#> character(0)
#> 
#> [[4]]
#> [1] "days"      "Thursdays"

Replace and Remove

str_remove_first(strings, pattern)
#> [1] "in the middle of the , I eat lunch" "'s the day, forget yesterday"      
#> [3] "only weeks to be found here"        " like this make me love Thursdays"

str_replace_all(strings, pattern, "night")
#> [1] "in the middle of the night, I eat lunch"
#> [2] "night's the night, forget night"        
#> [3] "only weeks to be found here"            
#> [4] "night like this make me love night"

Split

# Let's just split on spaces to get the words
str_split_all(strings, " ")
#> [[1]]
#> [1] "in"     "the"    "middle" "of"     "the"    "day,"   "I"      "eat"   
#> [9] "lunch" 
#> 
#> [[2]]
#> [1] "today's"   "the"       "day,"      "forget"    "yesterday"
#> 
#> [[3]]
#> [1] "only"  "weeks" "to"    "be"    "found" "here" 
#> 
#> [[4]]
#> [1] "days"      "like"      "this"      "make"      "me"        "love"     
#> [7] "Thursdays"

Character vector organisation

Vectors are ordered, so you might like to reorder them.

chr_sort(c("cherry", "apple", "date", "banana"))
#> [1] "apple"  "banana" "cherry" "date"