Intro to R: Day 1

stats
Author

Isaac Vock

Published

January 21, 2025

This worksheet will walk you through some basic concepts in R. I would suggest copying code shown here into an R script and running it yourself so that you can play around with the presented examples.

Pre-requisite knowledge for simple_calc()

Math in R

The simplest use case of R is using it to do math:

1 + 1 # Addition
[1] 2
10 - 1.5 # Subtraction
[1] 8.5
2 * 3 # Multiplication
[1] 6
10.12 / 17.99 # Division
[1] 0.5625347
5 ^ 2 # Exponentiation
[1] 25

Numeric variables in R

You can store numbers in “variables”. This is like a special box in your computer’s memory labeled with a name (like my_number). When you put a number into this box (for example, 10), we say you have assigned the value 10 to the variable my_number.

In R, you’d do this by writing:

my_number <- 10
my_number
[1] 10

Typing and executing print(my_number) or just my_number will print out the value of the variable to your console.

Here is what’s happening in this code:

  1. my_number is the label on the box in memory.
  2. <- is like an arrow pointing from the value to the box, meaning “put this value in that box”.
  3. `10 is the actual number you are storing.

You can then do math like this just like with regular numbers:

my_number * 2
[1] 20
my_number ^ 3
[1] 1000
my_number - 4
[1] 6

my_number does not change value in any of the above lines. To change the value of my_number, you would have to assign it the new value with <- again:

my_number # my_number is 10
[1] 10
my_number <- 1001 # my_number is now 1001
my_number # Confirm new value of my_number
[1] 1001

Strings in R

You can store more than numbers in variables. For example, you can store text, which is referred to as a “string”:

my_string <- "Hello"
my_string2 <- 'Bye'

my_string
[1] "Hello"
my_string2
[1] "Bye"

You tell R that you are storing text by wrapping that text in "" or ''.

Below are some useful tools that R provide you to work with strings. These are called functions, a concept discussed later.

  1. paste(..., sep = " "): paste() allows you to stitch together multiple strings, with a chosen separator text between strings (sep argument). Having no separator (sep = "") is identical to using a different function paste0():
string1 <- "Hello"
string2 <- "friend."
string3 <- "It's been too long"

paste(string1, string2)
[1] "Hello friend."
paste(string1, string2, sep = "")
[1] "Hellofriend."
paste0(string1, string2)
[1] "Hellofriend."
paste(string1, string2, string3)
[1] "Hello friend. It's been too long"
paste(string1, string2, collapse = "_")
[1] "Hello friend."
  1. nchar(): This will give you the number of individual characters in your text string:
string1 <- "Hello"
nchar(string1)
[1] 5
  1. gsub(pattern, replacement, x): This allows you to look for the string pattern in the query string x, and replace it with the string replacement:
text <- "Hello, Hello, Hello!"
gsub("Hello", "Hi", text)
[1] "Hi, Hi, Hi!"
  1. grepl(pattern, x): This is similar to gsub() but just searches for string pattern in string x and spits out TRUE if it finds it
text <- "Hello, Hello, Hello!"
grepl("Hello", text)
[1] TRUE

There is a whole R package called stringr devoted to making working with strings in R easier and more intuitive, so you might want to look into that as well!

Booleans in R

Another thing that is commonly stored in variables is logical values (TRUE or FALSE), otherwise known as “booleans”:

my_bool <- TRUE
my_bool2 <- FALSE

my_bool
[1] TRUE
my_bool2
[1] FALSE

You can do a sort of math with booleans, referred to as “boolean logic”. This takes as input two (in the case of AND and OR) or one (in the case of NOT) boolean variables and outputs a new boolean. The most common examples are:

AND (&)

  • Both of the booleans must be TRUE for the output to be TRUE:
TRUE & TRUE # This is TRUE
[1] TRUE
TRUE & FALSE # This is FALSE
[1] FALSE
FALSE & TRUE # This is FALSE
[1] FALSE
FALSE & FALSE # This is FALSE
[1] FALSE

OR (|)

  • At least one of the booleans must be TRUE for the output of this to be TRUE
TRUE | TRUE # This is TRUE
[1] TRUE
TRUE | FALSE # This is TRUE
[1] TRUE
FALSE | TRUE # This is TRUE
[1] TRUE
FALSE | FALSE # This is FALSE
[1] FALSE

NOT (!)

  • Unlike AND and OR, this takes a single boolean value as input
  • This reverses the value of the boolean:
!TRUE # This is FALSE
[1] FALSE
!FALSE # This is TRUE
[1] TRUE

Finally, you can compare the value of two variables to see if they are the same. If the are variable_1 == variable_2 will return TRUE, otherwise it will return FALSE:

"Hello" == "Hello" # TRUE
[1] TRUE
"Hi" == "Bye" # FALSE
[1] FALSE
1 == 1 # TRUE
[1] TRUE
my_number <- 1
my_number2 <- 2
my_number == my_number2
[1] FALSE

Functions in R

A function in R is like a “recipe” for a mini “machine” that does one specific job. You give it some inputs (called arguments), it follows the steps you’ve defined, and then it gives you a result.

Functions help you organize your code so you can reuse it instead of writing the same steps again and again. Here is a simple example:

# Function name: my_function
# Arguments: x and y
# Output: x + y
my_function <- function(x, y){
  
  # 1. Inside the curly braces, write the steps of what you will do with x and y
  
  # We will add x and y
  result <- x + y
  
  # 2. Tell the function what to output (i.e., its "return value")
  return(result)
  
}
  • my_function is the name of the function (like a label on the mini machine).
  • function(x,y) { ... } says “I am creating a function that expects two inputs, called x and y.
  • Inside the { ... }, you can write as much code as you want; this is the instructions for what you want the function to do with the inputs
  • return(result) sends the output of the function back to you.

After creating my_function, you can call it (computer science lingo meaning “use the function”) by typing:

my_function(3,5)
[1] 8
my_new_number <- my_function(2, 2)
my_new_number
[1] 4

Sometimes, you want one (or more) of your function’s inputs to have a “fallback” value if the user doesn’t supply one. That’s where default arguments come in. For example:

my_new_function <- function(x, y = 10){

  result <- x + y
  
  return(result)
  
}

my_new_function now only needs you to supply x. You can supply x and y, but if you don’t supply y, it will give y a default value of 10 by default:

my_new_function(x = 1)
[1] 11
my_new_function(x = 2, y = 20)
[1] 22
my_new_function(2, 20) # Will fill arguments in order, so x = 2 and y = 20 here
[1] 22

Tip 1: argument with small set of possible values

Sometimes, one of the arguments of your function may have a set number of possible values that you intend for a user to input. You can specify this as such:

my_options <- function(a, b, greeting = c("Hi", "Bye", "Huh?")){
  
  # Check to make sure the user supplied one of the valid options
  greeting <- match.arg(greeting)
  
  print(greeting)
  
  result <- a + b
  
  return(result)
  
}

# Uses first option by default
my_options(2, 2)
[1] "Hi"
[1] 4
my_options(2, 2, "Huh?")
[1] "Huh?"
[1] 4

Tip 2: Catching errors

In all of our examples so far, we have assumed that the user has supplied a particular kind of data for each argument. Mostly, we have assumed that numbers are being passed to many of our example functions, numbers that we can add. What if they messed up though and passed a string, for example? We can catch this and throw an error message:

my_valuecheck <- function(a, b){
  
  # check if a is a number
  stopifnot(is.numeric(a))
  
  # check if b is a number, but with a slightly different strategy
  # if-statements are discussed more later.
  if(!is.numeric(b)){
    
    stop("b isn't a number")
    
  }
  
  result <- a + b
  
  return(result)
}

This function will work as normal if a and b are numbers, but will throw informative error messages if not. You will also get an error in the old version of this function that didn’t have the stopifnot() lines, but this error might be far more cryptic and hard to understand. You will also get a different error depending on what is wrong about a and/or b, further confusing you or other users of this function.

Control flow (if-else statements)

An if-else statement is one of the most common ways to control the flow of a program. It lets your code make decisions based on whether a condition is TRUE or FALSE.

  • if checks if something is TRUE
  • else covers what happens if it is not TRUE
  • You can add extra steps in between using else if to handle different possible conditions

The basic structure looks like:

if (condition1){
  # This code runs if 'condition1' is TRUE
}else if(condition2){
  # This code runs if 'condition2' is TRUE
}else{
  # This code runs if both 'condition1' and 'condition2' are FALSE
}

Think of this code as asking a set of questions:

  • If condition1 is TRUE, do something.
  • Else if condition2 is TRUE, do something else
  • Else, if neither condition1 nor condition2 are TRUE, do a default thing.

A real example might look like:

x <- 5

if(x > 3){
  print("x is greater than 3")
}else if(x < 5){
  print("x is between 3 and 5")
}else{
  print("x is greater than or equal to 5")
}
[1] "x is greater than 3"

Conditions in R must evaluate to a single TRUE or FALSE. Common ways to form conditions are comparison operators:

  • ==: Check if two things are equal (e.g., a == b). a and b can be numbers, strings, booleans, etc.
  • !=: Check if two things are not equal 1 <, >, <=, >=: Less than, greater than, less than or equal to, or greater than or equal to, respectively.

Here is an example of how you might use control flow in a function:

greetUser <- function(user_input){
  
  # Check if user_input equals "Hello"
  if (user_input == "Hello"){
    return("Hi there! Nice to meet you.")
  }else if(user_input == "Goodbye"){
    return("See you later! Take care.")
  }else{
    return("I'm not sure how to respond to that...")
  }
  
}

greetUser("Hello")
[1] "Hi there! Nice to meet you."
greetUser("Comment allez-vous?")
[1] "I'm not sure how to respond to that..."

Pre-requisite knowledge for vector_calc()

Vectors

In R, a vector is a container that holds multiple values of the same data type (such as numbers, strings, or booleans). You can think of it like a row of boxes, each containing a value of the same kind.

You can create a vector with the c() function (short for “combine” or “concatenate”). Here are a few example:

# A numeric vector
numbers <- c(10, 20, 30, 40)

# A character (string) vector
words <- c("cat", "dog", "bird")

# A boolean vector
bools <- c(TRUE, FALSE, TRUE)

Often, you will want to access specific elements or sets of elements of a vector. To do this, you can use square brackets [ ]:

# Get the first element of 'numbers'
numbers[1] # 10
[1] 10
# Get the second element of 'words'
words[2]
[1] "dog"
# Get multiple elements at once
numbers[c(1, 3)] # This gives the 1st and 3rd elements: c(10, 30)
[1] 10 30
# Exclude specific elements
bools[-1] # This gives everything but the 1st element
[1] FALSE  TRUE

You can also change values of specific elements:

# See what 'numbers' is now
numbers
[1] 10 20 30 40
# Change a value
numbers[2] <- 99

# Check 'numbers' now
numbers
[1] 10 99 30 40

Sometimes, it will be useful to check what kind of data is in a vector. This can be done with the class() function:

class(numbers) # "numeric"
[1] "numeric"
class(words) # "character"
[1] "character"
class(bools) # "logical" (another word for boolean)
[1] "logical"

You can also check the value with functions like is.numeric(). is.character(), or is.logical():

is.numeric(numbers) # TRUE
[1] TRUE
is.numeric(words) # FALSE
[1] FALSE
is.character(numbers) # FALSE
[1] FALSE
is.logical(numbers) # FALSE
[1] FALSE
is.character(words) # TRUE
[1] TRUE
is.logical(bools) # TRUE
[1] TRUE

Below are some useful functions that allow you to create vectors or lookup some information about a vector:

  1. length(v): returns the number of elements in the vector v:
length(c(1, 2, 3))
[1] 3
  1. seq(from, to, length.out) or seq(from, to, by): Creates a vector starting from the number from (default value of 1), to the number to (default value of 1). If you set length.out, then you will get a vector of length.out elements. If you set by, then you specify the distance between adjacent elements:
seq(from = 1, to = 5, length.out = 5)
[1] 1 2 3 4 5
seq(from = 1, to = 5, by = 1)
[1] 1 2 3 4 5
  1. rep(x, times): Creates a vector containing the value x repeated times times:
rep(x = 1, times = 10)
 [1] 1 1 1 1 1 1 1 1 1 1
  1. start:end: Same as seq(from = start, to = end, by = 1):
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
0.5:2.5
[1] 0.5 1.5 2.5

Loops

A loop is a way to tell R to “do something multiple times”. This unlocks one of the powerful aspects of computers: their ability to do multiple things quickly.

There are two commonly used types of loops: for loops and while loops.

A for loop in R iterates (or “loops”) over each element of a vector and does something with it. For example, if we want to print every element of a numeric vector:

numbers <- c(10, 20, 30, 40)

# Loop over the values
for(value in numbers){
  print(value)
}
[1] 10
[1] 20
[1] 30
[1] 40
# Loop over the vector 1 to the length of the vector
for(i in 1:length(numbers)){
  
  print(numbers[i])
 
}
[1] 10
[1] 20
[1] 30
[1] 40
# Fancier alternative to the above code
for(i in seq_along(numbers)){
  
  print(numbers[i])
  
}
[1] 10
[1] 20
[1] 30
[1] 40

What’s happening here?

  1. for (value in numbers) means “go through each element of numbers and temporarily call that element value. for(i in 1:length(numbers) creates a vector (1:length(numbers)) which is a vector of whole numbers from 1 to the length of the vector numbers. Each of these whole numbers is then temporarily called i. seq_along(numbers) does pretty much the same things as 1:length(numbers).
  2. print(value) means we display the current value on the screen.
  3. R will do this until it has gone through all elements in numbers.

A while loop keeps going as long as some condition is TRUE. Suppose we want to keep adding numbers from a vector until the total sum exceeds 50:

numbers <- c(10, 20, 30, 40, 50)
total <- 0 # Start total at 0
i <- 1 # Start index at 1

while(i <= length(numbers) & total <= 50){
  
  # Add to total
  total <- total + numbers[i]
  
  # Track which element we are on
  i <- i + 1
  
}

print(total)
[1] 60
print(i)
[1] 4

What’s going on here?

  1. while(i <= length(numbers) && total <= 50) - The loop will continue running while two conditions are both TRUE:
  • We haven’t reached the end of the vector (i <= length(numbers)) and
  • The total hasn’t exceeded 50 (total <= 50).
  1. Inside the loop, we add the i-th element of numbers to total.
  2. We then move i to the next element by adding 1.
  3. As soon as one of the conditions in 1. become FALSE, the loop stops.

Pre-requisite knowledge for calc_df_stats()

Reading a file with readr

The readr package (part of the tidyverse collection of packages), provides user-friendly functions for reading in data. For example, you can read a csv file like so:

my_data <- read_csv("path/to/mydata.csv")
  • read_csv("path/to/mydata.csv") reads the CSV file located at the specified path (either a relative or absolute path) and creates a data frame (more on those soon).
  • We’re storing that data frame in a variable called my_data.

Data Frames

A data frame is a table-like structure with rows and columns, commonly used for storign datasets in R. Each column is usually a vector of a particular type (numeric, character, boolean, etc.), and all columns have the same length.

To create a data frame you can run code like this:

ages <- c(30, 25, 35)

# You can either specify the vector directly
# or provide the name of a vector you previously created
people_df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = ages,
  Score = c(100, 95, 90)
)

people_df
     Name Age Score
1   Alice  30   100
2     Bob  25    95
3 Charlie  35    90

Here are some ways you can interact with the data inside of a data frame:

  1. You can grab an entire column with $ or [[<col name as string>]]:
# This will give you a vector
people_df$Name
[1] "Alice"   "Bob"     "Charlie"
# This will be the same vector
people_df[["Name"]]
[1] "Alice"   "Bob"     "Charlie"
  1. You can grab an entire column with [ , <column number>]:
# This will give you a data frame with one column
people_df[, 1]
[1] "Alice"   "Bob"     "Charlie"
# This will give you a data frame with multiple columns
people_df[,c(1, 2)]
     Name Age
1   Alice  30
2     Bob  25
3 Charlie  35
  1. You can get all columns in a specific row with [<row number>, ]:
# This will give you a data frame with one row
people_df[1,]
   Name Age Score
1 Alice  30   100
# This will give you a data frame with multiple rows
people_df[c(1, 3), ]
     Name Age Score
1   Alice  30   100
3 Charlie  35    90

Lists

A list is like a container in R that can hold a mix of different types of items, like a data frame. Lists are more flexible though, and can hold things of different sizes. A list can hold:

  • A numeric vector
  • A string vector
  • A single number
  • An entire data frame
  • Another list

All at once!

Here is how you can create a list:

my_list <- list(
  name = "Alice",
  age = 30,
  scores = c(100, 95, 90),
  is_student = FALSE,
  df = data.frame(a = c(1, 2, 3), b = c("a", "b", "c"))
)

To access elements of a list, you can:

  1. Use the $ operator (if the elements have names):
my_list$name
[1] "Alice"
my_list$scores
[1] 100  95  90
  1. The [[ ]] operator with the element’s name (if it has one), or its position:
my_list[["name"]]
[1] "Alice"
my_list[[1]]
[1] "Alice"

Often, you will want to go element by element of a list and do something with each element. In addition, data frame columns are equivalent to elements of a list (actually, under the hood, a data frame is just a list that forces the list elements to be the same size). You could write a for loop, but there are popular alternatives that can make your code cleaner and easier to read. R has a version of these, but the R package purrr has improved versions of these that I prefer.

  1. map(): takes a single list as input
library(purrr)

numbers <- list(
  c(1, 2, 3),
  c(4, 5, 6),
  c(10, 20, 30, 40, 50)
)

# Outputs a list, one element for original element of the list
map(numbers, function(x) sum(x))
[[1]]
[1] 6

[[2]]
[1] 15

[[3]]
[1] 150
# Outputs a vector numbers, one element per original list element
# Also using an alternative notation
map_dbl(numbers, ~ sum(.x))
[1]   6  15 150
  1. map2(): takes two lists as input
numbers <- list(
  c(1, 2, 3),
  c(4, 5, 6),
  c(10, 20, 30, 40, 50)
)

numbers2 <- list(
  c(-1, -2, -3),
  c(12, 13),
  c(2, 4, 6, 8, 10, 12)
)

# Outputs a list, one element for original element of the list
map2(numbers, numbers2, ~ sum(.x) + sum(.y))
[[1]]
[1] 0

[[2]]
[1] 40

[[3]]
[1] 192
# Outputs a vector numbers, one element per original list element
map2_dbl(numbers, numbers2, function(x, y) sum(x) + sum(y))
[1]   0  40 192
  1. pmap() allows you to provide a named list of inputs:
# A list of vectors
lists_of_inputs <- list(
  a = c(1, 3, 5),
  b = c(2, 4, 6),
  c = c(10, 20, 30)
)

pmap(lists_of_inputs, function(a, b, c) a + b + c)
[[1]]
[1] 13

[[2]]
[1] 27

[[3]]
[1] 41
pmap_dbl(lists_of_inputs, function(a, b, c) a + b + c)
[1] 13 27 41