Chapter 5 The Fundamentals of R
5.1 Four Fundamentals
The essence of R:
R <- c(1:4)
R
## [1] 1 2 3 4
(See Vectors later).
[One] important difference about R:
- Vector-based: R is not a procedural language
[Two] reasons to use R for Data Science:
- Designed for data: R can manipulate big data sets
- Graphics Are Graspable: people understand graphical data
[Three] fundamental principles of R per John Chambers:
- Objects: Everything that exists in R is an object
- Functions: Everything that happens in R is a function call
- Interfaces: to other softwares are an integral part of R
[Four] ways of programming R:
- Command line: entering R commands in a terminal
- Source file: running a set of commands from a saved file
- R GUI interface: available for Mac, WIndows, and Linux
- Code chunks in RStudio: allows debugging as you write
5.2 Basic Maths
R has all the basic mathematical functions:
1 + 1
## [1] 2
1 + 2 + 3
## [1] 6
3 * 7 * 2
## [1] 42
4 / 3
## [1] 1.333333
R obeys the standard order of mathematical operations (PEMDAS):
- Parentheses ( )
- Exponents ^
- Multiplication x
- Division
- Addition +
- Subtraction -
(2 ^ 5) + (2 * 5)
## [1] 42
The use of white space between operators is recommended.
5.3 Variables
Unlike statically-typed languages such a C++, R does not require variable types to be declared. An R variable can represent any data type or R object, such as a function, result, or graphical plot. R variables can be redeclared.
- Variable names can contain alphanumeric characters
- but not periods
.
or underscores_
- but not periods
- They cannot start with a number or underscore
- Variable names are case sensitive
5.3.1 Assigning variables
R variable assignment operators are <-
(default) and =
(acceptable).
x <- 2
x
## [1] 2
y = 5
y
## [1] 5
You can also assign left-to-right with ->
, but variables are not often assigned that way.
7 -> z
z
## [1] 7
Assignment operations can be used successively to assign a value to multiple variables
a <- b <- 42
a
## [1] 42
b
## [1] 42
You can also use the built-in assign
function:
assign("q", 4)
q
## [1] 4
5.3.2 Removing variables
rm(variablename)
removes a variable.
rm(q)
5.4 Data Types
R has four main data types:
- Numeric
- Character (a.k.a Nominal)
- Date
- Logical
You can check the type of variable with class(variablename
)
x <- "eh?"
x
## [1] "eh?"
class(x)
## [1] "character"
y <- 99
y
## [1] 99
class(y)
## [1] "numeric"
5.4.1 Numeric
data types
Numeric data includes both integers and decimals — positive, negative, and zero — similar to float
or double
in other languages. A numeric value stored in a variable is automatically assumed to be numeric in R.
You can test whether data is numeric with is.numeric()
:
is.numeric(y)
## [1] TRUE
And if it’s an integer with `is.integer()
:
is.integer(y)
## [1] FALSE
The response of FALSE
is because to set an integer as a variable you must append the value with L
:
y <- 99L
is.integer(y)
## [1] TRUE
R promotes integers
to numeric
when needed.
5.4.2 Character
data types
R handles Character data in two primary ways: as character
and as factor
. They are treated differently:
x <- "data"
x
## [1] "data"
class(x)
## [1] "character"
and
y <- factor("data")
y
## [1] data
## Levels: data
The levels
are attributes of that factor.
To find the length of a character
(or numeric
):
nchar(x)
## [1] 4
This does not work for factor
data.
5.4.3 Date
data types
R has numerous types of dates. Date
and POSIXct
are the most useful.
date1 <- as.Date("2018-03-28")
date1
## [1] "2018-03-28"
class(date1)
## [1] "Date"
as.numeric(date1)
## [1] 17618
and
date2 <- as.POSIXct("2018-03-28 10:45")
date2
## [1] "2018-03-28 10:45:00 PDT"
class(date2)
## [1] "POSIXct" "POSIXt"
as.numeric(date2)
## [1] 1522259100
Using as.numeric
also changes the underlying type:
class(date1)
## [1] "Date"
class(as.numeric(date1))
## [1] "numeric"
5.4.4 Logical
data types
Logical
s can be either TRUE
(T
or 1
) or FALSE
(F
or 0). T
and F
are not recommended as they are simply shortcuts to TRUE
and FALSE
and can be overwritten, causing woe, anguish, mayhem, and rioting. (TRUE
or F
?)
Logical data types have a similar test function is.logical()
:
k <- TRUE
class(k)
## [1] "logical"
is.logical(k)
## [1] TRUE
5.5 Data Structures
R data structures are containers for data elements:
- Vectors – collections of only same-type elements
- Matrices – rectangular containers of only same-type elements
- Data Frames – contain many types of vectors , all of the same length
- Arrays – Vectors with dimensions for each same-type element
- Lists – containers for elements of multi-type data types
5.5.1 Vectors
Vectors are the heart of R; it is a vectorised language. An R Vector
is:
A collection of elements of the same type.
Operations are applied to each element of a vector without the need to loop through them. This separates R from other programming languages and makes it most suited to manipulation and graphical presentation of data.
Vectors do not have a dimension: there is no column
or row
vector. Unlike mathematical vectors
there is no difference between column or row orientation.
5.5.1.1 Creating a vector
Vectors are created with c
, meaning “combine”:
x <- c(1, 2, 3, 4, 5, 6, 7, 8)
x
## [1] 1 2 3 4 5 6 7 8
Operations are applied to all elements at once:
x + 2
## [1] 3 4 5 6 7 8 9 10
x -3
## [1] -2 -1 0 1 2 3 4 5
x * 2
## [1] 2 4 6 8 10 12 14 16
x / 4
## [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
x^2
## [1] 1 4 9 16 25 36 49 64
sqrt(x)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
5.5.1.2 Vector creation shortcuts
1:8
## [1] 1 2 3 4 5 6 7 8
8:1
## [1] 8 7 6 5 4 3 2 1
-3:4
## [1] -3 -2 -1 0 1 2 3 4
4:-3
## [1] 4 3 2 1 0 -1 -2 -3
5.5.1.3 Accessing vector elements
Any element of a Vector
can be directly access using [square brackets] to point to it:
x
## [1] 1 2 3 4 5 6 7 8
x[4]
## [1] 4
x[8]
## [1] 8
5.5.1.4 Counting within Vectors
You can check the length of a vector:
x
## [1] 1 2 3 4 5 6 7 8
length(x)
## [1] 8
y
## [1] data
## Levels: data
length(y)
## [1] 1
length(x + y)
## Warning in Ops.factor(x, y): '+' not meaningful for factors
## [1] 8
and count the number of charactors in a vector:
q <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight")
q
## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"
nchar(q)
## [1] 3 3 5 4 4 3 5 5
5.5.1.5 Combining Vectors
Two vectors of the same or different length can be combined:
5.5.1.5.1 Vectors of the same length
x <- 1:8
x
## [1] 1 2 3 4 5 6 7 8
y <- -3:4
y
## [1] -3 -2 -1 0 1 2 3 4
x + y
## [1] -2 0 2 4 6 8 10 12
x - y
## [1] 4 4 4 4 4 4 4 4
x * y
## [1] -3 -4 -3 0 5 12 21 32
x / y
## [1] -0.3333333 -1.0000000 -3.0000000 Inf 5.0000000 3.0000000
## [7] 2.3333333 2.0000000
x^y
## [1] 1.0000000 0.2500000 0.3333333 1.0000000 5.0000000
## [6] 36.0000000 343.0000000 4096.0000000
5.5.1.5.2 Vectors of different lengths
For two vectors
of different lengths, the shorter vector is recycled, and R may issue a warning:
x + c(1, 2)
## [1] 2 4 4 6 6 8 8 10
x + c(1, 2, 3)
## Warning in x + c(1, 2, 3): longer object length is not a multiple of
## shorter object length
## [1] 2 4 6 5 7 9 8 10
5.5.1.6 Comparison of two Vectors
x <- c(1:8)
x
## [1] 1 2 3 4 5 6 7 8
x > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
y <- c(3:10)
y
## [1] 3 4 5 6 7 8 9 10
x > y
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
The all()
function tests whether all elements are TRUE
x <- 10:1
y <- -4:5
x
## [1] 10 9 8 7 6 5 4 3 2 1
y
## [1] -4 -3 -2 -1 0 1 2 3 4 5
all(x < y)
## [1] FALSE
The any()
function tests is any element is ’TRUE`:
any(x < y)
## [1] TRUE
including vectors, matrices, data frames (similar to datasets), and lists (collections of objects).
5.5.1.7 Factor Vectors
Factors
are an important concept in R. Factors
contain levels
, which are the unique values of that factor
variable.
q
## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"
qFactor <- as.factor(q)
qFactor
## [1] One Two Three Four Five Six Seven Eight
## Levels: Eight Five Four One Seven Six Three Two
Note that the order of levels
does not matter unless the ordered
argument is set TRUE
:
factor(x=c("High School", "Doctorate", "Masters", "College"),
levels=c("High School", "College", "Masters", "Doctorate"),
ordered=TRUE)
## [1] High School Doctorate Masters College
## Levels: High School < College < Masters < Doctorate
5.5.2 Matrices
A familiar mathematical structure, matrices
are essential to statistics.
A
Matrix
is a rectangular structure of rows and columns in which every element is of the same type, often all numerics.
Matrics
can be acted upon similarly to Vectors
, with PEDMAS-style element-by-element addition, subtraction, division, and equality.
5.5.2.1 Creating a Matrix
A <- matrix(1:12, nrow=3)
A
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Any element of a matrix
can be directly accessed using [square bracket] co-ordinates:
A[2,3]
## [1] 8
A[3,4]
## [1] 12
5.5.2.2 Dimensions of a Matrix
nrow(A)
## [1] 3
ncol(A)
## [1] 4
dim(A)
## [1] 3 4
5.5.2.3 Adding Matrices
A
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
B <- matrix(13:24, nrow=3)
B
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
A + B
## [,1] [,2] [,3] [,4]
## [1,] 14 20 26 32
## [2,] 16 22 28 34
## [3,] 18 24 30 36
5.5.2.4 Multiplying Matrices
A * B
## [,1] [,2] [,3] [,4]
## [1,] 13 64 133 220
## [2,] 28 85 160 253
## [3,] 45 108 189 288
5.5.2.5 Logical querying
A == B
## [,1] [,2] [,3] [,4]
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
5.5.2.6 Naming rows and columns
colnames(A) <- c("A1", "A2", "A3", "A4")
rownames(A) <- c("First", "Second", "Third")
A
## A1 A2 A3 A4
## First 1 4 7 10
## Second 2 5 8 11
## Third 3 6 9 12
A["First", "A2"]
## [1] 4
A[1,2]
## [1] 4
Two special vectors
– letters
and LETTERS
– create lowercase and UPPERCASE letter named matrix columns or rows:
C <- matrix(21:40, nrow=2)
colnames(C) <- LETTERS[1:10]
rownames(C) <- c(letters[1:2])
C
## A B C D E F G H I J
## a 21 23 25 27 29 31 33 35 37 39
## b 22 24 26 28 30 32 34 36 38 40
5.5.3 Dataframes
The data.frame
is perhaps the primary reason for R’s growing popularity as a powerful, focussed, and flexible language for use in all aspects of Data Science.
A
data.frame
is a rectangular collection of vectors, all of which are of the same length but differing data types.
A Data Frame
looks like an Excel spreadsheet in that the data is organised into columns and rows. In statistical terms, each column is a variable while each row contains specific observations. Similar to a Matrix only in that it is also rectangular, a data.frame
is a much more flexible and comprehensive data structure.
5.5.3.1 Creating a Dataframe
Using the existing functions:
(x <- 8:1)
## [1] 8 7 6 5 4 3 2 1
(y <- -3:4)
## [1] -3 -2 -1 0 1 2 3 4
(q <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight"))
## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"
The simplest way of creating a Dataframe
is with the data.frame()
function:
theDF <- data.frame(x, y, q)
theDF
## x y q
## 1 8 -3 One
## 2 7 -2 Two
## 3 6 -1 Three
## 4 5 0 Four
## 5 4 1 Five
## 6 3 2 Six
## 7 2 3 Seven
## 8 1 4 Eight
This creates an 8x3 data.frame
consisting of three vectors
. Notice that the data types are included below the column headings.
To assign names to the vectors
:
theDF <- data.frame(First=x, Second=y, Third=q)
theDF
## First Second Third
## 1 8 -3 One
## 2 7 -2 Two
## 3 6 -1 Three
## 4 5 0 Four
## 5 4 1 Five
## 6 3 2 Six
## 7 2 3 Seven
## 8 1 4 Eight
To assign names to the rows:
rownames(theDF) <- c("One", "Two", "Three", "Four", "Five", "Six", "Seven", "Eight")
theDF
## First Second Third
## One 8 -3 One
## Two 7 -2 Two
## Three 6 -1 Three
## Four 5 0 Four
## Five 4 1 Five
## Six 3 2 Six
## Seven 2 3 Seven
## Eight 1 4 Eight
5.5.3.2 Examining a Dataframe
The nrow()
, ncol()
, dim()
, rownames()
, and names()
functions are available to investigate its properties:
(nrow(theDF))
## [1] 8
(ncol(theDF))
## [1] 3
(dim(theDF))
## [1] 8 3
(rownames(theDF))
## [1] "One" "Two" "Three" "Four" "Five" "Six" "Seven" "Eight"
(names(theDF))
## [1] "First" "Second" "Third"
Elements of any vector
of a data.frame
can be directly accessed using the $
or [row, col]
operators:
(theDF$Second)
## [1] -3 -2 -1 0 1 2 3 4
(theDF[7, 3])
## [1] Seven
## Levels: Eight Five Four One Seven Six Three Two
To specify an entire row, leave out the column specification, vice versa for specifying an entire column:
(theDF[2, ])
## First Second Third
## Two 7 -2 Two
(theDF[, 2])
## [1] -3 -2 -1 0 1 2 3 4
To specify more than one row or column, use a vector
of indices:
(theDF[3:5, 2:3])
## Second Third
## Three -1 Three
## Four 0 Four
## Five 1 Five
To specify multiple columns by name, use a character vector
of the column names:
(theDF[, c("First", "Third")])
## First Third
## One 8 One
## Two 7 Two
## Three 6 Three
## Four 5 Four
## Five 4 Five
## Six 3 Six
## Seven 2 Seven
## Eight 1 Eight
To find the class
of the entire data.frame
:
(class(theDF))
## [1] "data.frame"
or the class
of any vector
:
(class(theDF$Third))
## [1] "factor"
5.5.3.3 Displaying a Dataframe
data.frames
can be small, large, big, huge, or ginormous, depending on their size. The head()
and tail()
functions functions print only the first or last few rows, or the number of rows you set:
(head(theDF))
## First Second Third
## One 8 -3 One
## Two 7 -2 Two
## Three 6 -1 Three
## Four 5 0 Four
## Five 4 1 Five
## Six 3 2 Six
(head(theDF, n=5))
## First Second Third
## One 8 -3 One
## Two 7 -2 Two
## Three 6 -1 Three
## Four 5 0 Four
## Five 4 1 Five
(tail(theDF, n=5))
## First Second Third
## Four 5 0 Four
## Five 4 1 Five
## Six 3 2 Six
## Seven 2 3 Seven
## Eight 1 4 Eight
5.5.4 Arrays
An
Array
is a multidimensional Vector whose elements are all the same type, but which also have attributes having dimensions (dim
) that can also be named (dimnames
).
5.5.4.1 Creating Arrays
To create an Array
, the first element is the row index, the second the column index, and the remaining elements are for the outer dimensions row
, column
, number of arrays
:
theArray <- array(1:12, dim = c(2, 3, 2))
theArray
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
5.5.4.2 Accessing Arrays
Individual elements of an Array
are accesssed using square brackets similar to a Vector
but in this case by [row, column, array #]
.
theArray[1, , ]
## [,1] [,2]
## [1,] 1 7
## [2,] 3 9
## [3,] 5 11
theArray[2, , ]
## [,1] [,2]
## [1,] 2 8
## [2,] 4 10
## [3,] 6 12
theArray[1, , 1]
## [1] 1 3 5
theArray[1, , 2]
## [1] 7 9 11
theArray[, , 1]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
theArray[, , 2]
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
5.5.5 Lists
Lists
are used to store any number of items of any type: allnumeric
or allcharacter
vectors, or a mix of them; completedata.frames
; and even otherlists
.
5.5.5.1 Creating Lists
Lists
are created with the list()
function. Each argument to the function becomes an element of the list:
list(1, 2, 3)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 2
##
## [[3]]
## [1] 3
Single-element lists can contain multi-element vectors:
list(c(1, 2, 3))
## [[1]]
## [1] 1 2 3
Here’s a two-element list with the second element a five-element vector
:
list1 <- list(c(1, 2, 3), 3:7)
list1
## [[1]]
## [1] 1 2 3
##
## [[2]]
## [1] 3 4 5 6 7
A two-element list
with the first element an array
, the second element a ten-element vector
:
list2 <- list(theArray, 1:10)
list2
## [[1]]
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
5.5.5.2 Creating Empty Lists
Empty lists
of a determined length are created using a vector
:
(emptyList <- vector(mode = "list", length = 4))
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
Note: Enclosing an expression in round brackets displays the results immediately after execution.
5.5.5.3 Naming Lists
Lists
can have names, and each element of a list
can have a unique name
names(list2)
## NULL
(names(list2) <- c("The Array", "The Vector"))
## [1] "The Array" "The Vector"
list2
## $`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
5.5.5.4 Naming List Elements
Names can also be assigned to list
elements during creation using name-value pairs. This can also include naming the list
itself:
(list3 <- list(theARR=theArray, theVECT=1:10, List3=list2))
## $theARR
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $theVECT
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List3
## $List3$`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $List3$`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
5.5.5.5 Adding To A List
New elements can be added to a list
by appending a numeric
or named
index that does not yet exist:
length(list3)
## [1] 3
Adding a numeric
index:
list3[[4]] <- 11
length(list3)
## [1] 4
list3
## $theARR
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $theVECT
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List3
## $List3$`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $List3$`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
##
##
## [[4]]
## [1] 11
Adding a named
index:
list3[["AddedElement"]] <- 12:16
length(list3)
## [1] 5
list3
## $theARR
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $theVECT
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List3
## $List3$`The Array`
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
##
## $List3$`The Vector`
## [1] 1 2 3 4 5 6 7 8 9 10
##
##
## [[4]]
## [1] 11
##
## $AddedElement
## [1] 12 13 14 15 16