R Introduction


BIO401-01/598-02

Mar. 17 2021

R is a script language for statistical data analysis and manipulation.

home page : https://cran.r-project.org/

How to run

  1. batch mode

    $ R CMD BATCH script.R

help for the batch mode

$ R CMD command --help

for example

$ R CMD INSTALL --help
  1. interactive mode

console : “>” as the R prompt

common IDE (integrated development environment) for R

  • R studio

  • Jupyter-notebook

  • ESS(emacs speaks statistics)

  • R commander, etc.

Installing packages

> install.packages("package_name")

> install.packages(c("ggplot2","sp","raster"))

Interfacing with the system

> dir() #  show files in the current directory

> getwd() #  is asking for the current working directory

Getting help

R provides help with function and commands. On-line help gives useful information as well. Getting used to R help is a key to successful statistical modelling. The online help can be accessed in HTML format by typing:

> help.start()

A keyword search is possible using the Search Engine and Keywords link. You can also use the help() or ? functions.

> help(dir)

or

> ? dir

As a caculator

[1]:
3 + 5
8
[2]:
sin(pi/2)
1
[13]:
2^3
8
[31]:
5/3
1.66666666666667
[33]:
5%%3
2
[1]:
5%/%3
1

Logic Operations

  • AND operator &

  • OR operator |

  • NOT operator !

[1]:
12 > 5 & 12 < 15
TRUE
[2]:
! TRUE
FALSE
[3]:
TRUE | FALSE
TRUE

Data structures

R objects

The entities R operates on are technically known as objects. Examples are “vectors of numeric (real)” or “complex values”, “vectors of logical” values and “vectors of character strings”. These are known as “atomic” structures since their components are all of the same type, or mode, namely numeric, complex, logical, character and raw. R also operates on objects called “lists”, which are of mode list. These are ordered sequences of objects which individually can be of any mode. Lists are known as “recursive” rather than atomic structures since their components can themselves be lists in their own right. The other recursive structures are those of mode function and expression.

By the “mode” of an object we mean the basic type of its fundamental constituents. This is a special case of a “property” of an object. Another property of every object is its “length.” The functions mode(object) and length(object) can be used to find out the mode and length of any defined structure 10.

Further properties of an object are usually provided by attributes(object). Because of this, mode and length are also called “intrinsic attributes” of an object. For example, if z is a complex vector of length 100, then in an expression mode(z) is the character string “complex” and length(z) is 100.

vectors : the heart of R

Vectors are combinations of scalars in a string structure. Vectors must have all values of the same mode. Thus any given vector must be unambiguously either logical, numeric, complex, character or raw. (The only apparent exception to this rule is the special “value” listed as NA for quantities not available, but in fact there are several types of NA). Note that a vector can be empty and still have a mode. For example the empty character string vector is listed as character(0) and the empty numeric vector as numeric(0).

> help(vector)
[6]:
# "<-" is the  assignment operator in R.
# "c" stands for concatenate.

x <- c(1,5,8)
[7]:
# subsetting operation
# "[]" used to retrieve vector elements
# ":" used to give a range for retrieval
x[1]
x[2:3]
1
  1. 5
  2. 8
[8]:
length(x)
3
[9]:
mode(x)
'numeric'
[30]:
# scalar not really exist in R
# individual numbers actually single element vectors

x <- 3.5
x
x[1]
3.5
3.5

Two common functions we can use to generate a sequence.

The seq() function “seq(from = number, to = number, by = number)” allow to create a vector starting from a value to another by a defined increment:

The replicate function “rep(x,times)” enables you to replicate a vector several times in a more complex vector. Calculations can be included to form vectors as well and functions can be combined in the same command.

[16]:
1:10
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
[11]:
seq(1,3,0.25)
  1. 1
  2. 1.25
  3. 1.5
  4. 1.75
  5. 2
  6. 2.25
  7. 2.5
  8. 2.75
  9. 3
[1]:
one2three <- 1:3
rep(one2three,5)
  1. 1
  2. 2
  3. 3
  4. 1
  5. 2
  6. 3
  7. 1
  8. 2
  9. 3
  10. 1
  11. 2
  12. 3
  13. 1
  14. 2
  15. 3

Missing Values

In some cases the components of a vector or of an R object more in general, may not be completely known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. Any operation on an NA becomes an NA.

The function is.na(x) gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA.

Functions : dealing with NAs

na.fail returns the object if it does not contain any missing values, and signals an error otherwise.

na.omit returns the object with incomplete cases removed.

na.pass returns the object unchanged.

[90]:
z <- c(1:3,NA)
ind <- is.na(z)
ind
  1. FALSE
  2. FALSE
  3. FALSE
  4. TRUE

There is a second kind of “missing” values which are produced by numerical computation, the so-called Not a Number, NaN , values.

Examples are 0/0 or Inf - Inf which both give NaN since the result cannot be defined sensibly.

[22]:
Inf-Inf
0/0
NaN
NaN

is.na(xx) is TRUE both for NA and NaN values. To differentiate these, is.nan(xx) is only TRUE for NaNs. Missing values are sometimes printed as when character vectors are printed without quotes.

[24]:
z <- c(1:3,NA,0/0)
is.not.available <- is.na(z)
is.not.a.number <-is.nan(z)

is.not.a.number
is.not.available
  1. FALSE
  2. FALSE
  3. FALSE
  4. FALSE
  5. TRUE
  1. FALSE
  2. FALSE
  3. FALSE
  4. TRUE
  5. TRUE

On the other hand, NULL represent the value does not exit.

[3]:
x <- c(1,2,3,NA)
mean(x)
<NA>
[4]:
mean(x,na.rm=TRUE)
2
[1]:
y <- c(1,2,3,NULL)
mean(y)
2
[99]:
na.fail(x)
Error in na.fail.default(x): missing values in object
Traceback:

1. na.fail(x)
2. na.fail.default(x)
3. stop("missing values in object")
[100]:
na.fail(y)
  1. 1
  2. 2
  3. 3

Exercise

creat a sequence as below and compute its mean

10 20 30 10 20 30 50 50 50 NA NA NA 10 10 10

Working with strings

character strings : single element vectors

mode : check datatype

[38]:
x <- "hello"
mode(x)
length(x)
'character'
1
[3]:
# concatenate strings

paste("hello","world")
'hello world'
[2]:
# if you don't want any space between the strings, you can use paste0

paste0 ("hello","world")
'helloworld'
[5]:
# substr is used to get parts of a string

substr("Hello world", 6,11)
' world'
[6]:
# grep : retrieval based on certain pattern

timespan <- c("day","month","year")
grep('y', timespan)
  1. 1
  2. 3
[7]:
# ^ : start with

grep('^m', timespan)
2
[41]:
numeric.vector <- c(rep(c (5*10:1, 5, 6), 2))
numeric.vector
  1. 50
  2. 45
  3. 40
  4. 35
  5. 30
  6. 25
  7. 20
  8. 15
  9. 10
  10. 5
  11. 5
  12. 6
  13. 50
  14. 45
  15. 40
  16. 35
  17. 30
  18. 25
  19. 20
  20. 15
  21. 10
  22. 5
  23. 5
  24. 6
[43]:
character.vector  <- as.character(numeric.vector)
character.vector
  1. '50'
  2. '45'
  3. '40'
  4. '35'
  5. '30'
  6. '25'
  7. '20'
  8. '15'
  9. '10'
  10. '5'
  11. '5'
  12. '6'
  13. '50'
  14. '45'
  15. '40'
  16. '35'
  17. '30'
  18. '25'
  19. '20'
  20. '15'
  21. '10'
  22. '5'
  23. '5'
  24. '6'

Exercise

retreive the position of the 2nd “to” in the quatation in the sentence below, replace it with “T” and reprint the whole sentence.

“All we have to decide is what to do with the time that is given us.”

― J.R.R. Tolkien, The Fellowship of the Ring

Factors

A factor is a variable to represent categories of a set. They can created using the function as.factor().

[9]:
countries <- c("Korea", "China", "UK", "UK", "USA", "Japan", "Korea")
as.factor(countries)
  1. Korea
  2. China
  3. UK
  4. UK
  5. USA
  6. Japan
  7. Korea
Levels:
  1. 'China'
  2. 'Japan'
  3. 'Korea'
  4. 'UK'
  5. 'USA'
[10]:
# factor can work on numbers also.
fnum <- c(1:3,2:5,1:8)
as.factor(fnum)
  1. 1
  2. 2
  3. 3
  4. 2
  5. 3
  6. 4
  7. 5
  8. 1
  9. 2
  10. 3
  11. 4
  12. 5
  13. 6
  14. 7
  15. 8
Levels:
  1. '1'
  2. '2'
  3. '3'
  4. '4'
  5. '5'
  6. '6'
  7. '7'
  8. '8'

Set Operations

union(x,y)
intersect(x,y)
setdiff(x,y) : all elements of x NOT in y
setequal(x,y)
c %in% x : membership testing
[27]:
x <- c(1, 3, 5, 8, 10)
y <- c(3, 4, 5, 6, 7)
[29]:
union(x,y)
intersect(x,y)
setdiff(x,y)
setequal(x,y)
  1. 1
  2. 3
  3. 5
  4. 8
  5. 10
  6. 4
  7. 6
  8. 7
  1. 3
  2. 5
  1. 1
  2. 8
  3. 10
FALSE
[30]:
7 %in% x
FALSE

List : contains that can hold different types

Lists are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation.

[26]:
x <- list(w=2, v="GIS")
x
$w
2
$v
'GIS'
[27]:
# $ sign used to access list elements by names
x$w
2
[62]:
mode(x)
mode(c(x$w,x$v))
'list'
'character'
[65]:
x <- c(2,"GIS")
mode(x)
mode(x[1])
'character'
'character'
[ ]:
# show internal datasets

data()
[3]:
hn <- hist(Nile)
../_images/R_R_Intro_64_0.png
[67]:
print (hn)
$breaks
 [1]  400  500  600  700  800  900 1000 1100 1200 1300 1400

$counts
 [1]  1  0  5 20 25 19 12 11  6  1

$density
 [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001

$mids
 [1]  450  550  650  750  850  950 1050 1150 1250 1350

$xname
[1] "Nile"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"
[68]:
# str : structure
str(hn)
List of 6
 $ breaks  : int [1:11] 400 500 600 700 800 900 1000 1100 1200 1300 ...
 $ counts  : int [1:10] 1 0 5 20 25 19 12 11 6 1
 $ density : num [1:10] 0.0001 0 0.0005 0.002 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001
 $ mids    : num [1:10] 450 550 650 750 850 950 1050 1150 1250 1350
 $ xname   : chr "Nile"
 $ equidist: logi TRUE
 - attr(*, "class")= chr "histogram"
[69]:
mode(hn)
'list'
[5]:
# summary is a generic function in R. It outputs a concise and more friendly representation of a list.
summary(hn)
         Length Class  Mode
breaks   11     -none- numeric
counts   10     -none- numeric
density  10     -none- numeric
mids     10     -none- numeric
xname     1     -none- character
equidist  1     -none- logical

Matrices

Matrices, or more generally arrays, are multi-dimensional generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in a special way.

The matrix() function creates a matrix from the given set of values. We use the matrix(x, nrow=, ncol=) function to set the matrix cell values, the number of rows and the number of columns. We can use the colnames() and rownames() functions to set the column and row names of the matrix-like object.

[29]:
matrix(data = NA, nrow = 2, ncol = 3)
A matrix: 2 × 3 of type lgl
NANANA
NANANA
[11]:
example.matrix <- matrix(1:6,2,3)
example.matrix
A matrix: 2 × 3 of type int
135
246
[12]:
# data distributed by columns by default, can be changed using the keyword byrow
example.matrix <- matrix(1:6,2,3, byrow=TRUE)
example.matrix
A matrix: 2 × 3 of type int
123
456
[13]:
# retrieval by rows
example.matrix[1,]
  1. 1
  2. 2
  3. 3
[14]:
# retrival by columns
example.matrix[,2]
  1. 2
  2. 5
[42]:
# changing values

example.matrix[1,] <- 1:3
example.matrix[2,] <- c(5,10,4)
example.matrix
A matrix: 2 × 3 of type dbl
1 23
5104
[45]:
# apply () function is a machanism to apply a function across a vector.

# by column
apply(example.matrix,2,sum)
  1. 6
  2. 12
  3. 7
[46]:
# by row
apply(example.matrix,1,sum)
  1. 6
  2. 19
[44]:
matrix.head <- c("col a","col b","column c")
matrix.side <- c("first raw","second raw")
[45]:
colnames(example.matrix) = matrix.head
rownames(example.matrix) = matrix.side
example.matrix
A matrix: 2 × 3 of type dbl
col acol bcolumn c
first raw1 23
second raw5104
[52]:
# The structure function str(object.name) informs you of the structure of a
# specific object

str(example.matrix)
 num [1:2, 1:3] 1 5 2 10 3 4
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "first raw" "second raw"
  ..$ : chr [1:3] "col a" "col b" "column c"
[17]:
# combine matrix using rbind (by row) and cbind (by column)

matrix1 <- matrix(1:6,2,3)
matrix2 <- matrix(11:16,2,3)
rbind(matrix1,matrix2)
cbind(matrix1,matrix2)
A matrix: 4 × 3 of type int
1 3 5
2 4 6
111315
121416
A matrix: 2 × 6 of type int
135111315
246121416

Exercise

create a matrix with the following elements and then change the first column values to all 0.

1 1 1
2 2 2
3 3 3

Arrays

An array can be considered a multiple subscripted collection of data entries, for example numeric. R allows simple facilities for creating and handling arrays, and in particular the special case of matrices.

As well as giving a vector structure a dim attribute, arrays can be constructed from vectors by the array function, which has the form array(data_vector, dim_vector)

[60]:
Z <- array(1:24, c(3,4,2))
Z
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
  23. 23
  24. 24
[56]:
str(Z)
 int [1:3, 1:4, 1:2] 1 2 3 4 5 6 7 8 9 10 ...

list and delete R objects

The list function ls() outputs a list of existing R objects.

rm() removes the object.

[57]:
ls()
  1. 'character.vector'
  2. 'example.matrix'
  3. 'hn'
  4. 'ind'
  5. 'is.not.a.number'
  6. 'is.not.available'
  7. 'matrix.head'
  8. 'matrix.side'
  9. 'numeric.vector'
  10. 'one2three'
  11. 'x'
  12. 'z'
  13. 'Z'
[61]:
rm(Z)
ls()
  1. 'character.vector'
  2. 'example.matrix'
  3. 'hn'
  4. 'ind'
  5. 'is.not.a.number'
  6. 'is.not.available'
  7. 'matrix.head'
  8. 'matrix.side'
  9. 'numeric.vector'
  10. 'one2three'
  11. 'x'
  12. 'z'

Data Frames

Data frames are matrix-like structures, in which the columns can be of different types. Think of data frames as data matrices with one row per observational unit but with (possibly) both numerical and categorical variables. Many experiments are best described by data frames: the treatments are categorical but the response is numeric.

As a result R dataframes are tightly coupled collections of variables which share many of the properties of matrices and of lists. Data frames are used as the fundamental data structure by most of R’s modeling software.

A data frame is a list with class “data.frame”. There are restrictions on lists that may be made into data frames, namely :

The components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frames. Matrices, lists, and data frames provide as many variables to the new data frame as they have columns, elements, or variables, respectively. Numeric vectors, logicals and factors are included, and character vectors are coerced to be factors, whose levels are the unique values appearing in the vector. Vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size.

Dataframe construction

[62]:
my.data.frame = data.frame(v = 1:4, ch = c("a", "b", "c", "d"), n = 10)
my.data.frame
A data.frame: 4 × 3
vchn
<int><chr><dbl>
1a10
2b10
3c10
4d10
[18]:
# Or:

my.data.frame = data.frame(vector = 1:4,
                           character = c("a", "b", "c", "d"),
                           const.vector = 10,
                           row.names =c("data1", "data2", "data3", "data4"))
my.data.frame
A data.frame: 4 × 3
vectorcharacterconst.vector
<int><chr><dbl>
data11a10
data22b10
data33c10
data44d10

Data selection and manipulation

[64]:
# You can extract data from dataframes using the    [ [    ] ]  and  $  sign:

my.data.frame[["character"]]

my.data.frame[[2]]
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
  1. 'a'
  2. 'b'
  3. 'c'
  4. 'd'
[66]:
# Call the 3rd value of the character vector:

my.data.frame[[2]][3]
'c'
[67]:
# Or using the $ syntax:

my.data.frame$vector

my.data.frame$character[2:3]
  1. 1
  2. 2
  3. 3
  4. 4
  1. 'b'
  2. 'c'
[68]:
# You can add single arguments to a data frame, query information, select and
# manipulate arguments or single values from a dataframe

my.data.frame$new

my.data.frame$new = c(10,11,20,40)
my.data.frame
NULL
A data.frame: 4 × 4
vectorcharacterconst.vectornew
<int><chr><dbl><dbl>
data11a1010
data22b1011
data33c1020
data44d1040
[69]:
# length(object.name) returns the number of elements in an object such as
# matrix vector or dataframes:

length(my.data.frame$new)
4
[70]:
# which(object.name) and which.max(object.name) return the index of a specific
# or of the greatest element of an object

which.max(my.data.frame$new)

which(my.data.frame$new == 20)
4
3
[71]:
# max(object.name) returns the value of the greatest element

max(my.data.frame$new)
40
[72]:
# sort(object.name) sort from small to big

sort(my.data.frame$new)
  1. 10
  2. 11
  3. 20
  4. 40
[73]:
# rev(object.name) sorts from big to small

rev(sort(my.data.frame$new))
  1. 40
  2. 20
  3. 11
  4. 10
[74]:
# subset(object.name, ...) returns a selection of an R-object with respect to
# criteria (typically comparisons: x$V1 < 10).

subset(my.data.frame, my.data.frame$new == 20)
A data.frame: 1 × 4
vectorcharacterconst.vectornew
<int><chr><dbl><dbl>
data33c1020
[79]:
# If the R-object is a data frame,
# the option select gives the variables to be kept or dropped using a minus
# sign

my.data.frame[-1]
A data.frame: 4 × 3
characterconst.vectornew
<chr><dbl><dbl>
data1a1010
data2b1011
data3c1020
data4d1040

Reading and Writing Files

read.table(“filename”)** Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields within the file. The default separator sep=”” is any whitespace. You might need sep=“,” or “;” and so on. Use header=TRUE to read the first line as a header of column names. The as.is=TRUE specification is used to prevent character vectors from being converted to factors.

The comment.char=”” specification is used to prevent “#” from being interpreted as a comment and use “skip=n” to skip n lines before reading data. For more details: ?read.table

read.csv(“filename”) is set to read comma separated files. Example usage is: read.csv(file.name, header = TRUE, sep = “,”, quote=“””, dec=“.”, fill =TRUE, comment.char=””, …)

read.delim(“filename”) is used for reading tab-delimited files

read.fwf() reads a table of fixed width formatted data into a ’data.frame’. Widths is an integer vector, giving the widths of the fixed-width fields

write.table(Robj, “filename”, sep=“,”, row.names=FALSE, quote=FALSE)

[36]:
df <- read.csv("./geodata/shp/point_stat.csv")
[5]:
head(df)
A data.frame: 6 × 4
FIDmeanstdevmin
<int><dbl><dbl><dbl>
102.5507250.258913422.396264
212.5507250.258913422.396264
322.5507250.258913422.396264
432.3837110.123817612.050272
542.3831370.096621222.133288
652.3831370.096621222.133288
[33]:
# subsetting
dfSel <- df[which(df$mean>7),]
dfSel
A data.frame: 2 × 4
FIDmeanstdevmin
<int><dbl><dbl><dbl>
1061057.5365150.18950117.16515
1071067.5365150.18950117.16515
[34]:
write.csv(dfSel,"./geodata/shp/point_stat_sel.csv", row.names=FALSE)
[39]:
df <- read.csv("./geodata/shp/point_stat_sel.csv")
df
A data.frame: 2 × 4
FIDmeanstdevmin
<int><dbl><dbl><dbl>
1057.5365150.18950117.16515
1067.5365150.18950117.16515

Exercise

Print the data rows with the column min value less than 1. Do NOT use the “which” function in R.

Functions

Functions are themselves objects in R which can be stored in the project’s workspace. This provides a simple and convenient way to extend R.

Usage: in writing your own function you provide one or more arguments or names for the function, an expression and a value is produced equal to the output function result.

function(arglist) expr function definition
return(value)
[112]:
# Example

myfunction <- function(x) x^5
myfunction(3)
243
[115]:
# oddcount function

oddcount <- function(x){
    k <- 0
    for (i in x) {
        if (i %% 2 == 1) k <- k + 1
    }
    return (k)
}
[116]:
oddcount(c(1,3,5,7))
4
[117]:
oddcount(c(1,3,5,8))
3

break and next

break : break out of a loop
next : skip a step
[48]:
for (i in 1:10){
    if (i %in% c(1,3,5,7)) {
        next
    }
    else if (i > 8){
        break
    }
    print (i)
}
[1] 2
[1] 4
[1] 6
[1] 8

Exercise

write a function to test if a number is a prime number, a natural number that is NOT a multiplication of two smaller numbers.

Basic Graphs

The plot() function forms the foundation for much of R based graphing operations.

[11]:
plot(c(1,3,5,7),c(2,4,6,8))
../_images/R_R_Intro_121_0.png
[17]:
x <- seq(0,10,length.out=100)
y <- exp(-x)
[18]:
plot(x,y)
../_images/R_R_Intro_123_0.png
[20]:
plot(x,y,type="l",lwd=3)
../_images/R_R_Intro_124_0.png
[23]:
plot(x,y,type="l",lwd=3, xlab="time", cex.lab=1.5,cex.axis=1.5)
grid(lwd=3)
../_images/R_R_Intro_125_0.png

Assignment

  • individual effort or collaboration by maximum two people

  • if in collaboration, please specify the contribution of each contributor

  • please turn in a set of .pdf and .ipynb files

  • please label the question number before giving the answer

Q1. Read in file “./geodata/shp/point_stat.csv” and output the rows that their values in the “stdev” column are less than those in their “min” column and their “mean” values are all greater than 7. (20%)

Q2. Create a 3x3 matrix with element values in the range of 1 to 9 and then change the diagonal elements to 0. (20%)

1   4   7           0   4   7
2   5   8    ==>    2   0   8
3   6   9           3   6   0

Q3. Create a dataframe to store the calendar of March 2021 with the day of the week as the column name. (20%)

Mo Tu We Th Fr Sa Su
 1  2  3  4  5  6  7
 8  9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31

QB (bonus) (10%)

According to Gregorian calendar, leap years can be determined by the criteria below.

Years that are divisible by 4 and not divisible by 100 are leap years.
Years that are divisible by 100 but not divisible by 400 are NOT leap years.
Years that are divisible by 400 are leap years.

Please write a R script to print all the leap years between 1800 and 2020 into a file named “leap_years.dat” and calculate the ratio of the number of leap years over this period of time.