R Introduction

BIO401-01/598-02

Mar. 17 2021

R is a script language for statistical data analysis and manipulation.

How to run

batch mode

$ R CMD BATCH script.R

help for the batch mode

$ R CMD command --help

for example

$ R CMD INSTALL --help

interactive mode

console : “>” as the R prompt

common IDE (integrated development environment) for R

R studio
Jupyter-notebook
ESS(emacs speaks statistics)
R commander, etc.

Installing packages

> install.packages("package_name")

> install.packages(c("ggplot2","sp","raster"))

Interfacing with the system

> dir() #  show files in the current directory

> getwd() #  is asking for the current working directory

Getting help

R provides help with function and commands. On-line help gives useful information as well. Getting used to R help is a key to successful statistical modelling. The online help can be accessed in HTML format by typing:

> help.start()

A keyword search is possible using the Search Engine and Keywords link. You can also use the help() or ? functions.

> help(dir)

or

> ? dir

As a caculator

[1]:

3 + 5

8

[2]:

sin(pi/2)

1

[13]:

2^3

8

[31]:

5/3

1.66666666666667

[33]:

5%%3

2

[1]:

5%/%3

1

Logic Operations

AND operator &
OR operator |
NOT operator !

[1]:

12 > 5 & 12 < 15

TRUE

[2]:

! TRUE

FALSE

[3]:

TRUE | FALSE

TRUE

Data structures

R objects

The entities R operates on are technically known as objects. Examples are “vectors of numeric (real)” or “complex values”, “vectors of logical” values and “vectors of character strings”. These are known as “atomic” structures since their components are all of the same type, or mode, namely numeric, complex, logical, character and raw. R also operates on objects called “lists”, which are of mode list. These are ordered sequences of objects which individually can be of any mode. Lists are known as “recursive” rather than atomic structures since their components can themselves be lists in their own right. The other recursive structures are those of mode function and expression.

By the “mode” of an object we mean the basic type of its fundamental constituents. This is a special case of a “property” of an object. Another property of every object is its “length.” The functions mode(object) and length(object) can be used to find out the mode and length of any defined structure 10.

Further properties of an object are usually provided by attributes(object). Because of this, mode and length are also called “intrinsic attributes” of an object. For example, if z is a complex vector of length 100, then in an expression mode(z) is the character string “complex” and length(z) is 100.

vectors : the heart of R

Vectors are combinations of scalars in a string structure. Vectors must have all values of the same mode. Thus any given vector must be unambiguously either logical, numeric, complex, character or raw. (The only apparent exception to this rule is the special “value” listed as NA for quantities not available, but in fact there are several types of NA). Note that a vector can be empty and still have a mode. For example the empty character string vector is listed as character(0) and the empty numeric vector as numeric(0).

> help(vector)

[6]:

# "<-" is the  assignment operator in R.
# "c" stands for concatenate.

x <- c(1,5,8)

[7]:

# subsetting operation
# "[]" used to retrieve vector elements
# ":" used to give a range for retrieval
x[1]
x[2:3]

1

5
8

[8]:

length(x)

3

[9]:

mode(x)

'numeric'

[30]:

# scalar not really exist in R
# individual numbers actually single element vectors

x <- 3.5
x
x[1]

3.5

Two common functions we can use to generate a sequence.

The seq() function “seq(from = number, to = number, by = number)” allow to create a vector starting from a value to another by a defined increment:

The replicate function “rep(x,times)” enables you to replicate a vector several times in a more complex vector. Calculations can be included to form vectors as well and functions can be combined in the same command.

[16]:

1:10

1
2
3
4
5
6
7
8
9
10

[11]:

seq(1,3,0.25)

1
1.25
1.5
1.75
2
2.25
2.5
2.75
3

[1]:

one2three <- 1:3
rep(one2three,5)

1
2
3
1
2
3
1
2
3
1
2
3
1
2
3

Missing Values

In some cases the components of a vector or of an R object more in general, may not be completely known. When an element or value is “not available” or a “missing value” in the statistical sense, a place within a vector may be reserved for it by assigning it the special value NA. Any operation on an NA becomes an NA.

The function is.na(x) gives a logical vector of the same size as x with value TRUE if and only if the corresponding element in x is NA.

Functions : dealing with NAs

na.fail returns the object if it does not contain any missing values, and signals an error otherwise.

na.omit returns the object with incomplete cases removed.

na.pass returns the object unchanged.

[90]:

z <- c(1:3,NA)
ind <- is.na(z)
ind

FALSE
FALSE
FALSE
TRUE

There is a second kind of “missing” values which are produced by numerical computation, the so-called Not a Number, NaN , values.

Examples are 0/0 or Inf - Inf which both give NaN since the result cannot be defined sensibly.

[22]:

Inf-Inf
0/0

NaN

is.na(xx) is TRUE both for NA and NaN values. To differentiate these, is.nan(xx) is only TRUE for NaNs. Missing values are sometimes printed as when character vectors are printed without quotes.

[24]:

z <- c(1:3,NA,0/0)
is.not.available <- is.na(z)
is.not.a.number <-is.nan(z)

is.not.a.number
is.not.available

FALSE
FALSE
FALSE
FALSE
TRUE

FALSE
FALSE
FALSE
TRUE
TRUE

On the other hand, NULL represent the value does not exit.

[3]:

x <- c(1,2,3,NA)
mean(x)

<NA>

[4]:

mean(x,na.rm=TRUE)

2

[1]:

y <- c(1,2,3,NULL)
mean(y)

2

[99]:

na.fail(x)

Error in na.fail.default(x): missing values in object
Traceback:

1. na.fail(x)
2. na.fail.default(x)
3. stop("missing values in object")

[100]:

na.fail(y)

1
2
3

Exercise

creat a sequence as below and compute its mean

10 20 30 10 20 30 50 50 50 NA NA NA 10 10 10

Working with strings

character strings : single element vectors

mode : check datatype

[38]:

x <- "hello"
mode(x)
length(x)

'character'

1

[3]:

# concatenate strings

paste("hello","world")

'hello world'

[2]:

# if you don't want any space between the strings, you can use paste0

paste0 ("hello","world")

'helloworld'

[5]:

# substr is used to get parts of a string

substr("Hello world", 6,11)

' world'

[6]:

# grep : retrieval based on certain pattern

timespan <- c("day","month","year")
grep('y', timespan)

1
3

[7]:

# ^ : start with

grep('^m', timespan)

2

[41]:

numeric.vector <- c(rep(c (5*10:1, 5, 6), 2))
numeric.vector

50
45
40
35
30
25
20
15
10
5
5
6
50
45
40
35
30
25
20
15
10
5
5
6

[43]:

character.vector  <- as.character(numeric.vector)
character.vector

'50'
'45'
'40'
'35'
'30'
'25'
'20'
'15'
'10'
'5'
'5'
'6'
'50'
'45'
'40'
'35'
'30'
'25'
'20'
'15'
'10'
'5'
'5'
'6'

Exercise

retreive the position of the 2nd “to” in the quatation in the sentence below, replace it with “T” and reprint the whole sentence.

“All we have to decide is what to do with the time that is given us.”

― J.R.R. Tolkien, The Fellowship of the Ring

Factors

A factor is a variable to represent categories of a set. They can created using the function as.factor().

[9]:

countries <- c("Korea", "China", "UK", "UK", "USA", "Japan", "Korea")
as.factor(countries)

Korea
China
UK
UK
USA
Japan
Korea

Levels:

'China'
'Japan'
'Korea'
'UK'
'USA'

[10]:

# factor can work on numbers also.
fnum <- c(1:3,2:5,1:8)
as.factor(fnum)

1
2
3
2
3
4
5
1
2
3
4
5
6
7
8

Levels:

'1'
'2'
'3'
'4'
'5'
'6'
'7'
'8'

Set Operations

union(x,y)
intersect(x,y)
setdiff(x,y) : all elements of x NOT in y
setequal(x,y)
c %in% x : membership testing

[27]:

x <- c(1, 3, 5, 8, 10)
y <- c(3, 4, 5, 6, 7)

[29]:

union(x,y)
intersect(x,y)
setdiff(x,y)
setequal(x,y)

1
3
5
8
10
4
6
7

3
5

1
8
10

FALSE

[30]:

7 %in% x

FALSE

List : contains that can hold different types

Lists are a general form of vector in which the various elements need not be of the same type, and are often themselves vectors or lists. Lists provide a convenient way to return the results of a statistical computation.

[26]:

x <- list(w=2, v="GIS")
x

$w: 2
$v: 'GIS'

[27]:

# $ sign used to access list elements by names
x$w

2

[62]:

mode(x)
mode(c(x$w,x$v))

'list'

'character'

[65]:

x <- c(2,"GIS")
mode(x)
mode(x[1])

'character'

[ ]:

# show internal datasets

data()

[3]:

hn <- hist(Nile)

[67]:

print (hn)

$breaks
 [1]  400  500  600  700  800  900 1000 1100 1200 1300 1400

$counts
 [1]  1  0  5 20 25 19 12 11  6  1

$density
 [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001

$mids
 [1]  450  550  650  750  850  950 1050 1150 1250 1350

$xname
[1] "Nile"

$equidist
[1] TRUE

attr(,"class")
[1] "histogram"

[68]:

# str : structure
str(hn)

List of 6
 $ breaks  : int [1:11] 400 500 600 700 800 900 1000 1100 1200 1300 ...
 $ counts  : int [1:10] 1 0 5 20 25 19 12 11 6 1
 $ density : num [1:10] 0.0001 0 0.0005 0.002 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001
 $ mids    : num [1:10] 450 550 650 750 850 950 1050 1150 1250 1350
 $ xname   : chr "Nile"
 $ equidist: logi TRUE
 - attr(*, "class")= chr "histogram"

[69]:

mode(hn)

'list'

[5]:

# summary is a generic function in R. It outputs a concise and more friendly representation of a list.
summary(hn)

         Length Class  Mode
breaks   11     -none- numeric
counts   10     -none- numeric
density  10     -none- numeric
mids     10     -none- numeric
xname     1     -none- character
equidist  1     -none- logical

Matrices

Matrices, or more generally arrays, are multi-dimensional generalizations of vectors. In fact, they are vectors that can be indexed by two or more indices and will be printed in a special way.

The matrix() function creates a matrix from the given set of values. We use the matrix(x, nrow=, ncol=) function to set the matrix cell values, the number of rows and the number of columns. We can use the colnames() and rownames() functions to set the column and row names of the matrix-like object.

[29]:

matrix(data = NA, nrow = 2, ncol = 3)

A matrix: 2 × 3 of type lgl
NA	NA	NA
NA	NA	NA

[11]:

example.matrix <- matrix(1:6,2,3)
example.matrix

A matrix: 2 × 3 of type int
1	3	5
2	4	6

[12]:

# data distributed by columns by default, can be changed using the keyword byrow
example.matrix <- matrix(1:6,2,3, byrow=TRUE)
example.matrix

A matrix: 2 × 3 of type int
1	2	3
4	5	6

[13]:

# retrieval by rows
example.matrix[1,]

1
2
3

[14]:

# retrival by columns
example.matrix[,2]

2
5

[42]:

# changing values

example.matrix[1,] <- 1:3
example.matrix[2,] <- c(5,10,4)
example.matrix

A matrix: 2 × 3 of type dbl
1	2	3
5	10	4

[45]:

# apply () function is a machanism to apply a function across a vector.

# by column
apply(example.matrix,2,sum)

6
12
7

[46]:

# by row
apply(example.matrix,1,sum)

6
19

[44]:

matrix.head <- c("col a","col b","column c")
matrix.side <- c("first raw","second raw")

[45]:

colnames(example.matrix) = matrix.head
rownames(example.matrix) = matrix.side
example.matrix

A matrix: 2 × 3 of type dbl
	col a	col b	column c
first raw	1	2	3
second raw	5	10	4

[52]:

# The structure function str(object.name) informs you of the structure of a
# specific object

str(example.matrix)

 num [1:2, 1:3] 1 5 2 10 3 4
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:2] "first raw" "second raw"
  ..$ : chr [1:3] "col a" "col b" "column c"

[17]:

# combine matrix using rbind (by row) and cbind (by column)

matrix1 <- matrix(1:6,2,3)
matrix2 <- matrix(11:16,2,3)
rbind(matrix1,matrix2)
cbind(matrix1,matrix2)

A matrix: 4 × 3 of type int
1	3	5
2	4	6
11	13	15
12	14	16

A matrix: 2 × 6 of type int
1	3	5	11	13	15
2	4	6	12	14	16

Exercise

create a matrix with the following elements and then change the first column values to all 0.

1 1
2 2
3 3

Arrays

An array can be considered a multiple subscripted collection of data entries, for example numeric. R allows simple facilities for creating and handling arrays, and in particular the special case of matrices.

As well as giving a vector structure a dim attribute, arrays can be constructed from vectors by the array function, which has the form array(data_vector, dim_vector)

[60]:

Z <- array(1:24, c(3,4,2))
Z

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

[56]:

str(Z)

 int [1:3, 1:4, 1:2] 1 2 3 4 5 6 7 8 9 10 ...

list and delete R objects

The list function ls() outputs a list of existing R objects.

rm() removes the object.

[57]:

ls()

'character.vector'
'example.matrix'
'hn'
'ind'
'is.not.a.number'
'is.not.available'
'matrix.head'
'matrix.side'
'numeric.vector'
'one2three'
'x'
'z'
'Z'

[61]:

rm(Z)
ls()

'character.vector'
'example.matrix'
'hn'
'ind'
'is.not.a.number'
'is.not.available'
'matrix.head'
'matrix.side'
'numeric.vector'
'one2three'
'x'
'z'

Data Frames

Data frames are matrix-like structures, in which the columns can be of different types. Think of data frames as data matrices with one row per observational unit but with (possibly) both numerical and categorical variables. Many experiments are best described by data frames: the treatments are categorical but the response is numeric.

As a result R dataframes are tightly coupled collections of variables which share many of the properties of matrices and of lists. Data frames are used as the fundamental data structure by most of R’s modeling software.

A data frame is a list with class “data.frame”. There are restrictions on lists that may be made into data frames, namely :

The components must be vectors (numeric, character, or logical), factors, numeric matrices, lists, or other data frames. Matrices, lists, and data frames provide as many variables to the new data frame as they have columns, elements, or variables, respectively. Numeric vectors, logicals and factors are included, and character vectors are coerced to be factors, whose levels are the unique values appearing in the vector. Vector structures appearing as variables of the data frame must all have the same length, and matrix structures must all have the same row size.

Dataframe construction

[62]:

my.data.frame = data.frame(v = 1:4, ch = c("a", "b", "c", "d"), n = 10)
my.data.frame

A data.frame: 4 × 3
v	ch	n
<int>	<chr>	<dbl>
1	a	10
2	b	10
3	c	10
4	d	10

[18]:

# Or:

my.data.frame = data.frame(vector = 1:4,
                           character = c("a", "b", "c", "d"),
                           const.vector = 10,
                           row.names =c("data1", "data2", "data3", "data4"))
my.data.frame

A data.frame: 4 × 3
	vector	character	const.vector
	<int>	<chr>	<dbl>
data1	1	a	10
data2	2	b	10
data3	3	c	10
data4	4	d	10

Data selection and manipulation

[64]:

# You can extract data from dataframes using the    [ [    ] ]  and  $  sign:

my.data.frame[["character"]]

my.data.frame[[2]]

'a'
'b'
'c'
'd'

'a'
'b'
'c'
'd'

[66]:

# Call the 3rd value of the character vector:

my.data.frame[[2]][3]

'c'

[67]:

# Or using the $ syntax:

my.data.frame$vector

my.data.frame$character[2:3]

1
2
3
4

'b'
'c'

[68]:

# You can add single arguments to a data frame, query information, select and
# manipulate arguments or single values from a dataframe

my.data.frame$new

my.data.frame$new = c(10,11,20,40)
my.data.frame

NULL

A data.frame: 4 × 4
	vector	character	const.vector	new
	<int>	<chr>	<dbl>	<dbl>
data1	1	a	10	10
data2	2	b	10	11
data3	3	c	10	20
data4	4	d	10	40

[69]:

# length(object.name) returns the number of elements in an object such as
# matrix vector or dataframes:

length(my.data.frame$new)

4

[70]:

# which(object.name) and which.max(object.name) return the index of a specific
# or of the greatest element of an object

which.max(my.data.frame$new)

which(my.data.frame$new == 20)

4

3

[71]:

# max(object.name) returns the value of the greatest element

max(my.data.frame$new)

40

[72]:

# sort(object.name) sort from small to big

sort(my.data.frame$new)

10
11
20
40

[73]:

# rev(object.name) sorts from big to small

rev(sort(my.data.frame$new))

40
20
11
10

[74]:

# subset(object.name, ...) returns a selection of an R-object with respect to
# criteria (typically comparisons: x$V1 < 10).

subset(my.data.frame, my.data.frame$new == 20)

A data.frame: 1 × 4
	vector	character	const.vector	new
	<int>	<chr>	<dbl>	<dbl>
data3	3	c	10	20

[79]:

# If the R-object is a data frame,
# the option select gives the variables to be kept or dropped using a minus
# sign

my.data.frame[-1]

A data.frame: 4 × 3
	character	const.vector	new
	<chr>	<dbl>	<dbl>
data1	a	10	10
data2	b	10	11
data3	c	10	20
data4	d	10	40

Reading and Writing Files

read.table(“filename”)** Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields within the file. The default separator sep=”” is any whitespace. You might need sep=“,” or “;” and so on. Use header=TRUE to read the first line as a header of column names. The as.is=TRUE specification is used to prevent character vectors from being converted to factors.

The comment.char=”” specification is used to prevent “#” from being interpreted as a comment and use “skip=n” to skip n lines before reading data. For more details: ?read.table

read.csv(“filename”) is set to read comma separated files. Example usage is: read.csv(file.name, header = TRUE, sep = “,”, quote=“””, dec=“.”, fill =TRUE, comment.char=””, …)

read.delim(“filename”) is used for reading tab-delimited files

read.fwf() reads a table of fixed width formatted data into a ’data.frame’. Widths is an integer vector, giving the widths of the fixed-width fields

write.table(Robj, “filename”, sep=“,”, row.names=FALSE, quote=FALSE)

[36]:

df <- read.csv("./geodata/shp/point_stat.csv")

[5]:

head(df)

A data.frame: 6 × 4
	FID	mean	stdev	min
	<int>	<dbl>	<dbl>	<dbl>
1	0	2.550725	0.25891342	2.396264
2	1	2.550725	0.25891342	2.396264
3	2	2.550725	0.25891342	2.396264
4	3	2.383711	0.12381761	2.050272
5	4	2.383137	0.09662122	2.133288
6	5	2.383137	0.09662122	2.133288

[33]:

# subsetting
dfSel <- df[which(df$mean>7),]
dfSel

A data.frame: 2 × 4
	FID	mean	stdev	min
	<int>	<dbl>	<dbl>	<dbl>
106	105	7.536515	0.1895011	7.16515
107	106	7.536515	0.1895011	7.16515

[34]:

write.csv(dfSel,"./geodata/shp/point_stat_sel.csv", row.names=FALSE)

[39]:

df <- read.csv("./geodata/shp/point_stat_sel.csv")
df

A data.frame: 2 × 4
FID	mean	stdev	min
<int>	<dbl>	<dbl>	<dbl>
105	7.536515	0.1895011	7.16515
106	7.536515	0.1895011	7.16515

Exercise

Print the data rows with the column min value less than 1. Do NOT use the “which” function in R.

Functions

Functions are themselves objects in R which can be stored in the project’s workspace. This provides a simple and convenient way to extend R.

Usage: in writing your own function you provide one or more arguments or names for the function, an expression and a value is produced equal to the output function result.

function(arglist) expr function definition

return(value)

[112]:

# Example

myfunction <- function(x) x^5
myfunction(3)

243

[115]:

# oddcount function

oddcount <- function(x){
    k <- 0
    for (i in x) {
        if (i %% 2 == 1) k <- k + 1
    }
    return (k)
}

[116]:

oddcount(c(1,3,5,7))

4

[117]:

oddcount(c(1,3,5,8))

3

break and next

break : break out of a loop
next : skip a step

[48]:

for (i in 1:10){
    if (i %in% c(1,3,5,7)) {
        next
    }
    else if (i > 8){
        break
    }
    print (i)
}

[1] 2
[1] 4
[1] 6
[1] 8

Exercise

write a function to test if a number is a prime number, a natural number that is NOT a multiplication of two smaller numbers.

Basic Graphs

The plot() function forms the foundation for much of R based graphing operations.

[11]:

plot(c(1,3,5,7),c(2,4,6,8))

[17]:

x <- seq(0,10,length.out=100)
y <- exp(-x)

[18]:

plot(x,y)

[20]:

plot(x,y,type="l",lwd=3)

[23]:

plot(x,y,type="l",lwd=3, xlab="time", cex.lab=1.5,cex.axis=1.5)
grid(lwd=3)

Assignment

individual effort or collaboration by maximum two people
if in collaboration, please specify the contribution of each contributor
please turn in a set of .pdf and .ipynb files
please label the question number before giving the answer

Q1. Read in file “./geodata/shp/point_stat.csv” and output the rows that their values in the “stdev” column are less than those in their “min” column and their “mean” values are all greater than 7. (20%)

Q2. Create a 3x3 matrix with element values in the range of 1 to 9 and then change the diagonal elements to 0. (20%)

 4   7           0   4   7
 5   8    ==>    2   0   8
 6   9           3   6   0

Q3. Create a dataframe to store the calendar of March 2021 with the day of the week as the column name. (20%)

Mo Tu We Th Fr Sa Su
2  3  4  5  6  7
9 10 11 12 13 14
16 17 18 19 20 21
23 24 25 26 27 28
30 31

QB (bonus) (10%)

According to Gregorian calendar, leap years can be determined by the criteria below.

Years that are divisible by 4 and not divisible by 100 are leap years.
Years that are divisible by 100 but not divisible by 400 are NOT leap years.
Years that are divisible by 400 are leap years.

Please write a R script to print all the leap years between 1800 and 2020 into a file named “leap_years.dat” and calculate the ratio of the number of leap years over this period of time.