User Tools

Site Tools


wiki:basicr

Introduction to R

The object of this document is to help you starting to use the R environment for statistical analysis and graphics.
You can read and follow the the text. Meanwhile you can copy the commands included into the frames part of this document, and paste them into an interactive R session.
Once you are familiar with the general functioning of R and of R's objects you can further advance in learning R with online manuals and guides. There is a great variety of documentation available at:


As well efficacious learning tools we would recommend that the user experiment with commands by, for example, trying different options to those stated. This experimentation is an important part of learning R using this synthetic document.

Starting R, getting help, stopping R

Start R

from a shell window type

R

In the bash terminal the following text will appear:

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale\\
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

the > sign and the following blinking cursor is advising you are in the R environment. If you like, enter in administrative mode type sudo R and you will be able to install packages

stefano@stefano-linux:~\$ sudo R



basic_r.r
# Getting help 
 
# R provides help with function and commands. On-line help gives useful 
# information as well. Getting used to R help is a key to successful 
# statistical modelling. The online help can be accessed in HTML format by 
# typing:
 
help.start()
 
 
# A keyword search is possible using the Search Engine and Keywords link.
# You can also use the help() or ? functions. For example, if we want to 
# know how to use the matrix() function, the following two commands are 
# equivalents:
 
help(matrix)
? matrix
 
# The str(object.name) command is used to display the internal structure 
# of an R object. The summary(object.name) command gives a summary of an 
# object, usually a statistical summary but it is generic meaning it has 
# different operations for different classes of object.
 
dir() #  show files in the current directory
ls.str() # str() for each variable in the search path
getwd() #  is asking for the current working directory
 
# When you quit, R will ask you if you want to save the workspace 
# (that is, all of the variables you have defined in this session). 
# Say “no” in order to avoid clutter.
 
# Should an R command seem to be stuck or take longer than you’re willing
# to wait, type Control-C.
 
# Calling linux shell scripting commands 
# system("...") is used to call any linux scripting commands within the R
# environment.
 
system("pwd")
 
# is equivalent to:
 
getwd()
 
# Inputs and outputs
# Once you have opened an R session and eventually loaded the library you need,
# you can start exploring your data
 
# Loading data
 
# load(file.name) function, loads R datasets written with the save function 
# load(file.name) 
 
# Saving data 
 
# save(object.name.1, object.name.2, ... ) function save the specified object 
# in XDR platform independent binary format
 
# Reading tables
 
# read.table("filename")** Reads a file in table format and creates a data 
# frame from it, with cases corresponding to lines and variables to fields 
# within the file. The default separator  sep=""  is any whitespace. 
# You might need  sep=","  or  ";"  and so on. 
# Use  header=TRUE  to read the first line as a header of column names. 
# The **as.is=TRUE** specification is used to prevent character vectors from
# being converted to factors. 
 
# The comment.char="" specification is used to prevent "#" from being 
# interpreted as a comment and use "skip=n" to skip n lines before reading 
# data. For more details:
 
?read.table
 
landuse04=read.csv("~/ost4sem/exercise/basic_adv_r/inputs/2004_landuse.csv", 
                   header=TRUE, sep=",", dec=".", na.string=":")
 
# read.csv("filename")  is set to read comma separated files. Example usage is:
# read.csv(file.name, header = TRUE, sep = ",", quote="\"", dec=".", 
# fill =TRUE,  comment.char="", ...)
 
# read.delim("filename") is used for reading tab-delimited files
 
# read.fwf() reads a table of fixed width formatted data into a ’data.frame’.
# Widths is an integer vector, giving the widths of the fixed-width fields
 
# Show variables and data in your workspace 
 
# The list function ls() outputs a list of existing R objects
 
ls()
 
# The structure function str(object.name) informs you of the structure of a
# specific object the summary function summary(object.name) informs you of 
# basic statistics of a specific object.
 
# Save and remove data or R objects 
# save(file, ...)  saves the specified objects (...) in the XDR 
# platform-independent binary format
 
save(landuse04, file="~/ost4sem/exercise/basic_adv_r/outputs/landuse2004")
 
# save.image(file) saves all objects
 
save(file="~/ost4sem/exercise/basic_adv_r/outputs/landuse2004_and_more")
 
# rm(file, ...) removes the object you created or data you uploaded
 
rm(landuse04)
 
# No objects are present in memory now, use ls function to check it
 
ls()
character(0)
 
# But since you saved the landuse2004 data you can reload it using the load() 
# function and check its structrure using the str() function
 
load("~/ost4sem/exercise/basic_adv_r/outputs/landuse2004")
str(landuse04)
 
#  Variables and calculations
 
# R has an interactive calculations function. The command is executed and 
# results are displayed. R uses:  +, -, /, and ^  for addition, subtraction,
# multiplication, division and exponentiation, respectively
 
2+2 
 
# The [1] at the beginning of the line is just R printing an index of element 
# numbers. If you print a result that appears on multiple lines, R will put 
# an index at the beginning of each line.
 
2*5
 
10/2
 
2^3
 
 
# Variable settings 
 
# You can simply create a variable by typing: variable name = function, 
# constant or calculation.
 
 
x =3*2
 
# The results of 3*2 is not displayed. In fact, the x variable value is stored
# in the memory without printing it. To display the x value you can use: 
 
print(x) 
 
# Or 
 
x
 
# Most users apply a similar syntax using the '<-' character string instead 
# of the = character.
 
x <- 3
x
 
# Also remember that R is case sensitive, print(X) or X is different from x. 
# For instance:
 
a <- 3
a
A
 
# Variable names in R must begin with a letter, followed by alphanumeric 
# characters. 
 
3e = 2
 
# In long names you can use "." or "_" as in 
 
# very.long.variable.name.X or very_long_variable_name_Y but you can’t use 
# blank spaces in variable names. Avoid single letter names such us: 
# c, l, q, t, C, D, F, I, and T, which are either built-in R functions or hard
# to tell apart.
 
very.long.variable.name.X3 = 3
very.long.variable.name.X3
 
 
# Interactive calculations
 
# Once defined,  you can use variables in interactive calculations :
 
b = 2*2
a = 2*3
a*b
 
# And you can use variables in formulas :
 
c = 60 /(a+b)
c
 
# typing a;b you can display a and b variables at the same time:
 
a;b
 
# If you forget to close a parenthesis, R will display a *+* sign. 
 
# c = 60 /(a+b
 
# In this case you can either close the parenthesis in the next line or type 
# ctrl + c to go back to a new starting prompt. 
 
# Order of operations
 
# When using more complex formulas be aware of the importance of the order of 
# operators. Parenthesis have priority over exponentiation, or powers, then 
# comes multiplication and division, finally addition and subtraction. 
 
# The following command:
 
C = ((a + 2 * sqrt(b))/(a + 8 * sqrt(b)))/2
C
 
# is different from:
 
C = a + 2 * sqrt(b) / a + 8 * sqrt(b) / 2
C
 
# as well as 
 
100-40/2^4
 
# is different from:
 
(100-40)/2^4 
 
# and 
 
-2^4
 
# is different from: 
 
(-2)^4
 
 
# Logical values
 
# R can perform conditional tests and generate True or False values as results.
# The logical operators are  < ,  <= ,  > ,  >= ,  ==  for exact equality and
# != for inequality. 
 
x = 60
x > 100
 
x == 70
 
x >   3
 
x = 100
 
# Logical values can be stored as variables: 
 
x = 60
logical.value =  x >  3
logical.value
 
# In addition if c1 and c2 are logical expressions, then c1  &  c2 is their 
# intersection (“and”), c1  |  c2 is their union (“or”), and  !  !c1 is the 
# negation of c1. 
 
 
 
# R objects
 
# The entities R operates on are technically known as  objects. 
# Examples are "vectors of numeric (real)" or "complex values", "vectors of 
# logical" values and "vectors of character strings". 
# These are known as  “atomic”  structures since their components are all of 
# the same type, or mode, namely numeric, complex, logical, character and raw.
# R also operates on objects called "lists", which are of mode list. 
# These are ordered sequences of objects which individually can be of any mode. 
# Lists are known as  “recursive”  rather than atomic structures since their 
# components can themselves be lists in their own right.
 
# The other recursive structures are those of mode function and  expression. 
# Functions are the objects that form part of the R system along with similar 
# user written functions, which we discuss in some detail later. Expressions 
# as objects form an advanced part of R which will not be discussed in this 
# guide, except indirectly when we discuss formulae used with modeling in R.
 
# By the "mode" of an object we mean the basic type of its fundamental 
# constituents. This is a special case of a  “property”  of an object. Another
# property of every object is its "length." The functions mode(object) and 
# length(object) can be used 
# to find out the mode and length of any defined structure 10.
 
# Further properties of an object are usually provided by attributes(object), 
# (see 'Getting and setting attributes'). Because of this, mode and length are 
# also called “intrinsic attributes” of an object. For example, if z is a 
# complex vector of length 100, then in an expression mode(z) is the character
# string "complex" and length(z) is 100.
 
 
# Vectors
 
# Vectors are combinations of scalars in a string structure. Vectors must have 
# all values of the same mode. Thus any given vector must be unambiguously 
# either logical, numeric, complex, character or raw. (The only apparent 
# exception to this rule is the special “value” listed as NA for quantities not
# available, but in fact there are several types of NA). Note that a vector can
# be empty and still have a mode. For example the empty character string vector
# is listed as character(0) and the empty numeric vector as numeric(0). 
 
 
# c(...) is the generic function to combine arguments with the default forming 
# a vector; with RECURSIVE=TRUE descends through lists combining all elements 
# into one vector. To see details for the generic function c(...) and combine 
# arguments forming a vector: 
 
? c 
 
# As an example we can create a simple vector of seven values typing: 
 
c(2, 3, 4, 5, 10, 5, 8)
 
# We can generate a sequence using the syntax:
 
1:10
 
# We can generate the same sequence of  1:10  command using the seq() function. 
# The syntax will be :
 
seq(1,10)
 
# The seq() function "seq(from = number, to = number, by = number)" allow to 
# create a vector starting from a value to another by a defined increment:
 
seq(1,10, 0.25)
 
seq(from = 1, to = 10, by =  0.25)
 
# The replicate function  "rep(x,times)"  enables you to replicate a vector 
# several times in a more complex vector. Calculations can be included to 
# form vectors as well and functions can be combined in the same command:
 
one2three = 1:3 
rep(one2three,10) 
 
c(10*0:10)
 
rep(c (5*40:1, 5*1:40, 5, 6,7,8, 3, 2001:2014), 2)
 
rep(seq(1,3,0.5),3)
 
# Missing Values
 
# In some cases the components of a vector or of an R object more in general, 
# may not be completely known. When an element or value is “not available” 
# or a “missing value” in the statistical sense, a place within a vector may
# be reserved for it by assigning it the special value NA. Any operation on 
# an NA becomes an NA. 
 
# The function is.na(x)  gives a logical vector of the same size as x with 
# value TRUE if and only if the corresponding element in x is NA.
 
z <- c(1:3,NA)
ind <- is.na(z)
ind
 
# There is a second kind of “missing” values which are produced by numerical 
# computation, the so-called Not a Number,  NaN , values. Examples are 0/0 
# or Inf - Inf which both give NaN since the result cannot be defined sensibly.
 
Inf-Inf
0/0
 
# In summary, is.na(xx) is TRUE both for NA and NaN values. To differentiate 
# these, is.nan(xx) is only TRUE for NaNs. Missing values are sometimes printed
# as <NA> when character vectors are printed without quotes. 
 
z <- c(1:3,NA)
is.not.available <- is.na(z) 
is.not.a.number <-is.nan(z)
 
is.not.a.number
is.not.available 
 
 
# Matrices
 
# Matrices, or more generally arrays, are multi-dimensional generalizations of 
# vectors. In fact, they are vectors that can be indexed by two or more indices
# and will be printed in a special way. See Arrays and matrices.
# Factors provide compact ways to handle categorical data. See Factors.
# Lists are a general form of vector in which the various elements need not be 
# of the same type, and are often themselves vectors or lists. Lists provide a 
# convenient way to return the results of a statistical computation. See Lists.
 
# The matrix() function creates a matrix from the given set of values. We use 
# the matrix(x, nrow=, ncol=) function to set the matrix cell values, the 
# number of rows and the number of columns. We can use the colnames() and 
# rownames() functions to set the column and row names of the matrix-like 
# object.
 
matrix(data = NA, nrow = 2, ncol = 3) 
example.matrix = matrix(0,2,3)
example.matrix
 
example.matrix[1,]
 
example.matrix[,2]
 
example.matrix[1,] = 1:3
example.matrix[2,] = c(5,10,4)
example.matrix
 
matrix.head = c("col a","col b","column c")
matrix.side = c("first raw","second raw")
str(matrix.side)
 
# When using " "  we create and refer to a character type "chr" input
 
numeric.vector = c(rep(c (5*10:1, 5, 6), 2))
character.vector  = as.character(numeric.vector)
str(character.vector)
 
colnames(example.matrix) = matrix.head
rownames(example.matrix) = matrix.side
example.matrix
 
str(example.matrix)
 
# Array
 
# An array can be considered a multiple subscripted collection of data 
# entries, for example numeric. R allows simple facilities for creating 
# and handling arrays, and in particular the special case of matrices. 
 
# As well as giving a vector structure a dim attribute, arrays can be 
# constructed from vectors by the array function, which has the form 
# array(data_vector, dim_vector)
 
Z <- array(1:24, c(3,4,2))
Z
 
# Data Frames
 
# Data frames are matrix-like structures, in which the columns can be 
# of different types. Think of data frames as  data matrices  with one row per 
# observational unit but with (possibly) both numerical and categorical 
# variables. Many experiments are best described by data frames: the treatments
# are categorical but the response is numeric. 
 
# As a result R dataframes are tightly coupled collections of variables which 
# share many of the properties of matrices and of lists. Data frames are used 
# as the fundamental data structure by most of R's modeling software.
 
# A data frame is a list with class "data.frame". There are restrictions on 
# lists that may be made into data frames, namely :
 
# The components must be vectors (numeric, character, or logical), factors, 
# numeric matrices, lists, or other data frames.
# Matrices, lists, and data frames provide as many variables to the new data 
# frame as they have columns, elements, or variables, respectively.
# Numeric vectors, logicals and factors are included, and character vectors 
# are coerced to be factors, whose levels are the unique values appearing in
# the vector.
# Vector structures appearing as variables of the data frame must all have the
# same length, and matrix structures must all have the same row size. See:
 
? data.frame
 
# To construct a dataframe:
 
my.data.frame = data.frame(v = 1:4, ch = c("a", "b", "c", "d"), n = 10)
my.data.frame
 
# Or:
 
my.data.frame = data.frame(vector = 1:4,
                           character = c("a", "b", "c", "d"),
                           const.vector = 10,
                           row.names =c("data1", "data2", "data3", "data4"))
my.data.frame
 
 
# Data selection and manipulation
 
# You can extract data from dataframes using the    [ [    ] ]  and  $  sign:
 
my.data.frame[["character"]]
 
my.data.frame[[2]]
 
# Call the 3rd value of the character vector:
 
my.data.frame[[2]][3]
 
# Or using the $ syntax:
 
my.data.frame$vector
 
my.data.frame$character[2:3]
 
# You can add single arguments to a data frame, query information, select and 
# manipulate arguments or single values from a dataframe 
 
my.data.frame$new
 
my.data.frame$new = c(10,11,20,40)
my.data.frame
 
# length(object.name) returns the number of elements in an object such as 
# matrix vector or dataframes:
 
length(my.data.frame$new) 
 
# which(object.name) and which.max(object.name) return the index of a specific
# or of the greatest element of an object
 
which.max(my.data.frame$new) 
 
which(my.data.frame$new == 20) 
 
# max(object.name) returns the value of the greatest element
 
max(my.data.frame$new) 
 
# sort(object.name) sort from small to big 
 
sort(my.data.frame$new) 
 
# rev(object.name) sorts from big to small
 
rev(sort(my.data.frame$new)) 
 
# subset(object.name, ...) returns a selection of an R-object with respect to 
# criteria (typically comparisons: x$V1 < 10). If the R-object is a data frame,
# the option select gives the variables to be kept or dropped using a minus 
# sign
 
subset(my.data.frame, my.data.frame$new == 20)
 
# Sample() allows sampling from a set of values.
 
sample(my.data.frame$new, 3)
sample(my.data.frame$new, 3)
sample(my.data.frame$new, 3)
 
# More examples
 
# The following R commands give an example of the simple procedure of importing 
# data, cleaning a table by extracting relevant information, checking the 
# presence of missing data.
 
landuse04=read.csv("~/ost4sem/studycase/Lab_scripts/inputs/2004_landuse.csv", 
                   header=TRUE, sep=",", dec=".", na.string=":")
 
forests04 = subset(landuse04, landuse04$forest.Wooded.area >= 0 )
forests04$landuse = NULL
forests04.check=na.fail(forests04)
forests04$total.Total.area[1] = NA
forests04.check=na.fail(forests04)
 
# The last line above will throw an error.
# We can resolve the situation from the beginning with no NA
 
forests04 = subset(landuse04, landuse04$forest.Wooded.area >=0 )
forests04$landuse = NULL
forests04.check=na.fail(forests04)
str(forests04)
 
# Do you see something strange? Look at theforests04$geographic.Unit level of
# factors and the dataframe number of variables!
 
# Let's fix it now!
 
library(gdata)
forests04 = drop.levels(forests04)
str(forests04)
 
 
# Functions
 
# Functions are themselves objects in R which can be stored in the project's 
# workspace. This provides a simple and convenient way to extend R.  
# Usage: in writing your own function you provide one or more arguments or 
# names for the function, an expression (or body of the function) and a value 
# is produced equal to the output function result.
 
# function(arglist) expr   function definition  
# return(value) 
 
# Example
 
myfunction <- function(x) x^5
myfunction(3)
 
body(myfunction) <- quote(5^x)
 
## or equivalently  body(myfunction) <- expression(5^x)
 
myfunction(3) 
 
body(myfunction)
 
myfunction
wiki/basicr.txt · Last modified: 2018/05/10 15:25 (external edit)