This is a quick introduction to R. It’s not meant to be a comprehensive tutorial, but rather as a small, sturdy foothold to get you started. We will cover concepts such as vectors, functions, and packages.
R is a programming language, but above all, it’s a statistical programming language that’s interactive. The basic usage of R is to use it like a calculator.
When you open R, the first thing you’ll see is the version, license and then the R prompt (>):
R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"
Copyright (C) 2016 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
>
The R prompt tells us that it’s ready for us to type commands. Unlike other programming languages, it’s designed to be interactive. As an example, we’ll do some simple calculations showing the relationship between 47 and 2. In your R console, type 47 + 2
and then hit Enter. After that, type 47 * 2
and hit Enter. You should see something like this:
> 47 + 2
[1] 49
> 47 * 2
[1] 94
Unlike a calculator, you can save the results of your calculations into variables by using the assignment operator (<-). This looks like an arrow pointing the results of the calculation on the right to the variable on the left.
> # 49 - 2 and 94/2 are both 47. It's magic!
> its <- 49 - 2
> magic <- 94 / 2
The ‘#’ symbol is a “comment”. R will ignore anything that comes after a comment, allowing you to write notes to yourself in your R script
Now that we saved the results into a variable, how do we look at it? All we have to do is type the variable:
> its
[1] 47
> magic
[1] 47
Variables can be used in calculations.
> oh <- its + 2
> oh
[1] 49
> wow <- magic * 2
> wow
[1] 94
You can even overwrite variables.
> oh <- "oh"
> oh
[1] "oh"
> wow <- "wow"
> wow
[1] "wow"
But single variables are not all that you can do In R. Read more to find out about vectors!
R can store more than single values. It has vectors. These are sequences of data that are all of the same type (i.e. integers, decimals, text (characters), and logical values (TRUE/FALSE)). To construct a vector, you can use the c()
function (more about functions later):
> some_numbers <- c(10.0, pi, 47.5362, 3.50, 1.1111)
> some_numbers
[1] 10.000000 3.141593 47.536200 3.500000 1.111100
> some_integers <- c(NA, 1, 1, 2, 3) # NA is code for "missing" in R
> some_integers
[1] NA 1 1 2 3
> more_integers <- 1:5 # same as c(1, 2, 3, 4), but a lot easier to type!
> more_integers
[1] 1 2 3 4 5
> some_characters <- c("work it", "harder", "make it", "better", # You can wrap commands on
+ "do it", "faster", "makes us", "stronger") # multiple lines.
> some_characters
[1] "work it" "harder" "make it" "better" "do it" "faster"
[7] "makes us" "stronger"
> some_logic <- c(TRUE, FALSE, TRUE, TRUE, FALSE)
> some_logic
[1] TRUE FALSE TRUE TRUE FALSE
You can think about these as being similar to a column in an excel spreadhseet. For example, if you wanted to take column A and square each number, you would start by setting cell B1 to =A1^2
and then you would drag that down to apply to the entire column.
Math works with vectors the same way in R:
> some_integers ^ 2
[1] NA 1 1 4 9
Note: “NA” is a missing value in R.
Again, if you tried to add two columns together in excel, you would type something like =A1+B1
and then drag to apply to the column. R behaves the same way:
> some_integers + some_numbers
[1] NA 4.141593 48.536200 5.500000 4.111100
Notice that each vector starts with [1]
when you print it? This is telling you what position this number is in the vector. You can get a specific element of a vector by supplying a specific integer or sequence of integers:
> some_numbers[2]
[1] 3.141593
> some_characters[c(1, 3, 5, 7)]
[1] "work it" "make it" "do it" "makes us"
> some_logic[1:3]
[1] TRUE FALSE TRUE
> fib <- some_integers[-1] # remove the first element
> fib
[1] 1 1 2 3
> some_characters[fib] # you can also use other vectors!
[1] "work it" "work it" "harder" "make it"
Notice in the last example, we took our integer vector with the Fibbonacci sequence and used it to subset the character vector.
We can also use logical values to subset a vector:
> some_numbers[some_logic]
[1] 10.0000 47.5362 3.5000
This is a very powerful method of subsetting a vector because we can use logical comparison such as >
, <
, and ==
. We can get the same result as above by asking R to only return numbers less than ten:
> some_numbers # the first and second elements are not less than 10.
[1] 10.000000 3.141593 47.536200 3.500000 1.111100
> some_numbers[some_numbers < 10]
[1] 3.141593 3.500000 1.111100
Of course, no one expects you to enter all of your data in R by hand. When you have data, it’s usually in tabular format. In this example, we’ll use data from the agricolae package assessing potato varieties for resistance to late blight in locations in Peru (see help("ComasOxapampa", package = "agricolae")
for details)
To read these data into R, you can use read.table()
:
> ComasOxapampa <- read.table("ex_data/ComasOxapampa.csv", sep = ",", head = TRUE)
Here we are telling R to read a table from the file “ComasOxapampa.csv” that is in the folder called “ex_data” using the function read.table()
(more on functions in the next section). What we get back is a data frame. A data frame is made up of one or more vectors that are all the same length in columns.
We can look at the structure of a data frame using the str()
function:
> str(ComasOxapampa)
'data.frame': 168 obs. of 4 variables:
$ cultivar : Factor w/ 56 levels "Amarilis-INIA",..: 1 2 3 4 5 6 7 8 9 10 ...
$ replication: Factor w/ 3 levels "I","II","III": 1 1 1 1 1 1 1 1 1 1 ...
$ comas : num 0.57 0.319 0.5 0.423 0.555 0.305 0.707 0.602 0.655 0.471 ...
$ oxapampa : num 0.6 0.661 0.627 0.511 0.683 0.397 0.68 0.618 0.723 0.625 ...
This output is telling us that we have 168 rows with 4 columns named “cultivar”, “replication”, “comas”, and “oxapampa”. We can see that “cultivar” and “replication” are both factors (a way of representing categorical variables in R) and comas and oxapampa are both numeric values of the AUDPC. We can access each column using the “$”:
> ComasOxapampa$comas # vector of AUDPC values for Comas
[1] 0.570 0.319 0.500 0.423 0.555 0.305 0.707 0.602 0.655 0.471 0.637
[12] 0.148 0.524 0.245 0.586 0.087 0.726 0.465 0.411 0.309 0.270 0.268
[23] 0.341 0.133 0.605 0.509 0.166 0.684 0.195 0.240 0.251 0.547 0.393
[34] 0.030 0.370 0.500 0.405 0.655 0.619 0.377 0.280 0.035 0.303 0.373
[45] 0.679 0.587 0.162 0.534 0.669 0.725 0.140 0.555 0.433 0.234 0.564
[56] 0.710 0.596 0.357 0.498 0.334 0.499 0.236 0.646 0.555 0.706 0.420
[67] 0.706 0.164 0.570 0.062 0.663 0.293 0.734 0.489 0.390 0.305 0.346
[78] 0.293 0.356 0.283 0.541 0.537 0.162 0.670 0.177 0.269 0.280 0.524
[89] 0.353 0.011 0.359 0.559 0.505 0.729 0.557 0.500 0.445 0.035 0.377
[100] 0.541 0.583 0.596 0.328 0.655 0.689 0.704 0.203 0.642 0.515 0.461
[111] 0.781 0.725 0.641 0.400 0.600 0.323 0.576 0.332 0.732 0.570 0.711
[122] 0.386 0.688 0.194 0.620 0.202 0.632 0.271 0.776 0.534 0.560 0.000
[133] 0.227 0.345 0.507 0.326 0.578 0.555 0.249 0.725 0.285 0.264 0.248
[144] 0.000 0.512 0.011 0.328 0.640 0.505 0.724 0.541 0.428 0.430 0.019
[155] 0.364 0.000 0.639 0.596 0.383 0.560 0.641 0.736 0.225 0.593 0.490
[166] 0.424 0.000 0.720
You can even create new columns with the “$” symbol:
> ComasOxapampa$difference <- ComasOxapampa$comas - ComasOxapampa$oxapampa
> head(ComasOxapampa) # Look at the top of the data (6 rows by default)
cultivar replication comas oxapampa difference
1 Amarilis-INIA I 0.570 0.600 -0.030
2 Andinita I 0.319 0.661 -0.342
3 Atahualpa I 0.500 0.627 -0.127
4 Baseko I 0.423 0.511 -0.088
5 BIRRIS I 0.555 0.683 -0.128
6 C91-906-(Primavera) I 0.305 0.397 -0.092
One other feature of R is the fact that it has functions. These are self-contained pieces of code that can be run over and over. Functions can do anything from reading and writing files, graphics, and calculations. We’ve seen a couple of them in the previous section, read.table()
and str()
. Functions always take the form of do_this(to_this_thing)
, so in the case of read.table()
, it’s “read the table in the file called ex_data/ComasOxapampa.csv”.
There are other functions that help you assess your data. For example, with data frames and matrices, you can ask R to give you the number of rows and number of columns with nrow()
and ncol()
:
> nrow(ComasOxapampa)
[1] 168
> ncol(ComasOxapampa)
[1] 5
You can also nest functions together using one as input for the other. Here, we can find out how many varieties we have by first finding the unique cultivars and then reporting the length
> length(unique(ComasOxapampa$cultivar))
[1] 56
The most useful functions in R are the functions for calculating statistics. For example, you can test if there is a significan difference in disease between the two regions with a t-test by using the function t.test()
:
> region_t <- t.test(ComasOxapampa$comas, ComasOxapampa$oxapampa)
> region_t
Welch Two Sample t-test
data: ComasOxapampa$comas and ComasOxapampa$oxapampa
t = -7.3095, df = 320.91, p-value = 2.148e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1871475 -0.1077692
sample estimates:
mean of x mean of y
0.4396786 0.5871369
Note: we are using a t-test for this for the sake of simplicity. This is not the appropriate test for these data.
R would not be so widely used without packages. These are a set of functions, documentation, and data sets that can be distributed to anyone using R. If you can think of something you want R to do, there’s probably a package for that. Packages can be downloaded from online repositories such as CRAN or BioConductor.
The easiest way to install a package is through your R console with the function install.packages()
:
> install.packages("agricolae", repos = "https://cran.rstudio.org")
Installing package into ‘/Users/zhian/R’
(as ‘lib’ is unspecified)
trying URL 'https://cran.r-project.org/bin/macosx/mavericks/contrib/3.3/agricolae_1.2-4.tgz'
Content type 'application/x-gzip' length 923645 bytes (901 KB)
==================================================
downloaded 901 KB
The downloaded binary packages are in
/var/folders/qd/dpdhfsz12wb3c7wz0xdm6dbm0000gn/T//RtmpWCcCSI/downloaded_packages
You should only have to install a package once and then you can use it as many times as you need.
This package is stored in your R library, which is a folder on your computer where your downloaded packages live. When you want to use the functions in a package, you load it with the library()
function. This function tells R to look in your library for a package and load it.
> library("agricolae")
Pro Tip! For better organization: load all of the packages you need at the beginning of your R script
One way that R shines above other languages is that R packages in CRAN are all documented and easy to install. Help files are written in HTML and give the user a brief overview of:
To get help on any R function, type a question mark before the empty function. Here’s an example of how to get help about the audpc()
function from the agricolae package:
> library("agricolae") # The package with the audpc() function.
> ?audpc # open the R documentation of the function audpc()
audpc | R Documentation |
Area Under Disease Progress Curve. The AUDPC measures the disease throughout a period. The AUDPC is the area that is determined by the sum of trapezes under the curve.
audpc(evaluation, dates, type = "absolute")
evaluation
|
Table of data of the evaluations: Data frame |
dates
|
Vector of dates corresponding to each evaluation |
type
|
relative, absolute |
AUDPC. For the illustration one considers three evaluations (14, 21 and 28 days) and percentage of damage in the plant 40, 80 and 90 (interval between dates of evaluation 7 days). AUDPC = 1045. The evaluations can be at different interval.
evaluation
|
data frame, matrix or numeric vector |
dates
|
a numeric vector |
type
|
text |
Felipe de Mendiburu
Campbell, C. L., L. V. Madden. (1990): Introduction to Plant Disease Epidemiology. John Wiley & Sons, New York City.
library(agricolae) dates<-c(14,21,28) # days # example 1: evaluation - vector evaluation<-c(40,80,90) audpc(evaluation,dates) # example 2: evaluation: dataframe nrow=1 evaluation<-data.frame(E1=40,E2=80,E3=90) # percentages plot(dates,evaluation,type="h",ylim=c(0,100),col="red",axes=FALSE) title(cex.main=0.8,main="Absolute or Relative AUDPC\nTotal area = 100*(28-14)=1400") lines(dates,evaluation,col="red") text(dates,evaluation+5,evaluation) text(18,20,"A = (21-14)*(80+40)/2") text(25,60,"B = (28-21)*(90+80)/2") text(25,40,"audpc = A+B = 1015") text(24.5,33,"relative = audpc/area = 0.725") abline(h=0) axis(1,dates) axis(2,seq(0,100,5),las=2) lines(rbind(c(14,40),c(14,100)),lty=8,col="green") lines(rbind(c(14,100),c(28,100)),lty=8,col="green") lines(rbind(c(28,90),c(28,100)),lty=8,col="green") # It calculates audpc absolute absolute<-audpc(evaluation,dates,type="absolute") print(absolute) rm(evaluation, dates, absolute) # example 3: evaluation dataframe nrow>1 data(disease) dates<-c(1,2,3) # week evaluation<-disease[,c(4,5,6)] # It calculates audpc relative index <-audpc(evaluation, dates, type = "relative") # Correlation between the yield and audpc correlation(disease$yield, index, method="kendall") # example 4: days infile data(CIC) comas <- CIC$comas oxapampa <- CIC$oxapampa dcomas <- names(comas)[9:16] days<- as.numeric(substr(dcomas,2,3)) AUDPC<- audpc(comas[,9:16],days) relative<-audpc(comas[,9:16],days,type = "relative") h1<-graph.freq(AUDPC,border="red",density=4,col="blue") table.freq(h1) h2<-graph.freq(relative,border="red",density=4,col="blue", frequency=2, ylab="relative frequency")
Other ways of getting help:
> help(package = "agricolae") # Get help for a package.
> help("audpc") # Get help for the audpc function
> ?audpc # same as above
> ??disease # Search for help with the keyword 'disease' in all packages
If you want to run the examples, you can either copy and paste the commands to your R console, or you can run them all with:
> example("audpc", package = "agricolae")
audpc> library(agricolae)
audpc> dates<-c(14,21,28) # days
audpc> # example 1: evaluation - vector
audpc> evaluation<-c(40,80,90)
audpc> audpc(evaluation,dates)
evaluation
1015
audpc> # example 2: evaluation: dataframe nrow=1
audpc> evaluation<-data.frame(E1=40,E2=80,E3=90) # percentages
audpc> plot(dates,evaluation,type="h",ylim=c(0,100),col="red",axes=FALSE)
audpc> title(cex.main=0.8,main="Absolute or Relative AUDPC\nTotal area = 100*(28-14)=1400")
audpc> lines(dates,evaluation,col="red")
audpc> text(dates,evaluation+5,evaluation)
audpc> text(18,20,"A = (21-14)*(80+40)/2")
audpc> text(25,60,"B = (28-21)*(90+80)/2")
audpc> text(25,40,"audpc = A+B = 1015")
audpc> text(24.5,33,"relative = audpc/area = 0.725")
audpc> abline(h=0)
audpc> axis(1,dates)
audpc> axis(2,seq(0,100,5),las=2)
audpc> lines(rbind(c(14,40),c(14,100)),lty=8,col="green")
audpc> lines(rbind(c(14,100),c(28,100)),lty=8,col="green")
audpc> lines(rbind(c(28,90),c(28,100)),lty=8,col="green")
audpc> # It calculates audpc absolute
audpc> absolute<-audpc(evaluation,dates,type="absolute")
audpc> print(absolute)
[1] 1015
audpc> rm(evaluation, dates, absolute)
audpc> # example 3: evaluation dataframe nrow>1
audpc> data(disease)
audpc> dates<-c(1,2,3) # week
audpc> evaluation<-disease[,c(4,5,6)]
audpc> # It calculates audpc relative
audpc> index <-audpc(evaluation, dates, type = "relative")
audpc> # Correlation between the yield and audpc
audpc> correlation(disease$yield, index, method="kendall")
Kendall's rank correlation tau
data: disease$yield and index
z-norm = -3.326938 p-value = 0.0008780595
alternative hypothesis: true rho is not equal to 0
sample estimates:
tau
-0.5436832
audpc> # example 4: days infile
audpc> data(CIC)
audpc> comas <- CIC$comas
audpc> oxapampa <- CIC$oxapampa
audpc> dcomas <- names(comas)[9:16]
audpc> days<- as.numeric(substr(dcomas,2,3))
audpc> AUDPC<- audpc(comas[,9:16],days)
audpc> relative<-audpc(comas[,9:16],days,type = "relative")
audpc> h1<-graph.freq(AUDPC,border="red",density=4,col="blue")
audpc> table.freq(h1)
audpc> h2<-graph.freq(relative,border="red",density=4,col="blue",
audpc+ frequency=2, ylab="relative frequency")
Some packages include vignettes that can have different formats such as being introductions, tutorials, or reference cards in PDF format. You can look at a list of vignettes in all packages by typing:
> browseVignettes() # see vignettes from all packages
> browseVignettes(package = "tidyr") # see vignettes from a specific package.
and to look at a specific vignette you can type:
> vignette('mlg') # Multilocus Genotype vignette from the poppr package
assignment operator (<-): assigns the result of whatever’s to the right of the operator to whatever is on the left
data frame: a representation of a table in R
function: a self-contained set of code that can be used many times
functions: a self-contained set of code that can be used many times
packages: a set of functions and data sets that are bundled together for distribution and download
R library: a folder/directory on your computer where your R packages are stored
R prompt (>): a symbol that R uses to let you know it’s ready for you to enter commands
vectors: a sequence of data elements of the same type