Why use R?
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. – http://cran.csiro.au/doc/manuals/r-release/R-intro.html
This is a valid question considering that most languages/frameworks, including CUDA have statistical analysis libraries built in. Hopefully running through some introductory exercises will reveal the benefits.
Associated GUI’s and extensions:
- Weka – Specific for machine learning algorithms
- R Commander – Data analysis GUI
Install on Ubuntu 12.04:
-
sudo echo "deb http://cran.csiro.au/bin/linux/ubuntu precise/" >> /etc/apt/sources.list sudo apt-get update sudo apt-get install r-base
Then to enter the R command line interface, $ R
For starters, will run through an intro from UCLA: http://www.ats.ucla.edu/stat/r/seminars/intro.htm
Within the R command line interface if a package is to be used it must first be installed:
install.packages()
- foreign – package to read data files from other stats packages
- xlsx – package (requires Java to be installed, same architecture as your R version, also the rJava package and xlsxjars package)
- reshape2 – package to easily melt data to long form
- ggplot2 – package for elegant data visualization using the Grammar of Graphics
- GGally – package for scatter plot matrices
- vcd – package for visualizing and analyzing categorical data
install.packages("xlsx") install.packages("reshape2") install.packages("ggplot2") install.packages("GGally") install.packages("vcd")
Pre-requisites:
sudo apt-get install openjdk-7-* sudo ln -s /usr/lib/jvm/java-7-openjdk-amd64/bin/java /etc/alternatives/java sudo R CMD javareconf
Preparing session:
After installing R and the packages needed for a task if these packages are needed in the current session they must be included:
require(foreign) require(xlsx)
After attaching all of the required packages to the current session, confirmation can be completed via:
sessionInfo()
R code can be entered into the command line directly or saved to a script which can be run inside a session using the ‘source’ function.
Help can be attained using ? preceding a function name.
Entering Data:
R is most compatible with datasets stored as text files, ie: csv.
Base R contains functions read.table and read.csv see the help files on these functions for many options.
# comma separated values dat.csv <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv") # tab separated values dat.tab <- read.table("http://www.ats.ucla.edu/stat/data/hsb2.txt", header=TRUE, sep = "\t")
Datasets from other statistical analysis software can be imported using the foreign package:
require(foreign) # SPSS files dat.spss <- read.spss("http://www.ats.ucla.edu/stat/data/hsb2.sav", to.data.frame=TRUE) # Stata files dat.dta <- read.dta("http://www.ats.ucla.edu/stat/data/hsb2.dta")
If converting excel spreadsheets to CSV is too much of a hassle the xlxs package we imported will do the job:
# these two steps only needed to read excel files from the internet f <- tempfile("hsb2", fileext=".xls") download.file("http://www.ats.ucla.edu/stat/data/hsb2.xls", f, mode="wb") dat.xls <- read.xlsx(f, sheetIndex=1)
Viewing Data:
# first few rows head(dat.csv) # last few rows tail(dat.csv) # variable names colnames(dat.csv) # pop-up view of entire data set (uncomment to run) View(dat.csv)
Datasets that have been read in are stored as data frames which have a matrix structure. The most common method of indexing is object[row,column] but many others are available.
# single cell value dat.csv[2, 3] # omitting row value implies all rows; here all rows in column 3 dat.csv[, 3] # omitting column values implies all columns; here all columns in row 2 dat.csv[2, ] # can also use ranges - rows 2 and 3, columns 2 and 3 dat.csv[2:3, 2:3]
Variables can also be accessed via their names:
# get first 10 rows of variable female using two methods dat.csv[1:10, "female"] dat.csv$female[1:10]
The c function is used to combine values of common type together to form a vector:
# get column 1 for rows 1, 3 and 5 dat.csv[c(1, 3, 5), 1] ## [1] 70 86 172 # get row 1 values for variables female, prog and socst dat.csv[1, c("female", "prog", "socst")] ## female prog socst ## 1 0 1 57
Creating colnames:
colnames(dat.csv) <- c("ID", "Sex", "Ethnicity", "SES", "SchoolType", "Program", "Reading", "Writing", "Math", "Science", "SocialStudies") # to change one variable name, just use indexing colnames(dat.csv)[1] <- "ID2"
Saving data:
#write.csv(dat.csv, file = "path/to/save/filename.csv") #write.table(dat.csv, file = "path/to/save/filename.txt", sep = "\t", na=".") #write.dta(dat.csv, file = "path/to/save/filename.dta") #write.xlsx(dat.csv, file = "path/to/save/filename.xlsx", sheetName="hsb2") # save to binary R format (can save multiple datasets and R objects) #save(dat.csv, dat.dta, dat.spss, dat.txt, file = "path/to/save/filename.RData") #change workspace directory setwd("/home/a/Desktop/R/testspace1")