# Introduction to R
# Any line that starts with the pound sign is a comment, and is ignored when it
# gets run.
# R makes many simple and advanced statistical methods readily available and
# easy to use. As a beginner in R, you need to think a bit like a programmer,
# which is why R can be considered "hard."
# R can be used as a calculator:
1+1
# When you run a line of code, you can see the line show up in the console
# (bottom left screen) and then any output is diplayed.
# You can use <- to assign a value to a variable:
x <- 5
# You need to run lines for code in order for them to be executed. Once you
# run the line, you see the value for x show up in the upper right hand pane in
# the environment tab.
# We could now use that variable:
x + 9
# Usually, we will want to store more than one value of a variable, we will get
# to that shortly.
# Your working directory is where R will look for files (datasets and R
# scripts) by default. You can find where that location is currently using the
# getwd() function:
getwd()
# You can set your working directory using the setwd() function or going to:
# Session -> Set Working Directory -> Choose Directory... and navigating to the
# location your want. It is good to create a dedicated folder for your work on
# a particular project. Keep the folder name simple and put it in a location
# it is easy for you to locate.
# Now, you can see all the files in this folder in the Files tab in the lower
# right hand pane. Let's save this file as firstscript.R. You may need to
# refresh the bottom right pane, but now you should be able to see your script
# firstscript.R. Looking at the tab in the upper left pane, any time the
# filename is red, you know you have made changes that have not yet been saved.
# When you save, the filename turns black.
# A vector is the building block for handling a collection of observations of a
# single variable. For a numeric vector:
numvec <- c(1,23,506,22)
# We have used the c() function to create this vector. We pass information to
# the function via arguments, separated by commas.
# Notice that the numvec vector showed up in the top right pane under the
# environment tab. If we type numvec, the elements will be returned in the
# console:
numvec
# The [1] in front of the list of numbers is the element number of the first
# element on that line. You can typically ignore this value. It can be
# helpful if the number of elements is large.
# The length() function will tell you how many elements are in the numvec
# vector:
length(numvec)
# We have used the length() function and passed the argument numvec to
# determine the length (number of elements) in the vector numvec. In the
# bottom right pane, click on the Help tab and type length in the search window
# and click enter. The help tab provides information about every function that
# is currently installed. The information can get a little computer
# programming heavy, but it provides a great deal of information. Googling
# various R functions can also be extremely helpful, and sometimes provides
# answers in more "layman's" terms. Notice, help gives us a description, gives
# us the possible arguments of the function, and examples of code using the
# function, among other things.
# For a vector of character strings (words), the same command it used, and we
# need to put double quotes around each element:
charvec <- c("Sol","Mary","D","Kim")
# Notice that the charvec vector showed up in the top right pane under the
# environment tab. If we type charvec, the elements will be returned:
charvec
# R is case sensitive, CharVEC is different than charvec. Keep things simple.
# It will save you lots of headaches.
# We can use the length() function for the charvec object as well:
length(charvec)
# Often, we will import data from another source. One common filetype is a .csv
# file. Get the iqlead.csv file from http://faculty.cascadia.edu/rburke/ and
# put it in your working directory (the folder you created earlier to put your
# R stuff in)
# Now, we are going to import the data contained in this .csv file into an
# object (specifically a dataframe) we are going to call iqlead using the
# read.csv() function.
iqlead <- read.csv("iqlead.csv")
# Data are measured from children in two consecutive years, and the children
# were living close to a lead smelter. LEAD is a blood lead level group,
# 1 = low lead level, 2 = medium lead level, 3 = high lead level. AGE is age
# in years. SEX is sex of subject 1 = male, 2 = female. YEAR1 is blood lead
# level in first year and YEAR2 is blood lead level in second year. IQV is
# measured verbal IQ score. IQP is measured performance IQ score. IQF is
# measured full IQ score.
# Notice the object iqlead has shown up in the upper right pane in the
# Environment Tab. If you click the little arrow to the left of iqlead, you
# can see that this data set is made up of 121 observations of 8 variables.
# The variables are described above and are all integer values.
# If you double click the word iqlead, the dataset opens in tabular form in the
# upper left window so you can browse the data. You can close this tab
# whenever you want by clicking the x next to the tab name in this window.
# To calculate the mean, we use the fuction mean(). If we want to calculate
# the mean of the variable IQF, we need to indicate the dataset we want
# (iqlead) and the variable we want (IQF) with a dollar sign between:
mean(iqlead$IQF)
# If we want to calculate the mean for just the low lead blood level group
# (LEAD = 1):
mean(iqlead$IQF[iqlead$LEAD==1])
# If we want to know how many observations are in the low lead blood level
# group:
length(iqlead$IQF[iqlead$LEAD==1])
# If we want to calculate the mean for just the medium lead blood level group
# (LEAD = 2):
mean(iqlead$IQF[iqlead$LEAD==2])
# If we want to know how many observations are in the medium blood level group:
length(iqlead$IQF[iqlead$LEAD==2])
# If we want to calculate the mean for just the high lead blood level group
# (LEAD = 3):
mean(iqlead$IQF[iqlead$LEAD==3])
# If we want to know how many observations are in the high lead blood level
# group:
length(iqlead$IQF[iqlead$LEAD==3])
# It might be interesting to determine if the differences in the means of these
# differenct blood level groups are statistically significant.
# PRACTICE
# 1. Calculate the mean age.
# 2. Calculate the mean age of the females.
# 3. Calculate how many of the ages are from females.
# Let's take a look at the mean() function in the help. In the bottom right
# pane, click on the help tab and type mean into the search window and click
# return. We have been providing only one argument, named x in the help, we
# have been passing the function the vector name we want to find the length of.
# Our dataset is not missing any values, there are 121 values for each of the 8
# variables. This is not always the case. There is a dataset automatically
# loaded with R called airquality. It contains daily air quality measurements
# in New York in 1973. Let's looks at this dataset:
airquality
# Notice that there are some NA's in Ozone column. Let's try to take the mean
# Ozone:
mean(airquality$Ozone)
# An NA is retuned, indicating there is at least one NA value. The mean()
# function has an argument called na.rm whose default value is FALSE (don't
# ignore). You can see this in the Uage and Arguments sections of the help for
# the function mean. Let's add the argement na.rm=T to the function. You can
# use T instead of typing out TRUE and F instead of typing out False.
mean(airquality$Ozone,na.rm=T)
# And now the mean is calculated, ignoring the NA's. Another argument is
# available, trim, whose default is 0. This argument allows you to take the
# trimmed mean.
# PRACTICE
# 4. How many observations of the Wind variable are there in the airquality
# dataset?
# 5. Calculate the mean of the Wind variable in the airquality dataset (ignore
# NA's if necessary)
# 6. Trim 5% of the of the observations from each end and calculate the trimmed
# mean of the Wind variable.
# If we want to calculate the median, we use the median() function:
median(iqlead$IQF)
# To calculate the sample standard deviation, we use the sd() function:
sd(iqlead$IQF)
# The median() and sd() functions also have the na.rm arguement available.
# Feel free to look them up in the Help tab to confirm this.
# We can create a histogram using the hist() function.
hist(iqlead$IQF)
# Histograms can help us get a sense of whether or not our dataset has an
# approximately normal distribution.
# Look up hist in the help tab. Take a quick look at the arguments available.
# We could label our x-axis and give our graph a title using the main and xlab
# arguments of hist(). Notice that to keep things neat, I have put each
# argument on a new line so I do not have to scroll right and left to see the
# whole line of code. You can run multiple lines of code at the same time
# by highlighting them before clicking run:
hist(iqlead$IQF,
main = "Full IQ Scores of Children Living Near Lead Smelter",
xlab = "Full IQ Score (in IQ points)")
# Let's make a histogram but only for the low lead level group
hist(iqlead$IQF[iqlead$LEAD==1],
main = "Full IQ cores of Children in the Low Lead Level Group",
xlab = "Full IQ Score (in IQ points)")
# We can also set the boundaries of the bins of our histogram ourselves. Let's
# take another look setting our own bin boundaries using the breaks argument:
hist(iqlead$IQF[iqlead$LEAD==1],
main = "Full IQ cores of Children in the Low Lead Level Group",
xlab = "Full IQ Score (in IQ points)",
breaks=c(30,40,50,60,70,80,90,100,110,120,130,140,150,160))
# Notice, we used the c() function to create a vector of the numbers of our bin
# breaks.
# PRACTICE
# 7. Generate a histogram for the Wind variable in the airquality dataset.
# 8. Generate a histogram for the Wind variable in the airquality dataset but
# only for the month of June (Month==6).
# 9. Add the argument freq=F to either of your histograms. What did this
# change on your histogram?
# Let's re-run the histogram for the IQF variable in the iqlead dataset for the
# low lead group:
hist(iqlead$IQF[iqlead$LEAD==1])
# This data looks somewhat normally distributed. A plot known as a Normal
# Quantile Plot, or a Quantile-Quantile Plot, can help with that. We can use
# the function qqnorm() to plot one:
qqnorm(iqlead$IQF[iqlead$LEAD==1])
# If we follow this up with the command qqline():
qqline(iqlead$IQF[iqlead$LEAD==1])
# If the data is normally distributed, the QQ plot points should be reasonably
# close to the qqline.
# For comparison purposes, let's have R pick 78 numbers at random from a
# normal distribution with a mean of 93 and and standard deviation of 15 and
# store them in a vector called norm78. I picked these values for the
# arguments to be similar to the IQF variable in the low lead data group.
norm78 <- rnorm(78, mean=93, sd=15)
# Feel free to look up the rnorm() function in the help tab for more
# information. Now, let's make a histogram (each of ours will be different
# since we will each have a different set of random numbers selected from the
# normal distribution.)
hist(norm78)
# Now, let's run qqnorm() and qqline() on norm78.
qqnorm(norm78)
qqline(norm78)
# PRACTICE
# 10. Re-run your histogram for the Wind variable in the airquality dataset.
# Does the distribution look approximately normal?
# 11. Run qqnorm() and qqline() for the Wind variable in the airquality
# dataset.
# 12. Create a vector of values selected randomly from a normal distribution
# that has approximately the same mean and standard deviation as the Wind
# variable in the airquality dataset. Make the sample size the same as well.
# Then generate a histogram and q qqplot with the qqline added for this data.
# The summary() function gives us nice summary information about our variables
# within our dataset:
summary(iqlead)
# It is a nice place to start when we are looking at a dataset for the first
# time. It can also help us spot errors with out import into R.
# We can make boxplots using the boxplot() function:
boxplot(iqlead$IQF)
# And we can break our boxplots up by our lead level groups:
boxplot(iqlead$IQF~iqlead$LEAD)
# We can add titles and labeling the same way we did for the histogram
# function:
boxplot(iqlead$IQF~iqlead$LEAD,
main = "Full IQ scores of Children Living Near a Lead Smelter",
xlab = "Lead Blood Level (1 = Low, 2 = Medium, 3 = High)",
ylab = "Full IQ Score (in IQ points)")
# For a closer look at any graph we can go to View -> Panes -> Zoom Plots.
# We can use the Export option in this window to save the graphic as an image
# file or pdf. OR copy it to the clipboard to paste into a word document,
# powerpoint, etc.
# To get out of this view, we can go to View -> Panes - > Show All Panes
# It looks like perhaps the low lead level group has a higher mean IQ than the
# high lead level group. Is this diference statistically significant?
# Performing a t-test on the two means could be helpful.
t.test(iqlead$IQF[iqlead$LEAD == 1],iqlead$IQF[iqlead$LEAD == 3])
# We have not checked if our dataset meets the conditions for running a t test,
# but assuming the conditions for running the t-test are satisfied:
# null hypothesis: true difference in means is equal to 0
# alternate hypothesis: true difference of means is not equal to 0
# With a p-value of 0.02638, there is a probability of about 2.6% that these
# random samples could be collected from populations where the two means were
# equal. For a common significance level of .1 or .05, we would reject the
# null hypothesis and conclude there is sufficient evidence to claim there is a
# difference in the mean IQ for the two groups. For a significance level of
# .01, we would fail to reject the null and not have enough evidence to support
# the claim.
# You can read more about the t.test function by typing t.test in the Help
# search box.
# PRACTICE
# 13. Run the summary() function on the airquality dataset.
# 14. Make boxplots of the Wind variable in the airquality dataset broken out
# by Month.
# 15. Run a t-test on the Wind variable for May and July, then one for July and
# August. What do you observe?
# The t.test function does a lot. If you look at the help for t.test, you can
# do one and two tailed tests, one or two variable tests, paired or unpaired,
# variances unequal or pooled, etc. You can also have the t.test function
# calculate a confidence interval for a particular confidence level. For
# example:
t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.9)
t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.95)
t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.99)
# You can ignore the other parts of the output and focus on the confidence
# interval section. As you would expect, as the increase the confidence level,
# the width of the confidence interval increases.