# Introduction to R
# Any line that starts with the pound sign is a comment, and is ignored when it
# gets run.

# R makes many simple and advanced statistical methods readily available and 
# easy to use.  As a beginner in R, you need to think a bit like a programmer, 
# which is why R can be considered "hard."

# R can be used as a calculator:

1+1

# When you run a line of code, you can see the line show up in the console 
# (bottom left screen) and then any output is diplayed.

# You can use <- to assign a value to a variable:

x <- 5

# You need to run lines for code in order for them to be executed.  Once you 
# run the line, you see the value for x show up in the upper right hand pane in 
# the environment tab.

# We could now use that variable:

x + 9

# Usually, we will want to store more than one value of a variable, we will get 
# to that shortly.

# Your working directory is where R will look for files (datasets and R 
# scripts) by default. You can find where that location is currently using the 
# getwd() function:

getwd()

# You can set your working directory using the setwd() function or going to:
# Session -> Set Working Directory -> Choose Directory... and navigating to the
# location your want.  It is good to create a dedicated folder for your work on
# a particular project.  Keep the folder name simple and put it in a location 
# it is easy for you to locate.

# Now, you can see all the files in this folder in the Files tab in the lower 
# right hand pane.  Let's save this file as firstscript.R.  You may need to 
# refresh the bottom right pane, but now you should be able to see your script 
# firstscript.R.  Looking at the tab in the upper left pane, any time the 
# filename is red, you know you have made changes that have not yet been saved.  
# When you save, the filename turns black.

# A vector is the building block for handling a collection of observations of a
# single variable. For a numeric vector:

numvec <- c(1,23,506,22)

# We have used the c() function to create this vector.  We pass information to 
# the function via arguments, separated by commas.

# Notice that the numvec vector showed up in the top right pane under the 
# environment tab.  If we type numvec, the elements will be returned in the 
# console:

numvec

# The [1] in front of the list of numbers is the element number of the first 
# element on that line.  You can typically ignore this value.  It can be 
# helpful if the number of elements is large.

# The length() function will tell you how many elements are in the numvec 
# vector:

length(numvec)

# We have used the length() function and passed the argument numvec to 
# determine the length (number of elements) in the vector numvec.  In the 
# bottom right pane, click on the Help tab and type length in the search window
# and click enter.  The help tab provides information about every function that
# is currently installed.  The information can get a little computer 
# programming heavy, but it provides a great deal of information.  Googling 
# various R functions can also be extremely helpful, and sometimes provides 
# answers in more "layman's" terms. Notice, help gives us a description, gives
# us the possible arguments of the function, and examples of code using the
# function, among other things.

# For a vector of character strings (words), the same command it used, and we 
# need to put double quotes around each element:

charvec <- c("Sol","Mary","D","Kim")

# Notice that the charvec vector showed up in the top right pane under the 
# environment tab.  If we type charvec, the elements will be returned:

charvec

# R is case sensitive, CharVEC is different than charvec.  Keep things simple.
# It will save you lots of headaches.

# We can use the length() function for the charvec object as well:

length(charvec)

# Often, we will import data from another source. One common filetype is a .csv
# file. Get the iqlead.csv file from http://faculty.cascadia.edu/rburke/ and 
# put it in your working directory (the folder you created earlier to put your
# R stuff in)

# Now, we are going to import the data contained in this .csv file into an 
# object (specifically a dataframe) we are going to call iqlead using the 
# read.csv() function.

iqlead <- read.csv("iqlead.csv")

# Data are measured from children in two consecutive years, and the children 
# were living close to a lead smelter.  LEAD is a blood lead level group, 
# 1 = low lead level, 2 = medium lead level, 3 = high lead level.  AGE is age 
# in years. SEX is sex of subject 1 = male, 2 = female.  YEAR1 is blood lead 
# level in first year and YEAR2 is blood lead level in second year.  IQV is 
# measured verbal IQ score.  IQP is measured performance IQ score.  IQF is 
# measured full IQ score.

# Notice the object iqlead has shown up in the upper right pane in the 
# Environment Tab.  If you click the little arrow to the left of iqlead, you 
# can see that this data set is made up of 121 observations of 8 variables.  
# The variables are described above and are all integer values.

# If you double click the word iqlead, the dataset opens in tabular form in the
# upper left window so you can browse the data.  You can close this tab 
# whenever you want by clicking the x next to the tab name in this window.

# To calculate the mean, we use the fuction mean().  If we want to calculate 
# the mean of the variable IQF, we need to indicate the dataset we want 
# (iqlead) and the variable we want (IQF) with a dollar sign between:

mean(iqlead$IQF)

# If we want to calculate the mean for just the low lead blood level group 
# (LEAD = 1):

mean(iqlead$IQF[iqlead$LEAD==1])

# If we want to know how many observations are in the low lead blood level 
# group:

length(iqlead$IQF[iqlead$LEAD==1])

# If we want to calculate the mean for just the medium lead blood level group 
# (LEAD = 2):

mean(iqlead$IQF[iqlead$LEAD==2])

# If we want to know how many observations are in the medium blood level group:

length(iqlead$IQF[iqlead$LEAD==2])

# If we want to calculate the mean for just the high lead blood level group 
# (LEAD = 3):

mean(iqlead$IQF[iqlead$LEAD==3])

# If we want to know how many observations are in the high lead blood level 
# group:

length(iqlead$IQF[iqlead$LEAD==3])

# It might be interesting to determine if the differences in the means of these
# differenct blood level groups are statistically significant.

# PRACTICE 
# 1. Calculate the mean age.
# 2. Calculate the mean age of the females.
# 3. Calculate how many of the ages are from females.

# Let's take a look at the mean() function in the help.  In the bottom right 
# pane, click on the help tab and type mean into the search window and click 
# return.  We have been providing only one argument, named x in the help, we 
# have been passing the function the vector name we want to find the length of.
# Our dataset is not missing any values, there are 121 values for each of the 8
# variables. This is not always the case.  There is a dataset automatically 
# loaded with R called airquality.  It contains daily air quality measurements
# in New York in 1973. Let's looks at this dataset:

airquality

# Notice that there are some NA's in Ozone column.  Let's try to take the mean
# Ozone:

mean(airquality$Ozone)

# An NA is retuned, indicating there is at least one NA value.  The mean() 
# function has an argument called na.rm whose default value is FALSE (don't 
# ignore).  You can see this in the Uage and Arguments sections of the help for
# the function mean. Let's add the argement na.rm=T to the function.  You can 
# use T instead of typing out TRUE and F instead of typing out False.

mean(airquality$Ozone,na.rm=T)

# And now the mean is calculated, ignoring the NA's.  Another argument is 
# available, trim, whose default is 0. This argument allows you to take the 
# trimmed mean.

# PRACTICE 
# 4. How many observations of the Wind variable are there in the airquality 
# dataset?
# 5. Calculate the mean of the Wind variable in the airquality dataset (ignore
# NA's if necessary)
# 6. Trim 5% of the of the observations from each end and calculate the trimmed
# mean of the Wind variable.

# If we want to calculate the median, we use the median() function:

median(iqlead$IQF)

# To calculate the sample standard deviation, we use the sd() function:

sd(iqlead$IQF)

# The median() and sd() functions also have the na.rm arguement available.  
# Feel free to look them up in the Help tab to confirm this.

# We can create a histogram using the hist() function.

hist(iqlead$IQF)  

# Histograms can help us get a sense of whether or not our dataset has an 
# approximately normal distribution.

# Look up hist in the help tab. Take a quick look at the arguments available.

# We could label our x-axis and give our graph a title using the main and xlab
# arguments of hist().  Notice that to keep things neat, I have put each 
# argument on a new line so I do not have to scroll right and left to see the 
# whole line of code.  You can run multiple lines of code at the same time
# by highlighting them before clicking run:

hist(iqlead$IQF,
     main = "Full IQ Scores of Children Living Near Lead Smelter",
     xlab = "Full IQ Score (in IQ points)")

# Let's make a histogram but only for the low lead level group

hist(iqlead$IQF[iqlead$LEAD==1],
     main = "Full IQ cores of Children in the Low Lead Level Group",
     xlab = "Full IQ Score (in IQ points)")

# We can also set the boundaries of the bins of our histogram ourselves.  Let's
# take another look setting our own bin boundaries using the breaks argument:

hist(iqlead$IQF[iqlead$LEAD==1], 
     main = "Full IQ cores of Children in the Low Lead Level Group",
     xlab = "Full IQ Score (in IQ points)",
     breaks=c(30,40,50,60,70,80,90,100,110,120,130,140,150,160))

# Notice, we used the c() function to create a vector of the numbers of our bin
# breaks.

# PRACTICE 
# 7. Generate a histogram for the Wind variable in the airquality dataset.
# 8. Generate a histogram for the Wind variable in the airquality dataset but 
# only for the month of June (Month==6).
# 9. Add the argument freq=F to either of your histograms.  What did this 
# change on your histogram?

# Let's re-run the histogram for the IQF variable in the iqlead dataset for the
# low lead group:

hist(iqlead$IQF[iqlead$LEAD==1])

# This data looks somewhat normally distributed.  A plot known as a Normal 
# Quantile Plot, or a Quantile-Quantile Plot, can help with that.  We can use 
# the function qqnorm() to plot one:

qqnorm(iqlead$IQF[iqlead$LEAD==1])

# If we follow this up with the command qqline():

qqline(iqlead$IQF[iqlead$LEAD==1])

# If the data is normally distributed, the QQ plot points should be reasonably
# close to the qqline.

# For comparison purposes, let's have R pick 78 numbers at random from a 
# normal distribution with a mean of 93 and and standard deviation of 15 and 
# store them in a vector called norm78.  I picked these values for the 
# arguments to be similar to the IQF variable in the low lead data group.

norm78 <- rnorm(78, mean=93, sd=15)

# Feel free to look up the rnorm() function in the help tab for more 
# information.  Now, let's make a histogram (each of ours will be different 
# since we will each have a different set of random numbers selected from the 
# normal distribution.)

hist(norm78)

# Now, let's run qqnorm() and qqline() on norm78.

qqnorm(norm78)

qqline(norm78)

# PRACTICE
# 10. Re-run your histogram for the Wind variable in the airquality dataset.  
# Does the distribution look approximately normal?
# 11. Run qqnorm() and qqline() for the Wind variable in the airquality 
# dataset.
# 12. Create a vector of values selected randomly from a normal distribution 
# that has approximately the same mean and standard deviation as the Wind 
# variable in the airquality dataset.  Make the sample size the same as well.
# Then generate a histogram and q qqplot with the qqline added for this data.

# The summary() function gives us nice summary information about our variables 
# within our dataset:

summary(iqlead)

# It is a nice place to start when we are looking at a dataset for the first 
# time. It can also help us spot errors with out import into R.

# We can make boxplots using the boxplot() function:

boxplot(iqlead$IQF)

# And we can break our boxplots up by our lead level groups:

boxplot(iqlead$IQF~iqlead$LEAD)

# We can add titles and labeling the same way we did for the histogram 
# function:

boxplot(iqlead$IQF~iqlead$LEAD,
        main = "Full IQ scores of Children Living Near a Lead Smelter",
        xlab = "Lead Blood Level (1 = Low, 2 = Medium, 3 = High)",
        ylab = "Full IQ Score (in IQ points)")

# For a closer look at any graph we can go to View -> Panes -> Zoom Plots.
# We can use the Export option in this window to save the graphic as an image 
# file or pdf. OR copy it to the clipboard to paste into a word document, 
# powerpoint, etc.

# To get out of this view, we can go to View -> Panes - > Show All Panes

# It looks like perhaps the low lead level group has a higher mean IQ than the 
# high lead level group.  Is this diference statistically significant? 
# Performing a t-test on the two means could be helpful.

t.test(iqlead$IQF[iqlead$LEAD == 1],iqlead$IQF[iqlead$LEAD == 3])

# We have not checked if our dataset meets the conditions for running a t test,
# but assuming the conditions for running the t-test are satisfied:
# null hypothesis: true difference in means is equal to 0
# alternate hypothesis: true difference of means is not equal to 0
# With a p-value of 0.02638, there is a probability of about 2.6% that these 
# random samples could be collected from populations where the two means were 
# equal.  For a common significance level of .1 or .05, we would reject the 
# null hypothesis and conclude there is sufficient evidence to claim there is a
# difference in the mean IQ for the two groups.  For a significance level of 
# .01, we would fail to reject the null and not have enough evidence to support
# the claim.

# You can read more about the t.test function by typing t.test in the Help 
# search box.

# PRACTICE
# 13. Run the summary() function on the airquality dataset.
# 14. Make boxplots of the Wind variable in the airquality dataset broken out 
# by Month.
# 15. Run a t-test on the Wind variable for May and July, then one for July and
# August. What do you observe?

# The t.test function does a lot.  If you look at the help for t.test, you can
# do one and two tailed tests, one or two variable tests, paired or unpaired, 
# variances unequal or pooled, etc.  You can also have the t.test function 
# calculate a confidence interval for a particular confidence level.  For 
# example:

t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.9)

t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.95)

t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.99)

# You can ignore the other parts of the output and focus on the confidence 
# interval section.  As you would expect, as the increase the confidence level,
# the width of the confidence interval increases.