# Introduction to R # Any line that starts with the pound sign is a comment, and is ignored when it # gets run. # R makes many simple and advanced statistical methods readily available and # easy to use. As a beginner in R, you need to think a bit like a programmer, # which is why R can be considered "hard." # R can be used as a calculator: 1+1 # When you run a line of code, you can see the line show up in the console # (bottom left screen) and then any output is diplayed. # You can use <- to assign a value to a variable: x <- 5 # You need to run lines for code in order for them to be executed. Once you # run the line, you see the value for x show up in the upper right hand pane in # the environment tab. # We could now use that variable: x + 9 # Usually, we will want to store more than one value of a variable, we will get # to that shortly. # Your working directory is where R will look for files (datasets and R # scripts) by default. You can find where that location is currently using the # getwd() function: getwd() # You can set your working directory using the setwd() function or going to: # Session -> Set Working Directory -> Choose Directory... and navigating to the # location your want. It is good to create a dedicated folder for your work on # a particular project. Keep the folder name simple and put it in a location # it is easy for you to locate. # Now, you can see all the files in this folder in the Files tab in the lower # right hand pane. Let's save this file as firstscript.R. You may need to # refresh the bottom right pane, but now you should be able to see your script # firstscript.R. Looking at the tab in the upper left pane, any time the # filename is red, you know you have made changes that have not yet been saved. # When you save, the filename turns black. # A vector is the building block for handling a collection of observations of a # single variable. For a numeric vector: numvec <- c(1,23,506,22) # We have used the c() function to create this vector. We pass information to # the function via arguments, separated by commas. # Notice that the numvec vector showed up in the top right pane under the # environment tab. If we type numvec, the elements will be returned in the # console: numvec # The [1] in front of the list of numbers is the element number of the first # element on that line. You can typically ignore this value. It can be # helpful if the number of elements is large. # The length() function will tell you how many elements are in the numvec # vector: length(numvec) # We have used the length() function and passed the argument numvec to # determine the length (number of elements) in the vector numvec. In the # bottom right pane, click on the Help tab and type length in the search window # and click enter. The help tab provides information about every function that # is currently installed. The information can get a little computer # programming heavy, but it provides a great deal of information. Googling # various R functions can also be extremely helpful, and sometimes provides # answers in more "layman's" terms. Notice, help gives us a description, gives # us the possible arguments of the function, and examples of code using the # function, among other things. # For a vector of character strings (words), the same command it used, and we # need to put double quotes around each element: charvec <- c("Sol","Mary","D","Kim") # Notice that the charvec vector showed up in the top right pane under the # environment tab. If we type charvec, the elements will be returned: charvec # R is case sensitive, CharVEC is different than charvec. Keep things simple. # It will save you lots of headaches. # We can use the length() function for the charvec object as well: length(charvec) # Often, we will import data from another source. One common filetype is a .csv # file. Get the iqlead.csv file from http://faculty.cascadia.edu/rburke/ and # put it in your working directory (the folder you created earlier to put your # R stuff in) # Now, we are going to import the data contained in this .csv file into an # object (specifically a dataframe) we are going to call iqlead using the # read.csv() function. iqlead <- read.csv("iqlead.csv") # Data are measured from children in two consecutive years, and the children # were living close to a lead smelter. LEAD is a blood lead level group, # 1 = low lead level, 2 = medium lead level, 3 = high lead level. AGE is age # in years. SEX is sex of subject 1 = male, 2 = female. YEAR1 is blood lead # level in first year and YEAR2 is blood lead level in second year. IQV is # measured verbal IQ score. IQP is measured performance IQ score. IQF is # measured full IQ score. # Notice the object iqlead has shown up in the upper right pane in the # Environment Tab. If you click the little arrow to the left of iqlead, you # can see that this data set is made up of 121 observations of 8 variables. # The variables are described above and are all integer values. # If you double click the word iqlead, the dataset opens in tabular form in the # upper left window so you can browse the data. You can close this tab # whenever you want by clicking the x next to the tab name in this window. # To calculate the mean, we use the fuction mean(). If we want to calculate # the mean of the variable IQF, we need to indicate the dataset we want # (iqlead) and the variable we want (IQF) with a dollar sign between: mean(iqlead$IQF) # If we want to calculate the mean for just the low lead blood level group # (LEAD = 1): mean(iqlead$IQF[iqlead$LEAD==1]) # If we want to know how many observations are in the low lead blood level # group: length(iqlead$IQF[iqlead$LEAD==1]) # If we want to calculate the mean for just the medium lead blood level group # (LEAD = 2): mean(iqlead$IQF[iqlead$LEAD==2]) # If we want to know how many observations are in the medium blood level group: length(iqlead$IQF[iqlead$LEAD==2]) # If we want to calculate the mean for just the high lead blood level group # (LEAD = 3): mean(iqlead$IQF[iqlead$LEAD==3]) # If we want to know how many observations are in the high lead blood level # group: length(iqlead$IQF[iqlead$LEAD==3]) # It might be interesting to determine if the differences in the means of these # differenct blood level groups are statistically significant. # PRACTICE # 1. Calculate the mean age. # 2. Calculate the mean age of the females. # 3. Calculate how many of the ages are from females. # Let's take a look at the mean() function in the help. In the bottom right # pane, click on the help tab and type mean into the search window and click # return. We have been providing only one argument, named x in the help, we # have been passing the function the vector name we want to find the length of. # Our dataset is not missing any values, there are 121 values for each of the 8 # variables. This is not always the case. There is a dataset automatically # loaded with R called airquality. It contains daily air quality measurements # in New York in 1973. Let's looks at this dataset: airquality # Notice that there are some NA's in Ozone column. Let's try to take the mean # Ozone: mean(airquality$Ozone) # An NA is retuned, indicating there is at least one NA value. The mean() # function has an argument called na.rm whose default value is FALSE (don't # ignore). You can see this in the Uage and Arguments sections of the help for # the function mean. Let's add the argement na.rm=T to the function. You can # use T instead of typing out TRUE and F instead of typing out False. mean(airquality$Ozone,na.rm=T) # And now the mean is calculated, ignoring the NA's. Another argument is # available, trim, whose default is 0. This argument allows you to take the # trimmed mean. # PRACTICE # 4. How many observations of the Wind variable are there in the airquality # dataset? # 5. Calculate the mean of the Wind variable in the airquality dataset (ignore # NA's if necessary) # 6. Trim 5% of the of the observations from each end and calculate the trimmed # mean of the Wind variable. # If we want to calculate the median, we use the median() function: median(iqlead$IQF) # To calculate the sample standard deviation, we use the sd() function: sd(iqlead$IQF) # The median() and sd() functions also have the na.rm arguement available. # Feel free to look them up in the Help tab to confirm this. # We can create a histogram using the hist() function. hist(iqlead$IQF) # Histograms can help us get a sense of whether or not our dataset has an # approximately normal distribution. # Look up hist in the help tab. Take a quick look at the arguments available. # We could label our x-axis and give our graph a title using the main and xlab # arguments of hist(). Notice that to keep things neat, I have put each # argument on a new line so I do not have to scroll right and left to see the # whole line of code. You can run multiple lines of code at the same time # by highlighting them before clicking run: hist(iqlead$IQF, main = "Full IQ Scores of Children Living Near Lead Smelter", xlab = "Full IQ Score (in IQ points)") # Let's make a histogram but only for the low lead level group hist(iqlead$IQF[iqlead$LEAD==1], main = "Full IQ cores of Children in the Low Lead Level Group", xlab = "Full IQ Score (in IQ points)") # We can also set the boundaries of the bins of our histogram ourselves. Let's # take another look setting our own bin boundaries using the breaks argument: hist(iqlead$IQF[iqlead$LEAD==1], main = "Full IQ cores of Children in the Low Lead Level Group", xlab = "Full IQ Score (in IQ points)", breaks=c(30,40,50,60,70,80,90,100,110,120,130,140,150,160)) # Notice, we used the c() function to create a vector of the numbers of our bin # breaks. # PRACTICE # 7. Generate a histogram for the Wind variable in the airquality dataset. # 8. Generate a histogram for the Wind variable in the airquality dataset but # only for the month of June (Month==6). # 9. Add the argument freq=F to either of your histograms. What did this # change on your histogram? # Let's re-run the histogram for the IQF variable in the iqlead dataset for the # low lead group: hist(iqlead$IQF[iqlead$LEAD==1]) # This data looks somewhat normally distributed. A plot known as a Normal # Quantile Plot, or a Quantile-Quantile Plot, can help with that. We can use # the function qqnorm() to plot one: qqnorm(iqlead$IQF[iqlead$LEAD==1]) # If we follow this up with the command qqline(): qqline(iqlead$IQF[iqlead$LEAD==1]) # If the data is normally distributed, the QQ plot points should be reasonably # close to the qqline. # For comparison purposes, let's have R pick 78 numbers at random from a # normal distribution with a mean of 93 and and standard deviation of 15 and # store them in a vector called norm78. I picked these values for the # arguments to be similar to the IQF variable in the low lead data group. norm78 <- rnorm(78, mean=93, sd=15) # Feel free to look up the rnorm() function in the help tab for more # information. Now, let's make a histogram (each of ours will be different # since we will each have a different set of random numbers selected from the # normal distribution.) hist(norm78) # Now, let's run qqnorm() and qqline() on norm78. qqnorm(norm78) qqline(norm78) # PRACTICE # 10. Re-run your histogram for the Wind variable in the airquality dataset. # Does the distribution look approximately normal? # 11. Run qqnorm() and qqline() for the Wind variable in the airquality # dataset. # 12. Create a vector of values selected randomly from a normal distribution # that has approximately the same mean and standard deviation as the Wind # variable in the airquality dataset. Make the sample size the same as well. # Then generate a histogram and q qqplot with the qqline added for this data. # The summary() function gives us nice summary information about our variables # within our dataset: summary(iqlead) # It is a nice place to start when we are looking at a dataset for the first # time. It can also help us spot errors with out import into R. # We can make boxplots using the boxplot() function: boxplot(iqlead$IQF) # And we can break our boxplots up by our lead level groups: boxplot(iqlead$IQF~iqlead$LEAD) # We can add titles and labeling the same way we did for the histogram # function: boxplot(iqlead$IQF~iqlead$LEAD, main = "Full IQ scores of Children Living Near a Lead Smelter", xlab = "Lead Blood Level (1 = Low, 2 = Medium, 3 = High)", ylab = "Full IQ Score (in IQ points)") # For a closer look at any graph we can go to View -> Panes -> Zoom Plots. # We can use the Export option in this window to save the graphic as an image # file or pdf. OR copy it to the clipboard to paste into a word document, # powerpoint, etc. # To get out of this view, we can go to View -> Panes - > Show All Panes # It looks like perhaps the low lead level group has a higher mean IQ than the # high lead level group. Is this diference statistically significant? # Performing a t-test on the two means could be helpful. t.test(iqlead$IQF[iqlead$LEAD == 1],iqlead$IQF[iqlead$LEAD == 3]) # We have not checked if our dataset meets the conditions for running a t test, # but assuming the conditions for running the t-test are satisfied: # null hypothesis: true difference in means is equal to 0 # alternate hypothesis: true difference of means is not equal to 0 # With a p-value of 0.02638, there is a probability of about 2.6% that these # random samples could be collected from populations where the two means were # equal. For a common significance level of .1 or .05, we would reject the # null hypothesis and conclude there is sufficient evidence to claim there is a # difference in the mean IQ for the two groups. For a significance level of # .01, we would fail to reject the null and not have enough evidence to support # the claim. # You can read more about the t.test function by typing t.test in the Help # search box. # PRACTICE # 13. Run the summary() function on the airquality dataset. # 14. Make boxplots of the Wind variable in the airquality dataset broken out # by Month. # 15. Run a t-test on the Wind variable for May and July, then one for July and # August. What do you observe? # The t.test function does a lot. If you look at the help for t.test, you can # do one and two tailed tests, one or two variable tests, paired or unpaired, # variances unequal or pooled, etc. You can also have the t.test function # calculate a confidence interval for a particular confidence level. For # example: t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.9) t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.95) t.test(iqlead$IQF[iqlead$LEAD == 1],conf.level=0.99) # You can ignore the other parts of the output and focus on the confidence # interval section. As you would expect, as the increase the confidence level, # the width of the confidence interval increases.