# (0) this is a first tutorial on R, focusing on getting data into R # from outside and doing simple descriptive analyses # (1) the comment character in R is the sharp or hash-tag sign # in what follows, lines beginning with no comment character # are commands you can run in your own R session, and lines # beginning with a comment character are the results of those commands # (2) the ls( ) function lists all objects defined in the current # R environment ls( ) # character(0) # this just means that nothing is there yet # (3) you can find out about any command with the help command: help( ls ) # ls package:base R Documentation # List Objects # Description: # 'ls' and 'objects' return a vector of character strings giving the # names of the objects in the specified environment. When invoked # with no argument at the top level prompt, 'ls' shows what data # sets and functions a user has defined. When invoked with no # argument inside a function, 'ls' returns the names of the # function's local variables. This is useful in conjunction with # 'browser'. # and so on # this is helpful if you already know the name of an R command and # you want to know what it does; the reverse type of help -- e.g., # 'how do i mess around with a matrix?' -- is not as well provided in R help( matrices ) # No documentation for 'matrices' in specified packages and libraries: # you could try '??matrices' ??matrices # Vignettes with name or keyword or title matching 'matrices' using fuzzy # matching: # Matrix::sparseModels Sparse Model Matrices # Type 'vignette("FOO", package="PKG")' to inspect entries 'PKG::FOO'. # Help files with alias or concept or title matching 'matrices' using # fuzzy matching: # base::matrix Matrices # base::prmatrix Print Matrices, Old-style # base::qr.X Reconstruct the Q, R, or X Matrices from a QR # Object # and so on # for more help, # (a) google the search string # R tutorial # and *many* helpful web pages and PDF files instantly appear # (b) inside R, run the command # demo( ) # to get some demonstrations of what R can do # (4) now i'll assign the IHGA-experiment control values to an object y.ihga.C <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 7, 7 ) # (5) i like to use naming conventions in R of the form # something.somethingelse.somethingyetelse. and so on # the dot . and the underscore _ are the two most popular delimeters # R allows you to use to create names with multiple parts; i could # have named this object # y_ihga_C # and R would have been equally pleased, but you have to use the shift key # to create the underscore and that's not needed for the dot, so # i use dots # in this case y.ihga.C means to me that this object stores the control (C) # outcome (y) values from the IHGA (ihga) experiment (note ihga in # lower case to again avoid having to use the shift key) # you can invent your own naming conventions, but whatever you do # i urge you to use long descriptive names: it's a form of commenting # your code, so that you know what things mean when you come back to them # a day or a week or a month or a year later # there's nothing worse than returning to a stored R environment four months # after you last looked at it and see something like this: # ls( ) # [1] "quack" "x.1" "x.2" "y.version.7" # when you've long forgotten what those variables mean # (6) the assigment operator # <- # assigns whatever's on the right side of the operator to the object # on the left side # if the thing on the left side doesn't previously exist in the # current R environment, it's created; if it does previously exist, # the new value writes over top of the old value # for example, if there were no object called x in the current # R environment, the command # x <- 1 # creates a constant or scalar called x in the current R environment # and stores the value 1 in it # if x already exists, the same command ( x <- 1 ) destroys whatever # value it used to have and replaces it with the integer 1 # an alternative assignment operator is available; instead of saying # x <- 1 # you can say # x = 1 # i don't recommend this alternative assignment operator, because # it can readily be confused with the other use of the = sign, # namely to ask R the true/false status of propositions such as ( x = 1 ) # (7) the c( ) function combines its arguments into a set; # if all of the arguments are numerical the result will look like a vector, # but it's not actually a vector inside R until you declare it to # have vector status (e.g., for the purpose of vector multiplication) # if you name an object on the command line without any other input, # R tells you its value in the current R environment y.ihga.C # [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 # [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 # [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 # [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 # [260] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 7 7 # the indices ([1], [38], and so on) are R's way of telling you that # (in this case) it thinks y.ihga.C is a set of elements whose first value # is 0, whose 38th value is also 0, ..., whose 223rd value is 2, and so on # (8) how many control values are there in the ihga data set? length( y.ihga.C ) # [1] 287 # the length( ) function returns the number of elements in a set # the dim( ) function similarly returns the dimension of a matrix, # but if you apply it to a set of numbers, such as y.ihga.C, which was # created with the assignment operator, it will tell you that the set # does not have vector or matrix status: dim( y.ihga.C ) # NULL # NULL is the null object in R # a natural command at this point would be n.y.ihga.C <- length( y.ihga.C ) # to create the new object n.y.ihga.C, which keeps track of the # control-group sample size # (9) you can combine creating an object and finding out what its value is # into a single command line, by chaining R commands as follows: print( n.y.ihga.C <- length( y.ihga.C ) ) # [1] 287 # this both creates n.y.ihga.C and prints out its value # (10) now i'm going to create a new object called x_3 and then get rid # of it, to show you how to delete objects from the current R environment x_3 <- 0 rm( x_3 ) # the rm( ) function removes objects (they can't be recovered unless # you saved them in some other R environment previously) y.ihga.T <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 6 ) ls( ) # [1] "n.y.ihga.C" "y.ihga.C" "y.ihga.T" # (11) there's a number of ways to get data into R; here's another one: y.ihga <- list( C = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 7, 7 ), n.C = 287, T = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 5, 6 ), n.T = 285 ) ls( ) # [1] "n.y.ihga.C" "y.ihga" "y.ihga.C" "y.ihga.T" # (12) the str( ) function tells you the internal structure of the # objects listed as arguments in the function: str( y.ihga.C ) # num [1:287] 0 0 0 0 0 0 0 0 0 0 ... # this is R's way of saying that y.ihga.C is a list of numbers of length # 287 that begins with a lot of 0s str( y.ihga.T ) # num [1:285] 0 0 0 0 0 0 0 0 0 0 ... str( y.ihga ) # List of 4 # $ C : num [1:287] 0 0 0 0 0 0 0 0 0 0 ... # $ n.C: num 287 # $ T : num [1:285] 0 0 0 0 0 0 0 0 0 0 ... # $ n.T: num 285 # y.ihga is not a set, it's a type of data structure that R calls a *list* # the list( ) function used to create y.ihga tells R that the list structure # of y.ihga has 4 components # as R is trying to say with notation of the form # $ C : num [1:287] 0 0 0 0 0 0 0 0 0 0 ... , # you can access the C component of the y.ihga list with the $ operator: y.ihga$C # [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 # [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 # [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 # [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 # [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 # [260] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 7 7 ls( ) # [1] "n.y.ihga.C" "y.ihga" "y.ihga.C" "y.ihga.T" # to see if y.ihga$C and y.ihga.C are the same, you can use the == operator: y.ihga$C == y.ihga.C # [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE # [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE # and so on # [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE # [286] TRUE TRUE # == asks if the stuff on the left side of == is the same as the stuff on # the right; it was necessary to create the == operator in R because the # command # y.ihga$C = y.ihga.C # would have *assigned* the values of y.ihga.C to the values of y.ihga$C # rather than comparing them to see if they're the same # (13) yet another way to get data into R proceeds in four steps: (I) in # a text file outside R, write R code that reads the data in, either in the # set format in item (4) above with a command of the form # y.ihga.C <- c( 0, 0, and so on, 7 ) # or in the list format in item (11) above with a command of the form # y.ihga <- list( C = c( 0, 0, and so on, 7 ) and so on ) # (II) save the file in step (I) in a directory of your choosing (let's say # the file is called test-1.txt ); (III) start R in the directory specified # in step (II); and (IV) issue the command source( "test-1.txt" ) # this will read in and execute all of the commands in the file test-1.txt # (14) yet *another* way to get data into R proceeds in four steps: (I) in # a text file outside R, store a single set of data values (separated by # spaces or tab characters) that you want R to read and put into an object # (let's say (a) the file is called test-2.txt , (b) it contains the # control-group values in the ihga case study and (c) you want the object # into which these values are read to be called y.ihga.C ); (II) save # the file in step (I) in a directory of your choosing; (III) start R # in the directory specified in step (II); and (IV) issue the command y.ihga.C <- scan( "test-2.txt" ) # Read 287 items # this will read the data values in the file test-2.txt and store them # in the object called y.ihga.C # (15) inside R, to find out what directory you're in, you can issue the # command getwd( ) # "/projects/draper/Teaching/206/Winter-2014" # as the name implies, this command gets the name of the # current nworking directory and prints it out # to change the current working directory inside an R session, you # can use a command like the following: # setwd( "quack" ) # where quack is the full name of the directory you want to read from # and write into from now on # if you're running R under an operating system that has pull-down menus, # you can also change the current working directory that way; for example, # under windows in the pull-down File menu there's an option called # Change dir ... # (16) now how about some simple descriptive analyses of the ihga data set? # let's start with graphical summaries of the control data values # (16.1) i'll begin with an exploratory-data-analysis tool invented by # the great (non-bayesian) statistician john tukey (1915-2000) stem( y.ihga.C ) # The decimal point is at the | # 0 | 00000000000000000000000000000000000000000000000000000000000000000000+58 # 0 | # 1 | 00000000000000000000000000000000000000000000000000000000000000000000 # 1 | # 2 | 0000000000000000000000000000000000000000000000 # 2 | # 3 | 000000000000 # 3 | # 4 | 00000000 # 4 | # 5 | 0000 # 5 | # 6 | # 6 | # 7 | 00 # this creates a stem-and-leaf plot, which is like a histogram rotated # 90 degrees clockwise hist( y.ihga.C ) # this produces a histogram of the y.ihga.C variable, but it's not # a very good histogram, in two ways: (a) it's more useful to plot # histograms on the density scale (recall that on this scale, # relative frequency corresponds to the area under the histogram) # and (b) R's default choice of the bar widths and placements # doesn't correspond well to the actual values (0, 1, ..., 7) # i also don't like the main title (in bold), which adds no new information # to fix the first and third defects, just plot the histogram on the # density scale and omit the main title with the improved command hist( y.ihga.C, probability = T, main = '' ) # to fix the second defect, ask for help about the hist command: help( hist ) # hist package:graphics R Documentation # Histograms # Description: # The generic function 'hist' computes a histogram of the given data # values. If 'plot=TRUE', the resulting object of class # '"histogram"' is plotted by 'plot.histogram', before it is # returned. # Usage: # hist(x, ...) # ## Default S3 method: # hist(x, breaks = "Sturges", # freq = NULL, probability = !freq, # include.lowest = TRUE, right = TRUE, # density = NULL, angle = 45, col = NULL, border = NULL, # main = paste("Histogram of" , xname), # xlim = range(breaks), ylim = NULL, # xlab = xname, ylab, # and so on # i can use the breaks option to force the histogram bars to be # centered at the data values # first you need to know something about the colon : operator 0:8 # [1] 0 1 2 3 4 5 6 7 8 # in the form i:j it will generate the integers between i and j inclusive ( 0:8 ) - 0.5 # [1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 # by putting the breaks at -0.5, 0.5, ..., 7.5, this will force R to # center the bars correctly at the data values hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5 ) # this is now a correct histogram; i can make it slightly more # user-friendly by giving the horizontal axis a more informative name: hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5, xlab = 'Control Outcome Values in the IHGA Case Study' ) # note that you can continue an R command onto additional lines # the distribution has a long right-hand tail (otherwise known as # positive skew); since it keeps track of the number of occurrences # of a fairly rare phenomenon, it's natural to think of a Poisson # distribution as a possible first stochastic model for the # control data # (16.2) it's easy to do the same thing with the treatment data set: hist( y.ihga.T, probability = T, main = '', breaks = ( 0:8 ) - 0.5, xlab = 'Treatment Outcome Values in the IHGA Case Study' ) # it also has a vaguely Poisson shape # it's hard to compare the two distributions with two separate hist # commands, since the second histogram writes over top of the first # this can be fixed by plotting the two histograms on the same graph, # by putting one above the other # the par( ) function controls *many* aspects of plotting in R; # you'll want to get familiar with its options # this is the one we need at present: par( mfrow = c( 2, 1 ) ) # this tells R to reformat the plotting window into a matrix of plots # with 2 rows (the first argument to mfrow) and 1 column (the second # argument) # now i just reissue the two previous hist commands: hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5, xlab = 'Control Outcome Values in the IHGA Case Study' ) hist( y.ihga.T, probability = T, main = '', breaks = ( 0:8 ) - 0.5, xlab = 'Treatment Outcome Values in the IHGA Case Study' ) # this is a good plot: the horizontal and vertical axes are the # same in the two graphs, making them directly comparable # even so, it's a bit difficult to see how the distributions differ, # but if you look carefully you'll see that the treatment densities # at the low numbers of hospitalizations are a bit higher than # the control densities, and vice versa for the high numbers of # hospitalizations; this indicates that the mean in the treatment group # should be smaller than the mean in the control group (let's see) # (16.3) some simple numerical descriptive analyses: print( mean.y.ihga.C <- mean( y.ihga.C ) ) # [1] 0.9442509 print( mean.y.ihga.T <- mean( y.ihga.T ) ) # [1] 0.7684211 # the mean( ) function does just what its name implies ( mean.y.ihga.T - mean.y.ihga.C ) / mean.y.ihga.C # [1] -0.1862109 # this shows that the mean rate of hospitalizations per two years # in the T group was lower (descriptively) than the corresponding rate # for the C group by 18.6% =. 19% (=. means is approximately equal to) # note, however, the following oddity with relative change: ( mean.y.ihga.C - mean.y.ihga.T ) / mean.y.ihga.T # [1] 0.2288196 # it is also simultaneously true to conclude that the mean rate of # hospitalizations in the C group was higher (descriptively) than the # corresponding rate for the T group by 22.8% =. 23% # there's no contradiction here: ( B - A ) / A is not the same thing # as - ( A - B ) / B = ( B - A ) / B # ( B - A ) / A represents the amount by which B is larger (if + sign) # or smaller (if - sign) than A (the denominator) in relative terms # ( A - B ) / B represents the amount by which A is larger (if + sign) # or smaller (if - sign) than B (the denominator) in relative terms # the fact that these two things are different (but may not be # too different) is related to the fact that the taylor series expansion # of the function 1 / ( 1 + x ) around x = 0 has first-order # approximation ( 1 - x ): # 1 # ----- =. 1 - x # 1 + x # you can see that this is true by multiplying both sides by ( 1 + x ), # which gives the approximation 1 =. 1 - x**2, clearly only good if # x is close to 0 # the point of all of this is that the relative amount by which B is # larger (smaller) than A will only be close to the negative of the # relative amount by which A is smaller (larger) than B if A and B # are not too far apart, i.e., if ( B - A ) and ( A - B ) are close to 0 # (16.4) another simple descriptive analysis in this case study # casts doubt on the Poisson modeling idea mentioned above: c( sd( y.ihga.C ), sd( y.ihga.T ) ) # [1] 1.239089 1.008268 c( var( y.ihga.C ), var( y.ihga.T ) ) # [1] 1.535343 1.016605 # the sd( ) and var( ) (variance) functions do what their names suggest # so here's a little numerical descriptive summary table: # group # control treatment # mean 0.944 0.768 # sd 1.239 1.008 # variance 1.535 1.017 # to see why the Poisson is not a good fit in either group, you # have to remember an important fact about that distribution: # for the Poisson, the mean and variance are the same; in other words, # its variance-to-mean-ratio (VTMR) is 1 var( y.ihga.C ) / mean( y.ihga.C ) # [1] 1.62599 var( y.ihga.T ) / mean( y.ihga.T ) # [1] 1.322979 # the VTMRs in the two groups are 63% and 32% bigger than # they should be if the data values were like random draws from # Poisson distributions; this is pretty good evidence that # the Poisson is not a good probability model here