# (0) this is a first tutorial on R, focusing on getting data into R
# from outside and doing simple descriptive analyses
# (1) the comment character in R is the sharp or hash-tag sign
# in what follows, lines beginning with no comment character
# are commands you can run in your own R session, and lines
# beginning with a comment character are the results of those commands
# (2) the ls( ) function lists all objects defined in the current
# R environment
ls( )
# character(0)
# this just means that nothing is there yet
# (3) you can find out about any command with the help command:
help( ls )
# ls package:base R Documentation
# List Objects
# Description:
# 'ls' and 'objects' return a vector of character strings giving the
# names of the objects in the specified environment. When invoked
# with no argument at the top level prompt, 'ls' shows what data
# sets and functions a user has defined. When invoked with no
# argument inside a function, 'ls' returns the names of the
# function's local variables. This is useful in conjunction with
# 'browser'.
# and so on
# this is helpful if you already know the name of an R command and
# you want to know what it does; the reverse type of help -- e.g.,
# 'how do i mess around with a matrix?' -- is not as well provided in R
help( matrices )
# No documentation for 'matrices' in specified packages and libraries:
# you could try '??matrices'
??matrices
# Vignettes with name or keyword or title matching 'matrices' using fuzzy
# matching:
# Matrix::sparseModels Sparse Model Matrices
# Type 'vignette("FOO", package="PKG")' to inspect entries 'PKG::FOO'.
# Help files with alias or concept or title matching 'matrices' using
# fuzzy matching:
# base::matrix Matrices
# base::prmatrix Print Matrices, Old-style
# base::qr.X Reconstruct the Q, R, or X Matrices from a QR
# Object
# and so on
# for more help,
# (a) google the search string
# R tutorial
# and *many* helpful web pages and PDF files instantly appear
# (b) inside R, run the command
# demo( )
# to get some demonstrations of what R can do
# (4) now i'll assign the IHGA-experiment control values to an object
y.ihga.C <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4,
4, 5, 5, 5, 5, 7, 7 )
# (5) i like to use naming conventions in R of the form
# something.somethingelse.somethingyetelse. and so on
# the dot . and the underscore _ are the two most popular delimeters
# R allows you to use to create names with multiple parts; i could
# have named this object
# y_ihga_C
# and R would have been equally pleased, but you have to use the shift key
# to create the underscore and that's not needed for the dot, so
# i use dots
# in this case y.ihga.C means to me that this object stores the control (C)
# outcome (y) values from the IHGA (ihga) experiment (note ihga in
# lower case to again avoid having to use the shift key)
# you can invent your own naming conventions, but whatever you do
# i urge you to use long descriptive names: it's a form of commenting
# your code, so that you know what things mean when you come back to them
# a day or a week or a month or a year later
# there's nothing worse than returning to a stored R environment four months
# after you last looked at it and see something like this:
# ls( )
# [1] "quack" "x.1" "x.2" "y.version.7"
# when you've long forgotten what those variables mean
# (6) the assigment operator
# <-
# assigns whatever's on the right side of the operator to the object
# on the left side
# if the thing on the left side doesn't previously exist in the
# current R environment, it's created; if it does previously exist,
# the new value writes over top of the old value
# for example, if there were no object called x in the current
# R environment, the command
# x <- 1
# creates a constant or scalar called x in the current R environment
# and stores the value 1 in it
# if x already exists, the same command ( x <- 1 ) destroys whatever
# value it used to have and replaces it with the integer 1
# an alternative assignment operator is available; instead of saying
# x <- 1
# you can say
# x = 1
# i don't recommend this alternative assignment operator, because
# it can readily be confused with the other use of the = sign,
# namely to ask R the true/false status of propositions such as ( x = 1 )
# (7) the c( ) function combines its arguments into a set;
# if all of the arguments are numerical the result will look like a vector,
# but it's not actually a vector inside R until you declare it to
# have vector status (e.g., for the purpose of vector multiplication)
# if you name an object on the command line without any other input,
# R tells you its value in the current R environment
y.ihga.C
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
# [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
# [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [260] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 7 7
# the indices ([1], [38], and so on) are R's way of telling you that
# (in this case) it thinks y.ihga.C is a set of elements whose first value
# is 0, whose 38th value is also 0, ..., whose 223rd value is 2, and so on
# (8) how many control values are there in the ihga data set?
length( y.ihga.C )
# [1] 287
# the length( ) function returns the number of elements in a set
# the dim( ) function similarly returns the dimension of a matrix,
# but if you apply it to a set of numbers, such as y.ihga.C, which was
# created with the assignment operator, it will tell you that the set
# does not have vector or matrix status:
dim( y.ihga.C )
# NULL
# NULL is the null object in R
# a natural command at this point would be
n.y.ihga.C <- length( y.ihga.C )
# to create the new object n.y.ihga.C, which keeps track of the
# control-group sample size
# (9) you can combine creating an object and finding out what its value is
# into a single command line, by chaining R commands as follows:
print( n.y.ihga.C <- length( y.ihga.C ) )
# [1] 287
# this both creates n.y.ihga.C and prints out its value
# (10) now i'm going to create a new object called x_3 and then get rid
# of it, to show you how to delete objects from the current R environment
x_3 <- 0
rm( x_3 )
# the rm( ) function removes objects (they can't be recovered unless
# you saved them in some other R environment previously)
y.ihga.T <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 5, 6 )
ls( )
# [1] "n.y.ihga.C" "y.ihga.C" "y.ihga.T"
# (11) there's a number of ways to get data into R; here's another one:
y.ihga <- list(
C = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4,
4, 5, 5, 5, 5, 7, 7 ),
n.C = 287,
T = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 5, 6 ),
n.T = 285
)
ls( )
# [1] "n.y.ihga.C" "y.ihga" "y.ihga.C" "y.ihga.T"
# (12) the str( ) function tells you the internal structure of the
# objects listed as arguments in the function:
str( y.ihga.C )
# num [1:287] 0 0 0 0 0 0 0 0 0 0 ...
# this is R's way of saying that y.ihga.C is a list of numbers of length
# 287 that begins with a lot of 0s
str( y.ihga.T )
# num [1:285] 0 0 0 0 0 0 0 0 0 0 ...
str( y.ihga )
# List of 4
# $ C : num [1:287] 0 0 0 0 0 0 0 0 0 0 ...
# $ n.C: num 287
# $ T : num [1:285] 0 0 0 0 0 0 0 0 0 0 ...
# $ n.T: num 285
# y.ihga is not a set, it's a type of data structure that R calls a *list*
# the list( ) function used to create y.ihga tells R that the list structure
# of y.ihga has 4 components
# as R is trying to say with notation of the form
# $ C : num [1:287] 0 0 0 0 0 0 0 0 0 0 ... ,
# you can access the C component of the y.ihga list with the $ operator:
y.ihga$C
# [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
# [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
# [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [260] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 7 7
ls( )
# [1] "n.y.ihga.C" "y.ihga" "y.ihga.C" "y.ihga.T"
# to see if y.ihga$C and y.ihga.C are the same, you can use the == operator:
y.ihga$C == y.ihga.C
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# and so on
# [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [286] TRUE TRUE
# == asks if the stuff on the left side of == is the same as the stuff on
# the right; it was necessary to create the == operator in R because the
# command
# y.ihga$C = y.ihga.C
# would have *assigned* the values of y.ihga.C to the values of y.ihga$C
# rather than comparing them to see if they're the same
# (13) yet another way to get data into R proceeds in four steps: (I) in
# a text file outside R, write R code that reads the data in, either in the
# set format in item (4) above with a command of the form
# y.ihga.C <- c( 0, 0, and so on, 7 )
# or in the list format in item (11) above with a command of the form
# y.ihga <- list( C = c( 0, 0, and so on, 7 ) and so on )
# (II) save the file in step (I) in a directory of your choosing (let's say
# the file is called test-1.txt ); (III) start R in the directory specified
# in step (II); and (IV) issue the command
source( "test-1.txt" )
# this will read in and execute all of the commands in the file test-1.txt
# (14) yet *another* way to get data into R proceeds in four steps: (I) in
# a text file outside R, store a single set of data values (separated by
# spaces or tab characters) that you want R to read and put into an object
# (let's say (a) the file is called test-2.txt , (b) it contains the
# control-group values in the ihga case study and (c) you want the object
# into which these values are read to be called y.ihga.C ); (II) save
# the file in step (I) in a directory of your choosing; (III) start R
# in the directory specified in step (II); and (IV) issue the command
y.ihga.C <- scan( "test-2.txt" )
# Read 287 items
# this will read the data values in the file test-2.txt and store them
# in the object called y.ihga.C
# (15) inside R, to find out what directory you're in, you can issue the
# command
getwd( )
# "/projects/draper/Teaching/206/Winter-2014"
# as the name implies, this command gets the name of the
# current nworking directory and prints it out
# to change the current working directory inside an R session, you
# can use a command like the following:
# setwd( "quack" )
# where quack is the full name of the directory you want to read from
# and write into from now on
# if you're running R under an operating system that has pull-down menus,
# you can also change the current working directory that way; for example,
# under windows in the pull-down File menu there's an option called
# Change dir ...
# (16) now how about some simple descriptive analyses of the ihga data set?
# let's start with graphical summaries of the control data values
# (16.1) i'll begin with an exploratory-data-analysis tool invented by
# the great (non-bayesian) statistician john tukey (1915-2000)
stem( y.ihga.C )
# The decimal point is at the |
# 0 | 00000000000000000000000000000000000000000000000000000000000000000000+58
# 0 |
# 1 | 00000000000000000000000000000000000000000000000000000000000000000000
# 1 |
# 2 | 0000000000000000000000000000000000000000000000
# 2 |
# 3 | 000000000000
# 3 |
# 4 | 00000000
# 4 |
# 5 | 0000
# 5 |
# 6 |
# 6 |
# 7 | 00
# this creates a stem-and-leaf plot, which is like a histogram rotated
# 90 degrees clockwise
hist( y.ihga.C )
# this produces a histogram of the y.ihga.C variable, but it's not
# a very good histogram, in two ways: (a) it's more useful to plot
# histograms on the density scale (recall that on this scale,
# relative frequency corresponds to the area under the histogram)
# and (b) R's default choice of the bar widths and placements
# doesn't correspond well to the actual values (0, 1, ..., 7)
# i also don't like the main title (in bold), which adds no new information
# to fix the first and third defects, just plot the histogram on the
# density scale and omit the main title with the improved command
hist( y.ihga.C, probability = T, main = '' )
# to fix the second defect, ask for help about the hist command:
help( hist )
# hist package:graphics R Documentation
# Histograms
# Description:
# The generic function 'hist' computes a histogram of the given data
# values. If 'plot=TRUE', the resulting object of class
# '"histogram"' is plotted by 'plot.histogram', before it is
# returned.
# Usage:
# hist(x, ...)
# ## Default S3 method:
# hist(x, breaks = "Sturges",
# freq = NULL, probability = !freq,
# include.lowest = TRUE, right = TRUE,
# density = NULL, angle = 45, col = NULL, border = NULL,
# main = paste("Histogram of" , xname),
# xlim = range(breaks), ylim = NULL,
# xlab = xname, ylab,
# and so on
# i can use the breaks option to force the histogram bars to be
# centered at the data values
# first you need to know something about the colon : operator
0:8
# [1] 0 1 2 3 4 5 6 7 8
# in the form i:j it will generate the integers between i and j inclusive
( 0:8 ) - 0.5
# [1] -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5
# by putting the breaks at -0.5, 0.5, ..., 7.5, this will force R to
# center the bars correctly at the data values
hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5 )
# this is now a correct histogram; i can make it slightly more
# user-friendly by giving the horizontal axis a more informative name:
hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
xlab = 'Control Outcome Values in the IHGA Case Study' )
# note that you can continue an R command onto additional lines
# the distribution has a long right-hand tail (otherwise known as
# positive skew); since it keeps track of the number of occurrences
# of a fairly rare phenomenon, it's natural to think of a Poisson
# distribution as a possible first stochastic model for the
# control data
# (16.2) it's easy to do the same thing with the treatment data set:
hist( y.ihga.T, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
xlab = 'Treatment Outcome Values in the IHGA Case Study' )
# it also has a vaguely Poisson shape
# it's hard to compare the two distributions with two separate hist
# commands, since the second histogram writes over top of the first
# this can be fixed by plotting the two histograms on the same graph,
# by putting one above the other
# the par( ) function controls *many* aspects of plotting in R;
# you'll want to get familiar with its options
# this is the one we need at present:
par( mfrow = c( 2, 1 ) )
# this tells R to reformat the plotting window into a matrix of plots
# with 2 rows (the first argument to mfrow) and 1 column (the second
# argument)
# now i just reissue the two previous hist commands:
hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
xlab = 'Control Outcome Values in the IHGA Case Study' )
hist( y.ihga.T, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
xlab = 'Treatment Outcome Values in the IHGA Case Study' )
# this is a good plot: the horizontal and vertical axes are the
# same in the two graphs, making them directly comparable
# even so, it's a bit difficult to see how the distributions differ,
# but if you look carefully you'll see that the treatment densities
# at the low numbers of hospitalizations are a bit higher than
# the control densities, and vice versa for the high numbers of
# hospitalizations; this indicates that the mean in the treatment group
# should be smaller than the mean in the control group (let's see)
# (16.3) some simple numerical descriptive analyses:
print( mean.y.ihga.C <- mean( y.ihga.C ) )
# [1] 0.9442509
print( mean.y.ihga.T <- mean( y.ihga.T ) )
# [1] 0.7684211
# the mean( ) function does just what its name implies
( mean.y.ihga.T - mean.y.ihga.C ) / mean.y.ihga.C
# [1] -0.1862109
# this shows that the mean rate of hospitalizations per two years
# in the T group was lower (descriptively) than the corresponding rate
# for the C group by 18.6% =. 19% (=. means is approximately equal to)
# note, however, the following oddity with relative change:
( mean.y.ihga.C - mean.y.ihga.T ) / mean.y.ihga.T
# [1] 0.2288196
# it is also simultaneously true to conclude that the mean rate of
# hospitalizations in the C group was higher (descriptively) than the
# corresponding rate for the T group by 22.8% =. 23%
# there's no contradiction here: ( B - A ) / A is not the same thing
# as - ( A - B ) / B = ( B - A ) / B
# ( B - A ) / A represents the amount by which B is larger (if + sign)
# or smaller (if - sign) than A (the denominator) in relative terms
# ( A - B ) / B represents the amount by which A is larger (if + sign)
# or smaller (if - sign) than B (the denominator) in relative terms
# the fact that these two things are different (but may not be
# too different) is related to the fact that the taylor series expansion
# of the function 1 / ( 1 + x ) around x = 0 has first-order
# approximation ( 1 - x ):
# 1
# ----- =. 1 - x
# 1 + x
# you can see that this is true by multiplying both sides by ( 1 + x ),
# which gives the approximation 1 =. 1 - x**2, clearly only good if
# x is close to 0
# the point of all of this is that the relative amount by which B is
# larger (smaller) than A will only be close to the negative of the
# relative amount by which A is smaller (larger) than B if A and B
# are not too far apart, i.e., if ( B - A ) and ( A - B ) are close to 0
# (16.4) another simple descriptive analysis in this case study
# casts doubt on the Poisson modeling idea mentioned above:
c( sd( y.ihga.C ), sd( y.ihga.T ) )
# [1] 1.239089 1.008268
c( var( y.ihga.C ), var( y.ihga.T ) )
# [1] 1.535343 1.016605
# the sd( ) and var( ) (variance) functions do what their names suggest
# so here's a little numerical descriptive summary table:
# group
# control treatment
# mean 0.944 0.768
# sd 1.239 1.008
# variance 1.535 1.017
# to see why the Poisson is not a good fit in either group, you
# have to remember an important fact about that distribution:
# for the Poisson, the mean and variance are the same; in other words,
# its variance-to-mean-ratio (VTMR) is 1
var( y.ihga.C ) / mean( y.ihga.C )
# [1] 1.62599
var( y.ihga.T ) / mean( y.ihga.T )
# [1] 1.322979
# the VTMRs in the two groups are 63% and 32% bigger than
# they should be if the data values were like random draws from
# Poisson distributions; this is pretty good evidence that
# the Poisson is not a good probability model here