# (0) this is a first tutorial on R, focusing on getting data into R
# from outside and doing simple descriptive analyses

# (1) the comment character in R is the sharp or hash-tag sign

# in what follows, lines beginning with no comment character
# are commands you can run in your own R session, and lines
# beginning with a comment character are the results of those commands

# (2) the ls( ) function lists all objects defined in the current
# R environment

ls( )

# character(0)

# this just means that nothing is there yet

# (3) you can find out about any command with the help command:

help( ls )

# ls                package:base             R Documentation

# List Objects

# Description:

# 'ls' and 'objects' return a vector of character strings giving the
# names of the objects in the specified environment. When invoked
# with no argument at the top level prompt, 'ls' shows what data
# sets and functions a user has defined. When invoked with no
# argument inside a function, 'ls' returns the names of the
# function's local variables. This is useful in conjunction with
# 'browser'.

# and so on

# this is helpful if you already know the name of an R command and
# you want to know what it does; the reverse type of help -- e.g.,
# 'how do i mess around with a matrix?' -- is not as well provided in R

help( matrices )

# No documentation for 'matrices' in specified packages and libraries:
# you could try '??matrices'

??matrices

# Vignettes with name or keyword or title matching 'matrices' using fuzzy
# matching:

# Matrix::sparseModels    Sparse Model Matrices

# Type 'vignette("FOO", package="PKG")' to inspect entries 'PKG::FOO'.

# Help files with alias or concept or title matching 'matrices' using
# fuzzy matching:

# base::matrix            Matrices
# base::prmatrix          Print Matrices, Old-style
# base::qr.X              Reconstruct the Q, R, or X Matrices from a QR
#                         Object
# and so on

# for more help, 

# (a) google the search string

#   R tutorial

# and *many* helpful web pages and PDF files instantly appear

# (b) inside R, run the command

# demo( )

# to get some demonstrations of what R can do

# (4) now i'll assign the IHGA-experiment control values to an object

y.ihga.C <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
               2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
               2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
               2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4,
               4, 5, 5, 5, 5, 7, 7 )

# (5) i like to use naming conventions in R of the form

#   something.somethingelse.somethingyetelse. and so on

# the dot . and the underscore _ are the two most popular delimeters 
# R allows you to use to create names with multiple parts; i could
# have named this object

#  y_ihga_C

# and R would have been equally pleased, but you have to use the shift key
# to create the underscore and that's not needed for the dot, so
# i use dots

# in this case y.ihga.C means to me that this object stores the control (C)
# outcome (y) values from the IHGA (ihga) experiment (note ihga in 
# lower case to again avoid having to use the shift key)

# you can invent your own naming conventions, but whatever you do
# i urge you to use long descriptive names: it's a form of commenting 
# your code, so that you know what things mean when you come back to them 
# a day or a week or a month or a year later

# there's nothing worse than returning to a stored R environment four months
# after you last looked at it and see something like this:

#   ls( )

#   [1] "quack" "x.1" "x.2" "y.version.7"

# when you've long forgotten what those variables mean

# (6) the assigment operator 

#   <- 

# assigns whatever's on the right side of the operator to the object 
# on the left side

# if the thing on the left side doesn't previously exist in the
# current R environment, it's created; if it does previously exist,
# the new value writes over top of the old value

# for example, if there were no object called x in the current
# R environment, the command 

#   x <- 1

# creates a constant or scalar called x in the current R environment 
# and stores the value 1 in it

# if x already exists, the same command ( x <- 1 ) destroys whatever
# value it used to have and replaces it with the integer 1

# an alternative assignment operator is available; instead of saying

#  x <- 1

# you can say

#  x = 1

# i don't recommend this alternative assignment operator, because
# it can readily be confused with the other use of the = sign,
# namely to ask R the true/false status of propositions such as ( x = 1 )

# (7) the c( ) function combines its arguments into a set;
# if all of the arguments are numerical the result will look like a vector,
# but it's not actually a vector inside R until you declare it to
# have vector status (e.g., for the purpose of vector multiplication)

# if you name an object on the command line without any other input,
# R tells you its value in the current R environment

y.ihga.C

#   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
# [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
# [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [260] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 7 7

# the indices ([1], [38], and so on) are R's way of telling you that
# (in this case) it thinks y.ihga.C is a set of elements whose first value 
# is 0, whose 38th value is also 0, ..., whose 223rd value is 2, and so on

# (8) how many control values are there in the ihga data set?

length( y.ihga.C )

# [1] 287

# the length( ) function returns the number of elements in a set

# the dim( ) function similarly returns the dimension of a matrix,
# but if you apply it to a set of numbers, such as y.ihga.C, which was
# created with the assignment operator, it will tell you that the set
# does not have vector or matrix status:

dim( y.ihga.C )

# NULL

# NULL is the null object in R

# a natural command at this point would be

n.y.ihga.C <- length( y.ihga.C )

# to create the new object n.y.ihga.C, which keeps track of the
# control-group sample size

# (9) you can combine creating an object and finding out what its value is
# into a single command line, by chaining R commands as follows:

print( n.y.ihga.C <- length( y.ihga.C ) )

# [1] 287

# this both creates n.y.ihga.C and prints out its value

# (10) now i'm going to create a new object called x_3 and then get rid 
# of it, to show you how to delete objects from the current R environment

x_3 <- 0

rm( x_3 )

# the rm( ) function removes objects (they can't be recovered unless
# you saved them in some other R environment previously)

y.ihga.T <- c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
               0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
               1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
               2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
               2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
               4, 4, 4, 5, 6 )

ls( )

# [1] "n.y.ihga.C" "y.ihga.C" "y.ihga.T"

# (11) there's a number of ways to get data into R; here's another one:

y.ihga <- list( 

  C = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4,
         4, 5, 5, 5, 5, 7, 7 ), 

  n.C = 287,

  T = c( 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 5, 6 ), 

  n.T = 285 

)

ls( )

# [1] "n.y.ihga.C" "y.ihga"   "y.ihga.C" "y.ihga.T"

# (12) the str( ) function tells you the internal structure of the
# objects listed as arguments in the function:

str( y.ihga.C )

#  num [1:287] 0 0 0 0 0 0 0 0 0 0 ...

# this is R's way of saying that y.ihga.C is a list of numbers of length
# 287 that begins with a lot of 0s

str( y.ihga.T )

# num [1:285] 0 0 0 0 0 0 0 0 0 0 ...

str( y.ihga )

# List of 4
#  $ C  : num [1:287] 0 0 0 0 0 0 0 0 0 0 ...
#  $ n.C: num 287
#  $ T  : num [1:285] 0 0 0 0 0 0 0 0 0 0 ...
#  $ n.T: num 285

# y.ihga is not a set, it's a type of data structure that R calls a *list*

# the list( ) function used to create y.ihga tells R that the list structure
# of y.ihga has 4 components

# as R is trying to say with notation of the form

#  $ C  : num [1:287] 0 0 0 0 0 0 0 0 0 0 ... ,

# you can access the C component of the y.ihga list with the $ operator:

y.ihga$C

#   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#  [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1
# [149] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# [186] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2
# [223] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
# [260] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5 7 7

ls( )

# [1] "n.y.ihga.C" "y.ihga"     "y.ihga.C"   "y.ihga.T"

# to see if y.ihga$C and y.ihga.C are the same, you can use the == operator:

y.ihga$C == y.ihga.C

#   [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#  [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#    and so on
# [271] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [286] TRUE TRUE

# == asks if the stuff on the left side of == is the same as the stuff on
# the right; it was necessary to create the == operator in R because the
# command

# y.ihga$C = y.ihga.C

# would have *assigned* the values of y.ihga.C to the values of y.ihga$C
# rather than comparing them to see if they're the same

# (13) yet another way to get data into R proceeds in four steps: (I) in
# a text file outside R, write R code that reads the data in, either in the 
# set format in item (4) above with a command of the form 

# y.ihga.C <- c( 0, 0, and so on, 7 )

# or in the list format in item (11) above with a command of the form

# y.ihga <- list( C = c( 0, 0, and so on, 7 ) and so on )

# (II) save the file in step (I) in a directory of your choosing (let's say
# the file is called test-1.txt ); (III) start R in the directory specified 
# in step (II); and (IV) issue the command

source( "test-1.txt" )

# this will read in and execute all of the commands in the file test-1.txt

# (14) yet *another* way to get data into R proceeds in four steps: (I) in
# a text file outside R, store a single set of data values (separated by
# spaces or tab characters) that you want R to read and put into an object
# (let's say (a) the file is called test-2.txt , (b) it contains the 
# control-group values in the ihga case study and (c) you want the object
# into which these values are read to be called y.ihga.C ); (II) save 
# the file in step (I) in a directory of your choosing; (III) start R 
# in the directory specified in step (II); and (IV) issue the command

y.ihga.C <- scan( "test-2.txt" )

# Read 287 items

# this will read the data values in the file test-2.txt and store them
# in the object called y.ihga.C

# (15) inside R, to find out what directory you're in, you can issue the
# command

getwd( )

# "/projects/draper/Teaching/206/Winter-2014"

# as the name implies, this command gets the name of the 
# current nworking directory and prints it out

# to change the current working directory inside an R session, you
# can use a command like the following:

# setwd( "quack" )

# where quack is the full name of the directory you want to read from
# and write into from now on

# if you're running R under an operating system that has pull-down menus, 
# you can also change the current working directory that way; for example,
# under windows in the pull-down File menu there's an option called

# Change dir ...

# (16) now how about some simple descriptive analyses of the ihga data set?
# let's start with graphical summaries of the control data values

# (16.1) i'll begin with an exploratory-data-analysis tool invented by
# the great (non-bayesian) statistician john tukey (1915-2000)

stem( y.ihga.C )

#   The decimal point is at the |

#   0 | 00000000000000000000000000000000000000000000000000000000000000000000+58
#   0 | 
#   1 | 00000000000000000000000000000000000000000000000000000000000000000000
#   1 | 
#   2 | 0000000000000000000000000000000000000000000000
#   2 | 
#   3 | 000000000000
#   3 | 
#   4 | 00000000
#   4 | 
#   5 | 0000
#   5 | 
#   6 | 
#   6 | 
#   7 | 00

# this creates a stem-and-leaf plot, which is like a histogram rotated
# 90 degrees clockwise

hist( y.ihga.C )

# this produces a histogram of the y.ihga.C variable, but it's not
# a very good histogram, in two ways: (a) it's more useful to plot
# histograms on the density scale (recall that on this scale,
# relative frequency corresponds to the area under the histogram)
# and (b) R's default choice of the bar widths and placements
# doesn't correspond well to the actual values (0, 1, ..., 7)

# i also don't like the main title (in bold), which adds no new information

# to fix the first and third defects, just plot the histogram on the 
# density scale and omit the main title with the improved command

hist( y.ihga.C, probability = T, main = '' )

# to fix the second defect, ask for help about the hist command:

help( hist )

# hist                 package:graphics                  R Documentation

# Histograms

# Description:

#      The generic function 'hist' computes a histogram of the given data
#      values.  If 'plot=TRUE', the resulting object of class
#      '"histogram"' is plotted by 'plot.histogram', before it is
#      returned.

# Usage:

#      hist(x, ...)
     
#      ## Default S3 method:
#      hist(x, breaks = "Sturges",
#           freq = NULL, probability = !freq,
#           include.lowest = TRUE, right = TRUE,
#           density = NULL, angle = 45, col = NULL, border = NULL,
#           main = paste("Histogram of" , xname),
#           xlim = range(breaks), ylim = NULL,
#           xlab = xname, ylab,

# and so on

# i can use the breaks option to force the histogram bars to be
# centered at the data values

# first you need to know something about the colon : operator

0:8

# [1] 0 1 2 3 4 5 6 7 8

# in the form i:j it will generate the integers between i and j inclusive

( 0:8 ) - 0.5

# [1] -0.5  0.5  1.5  2.5  3.5  4.5  5.5  6.5  7.5

# by putting the breaks at -0.5, 0.5, ..., 7.5, this will force R to
# center the bars correctly at the data values

hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5 )

# this is now a correct histogram; i can make it slightly more
# user-friendly by giving the horizontal axis a more informative name:

hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
  xlab = 'Control Outcome Values in the IHGA Case Study' )

# note that you can continue an R command onto additional lines

# the distribution has a long right-hand tail (otherwise known as
# positive skew); since it keeps track of the number of occurrences
# of a fairly rare phenomenon, it's natural to think of a Poisson
# distribution as a possible first stochastic model for the
# control data

# (16.2) it's easy to do the same thing with the treatment data set:

hist( y.ihga.T, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
  xlab = 'Treatment Outcome Values in the IHGA Case Study' )

# it also has a vaguely Poisson shape

# it's hard to compare the two distributions with two separate hist
# commands, since the second histogram writes over top of the first

# this can be fixed by plotting the two histograms on the same graph,
# by putting one above the other

# the par( ) function controls *many* aspects of plotting in R; 
# you'll want to get familiar with its options

# this is the one we need at present:

par( mfrow = c( 2, 1 ) )

# this tells R to reformat the plotting window into a matrix of plots
# with 2 rows (the first argument to mfrow) and 1 column (the second
# argument)

# now i just reissue the two previous hist commands:

hist( y.ihga.C, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
  xlab = 'Control Outcome Values in the IHGA Case Study' )

hist( y.ihga.T, probability = T, main = '', breaks = ( 0:8 ) - 0.5,
  xlab = 'Treatment Outcome Values in the IHGA Case Study' )

# this is a good plot: the horizontal and vertical axes are the
# same in the two graphs, making them directly comparable

# even so, it's a bit difficult to see how the distributions differ,
# but if you look carefully you'll see that the treatment densities
# at the low numbers of hospitalizations are a bit higher than
# the control densities, and vice versa for the high numbers of
# hospitalizations; this indicates that the mean in the treatment group
# should be smaller than the mean in the control group (let's see)

# (16.3) some simple numerical descriptive analyses:

print( mean.y.ihga.C <- mean( y.ihga.C ) )

# [1] 0.9442509 

print( mean.y.ihga.T <- mean( y.ihga.T ) )

# [1] 0.7684211

# the mean( ) function does just what its name implies

( mean.y.ihga.T - mean.y.ihga.C ) / mean.y.ihga.C

# [1] -0.1862109

# this shows that the mean rate of hospitalizations per two years
# in the T group was lower (descriptively) than the corresponding rate
# for the C group by 18.6% =. 19% (=. means is approximately equal to)

# note, however, the following oddity with relative change:

( mean.y.ihga.C - mean.y.ihga.T ) / mean.y.ihga.T

# [1] 0.2288196

# it is also simultaneously true to conclude that the mean rate of
# hospitalizations in the C group was higher (descriptively) than the
# corresponding rate for the T group by 22.8% =. 23%

# there's no contradiction here: ( B - A ) / A is not the same thing
# as - ( A - B ) / B = ( B - A ) / B

# ( B - A ) / A represents the amount by which B is larger (if + sign)
# or smaller (if - sign) than A (the denominator) in relative terms

# ( A - B ) / B represents the amount by which A is larger (if + sign)
# or smaller (if - sign) than B (the denominator) in relative terms

# the fact that these two things are different (but may not be 
# too different) is related to the fact that the taylor series expansion
# of the function 1 / ( 1 + x ) around x = 0 has first-order
# approximation ( 1 - x ):

#    1
#  ----- =. 1 - x
#  1 + x

# you can see that this is true by multiplying both sides by ( 1 + x ),
# which gives the approximation 1 =. 1 - x**2, clearly only good if
# x is close to 0

# the point of all of this is that the relative amount by which B is
# larger (smaller) than A will only be close to the negative of the 
# relative amount by which A is smaller (larger) than B if A and B
# are not too far apart, i.e., if ( B - A ) and ( A - B ) are close to 0

# (16.4) another simple descriptive analysis in this case study
# casts doubt on the Poisson modeling idea mentioned above:

c( sd( y.ihga.C ), sd( y.ihga.T ) )

# [1] 1.239089 1.008268

c( var( y.ihga.C ), var( y.ihga.T ) )

# [1] 1.535343 1.016605

# the sd( ) and var( ) (variance) functions do what their names suggest

# so here's a little numerical descriptive summary table:

#                 group
#            control treatment
# mean        0.944    0.768
# sd          1.239    1.008
# variance    1.535    1.017

# to see why the Poisson is not a good fit in either group, you 
# have to remember an important fact about that distribution:
# for the Poisson, the mean and variance are the same; in other words,
# its variance-to-mean-ratio (VTMR) is 1

var( y.ihga.C ) / mean( y.ihga.C )

# [1] 1.62599

var( y.ihga.T ) / mean( y.ihga.T )

# [1] 1.322979

# the VTMRs in the two groups are 63% and 32% bigger than
# they should be if the data values were like random draws from
# Poisson distributions; this is pretty good evidence that
# the Poisson is not a good probability model here