11: Hypothesis Testing

using computer simulation. Based on examples from the infer package. Code for Quiz 13.

Load the R package we will use.

Question: t-test

set.seed(???)

SEE QUIZ is the name of your data subset

hr  <- read_csv("https://estanny.com/static/week13/data/???", 
                col_types = "fddfff") 

use the skim to summarize the data in hr

???(???)

The mean hours worked per week is: ???

Q: Is the mean number of hours worked per week 48?

specify that hours is the variable of interest

hr  %>% 
  ???(response = ???)

hypothesize that the average hours worked is 48

???  %>% 
  specify(response = hours)  %>% 
  ???(null = "point", mu = ???)

generate 1000 replicates representing the null hypothesis

hr %>% 
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)  %>% 
  ???(reps = ???, type = "bootstrap") 

The output has ??? rows


calculate the distribution of statistics from the generated data

???  <- hr  %>% 
  specify(response = age)  %>% 
  hypothesize(null = "point", mu = 48)  %>% 
  generate(reps = 1000, type = "bootstrap")  %>% 
  ???(stat = "???")

???

visualize the simulated null distribution

???(???)

calculate the statistic from your observed data

???  <- hr  %>%
  specify(response = hours)  %>% 
  hypothesize(null = "point", mu = 48)  %>%
  calculate(stat = "t")

observed_t_statistic

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution  %>% 
  ???(obs_stat = ??? , direction = "two-sided")

shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  ???(obs_stat = ???, direction = "two-sided")

If the p-value < 0.05? ??? (yes/no)

Does your analysis support the null hypothesis that the true mean number of hours worked was 48? ??? (yes/no)


Question: 2 sample t-test

SEE QUIZ is the name of your data subset

hr_2 <- read_csv("https://estanny.com/static/week13/data/???", 
                col_types = "fddfff") 

Q: Is the average number of hours worked the same for both genders?

use skim to summarize the data in hr_2 by gender

hr_2 %>% 
  group_by(???)  %>% 
  ???()

Use geom_boxplot to plot distributions of hours worked by gender

hr_2 %>% 
  ggplot(aes(x = gender, y = hours)) + 
  ???()

specify the variables of interest are hours and gender

hr_2 %>% 
  ???(response = ???, explanatory = gender)

hypothesize that the number of hours worked and gender are independent

???  %>% 
  specify(response = hours, explanatory = gender)  %>% 
  ???(null = "???")

generate 1000 replicates representing the null hypothesis

hr_2 %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  ???(reps = ???, type = "permute") 

The output has ??? rows


calculate the distribution of statistics from the generated data

???  <- hr_2 %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")  %>% 
  ???(stat = "???", order = c("female", "male"))

???

visualize the simulated null distribution

???(???)

calculate the statistic from your observed data

???  <- hr_2 %>%
  specify(response = hours, explanatory = gender)  %>% 
  calculate(stat = "t", order = c("female", "male"))

???

get_p_value from the simulated null distribution and the observed statistic

null_t_distribution  %>% 
  ???(obs_stat = ??? , direction = "two-sided")

shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  ???(obs_stat = ???, direction = "two-sided")

If the p-value < 0.05? ??? (yes/no)

Does your analysis support the null hypothesis that the true mean number of hours worked by female and male employees was the same? ??? (yes/no)


Question: ANOVA

SEE QUIZ is the name of your data subset

hr_anova <- read_csv("https://estanny.com/static/week13/data/???", 
                col_types = "fddfff") 

Q: Is the average number of hours worked the same for all three status (fired, ok and promoted) ?

use skim to summarize the data in hr_anova by status

hr_anova %>% 
  group_by(???)  %>% 
  ???()

Use geom_boxplot to plot distributions of hours worked by status

hr_anova %>% 
  ggplot(aes(x = ???, y = hours)) + 
  ???()

specify the variables of interest are hours and status

hr_anova %>% 
  ???(response = ???, explanatory = status)

hypothesize that the number of hours worked and status are independent

???  %>% 
  specify(response = hours, explanatory = status)  %>% 
  ???(null = "???")

generate 1000 replicates representing the null hypothesis

hr_anova %>% 
  specify(response = hours, explanatory = status)  %>% 
  hypothesize(null = "independence")  %>% 
  ???(reps = ???, type = "permute") 

The output has ??? rows


calculate the distribution of statistics from the generated data

???  <- hr_anova %>% 
  specify(response = hours, explanatory = gender)  %>% 
  hypothesize(null = "independence")  %>% 
  generate(reps = 1000, type = "permute")  %>% 
  ???(stat = "F")

???

visualize the simulated null distribution

???(???)

calculate the statistic from your observed data

???  <- hr_anova %>%
  specify(response = hours, explanatory = status)  %>% 
  calculate(stat = "F")

???

get_p_value from the simulated null distribution and the observed statistic

null_distribution_anova  %>% 
  ???(obs_stat = ??? , direction = "greater")

shade_p_value on the simulated null distribution

null_t_distribution  %>% 
  visualize() +
  ???(obs_stat = ???, direction = "greater")

If the p-value < 0.05? ??? (yes/no)

Does your analysis support the null hypothesis that the true means of the number of hours worked for those that were “fired”, “ok” and “promoted” were the same? ??? (yes/no)