Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
#data
channel on Slack..csv
file into your data
folder.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
The salary survey comes from Ask a Manager. The readers were self-selected to respond to the study in May 2021. It has 26232 observations and 18 variables. I want to focus on the observations from the United States. We generally know working women are paid less than working men. My question is how do age, race, experience years in the field, and education affect the gender pay gap?
Given your question, what is your expectation about the data?
The data may show a positive, negative, or no association of gender pay gap by age, race, experience work years in the field, and education. Intuitively, the gender pay cap increases by age, decreases by experience years in the field and education, and may not change much by race.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
survey <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-18/survey.csv')
## Rows: 26232 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): timestamp, how_old_are_you, industry, job_title, additional_contex...
## dbl (2): annual_salary, other_monetary_comp
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(survey)
## Rows: 26,232
## Columns: 18
## $ timestamp <chr> "4/27/2021 11:02:10", "4/27/2…
## $ how_old_are_you <chr> "25-34", "25-34", "25-34", "2…
## $ industry <chr> "Education (Higher Education)…
## $ job_title <chr> "Research and Instruction Lib…
## $ additional_context_on_job_title <chr> NA, NA, NA, NA, NA, NA, NA, "…
## $ annual_salary <dbl> 55000, 54600, 34000, 62000, 6…
## $ other_monetary_comp <dbl> 0, 4000, NA, 3000, 7000, NA, …
## $ currency <chr> "USD", "GBP", "USD", "USD", "…
## $ currency_other <chr> NA, NA, NA, NA, NA, NA, NA, N…
## $ additional_context_on_income <chr> NA, NA, NA, NA, NA, NA, NA, N…
## $ country <chr> "United States", "United King…
## $ state <chr> "Massachusetts", NA, "Tenness…
## $ city <chr> "Boston", "Cambridge", "Chatt…
## $ overall_years_of_professional_experience <chr> "5-7 years", "8 - 10 years", …
## $ years_of_experience_in_field <chr> "5-7 years", "5-7 years", "2 …
## $ highest_level_of_education_completed <chr> "Master's degree", "College d…
## $ gender <chr> "Woman", "Non-binary", "Woman…
## $ race <chr> "White", "White", "White", "W…
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
First, I will select the United States’ data according to my focus. Second, I will rename and deal with NA and other unclear data to clarify the categorical data with different levels.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
# select data of United States
survey1<-survey%>%filter(country=="United States"|country=="america"|country=="America"|country=="US"|country=="USA")
# rename the column, deal with NA and other unclear data
survey2<-survey1%>%
rename(age=how_old_are_you,
experyr=years_of_experience_in_field,
edu=highest_level_of_education_completed)%>%
separate(col = race,
into = c("race1", "race2"),
sep = ",") %>%
mutate(id=1:19434,
race1=na_if(race1,"Another option not listed here or prefer not to answer"),
race1=str_replace_all(race1,"Hispanic","Hispanic, Latino, or Spanish origin"),
gender1=na_if(gender,"Other or prefer not to answer"),
gender1=na_if(gender1,"Prefer not to answer"),
gender1=na_if(gender1,"Non-binary"),
experyr1 =str_remove_all(experyr, "years"))
## Warning: Expected 2 pieces. Additional pieces discarded in 792 rows [9, 11, 40,
## 48, 87, 126, 127, 135, 136, 153, 170, 176, 189, 228, 286, 312, 328, 343, 378,
## 379, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 18017 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 10, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, ...].
# Convert type of categorical variables to factor, select variables we need to see the levels
survey3<-as.data.frame(unclass(survey2),
stringsAsFactors = TRUE)%>%
select(id,annual_salary,gender1,race1,age,experyr1,edu)
survey3%>%summarise(across(c(age,race1,gender1,experyr1,edu),
.fns = ~ str_flatten(sort(unique(.x)),collapse = "/"))) %>%
t() %>%
as_tibble(rownames = "var") %>%
gt()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
var | V1 |
---|---|
age | 18-24/25-34/35-44/45-54/55-64/65 or over/under 18 |
race1 | Asian or Asian American/Black or African American/Hispanic, Latino, or Spanish origin/Middle Eastern or Northern African/Native American or Alaska Native/White |
gender1 | Man/Woman |
experyr1 | 1 year or less/11 - 20 /2 - 4 /21 - 30 /31 - 40 /41 or more/5-7 /8 - 10 |
edu | College degree/High School/Master's degree/PhD/Professional degree (MD, JD, etc.)/Some college |
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
I choose some variables from survey2 and create survey11,survey12. I use left_join since the left one with variables about general information about observations and the right one with variables about the career. (To Midterm project, I don’t need to join. But this step may help to be familiar with join function and be useful to further study)
#join function
names(survey2)
## [1] "timestamp"
## [2] "age"
## [3] "industry"
## [4] "job_title"
## [5] "additional_context_on_job_title"
## [6] "annual_salary"
## [7] "other_monetary_comp"
## [8] "currency"
## [9] "currency_other"
## [10] "additional_context_on_income"
## [11] "country"
## [12] "state"
## [13] "city"
## [14] "overall_years_of_professional_experience"
## [15] "experyr"
## [16] "edu"
## [17] "gender"
## [18] "race1"
## [19] "race2"
## [20] "id"
## [21] "gender1"
## [22] "experyr1"
survey11<-survey2%>%select(id,age,gender,race1,edu,country,state)
survey12<-survey2%>%select(id,annual_salary,other_monetary_comp,industry,job_title,experyr)
survey_new<-left_join(survey11,survey12,by=c("id"="id"))
glimpse(survey_new)
## Rows: 19,434
## Columns: 12
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ age <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-3…
## $ gender <chr> "Woman", "Woman", "Woman", "Woman", "Man", "Woman"…
## $ race1 <chr> "White", "White", "White", "White", "White", "Whit…
## $ edu <chr> "Master's degree", "College degree", "College degr…
## $ country <chr> "United States", "US", "USA", "US", "USA", "USA", …
## $ state <chr> "Massachusetts", "Tennessee", "Wisconsin", "South …
## $ annual_salary <dbl> 55000, 34000, 62000, 60000, 62000, 33000, 50000, 1…
## $ other_monetary_comp <dbl> 0, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, 0, 0…
## $ industry <chr> "Education (Higher Education)", "Accounting, Banki…
## $ job_title <chr> "Research and Instruction Librarian", "Marketing S…
## $ experyr <chr> "5-7 years", "2 - 4 years", "5-7 years", "5-7 year…
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
# overlook data
glimpse(survey3)
## Rows: 19,434
## Columns: 7
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ annual_salary <dbl> 55000, 34000, 62000, 60000, 62000, 33000, 50000, 112000,…
## $ gender1 <fct> Woman, Woman, Woman, Woman, Man, Woman, Man, Woman, Woma…
## $ race1 <fct> "White", "White", "White", "White", "White", "White", "W…
## $ age <fct> 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 45-54, …
## $ experyr1 <fct> 5-7 , 2 - 4 , 5-7 , 5-7 , 2 - 4 , 2 - 4 , 5-7 , 21 - 30 …
## $ edu <fct> "Master's degree", "College degree", "College degree", "…
skim(survey3)
Name | survey3 |
Number of rows | 19434 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
factor | 5 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
gender1 | 814 | 0.96 | FALSE | 2 | Wom: 15366, Man: 3254 |
race1 | 470 | 0.98 | FALSE | 6 | Whi: 16280, Asi: 1161, His: 726, Bla: 626 |
age | 0 | 1.00 | FALSE | 7 | 25-: 8585, 35-: 7032, 45-: 2279, 18-: 735 |
experyr1 | 0 | 1.00 | FALSE | 8 | 11 : 4674, 5-7: 4572, 2 -: 4094, 8 -: 3554 |
edu | 123 | 0.99 | FALSE | 6 | Col: 9386, Mas: 6370, Som: 1315, Pro: 977 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 9717.5 | 5610.26 | 1 | 4859.25 | 9717.5 | 14575.75 | 19434 | ▇▇▇▇▇ |
annual_salary | 0 | 1 | 92379.2 | 68780.11 | 0 | 57000.00 | 79000.0 | 112380.00 | 3600000 | ▇▁▁▁▁ |
head(survey3,10)
## id annual_salary gender1 race1 age experyr1
## 1 1 55000 Woman White 25-34 5-7
## 2 2 34000 Woman White 25-34 2 - 4
## 3 3 62000 Woman White 25-34 5-7
## 4 4 60000 Woman White 25-34 5-7
## 5 5 62000 Man White 25-34 2 - 4
## 6 6 33000 Woman White 25-34 2 - 4
## 7 7 50000 Man White 25-34 5-7
## 8 8 112000 Woman White 45-54 21 - 30
## 9 9 45000 Woman Hispanic, Latino, or Spanish origin 35-44 21 - 30
## 10 10 47500 Woman White 25-34 5-7
## edu
## 1 Master's degree
## 2 College degree
## 3 College degree
## 4 College degree
## 5 Master's degree
## 6 College degree
## 7 Master's degree
## 8 College degree
## 9 College degree
## 10 College degree
Are the values what you expected for the variables? Why or Why not?
Most of the values are my expectations. There are six variables in total. The annual salary is numerical, and the others are categorical. I hope age and experience years are numerical since I can use a linear model to analyze the relationship.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
gender_all<-survey3%>%group_by(gender1)%>%
summarize(average_annual_salary = mean(annual_salary,na.rm=TRUE),
median_annual_salary=median(annual_salary,na.rm=TRUE))%>%
na.omit()
gender_all
## # A tibble: 2 × 3
## gender1 average_annual_salary median_annual_salary
## <fct> <dbl> <dbl>
## 1 Man 118701. 105000
## 2 Woman 87056. 75000
gender_all%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))
## # A tibble: 1 × 2
## diff_mean diff_median
## <dbl> <dbl>
## 1 -31645. -30000
gender_race<-survey3%>%group_by(race1,gender1)%>%
summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
na.omit()
## `summarise()` has grouped output by 'race1'. You can override using the `.groups` argument.
gender_race
## # A tibble: 12 × 4
## # Groups: race1 [6]
## race1 gender1 average_annual_… median_annual_s…
## <fct> <fct> <dbl> <dbl>
## 1 Asian or Asian American Man 137141. 117851
## 2 Asian or Asian American Woman 101278. 90000
## 3 Black or African American Man 109217. 98214
## 4 Black or African American Woman 92291. 76000
## 5 Hispanic, Latino, or Spanish origin Man 102999. 90000
## 6 Hispanic, Latino, or Spanish origin Woman 83332. 74000
## 7 Middle Eastern or Northern African Man 129438. 136000
## 8 Middle Eastern or Northern African Woman 91504. 75000
## 9 Native American or Alaska Native Man 121721. 110800
## 10 Native American or Alaska Native Woman 79597. 73000
## 11 White Man 118203. 105000
## 12 White Woman 85781. 75000
gender_race%>%group_by(race1)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)
## # A tibble: 6 × 3
## race1 diff_mean diff_median
## <fct> <dbl> <dbl>
## 1 Native American or Alaska Native -42124. -37800
## 2 Middle Eastern or Northern African -37933. -61000
## 3 Asian or Asian American -35864. -27851
## 4 White -32422. -30000
## 5 Hispanic, Latino, or Spanish origin -19667. -16000
## 6 Black or African American -16926. -22214
gender_edu<-survey3%>%group_by(edu,gender1)%>%
summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
na.omit()
## `summarise()` has grouped output by 'edu'. You can override using the `.groups` argument.
gender_edu
## # A tibble: 12 × 4
## # Groups: edu [6]
## edu gender1 average_annual_s… median_annual_s…
## <fct> <fct> <dbl> <dbl>
## 1 College degree Man 114860. 100000
## 2 College degree Woman 81569. 72000
## 3 High School Man 109547. 100003
## 4 High School Woman 57634. 50250
## 5 Master's degree Man 119138. 105000
## 6 Master's degree Woman 88192. 78000
## 7 PhD Man 143421. 136500
## 8 PhD Woman 103430. 91000
## 9 Professional degree (MD, JD, etc.) Man 165224. 133500
## 10 Professional degree (MD, JD, etc.) Woman 138389. 115000
## 11 Some college Man 109568. 96000
## 12 Some college Woman 69337. 60000
gender_edu%>%group_by(edu)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)
## # A tibble: 6 × 3
## edu diff_mean diff_median
## <fct> <dbl> <dbl>
## 1 High School -51913. -49753
## 2 Some college -40231. -36000
## 3 PhD -39992. -45500
## 4 College degree -33291. -28000
## 5 Master's degree -30946. -27000
## 6 Professional degree (MD, JD, etc.) -26835. -18500
gender_exper<-survey3%>%group_by(experyr1,gender1)%>%
summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
na.omit()
## `summarise()` has grouped output by 'experyr1'. You can override using the `.groups` argument.
gender_exper
## # A tibble: 16 × 4
## # Groups: experyr1 [8]
## experyr1 gender1 average_annual_salary median_annual_salary
## <fct> <fct> <dbl> <dbl>
## 1 "1 year or less" Man 70673. 60138.
## 2 "1 year or less" Woman 65061. 56000
## 3 "11 - 20 " Man 135344. 129400
## 4 "11 - 20 " Woman 99284. 89000
## 5 "2 - 4 " Man 88835. 75000
## 6 "2 - 4 " Woman 71680. 62646.
## 7 "21 - 30 " Man 154034. 145500
## 8 "21 - 30 " Woman 109906. 94000
## 9 "31 - 40 " Man 140926. 131000
## 10 "31 - 40 " Woman 106731. 95663
## 11 "41 or more" Man 137273. 121000
## 12 "41 or more" Woman 89706. 95000
## 13 "5-7 " Man 107066. 91000
## 14 "5-7 " Woman 82403. 72000
## 15 "8 - 10 " Man 118211. 106871
## 16 "8 - 10 " Woman 91599. 80000
gender_exper%>%group_by(experyr1)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)
## # A tibble: 8 × 3
## experyr1 diff_mean diff_median
## <fct> <dbl> <dbl>
## 1 "41 or more" -47567. -26000
## 2 "21 - 30 " -44128. -51500
## 3 "11 - 20 " -36060. -40400
## 4 "31 - 40 " -34195. -35337
## 5 "8 - 10 " -26612. -26871
## 6 "5-7 " -24663. -19000
## 7 "2 - 4 " -17155. -12354.
## 8 "1 year or less" -5612. -4138.
gender_age<-survey3%>%group_by(age,gender1)%>%
summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
na.omit()
## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.
gender_age
## # A tibble: 13 × 4
## # Groups: age [7]
## age gender1 average_annual_salary median_annual_salary
## <fct> <fct> <dbl> <dbl>
## 1 18-24 Man 66184. 60875
## 2 18-24 Woman 62114. 54000
## 3 25-34 Man 106680. 90000
## 4 25-34 Woman 81440. 70000
## 5 35-44 Man 126959. 117000
## 6 35-44 Woman 93526. 82000
## 7 45-54 Man 136893. 128000
## 8 45-54 Woman 94246. 84300
## 9 55-64 Man 132030. 123000
## 10 55-64 Woman 93755. 78000
## 11 65 or over Man 121452. 117000
## 12 65 or over Woman 98271. 90000
## 13 under 18 Woman 72453. 72600
gender_age%>%group_by(age)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)
## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.
## # A tibble: 6 × 3
## # Groups: age [6]
## age diff_mean diff_median
## <fct> <dbl> <dbl>
## 1 45-54 -42647. -43700
## 2 55-64 -38275. -45000
## 3 35-44 -33433. -35000
## 4 25-34 -25240. -20000
## 5 65 or over -23181. -27000
## 6 18-24 -4070. -6875
What are your findings about the summary? Are they what you expected?
From the summary, I get the mean and median annual salary between women and men by age, education, experience years in the field and race, and also get the mean and median gender pay gap by by age, education, experience years in the field and race.
Yes, the summary gave me all the values I needed.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
ggplot(data=na.omit(survey3),aes(y=race1, x=annual_salary,fill=gender1))+
geom_boxplot() +xlim(0,500000)+
labs(title = "Gender pay cap by race",
y = "Race",
x = "Annual Salary")
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
ggplot(data=na.omit(survey3),aes(y=edu, x=annual_salary,fill=gender1))+
geom_boxplot() +xlim(0,500000)+
labs(title = "Gender pay cap by education",
y = "Education",
x = "Annual Salary")
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
ggplot(data=na.omit(survey3),aes(y=experyr1, x=annual_salary,fill=gender1))+
geom_boxplot() +xlim(0,500000)+
labs(title = "Gender pay cap by experience years in field",
y = "Experience years in field",
x = "Annual Salary")
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
ggplot(data=na.omit(survey3),aes(y=age, x=annual_salary,fill=gender1))+
geom_boxplot() +xlim(0,500000)+
labs(title = "Gender pay cap by age",
y = "Age",
x = "Annual Salary")
## Warning: Removed 30 rows containing non-finite values (stat_boxplot).
Summarize your research question and findings below.
Accurate salary data is essential for people to know their situation in the salary market. As a graduate student in Biostatistics, I particularly care about women in the salary market.
According to the data from the ask a manager survey, I analyze annual income, gender, age, race, education, experience years in the field.
I found the gender pay gap between men and women exists, and in general, women can get 73% salary of men. In detail,
There exists a gender pay gap in the annual income. The mean gender pay gap between man and woman is $31,645.31, and the median is $30,000.
The gender pay gap by race is not the same. The mean gender pay gap in Native American or Alaska Native has the widest of $42,123.94, and Black or African American has the smallest of $16,926. However, the median gender pay gap in Middle eastern or Northern African has the widest of $61,000, and the samllest in Hispanic, Latino or Spanish origin of $16,000.
The gender pay gap by education is not the same. The mean gender pay gap in High school has the widest gap of -$51,913.40, and Professional degree (MD,JD) has the smallest of -$26.834.84. The median gender gap has the same solution with men but different values.
The mean gender pay gap by experience year in the field seems a positive association. As the experience years increases, the mean gender pay gap also increases except 11-20 years. However, the median gender pay gap doesn’t have this association. The median gender pay gap in 21-30 years has the widest of $51,500, and the smallest in 1 year or less is $4,137.5.
The gender pay gap by age is not the same. The mean gender pay gap in 45-54 has the widest of $42,647.364, and 18-24 has the smallest of $4,070.106. However, the median gender pay gap in 55-64 has the widest of $45,000, and the smallest in 18-24 of $6,875.
Are your findings what you expected? Why or Why not?
The results are not much like my expectation. As the report from DOL, women earn 82% of men. However, the gap in this survey is 73%. The gender pay gap has not narrowed but enlarged from the mean gender pay gap by age and experience years in the field. And gender pay gap by education shows that the smallest gap is in professional education; the education of Ph.D. enlarges the gap.
To the next exploration, first, I want to find the reason of the results, like why Ph.D. does not narrow the gap?
Second, since I only analyze the gender pay gap by each of age, race, education, experience years in the field. It could be explored more in other variables, like occupations. And it can also have some interesting findings in the gender pay gap by 2-3 variables together, like observations with the master degree, Asian American has more significant gender pay gap than White?
Third, since the data is from the survey and people are self-selected to answer, it is biased and inaccurate. I could explore more than data from DOL or other experiments.