Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
See other data sources in the #data channel on Slack.
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
You may use another dataset or your own data, but please make sure it is de-identified.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
The salary survey comes from Ask a Manager. The readers were self-selected to respond to the study in May 2021. It has 26232 observations and 18 variables. I want to focus on the observations from the United States. We generally know working women are paid less than working men. My question is how do age, race, experience years in the field, and education affect the gender pay gap?

Given your question, what is your expectation about the data?
The data may show a positive, negative, or no association of gender pay gap by age, race, experience work years in the field, and education. Intuitively， the gender pay cap increases by age, decreases by experience years in the field and education, and may not change much by race.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

survey <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-18/survey.csv')

## Rows: 26232 Columns: 18

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (16): timestamp, how_old_are_you, industry, job_title, additional_contex...
## dbl  (2): annual_salary, other_monetary_comp

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(survey)

## Rows: 26,232
## Columns: 18
## $ timestamp                                <chr> "4/27/2021 11:02:10", "4/27/2…
## $ how_old_are_you                          <chr> "25-34", "25-34", "25-34", "2…
## $ industry                                 <chr> "Education (Higher Education)…
## $ job_title                                <chr> "Research and Instruction Lib…
## $ additional_context_on_job_title          <chr> NA, NA, NA, NA, NA, NA, NA, "…
## $ annual_salary                            <dbl> 55000, 54600, 34000, 62000, 6…
## $ other_monetary_comp                      <dbl> 0, 4000, NA, 3000, 7000, NA, …
## $ currency                                 <chr> "USD", "GBP", "USD", "USD", "…
## $ currency_other                           <chr> NA, NA, NA, NA, NA, NA, NA, N…
## $ additional_context_on_income             <chr> NA, NA, NA, NA, NA, NA, NA, N…
## $ country                                  <chr> "United States", "United King…
## $ state                                    <chr> "Massachusetts", NA, "Tenness…
## $ city                                     <chr> "Boston", "Cambridge", "Chatt…
## $ overall_years_of_professional_experience <chr> "5-7 years", "8 - 10 years", …
## $ years_of_experience_in_field             <chr> "5-7 years", "5-7 years", "2 …
## $ highest_level_of_education_completed     <chr> "Master's degree", "College d…
## $ gender                                   <chr> "Woman", "Non-binary", "Woman…
## $ race                                     <chr> "White", "White", "White", "W…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

First, I will select the United States’ data according to my focus. Second, I will rename and deal with NA and other unclear data to clarify the categorical data with different levels.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# select data of United States
survey1<-survey%>%filter(country=="United States"|country=="america"|country=="America"|country=="US"|country=="USA")

# rename the column, deal with NA and other unclear data 
survey2<-survey1%>%
  rename(age=how_old_are_you,
         experyr=years_of_experience_in_field,
         edu=highest_level_of_education_completed)%>%
  separate(col = race,
           into = c("race1", "race2"),
           sep = ",") %>%
   mutate(id=1:19434,
          race1=na_if(race1,"Another option not listed here or prefer not to answer"),
          race1=str_replace_all(race1,"Hispanic","Hispanic, Latino, or Spanish origin"),
          gender1=na_if(gender,"Other or prefer not to answer"),
          gender1=na_if(gender1,"Prefer not to answer"),
          gender1=na_if(gender1,"Non-binary"),
          experyr1 =str_remove_all(experyr, "years"))

## Warning: Expected 2 pieces. Additional pieces discarded in 792 rows [9, 11, 40,
## 48, 87, 126, 127, 135, 136, 153, 170, 176, 189, 228, 286, 312, 328, 343, 378,
## 379, ...].

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 18017 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 10, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, ...].

# Convert type of categorical variables to factor, select variables we need to see the levels 
survey3<-as.data.frame(unclass(survey2),                     
                       stringsAsFactors = TRUE)%>%
  select(id,annual_salary,gender1,race1,age,experyr1,edu)

survey3%>%summarise(across(c(age,race1,gender1,experyr1,edu), 
                   .fns = ~ str_flatten(sort(unique(.x)),collapse = "/"))) %>%
  t() %>% 
  as_tibble(rownames = "var") %>% 
  gt()

## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

var	V1
age	18-24/25-34/35-44/45-54/55-64/65 or over/under 18
race1	Asian or Asian American/Black or African American/Hispanic, Latino, or Spanish origin/Middle Eastern or Northern African/Native American or Alaska Native/White
gender1	Man/Woman
experyr1	1 year or less/11 - 20 /2 - 4 /21 - 30 /31 - 40 /41 or more/5-7 /8 - 10
edu	College degree/High School/Master's degree/PhD/Professional degree (MD, JD, etc.)/Some college

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

I choose some variables from survey2 and create survey11,survey12. I use left_join since the left one with variables about general information about observations and the right one with variables about the career. (To Midterm project, I don’t need to join. But this step may help to be familiar with join function and be useful to further study)

#join function
names(survey2)

##  [1] "timestamp"                               
##  [2] "age"                                     
##  [3] "industry"                                
##  [4] "job_title"                               
##  [5] "additional_context_on_job_title"         
##  [6] "annual_salary"                           
##  [7] "other_monetary_comp"                     
##  [8] "currency"                                
##  [9] "currency_other"                          
## [10] "additional_context_on_income"            
## [11] "country"                                 
## [12] "state"                                   
## [13] "city"                                    
## [14] "overall_years_of_professional_experience"
## [15] "experyr"                                 
## [16] "edu"                                     
## [17] "gender"                                  
## [18] "race1"                                   
## [19] "race2"                                   
## [20] "id"                                      
## [21] "gender1"                                 
## [22] "experyr1"

survey11<-survey2%>%select(id,age,gender,race1,edu,country,state)
survey12<-survey2%>%select(id,annual_salary,other_monetary_comp,industry,job_title,experyr)
survey_new<-left_join(survey11,survey12,by=c("id"="id"))
glimpse(survey_new)

## Rows: 19,434
## Columns: 12
## $ id                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ age                 <chr> "25-34", "25-34", "25-34", "25-34", "25-34", "25-3…
## $ gender              <chr> "Woman", "Woman", "Woman", "Woman", "Man", "Woman"…
## $ race1               <chr> "White", "White", "White", "White", "White", "Whit…
## $ edu                 <chr> "Master's degree", "College degree", "College degr…
## $ country             <chr> "United States", "US", "USA", "US", "USA", "USA", …
## $ state               <chr> "Massachusetts", "Tennessee", "Wisconsin", "South …
## $ annual_salary       <dbl> 55000, 34000, 62000, 60000, 62000, 33000, 50000, 1…
## $ other_monetary_comp <dbl> 0, NA, 3000, 7000, NA, 2000, NA, 10000, 0, 0, 0, 0…
## $ industry            <chr> "Education (Higher Education)", "Accounting, Banki…
## $ job_title           <chr> "Research and Instruction Librarian", "Marketing S…
## $ experyr             <chr> "5-7 years", "2 - 4 years", "5-7 years", "5-7 year…

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

# overlook data
glimpse(survey3)

## Rows: 19,434
## Columns: 7
## $ id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ annual_salary <dbl> 55000, 34000, 62000, 60000, 62000, 33000, 50000, 112000,…
## $ gender1       <fct> Woman, Woman, Woman, Woman, Man, Woman, Man, Woman, Woma…
## $ race1         <fct> "White", "White", "White", "White", "White", "White", "W…
## $ age           <fct> 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 25-34, 45-54, …
## $ experyr1      <fct> 5-7 , 2 - 4 , 5-7 , 5-7 , 2 - 4 , 2 - 4 , 5-7 , 21 - 30 …
## $ edu           <fct> "Master's degree", "College degree", "College degree", "…

skim(survey3)

Data summary
Name	survey3
Number of rows	19434
Number of columns	7
_______________________
Column type frequency:
factor	5
numeric	2
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
gender1	814	0.96	FALSE	2	Wom: 15366, Man: 3254
race1	470	0.98	FALSE	6	Whi: 16280, Asi: 1161, His: 726, Bla: 626
age	0	1.00	FALSE	7	25-: 8585, 35-: 7032, 45-: 2279, 18-: 735
experyr1	0	1.00	FALSE	8	11 : 4674, 5-7: 4572, 2 -: 4094, 8 -: 3554
edu	123	0.99	FALSE	6	Col: 9386, Mas: 6370, Som: 1315, Pro: 977

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1	9717.5	5610.26	1	4859.25	9717.5	14575.75	19434	▇▇▇▇▇
annual_salary	0	1	92379.2	68780.11	0	57000.00	79000.0	112380.00	3600000	▇▁▁▁▁

head(survey3,10)

##    id annual_salary gender1                               race1   age experyr1
## 1   1         55000   Woman                               White 25-34     5-7 
## 2   2         34000   Woman                               White 25-34   2 - 4 
## 3   3         62000   Woman                               White 25-34     5-7 
## 4   4         60000   Woman                               White 25-34     5-7 
## 5   5         62000     Man                               White 25-34   2 - 4 
## 6   6         33000   Woman                               White 25-34   2 - 4 
## 7   7         50000     Man                               White 25-34     5-7 
## 8   8        112000   Woman                               White 45-54 21 - 30 
## 9   9         45000   Woman Hispanic, Latino, or Spanish origin 35-44 21 - 30 
## 10 10         47500   Woman                               White 25-34     5-7 
##                edu
## 1  Master's degree
## 2   College degree
## 3   College degree
## 4   College degree
## 5  Master's degree
## 6   College degree
## 7  Master's degree
## 8   College degree
## 9   College degree
## 10  College degree

Are the values what you expected for the variables? Why or Why not?

Most of the values are my expectations. There are six variables in total. The annual salary is numerical, and the others are categorical. I hope age and experience years are numerical since I can use a linear model to analyze the relationship.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

gender_all<-survey3%>%group_by(gender1)%>%
  summarize(average_annual_salary = mean(annual_salary,na.rm=TRUE),
            median_annual_salary=median(annual_salary,na.rm=TRUE))%>%
  na.omit()
gender_all

## # A tibble: 2 × 3
##   gender1 average_annual_salary median_annual_salary
##   <fct>                   <dbl>                <dbl>
## 1 Man                   118701.               105000
## 2 Woman                  87056.                75000

gender_all%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))

## # A tibble: 1 × 2
##   diff_mean diff_median
##       <dbl>       <dbl>
## 1   -31645.      -30000

gender_race<-survey3%>%group_by(race1,gender1)%>%
  summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
            median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
  na.omit()

## `summarise()` has grouped output by 'race1'. You can override using the `.groups` argument.

gender_race

## # A tibble: 12 × 4
## # Groups:   race1 [6]
##    race1                               gender1 average_annual_… median_annual_s…
##    <fct>                               <fct>              <dbl>            <dbl>
##  1 Asian or Asian American             Man              137141.           117851
##  2 Asian or Asian American             Woman            101278.            90000
##  3 Black or African American           Man              109217.            98214
##  4 Black or African American           Woman             92291.            76000
##  5 Hispanic, Latino, or Spanish origin Man              102999.            90000
##  6 Hispanic, Latino, or Spanish origin Woman             83332.            74000
##  7 Middle Eastern or Northern African  Man              129438.           136000
##  8 Middle Eastern or Northern African  Woman             91504.            75000
##  9 Native American or Alaska Native    Man              121721.           110800
## 10 Native American or Alaska Native    Woman             79597.            73000
## 11 White                               Man              118203.           105000
## 12 White                               Woman             85781.            75000

gender_race%>%group_by(race1)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)

## # A tibble: 6 × 3
##   race1                               diff_mean diff_median
##   <fct>                                   <dbl>       <dbl>
## 1 Native American or Alaska Native      -42124.      -37800
## 2 Middle Eastern or Northern African    -37933.      -61000
## 3 Asian or Asian American               -35864.      -27851
## 4 White                                 -32422.      -30000
## 5 Hispanic, Latino, or Spanish origin   -19667.      -16000
## 6 Black or African American             -16926.      -22214

gender_edu<-survey3%>%group_by(edu,gender1)%>%
  summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
            median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
  na.omit()

## `summarise()` has grouped output by 'edu'. You can override using the `.groups` argument.

gender_edu

## # A tibble: 12 × 4
## # Groups:   edu [6]
##    edu                                gender1 average_annual_s… median_annual_s…
##    <fct>                              <fct>               <dbl>            <dbl>
##  1 College degree                     Man               114860.           100000
##  2 College degree                     Woman              81569.            72000
##  3 High School                        Man               109547.           100003
##  4 High School                        Woman              57634.            50250
##  5 Master's degree                    Man               119138.           105000
##  6 Master's degree                    Woman              88192.            78000
##  7 PhD                                Man               143421.           136500
##  8 PhD                                Woman             103430.            91000
##  9 Professional degree (MD, JD, etc.) Man               165224.           133500
## 10 Professional degree (MD, JD, etc.) Woman             138389.           115000
## 11 Some college                       Man               109568.            96000
## 12 Some college                       Woman              69337.            60000

gender_edu%>%group_by(edu)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)

## # A tibble: 6 × 3
##   edu                                diff_mean diff_median
##   <fct>                                  <dbl>       <dbl>
## 1 High School                          -51913.      -49753
## 2 Some college                         -40231.      -36000
## 3 PhD                                  -39992.      -45500
## 4 College degree                       -33291.      -28000
## 5 Master's degree                      -30946.      -27000
## 6 Professional degree (MD, JD, etc.)   -26835.      -18500

gender_exper<-survey3%>%group_by(experyr1,gender1)%>%
  summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
            median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
  na.omit()

## `summarise()` has grouped output by 'experyr1'. You can override using the `.groups` argument.

gender_exper

## # A tibble: 16 × 4
## # Groups:   experyr1 [8]
##    experyr1         gender1 average_annual_salary median_annual_salary
##    <fct>            <fct>                   <dbl>                <dbl>
##  1 "1 year or less" Man                    70673.               60138.
##  2 "1 year or less" Woman                  65061.               56000 
##  3 "11 - 20 "       Man                   135344.              129400 
##  4 "11 - 20 "       Woman                  99284.               89000 
##  5 "2 - 4 "         Man                    88835.               75000 
##  6 "2 - 4 "         Woman                  71680.               62646.
##  7 "21 - 30 "       Man                   154034.              145500 
##  8 "21 - 30 "       Woman                 109906.               94000 
##  9 "31 - 40 "       Man                   140926.              131000 
## 10 "31 - 40 "       Woman                 106731.               95663 
## 11 "41  or more"    Man                   137273.              121000 
## 12 "41  or more"    Woman                  89706.               95000 
## 13 "5-7 "           Man                   107066.               91000 
## 14 "5-7 "           Woman                  82403.               72000 
## 15 "8 - 10 "        Man                   118211.              106871 
## 16 "8 - 10 "        Woman                  91599.               80000

gender_exper%>%group_by(experyr1)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)

## # A tibble: 8 × 3
##   experyr1         diff_mean diff_median
##   <fct>                <dbl>       <dbl>
## 1 "41  or more"      -47567.     -26000 
## 2 "21 - 30 "         -44128.     -51500 
## 3 "11 - 20 "         -36060.     -40400 
## 4 "31 - 40 "         -34195.     -35337 
## 5 "8 - 10 "          -26612.     -26871 
## 6 "5-7 "             -24663.     -19000 
## 7 "2 - 4 "           -17155.     -12354.
## 8 "1 year or less"    -5612.      -4138.

gender_age<-survey3%>%group_by(age,gender1)%>%
  summarize(average_annual_salary = mean(annual_salary, na.rm = TRUE),
            median_annual_salary=median(annual_salary, na.rm = TRUE))%>%
  na.omit()

## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.

gender_age

## # A tibble: 13 × 4
## # Groups:   age [7]
##    age        gender1 average_annual_salary median_annual_salary
##    <fct>      <fct>                   <dbl>                <dbl>
##  1 18-24      Man                    66184.                60875
##  2 18-24      Woman                  62114.                54000
##  3 25-34      Man                   106680.                90000
##  4 25-34      Woman                  81440.                70000
##  5 35-44      Man                   126959.               117000
##  6 35-44      Woman                  93526.                82000
##  7 45-54      Man                   136893.               128000
##  8 45-54      Woman                  94246.                84300
##  9 55-64      Man                   132030.               123000
## 10 55-64      Woman                  93755.                78000
## 11 65 or over Man                   121452.               117000
## 12 65 or over Woman                  98271.                90000
## 13 under 18   Woman                  72453.                72600

gender_age%>%group_by(age)%>%summarize(diff_mean=diff(average_annual_salary),diff_median=diff(median_annual_salary))%>%arrange(diff_mean)

## `summarise()` has grouped output by 'age'. You can override using the `.groups` argument.

## # A tibble: 6 × 3
## # Groups:   age [6]
##   age        diff_mean diff_median
##   <fct>          <dbl>       <dbl>
## 1 45-54        -42647.      -43700
## 2 55-64        -38275.      -45000
## 3 35-44        -33433.      -35000
## 4 25-34        -25240.      -20000
## 5 65 or over   -23181.      -27000
## 6 18-24         -4070.       -6875

What are your findings about the summary? Are they what you expected?

From the summary, I get the mean and median annual salary between women and men by age, education, experience years in the field and race, and also get the mean and median gender pay gap by by age, education, experience years in the field and race.

Yes, the summary gave me all the values I needed.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(data=na.omit(survey3),aes(y=race1, x=annual_salary,fill=gender1))+
  geom_boxplot() +xlim(0,500000)+
  labs(title = "Gender pay cap by race",
       y = "Race",
       x = "Annual Salary")

## Warning: Removed 30 rows containing non-finite values (stat_boxplot).

ggplot(data=na.omit(survey3),aes(y=edu, x=annual_salary,fill=gender1))+
  geom_boxplot() +xlim(0,500000)+
  labs(title = "Gender pay cap by education",
       y = "Education",
       x = "Annual Salary")

## Warning: Removed 30 rows containing non-finite values (stat_boxplot).

ggplot(data=na.omit(survey3),aes(y=experyr1, x=annual_salary,fill=gender1))+
  geom_boxplot() +xlim(0,500000)+
  labs(title = "Gender pay cap by experience years in field",
       y = "Experience years in field",
       x = "Annual Salary")

## Warning: Removed 30 rows containing non-finite values (stat_boxplot).

ggplot(data=na.omit(survey3),aes(y=age, x=annual_salary,fill=gender1))+
  geom_boxplot() +xlim(0,500000)+
  labs(title = "Gender pay cap by age",
       y = "Age",
       x = "Annual Salary")

## Warning: Removed 30 rows containing non-finite values (stat_boxplot).

Final Summary (10 points)

Summarize your research question and findings below.

Accurate salary data is essential for people to know their situation in the salary market. As a graduate student in Biostatistics, I particularly care about women in the salary market.

According to the data from the ask a manager survey, I analyze annual income, gender, age, race, education, experience years in the field.

I found the gender pay gap between men and women exists, and in general, women can get 73% salary of men. In detail,

There exists a gender pay gap in the annual income. The mean gender pay gap between man and woman is $31,645.31, and the median is $30,000.
The gender pay gap by race is not the same. The mean gender pay gap in Native American or Alaska Native has the widest of $42,123.94, and Black or African American has the smallest of $16,926. However, the median gender pay gap in Middle eastern or Northern African has the widest of $61,000, and the samllest in Hispanic, Latino or Spanish origin of $16,000.
The gender pay gap by education is not the same. The mean gender pay gap in High school has the widest gap of -$51,913.40, and Professional degree (MD,JD) has the smallest of -$26.834.84. The median gender gap has the same solution with men but different values.
The mean gender pay gap by experience year in the field seems a positive association. As the experience years increases, the mean gender pay gap also increases except 11-20 years. However, the median gender pay gap doesn’t have this association. The median gender pay gap in 21-30 years has the widest of $51,500, and the smallest in 1 year or less is $4,137.5.
The gender pay gap by age is not the same. The mean gender pay gap in 45-54 has the widest of $42,647.364, and 18-24 has the smallest of $4,070.106. However, the median gender pay gap in 55-64 has the widest of $45,000, and the smallest in 18-24 of $6,875.

Are your findings what you expected? Why or Why not?

The results are not much like my expectation. As the report from DOL, women earn 82% of men. However, the gap in this survey is 73%. The gender pay gap has not narrowed but enlarged from the mean gender pay gap by age and experience years in the field. And gender pay gap by education shows that the smallest gap is in professional education; the education of Ph.D. enlarges the gap.

To the next exploration, first, I want to find the reason of the results, like why Ph.D. does not narrow the gap?

Second, since I only analyze the gender pay gap by each of age, race, education, experience years in the field. It could be explored more in other variables, like occupations. And it can also have some interesting findings in the gender pay gap by 2-3 variables together, like observations with the master degree, Asian American has more significant gender pay gap than White?

Third, since the data is from the survey and people are self-selected to answer, it is biased and inaccurate. I could explore more than data from DOL or other experiments.

Midterm

Yan Liu

2022-02-13