Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

  1. Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.
  • Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
  • See other data sources in the #data channel on Slack.
  • Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
  • You may use another dataset or your own data, but please make sure it is de-identified.
  1. Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.

  2. You must use each of the following functions at least once:

  • mutate()
  • group_by()
  • summarize()
  • ggplot()

and at least one of the following:

  • case_when()
  • across()
  • *_join() (i.e. left_join())
  • pivot_*() (i.e. pivot_longer())
  1. The code chunks below are guides, please add more code chunks to do what you need.

  2. If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

With the 2022 Winter Olympics underway, I’m curious to know the most common age range for athletes, and which age ranges most often win Olympic medals. This dataset ranges from the beginning of the modern Olympic Games, in 1896, through 2016.

Given your question, what is your expectation about the data?

My expectation is that most Olympic athletes are in the 21-25 age range, and most Olympic medals are won by this same age group. This seems to be an age when someone would be in their peak athletic form, and have had fewer injuries that could impact their performance.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#tuesdata <- tt_load('2021-07-27')
#tuesdata <- tt_load(2021, week = 31)

#olympics <- tuesdata$olympics
#write_csv(olympics, 
#          file = ("data/olympics.csv"))

olympics <- read_csv("data/olympics.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, team, noc, games, season, city, sport, event, medal
## dbl  (5): id, age, height, weight, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(olympics)
Data summary
Name olympics
Number of rows 271116
Number of columns 15
_______________________
Column type frequency:
character 10
numeric 5
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 2 108 0 134731 0
sex 0 1.00 1 1 0 2 0
team 0 1.00 2 47 0 1184 0
noc 0 1.00 3 3 0 230 0
games 0 1.00 11 11 0 51 0
season 0 1.00 6 6 0 2 0
city 0 1.00 4 22 0 42 0
sport 0 1.00 4 25 0 66 0
event 0 1.00 15 85 0 765 0
medal 231333 0.15 4 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 68248.95 39022.29 1 34643 68205 102097.2 135571 ▇▇▇▇▇
age 9474 0.97 25.56 6.39 10 21 24 28.0 97 ▇▃▁▁▁
height 60171 0.78 175.34 10.52 127 168 175 183.0 226 ▁▂▇▂▁
weight 62875 0.77 70.70 14.35 25 60 70 79.0 214 ▃▇▁▁▁
year 0 1.00 1978.38 29.88 1896 1960 1988 2002.0 2016 ▁▂▃▆▇
glimpse(olympics)
## Rows: 271,116
## Columns: 15
## $ id     <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
## $ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
## $ age    <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
## $ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
## $ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
## $ team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
## $ noc    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
## $ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
## $ sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
## $ event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
## $ medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

This appears to be a very clean dataset. Characters and numbers are coded correctly, missing data is all coded as “NA”. I have removed individuals with “NA” listed for their age as these entries are not informative for the purposes of our question.

There are some unexpectedly high ages, with a maximum of 97 years. Upon further investigation, this is due to the types of Olympic events held in the early years of the modern Games. The event which had the oldest participants, was the Art Competition. This event was part of the Olympics from 1912 until 1948, with categories in architecture, literature, music, painting, and sculpture.

There were also several gaps in the data, which coincided with World War I and World War II. Therefore, I have chosen to only analyze data from 1952 through 2016, effectively removing both gaps in data and extreme outliers for non-athletic events. This analysis will also only focus on data from the Winter Olympics.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# remove unwanted variables
olympics2 <- olympics %>% drop_na(age)

olympics2 <- olympics2 %>% 
  select(sex, age, year, season, team, noc, sport, event, medal)

olympics2 <- olympics2 %>% filter(year >= "1952" & season == "Winter")


olympics2 <- olympics2 %>% 
  mutate(
  age_range = case_when(
    age <16 ~ "11-15", 
    age >= 16 & age <21 ~ "16-20", 
    age >= 21 & age <26 ~ "21-25", 
    age >= 26 & age <31 ~ "26-30", 
    age >= 31 & age <36 ~ "31-35", 
    age >= 36 & age <41 ~ "36-40", 
    age >= 41 & age <46 ~ "41-45", 
    age >= 46 & age <51 ~ "46-50", 
    age > 50  ~ "51+"
    
  )
)


olympics2 <- olympics2 %>% 
  relocate(age_range, .after = age)

Creating age ranges allows us to more easily view the data when graphing.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

olympics2 %>% skim
Data summary
Name Piped data
Number of rows 45100
Number of columns 10
_______________________
Column type frequency:
character 8
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sex 0 1.00 1 1 0 2 0
age_range 0 1.00 3 5 0 9 0
season 0 1.00 6 6 0 1 0
team 0 1.00 4 30 0 219 0
noc 0 1.00 3 3 0 119 0
sport 0 1.00 4 25 0 15 0
event 0 1.00 18 60 0 115 0
medal 39959 0.11 4 6 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 24.97 4.69 11 22 24 28 55 ▁▇▂▁▁
year 0 1 1991.70 17.47 1952 1980 1994 2006 2014 ▂▂▃▅▇

The transformed data shows a reduction in the number of variables, along with the addition of ‘age_range’. Restricting our analysis to just the Winter Olympics reduces the total number of rows from 271,116 to 45,100. The mean and maximum values for age have also been reduced by restricting analysis to the years 1952 through 2016. Additionally, ‘age’ no longer has values of “NA”.

(age_common <- olympics2 %>% tabyl(year, age_range) %>% 
  adorn_totals()) %>% gt()
year 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50 51+
1952 5 130 518 264 121 31 13 6 0
1956 13 139 586 365 119 43 11 6 0
1960 7 241 453 313 83 11 0 0 0
1964 15 344 749 480 152 14 7 3 1
1968 15 334 847 541 111 23 0 0 1
1972 26 336 702 449 114 21 4 0 0
1976 21 442 806 426 129 18 5 3 0
1980 18 419 783 414 91 18 0 2 0
1984 12 481 1050 479 81 18 0 0 2
1988 10 510 1342 620 116 23 11 2 1
1992 23 635 1635 914 190 31 6 1 0
1994 18 485 1461 934 231 25 3 1 0
1998 11 503 1508 1156 364 56 4 1 0
2002 9 474 1569 1401 533 101 19 3 0
2006 10 538 1655 1436 603 115 19 5 1
2010 6 529 1689 1308 678 169 17 4 2
2014 10 560 1852 1647 634 157 27 3 1
Total 229 7100 19205 13147 4350 874 146 40 9
medals <- olympics2 %>% filter(medal != "NA")

(age_medal <- olympics2 %>% 
   filter(medal != "NA") %>% 
   tabyl(medal, age_range) %>% 
  adorn_totals()) %>% gt()
medal 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50
Bronze 7 196 640 602 206 44 13 1
Gold 5 159 692 594 224 39 4 2
Silver 4 175 686 578 225 34 9 2
Total 16 530 2018 1774 655 117 26 5
(age_medal <- olympics2 %>% 
   filter(medal != "NA") %>% 
   tabyl(medal, age_range) %>% 
  adorn_totals()) %>% gt()
medal 11-15 16-20 21-25 26-30 31-35 36-40 41-45 46-50
Bronze 7 196 640 602 206 44 13 1
Gold 5 159 692 594 224 39 4 2
Silver 4 175 686 578 225 34 9 2
Total 16 530 2018 1774 655 117 26 5
(age2125 <- 2018/19205)
## [1] 0.1050768
(age2630 <- 1774/13147)
## [1] 0.1349357
(n2125 <- 19205/45100)
## [1] 0.4258315

Are the values what you expected for the variables? Why or Why not?

Yes, the results are as I expected, with the 21-25 year age range being the most common age range for Olympic athletes, at 19,205. This is slightly less than the number of athletes in the two neighboring age ranges combined.

The 21-25 age range is also the group that has won the most Olympic medals, at 2018. The 26-30 age range was very close with 1774 medalists. Interestingly, as a proportion, the 26-30 age range won medals more often, with 10.51% of athletes in the 26-30 age range receiving medals, while only 13.49% of athletes in the 21-25 age range received medals.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

olympics2 %>% 
  summarize(mean_age = mean(age, na.rm = TRUE),
            median_age = median(age, na.rm = TRUE), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))
## # A tibble: 1 × 4
##   mean_age median_age minimum_age maximum_age
##      <dbl>      <dbl>       <dbl>       <dbl>
## 1     25.0         24          11          55
medals %>% 
  group_by(medal) %>% 
  summarize(mean_age = mean(age, na.rm = T), 
            median_age = median(age, na.rm = T), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))
## # A tibble: 3 × 5
##   medal  mean_age median_age minimum_age maximum_age
##   <chr>     <dbl>      <dbl>       <dbl>       <dbl>
## 1 Bronze     26.0         26          14          46
## 2 Gold       26.0         26          13          49
## 3 Silver     25.9         25          15          50
olympics2 %>% 
  group_by(sex) %>% 
  summarize(mean_age = mean(age, na.rm = T), 
            median_age = median(age, na.rm = T), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))
## # A tibble: 2 × 5
##   sex   mean_age median_age minimum_age maximum_age
##   <chr>    <dbl>      <dbl>       <dbl>       <dbl>
## 1 F         24.0         24          11          48
## 2 M         25.4         25          12          55
medals %>% 
  group_by(sex, medal) %>% 
  summarize(mean_age = mean(age, na.rm = T), 
            median_age = median(age, na.rm = T), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 6 × 6
## # Groups:   sex [2]
##   sex   medal  mean_age median_age minimum_age maximum_age
##   <chr> <chr>     <dbl>      <dbl>       <dbl>       <dbl>
## 1 F     Bronze     25.2         25          15          45
## 2 F     Gold       25.3         25          13          43
## 3 F     Silver     25.3         25          15          46
## 4 M     Bronze     26.4         26          14          46
## 5 M     Gold       26.4         26          16          49
## 6 M     Silver     26.2         26          15          50

What are your findings about the summary? Are they what you expected?

TO obtain more precise results I chose to use the actual ages of the athletes, rather than the age range, for the summary analysis. The summary values obtained are as expected. Regarding the age of athletes, the mean is 24.97 years old, while the median is 24 years old. The median is a much better representation of the data, as it is more robust and not as susceptible to outliers as is the median. As we can see by the maximum age of 55, there are still some high ages present which would affect the mean.

The median ages for Gold and Bronze medals was 26 years, while the median age for Silver medal was 25 years old. All medals had very similar means ranging from 25.90 to 25.99. These values are much closer to each other, indicating that the higher age outliers have less of an effect for the medalists.

As an additional bit of information, I included sub-analysis and separated the sexes to see if there were any additional differences between male and female athletes. The median age for female athletes is 24 years, while the median age for male athletes is 25 years. The ages for medalists are also slightly different, with a mean age for female athletes of 25 years, and maximum age of 46 For male athletes the mean age is 26 years, with a maximum age for medalists of 50 years.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(data = olympics2, 
       aes(x = age_range,
           fill = age_range)) + 
  geom_bar(position = "stack") +
  labs(title = "Olympic Athletes by Age Range", 
       x = "Age Range", 
       y = "Count") + 
  theme(legend.position = "none")

ggplot(data = medals, 
       aes(x = age_range, 
           fill = medal)) + 
  geom_bar() +
  labs(title = "Olympic Athletes by Age Range", 
       x = "Age Range", 
       y = "Medal Count", 
       fill = "Medal") 

ggplot(data = medals, 
       aes(x = age, y = medal)) + 
  geom_boxplot(color="darkgrey") + 
  labs(x = "Age of Medalist",
       y = "Medal")

ggplot(data = medals, 
       aes(x = age, y = sex)) + 
  geom_boxplot(color="darkgrey") + 
  labs(x = "Age of Medalist",
       y = "Sex")

Final Summary (10 points)

Summarize your research question and findings below.

My aim was to determine the most common age of Olympic athletes in the Winter Games. Information was restricted to the years 1952 through 2016 to remove previous non-athletic Olympic events (such as Art Competition) which contributed greatly to the range of ages, the the outliers subsequently affecting non-robust measures such as the mean. The date restriction was also necessary to handle the gaps in data when no Olympic events were held due to World War I and World War II.

Of the 45,100 Olympic athletes participating in the Winter Games from 1952 through 2016 whose age was recorded, 19,205 of them were in the 21-25 year age range. This constitutes 42.58% of all athletes participating in the Winter Olympics. The mean age for athletes was 24.97 years, with a median of 24 years. Ages ranged from 11 to 55 years.

Of the 5141 Olympic medals, 2018 were won by athletes in the 21-25 year age range, constituting 39.25% of medals won during the Winter Olympics. Athletes who won Gold medals had a mean age of 25.97 years, with a median age of 26 years, and ranged from 14 to 46 years of age. Athletes who won Silver medals had a mean age of 25.99 years, with a mean age of 26years, and ranged from 13 to 49 years of age. Athletes who won Bronze medals had a mean age of 25.90 years, with a median age of 25 years, and ranged from 15 to 50 years of age.

Are your findings what you expected? Why or Why not?

Overall, these findings are what I expected. The mean age and age range for athletes and medalists were all around 25 years of age, with a majority of athletes in the 21-25 year age range. One of the most surprising things about the data was the range of values, with minimum age of 11 years for both male and female athletes, and maximum age of 48 years for female athletes and 55 years for male athletes (with medalists of 46 years and 50 years, respectively), even after limiting the dates of Olympic Games to those held in 1952 or later. Another unexpected finding was that the 21-25 year age range won more medals as an absolute value, but a greater proportion of the athletes in the 26-30 year age range won medals.