Midterm (Due Sunday 2/13/2022 at 11:55 pm)

Please submit your .Rmd and .html files in Sakai. If you are working together, both people should submit the files.

60 / 60 points total

The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.

Instructions: Before you get Started

Pick a dataset. Ideally, the dataset should be around 2000 rows, and should have both categorical and numeric covariates. Choose a dataset with more than 4 columns/variables.

Potential Sources for data: Tidy Tuesday: https://github.com/rfordatascience/tidytuesday.
See other data sources in the #data channel on Slack.
Note that most of the Tidy Tuesday data are .csv files. There is code to load the files from csv for each of the datasets and a short description of the variables, or you can upload the .csv file into your data folder.
You may use another dataset or your own data, but please make sure it is de-identified.

Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:

mutate()
group_by()
summarize()
ggplot()

and at least one of the following:

case_when()
across()
*_join() (i.e. left_join())
pivot_*() (i.e. pivot_longer())

The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.

You may remove these instructions from your final Rmd if you like

Working Together

If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.

No acknowledgements of contributions = -10 points overall.

Please Note

I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.

Define Your Research Question (10 points)

Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?

With the 2022 Winter Olympics underway, I’m curious to know the most common age range for athletes, and which age ranges most often win Olympic medals. This dataset ranges from the beginning of the modern Olympic Games, in 1896, through 2016.

Given your question, what is your expectation about the data?

My expectation is that most Olympic athletes are in the 21-25 age range, and most Olympic medals are won by this same age group. This seems to be an age when someone would be in their peak athletic form, and have had fewer injuries that could impact their performance.

Loading the Data (10 points)

Load the data below and use dplyr::glimpse() or skimr::skim() on the data. You should upload the data file into the data directory.

#tuesdata <- tt_load('2021-07-27')
#tuesdata <- tt_load(2021, week = 31)

#olympics <- tuesdata$olympics
#write_csv(olympics, 
#          file = ("data/olympics.csv"))

olympics <- read_csv("data/olympics.csv")

## Rows: 271116 Columns: 15

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, team, noc, games, season, city, sport, event, medal
## dbl  (5): id, age, height, weight, year

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

skim(olympics)

Data summary
Name	olympics
Number of rows	271116
Number of columns	15
_______________________
Column type frequency:
character	10
numeric	5
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	2	108	134731
sex	0	1.00	1	1	2
team	0	1.00	2	47	1184
noc	0	1.00	3	3	230
games	0	1.00	11	11	51
season	0	1.00	6	6	2
city	0	1.00	4	22	42
sport	0	1.00	4	25	66
event	0	1.00	15	85	765
medal	231333	0.15	4	6	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1.00	68248.95	39022.29	1	34643	68205	102097.2	135571	▇▇▇▇▇
age	9474	0.97	25.56	6.39	10	21	24	28.0	97	▇▃▁▁▁
height	60171	0.78	175.34	10.52	127	168	175	183.0	226	▁▂▇▂▁
weight	62875	0.77	70.70	14.35	25	60	70	79.0	214	▃▇▁▁▁
year	0	1.00	1978.38	29.88	1896	1960	1988	2002.0	2016	▁▂▃▆▇

glimpse(olympics)

## Rows: 271,116
## Columns: 15
## $ id     <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
## $ name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
## $ age    <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
## $ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
## $ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
## $ team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
## $ noc    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
## $ games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year   <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
## $ sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
## $ event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
## $ medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

If there are any quirks that you have to deal with NA coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.

This appears to be a very clean dataset. Characters and numbers are coded correctly, missing data is all coded as “NA”. I have removed individuals with “NA” listed for their age as these entries are not informative for the purposes of our question.

There are some unexpectedly high ages, with a maximum of 97 years. Upon further investigation, this is due to the types of Olympic events held in the early years of the modern Games. The event which had the oldest participants, was the Art Competition. This event was part of the Olympics from 1912 until 1948, with categories in architecture, literature, music, painting, and sculpture.

There were also several gaps in the data, which coincided with World War I and World War II. Therefore, I have chosen to only analyze data from 1952 through 2016, effectively removing both gaps in data and extreme outliers for non-athletic events. This analysis will also only focus on data from the Winter Olympics.

Make sure your data types are correct!

Transforming the data (15 points)

If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using case_when(), etc.

# remove unwanted variables
olympics2 <- olympics %>% drop_na(age)

olympics2 <- olympics2 %>% 
  select(sex, age, year, season, team, noc, sport, event, medal)

olympics2 <- olympics2 %>% filter(year >= "1952" & season == "Winter")


olympics2 <- olympics2 %>% 
  mutate(
  age_range = case_when(
    age <16 ~ "11-15", 
    age >= 16 & age <21 ~ "16-20", 
    age >= 21 & age <26 ~ "21-25", 
    age >= 26 & age <31 ~ "26-30", 
    age >= 31 & age <36 ~ "31-35", 
    age >= 36 & age <41 ~ "36-40", 
    age >= 41 & age <46 ~ "41-45", 
    age >= 46 & age <51 ~ "46-50", 
    age > 50  ~ "51+"
    
  )
)


olympics2 <- olympics2 %>% 
  relocate(age_range, .after = age)

Creating age ranges allows us to more easily view the data when graphing.

Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use left_join, inner_join, or right_join on these tables. No credit will be provided if you don’t.

Show your transformed table here. Use tools such as glimpse(), skim() or head() to illustrate your point.

olympics2 %>% skim

Data summary
Name	Piped data
Number of rows	45100
Number of columns	10
_______________________
Column type frequency:
character	8
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
sex	0	1.00	1	1	2
age_range	0	1.00	3	5	9
season	0	1.00	6	6	1
team	0	1.00	4	30	219
noc	0	1.00	3	3	119
sport	0	1.00	4	25	15
event	0	1.00	18	60	115
medal	39959	0.11	4	6	3

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	0	1	24.97	4.69	11	22	24	28	55	▁▇▂▁▁
year	0	1	1991.70	17.47	1952	1980	1994	2006	2014	▂▂▃▅▇

The transformed data shows a reduction in the number of variables, along with the addition of ‘age_range’. Restricting our analysis to just the Winter Olympics reduces the total number of rows from 271,116 to 45,100. The mean and maximum values for age have also been reduced by restricting analysis to the years 1952 through 2016. Additionally, ‘age’ no longer has values of “NA”.

(age_common <- olympics2 %>% tabyl(year, age_range) %>% 
  adorn_totals()) %>% gt()

year	11-15	16-20	21-25	26-30	31-35	36-40	41-45	46-50	51+
1952	5	130	518	264	121	31	13	6	0
1956	13	139	586	365	119	43	11	6	0
1960	7	241	453	313	83	11	0	0	0
1964	15	344	749	480	152	14	7	3	1
1968	15	334	847	541	111	23	0	0	1
1972	26	336	702	449	114	21	4	0	0
1976	21	442	806	426	129	18	5	3	0
1980	18	419	783	414	91	18	0	2	0
1984	12	481	1050	479	81	18	0	0	2
1988	10	510	1342	620	116	23	11	2	1
1992	23	635	1635	914	190	31	6	1	0
1994	18	485	1461	934	231	25	3	1	0
1998	11	503	1508	1156	364	56	4	1	0
2002	9	474	1569	1401	533	101	19	3	0
2006	10	538	1655	1436	603	115	19	5	1
2010	6	529	1689	1308	678	169	17	4	2
2014	10	560	1852	1647	634	157	27	3	1
Total	229	7100	19205	13147	4350	874	146	40	9

medals <- olympics2 %>% filter(medal != "NA")

(age_medal <- olympics2 %>% 
   filter(medal != "NA") %>% 
   tabyl(medal, age_range) %>% 
  adorn_totals()) %>% gt()

medal	11-15	16-20	21-25	26-30	31-35	36-40	41-45	46-50
Bronze	7	196	640	602	206	44	13	1
Gold	5	159	692	594	224	39	4	2
Silver	4	175	686	578	225	34	9	2
Total	16	530	2018	1774	655	117	26	5

(age_medal <- olympics2 %>% 
   filter(medal != "NA") %>% 
   tabyl(medal, age_range) %>% 
  adorn_totals()) %>% gt()

medal	11-15	16-20	21-25	26-30	31-35	36-40	41-45	46-50
Bronze	7	196	640	602	206	44	13	1
Gold	5	159	692	594	224	39	4	2
Silver	4	175	686	578	225	34	9	2
Total	16	530	2018	1774	655	117	26	5

(age2125 <- 2018/19205)

## [1] 0.1050768

(age2630 <- 1774/13147)

## [1] 0.1349357

(n2125 <- 19205/45100)

## [1] 0.4258315

Are the values what you expected for the variables? Why or Why not?

Yes, the results are as I expected, with the 21-25 year age range being the most common age range for Olympic athletes, at 19,205. This is slightly less than the number of athletes in the two neighboring age ranges combined.

The 21-25 age range is also the group that has won the most Olympic medals, at 2018. The 26-30 age range was very close with 1774 medalists. Interestingly, as a proportion, the 26-30 age range won medals more often, with 10.51% of athletes in the 26-30 age range receiving medals, while only 13.49% of athletes in the 21-25 age range received medals.

Visualizing and Summarizing the Data (15 points)

Use group_by() and summarize() to make a summary of the data here. The summary should be relevant to your research question

olympics2 %>% 
  summarize(mean_age = mean(age, na.rm = TRUE),
            median_age = median(age, na.rm = TRUE), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))

## # A tibble: 1 × 4
##   mean_age median_age minimum_age maximum_age
##      <dbl>      <dbl>       <dbl>       <dbl>
## 1     25.0         24          11          55

medals %>% 
  group_by(medal) %>% 
  summarize(mean_age = mean(age, na.rm = T), 
            median_age = median(age, na.rm = T), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))

## # A tibble: 3 × 5
##   medal  mean_age median_age minimum_age maximum_age
##   <chr>     <dbl>      <dbl>       <dbl>       <dbl>
## 1 Bronze     26.0         26          14          46
## 2 Gold       26.0         26          13          49
## 3 Silver     25.9         25          15          50

olympics2 %>% 
  group_by(sex) %>% 
  summarize(mean_age = mean(age, na.rm = T), 
            median_age = median(age, na.rm = T), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))

## # A tibble: 2 × 5
##   sex   mean_age median_age minimum_age maximum_age
##   <chr>    <dbl>      <dbl>       <dbl>       <dbl>
## 1 F         24.0         24          11          48
## 2 M         25.4         25          12          55

medals %>% 
  group_by(sex, medal) %>% 
  summarize(mean_age = mean(age, na.rm = T), 
            median_age = median(age, na.rm = T), 
            minimum_age = min(age, na.rm = TRUE), 
            maximum_age = max(age, na.rm = TRUE))

## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.

## # A tibble: 6 × 6
## # Groups:   sex [2]
##   sex   medal  mean_age median_age minimum_age maximum_age
##   <chr> <chr>     <dbl>      <dbl>       <dbl>       <dbl>
## 1 F     Bronze     25.2         25          15          45
## 2 F     Gold       25.3         25          13          43
## 3 F     Silver     25.3         25          15          46
## 4 M     Bronze     26.4         26          14          46
## 5 M     Gold       26.4         26          16          49
## 6 M     Silver     26.2         26          15          50

What are your findings about the summary? Are they what you expected?

TO obtain more precise results I chose to use the actual ages of the athletes, rather than the age range, for the summary analysis. The summary values obtained are as expected. Regarding the age of athletes, the mean is 24.97 years old, while the median is 24 years old. The median is a much better representation of the data, as it is more robust and not as susceptible to outliers as is the median. As we can see by the maximum age of 55, there are still some high ages present which would affect the mean.

The median ages for Gold and Bronze medals was 26 years, while the median age for Silver medal was 25 years old. All medals had very similar means ranging from 25.90 to 25.99. These values are much closer to each other, indicating that the higher age outliers have less of an effect for the medalists.

As an additional bit of information, I included sub-analysis and separated the sexes to see if there were any additional differences between male and female athletes. The median age for female athletes is 24 years, while the median age for male athletes is 25 years. The ages for medalists are also slightly different, with a mean age for female athletes of 25 years, and maximum age of 46 For male athletes the mean age is 26 years, with a maximum age for medalists of 50 years.

Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.

ggplot(data = olympics2, 
       aes(x = age_range,
           fill = age_range)) + 
  geom_bar(position = "stack") +
  labs(title = "Olympic Athletes by Age Range", 
       x = "Age Range", 
       y = "Count") + 
  theme(legend.position = "none")

ggplot(data = medals, 
       aes(x = age_range, 
           fill = medal)) + 
  geom_bar() +
  labs(title = "Olympic Athletes by Age Range", 
       x = "Age Range", 
       y = "Medal Count", 
       fill = "Medal")

ggplot(data = medals, 
       aes(x = age, y = medal)) + 
  geom_boxplot(color="darkgrey") + 
  labs(x = "Age of Medalist",
       y = "Medal")

ggplot(data = medals, 
       aes(x = age, y = sex)) + 
  geom_boxplot(color="darkgrey") + 
  labs(x = "Age of Medalist",
       y = "Sex")

Final Summary (10 points)

Summarize your research question and findings below.

My aim was to determine the most common age of Olympic athletes in the Winter Games. Information was restricted to the years 1952 through 2016 to remove previous non-athletic Olympic events (such as Art Competition) which contributed greatly to the range of ages, the the outliers subsequently affecting non-robust measures such as the mean. The date restriction was also necessary to handle the gaps in data when no Olympic events were held due to World War I and World War II.

Of the 45,100 Olympic athletes participating in the Winter Games from 1952 through 2016 whose age was recorded, 19,205 of them were in the 21-25 year age range. This constitutes 42.58% of all athletes participating in the Winter Olympics. The mean age for athletes was 24.97 years, with a median of 24 years. Ages ranged from 11 to 55 years.

Of the 5141 Olympic medals, 2018 were won by athletes in the 21-25 year age range, constituting 39.25% of medals won during the Winter Olympics. Athletes who won Gold medals had a mean age of 25.97 years, with a median age of 26 years, and ranged from 14 to 46 years of age. Athletes who won Silver medals had a mean age of 25.99 years, with a mean age of 26years, and ranged from 13 to 49 years of age. Athletes who won Bronze medals had a mean age of 25.90 years, with a median age of 25 years, and ranged from 15 to 50 years of age.

Are your findings what you expected? Why or Why not?

Overall, these findings are what I expected. The mean age and age range for athletes and medalists were all around 25 years of age, with a majority of athletes in the 21-25 year age range. One of the most surprising things about the data was the range of values, with minimum age of 11 years for both male and female athletes, and maximum age of 48 years for female athletes and 55 years for male athletes (with medalists of 46 years and 50 years, respectively), even after limiting the dates of Olympic Games to those held in 1952 or later. Another unexpected finding was that the 21-25 year age range won more medals as an absolute value, but a greater proportion of the athletes in the 26-30 year age range won medals.

Olympic Athlete and Medalist Ages

Trisha Marsh

2022-02-13