Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
#data
channel on Slack..csv
file into your data
folder.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
With the 2022 Winter Olympics underway, I’m curious to know the most common age range for athletes, and which age ranges most often win Olympic medals. This dataset ranges from the beginning of the modern Olympic Games, in 1896, through 2016.
Given your question, what is your expectation about the data?
My expectation is that most Olympic athletes are in the 21-25 age range, and most Olympic medals are won by this same age group. This seems to be an age when someone would be in their peak athletic form, and have had fewer injuries that could impact their performance.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
#tuesdata <- tt_load('2021-07-27')
#tuesdata <- tt_load(2021, week = 31)
#olympics <- tuesdata$olympics
#write_csv(olympics,
# file = ("data/olympics.csv"))
olympics <- read_csv("data/olympics.csv")
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): name, sex, team, noc, games, season, city, sport, event, medal
## dbl (5): id, age, height, weight, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
skim(olympics)
Name | olympics |
Number of rows | 271116 |
Number of columns | 15 |
_______________________ | |
Column type frequency: | |
character | 10 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 2 | 108 | 0 | 134731 | 0 |
sex | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
team | 0 | 1.00 | 2 | 47 | 0 | 1184 | 0 |
noc | 0 | 1.00 | 3 | 3 | 0 | 230 | 0 |
games | 0 | 1.00 | 11 | 11 | 0 | 51 | 0 |
season | 0 | 1.00 | 6 | 6 | 0 | 2 | 0 |
city | 0 | 1.00 | 4 | 22 | 0 | 42 | 0 |
sport | 0 | 1.00 | 4 | 25 | 0 | 66 | 0 |
event | 0 | 1.00 | 15 | 85 | 0 | 765 | 0 |
medal | 231333 | 0.15 | 4 | 6 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1.00 | 68248.95 | 39022.29 | 1 | 34643 | 68205 | 102097.2 | 135571 | ▇▇▇▇▇ |
age | 9474 | 0.97 | 25.56 | 6.39 | 10 | 21 | 24 | 28.0 | 97 | ▇▃▁▁▁ |
height | 60171 | 0.78 | 175.34 | 10.52 | 127 | 168 | 175 | 183.0 | 226 | ▁▂▇▂▁ |
weight | 62875 | 0.77 | 70.70 | 14.35 | 25 | 60 | 70 | 79.0 | 214 | ▃▇▁▁▁ |
year | 0 | 1.00 | 1978.38 | 29.88 | 1896 | 1960 | 1988 | 2002.0 | 2016 | ▁▂▃▆▇ |
glimpse(olympics)
## Rows: 271,116
## Columns: 15
## $ id <dbl> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, …
## $ name <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau…
## $ sex <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M"…
## $ age <dbl> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33,…
## $ height <dbl> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, …
## $ weight <dbl> 80, 60, NA, NA, 82, 82, 82, 82, 82, 82, 75, 75, 75, 75, 75, 75,…
## $ team <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "…
## $ noc <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", …
## $ games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "19…
## $ year <dbl> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 199…
## $ season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Wi…
## $ city <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgar…
## $ sport <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating"…
## $ event <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", …
## $ medal <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
This appears to be a very clean dataset. Characters and numbers are coded correctly, missing data is all coded as “NA”. I have removed individuals with “NA” listed for their age as these entries are not informative for the purposes of our question.
There are some unexpectedly high ages, with a maximum of 97 years. Upon further investigation, this is due to the types of Olympic events held in the early years of the modern Games. The event which had the oldest participants, was the Art Competition. This event was part of the Olympics from 1912 until 1948, with categories in architecture, literature, music, painting, and sculpture.
There were also several gaps in the data, which coincided with World War I and World War II. Therefore, I have chosen to only analyze data from 1952 through 2016, effectively removing both gaps in data and extreme outliers for non-athletic events. This analysis will also only focus on data from the Winter Olympics.
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
# remove unwanted variables
olympics2 <- olympics %>% drop_na(age)
olympics2 <- olympics2 %>%
select(sex, age, year, season, team, noc, sport, event, medal)
olympics2 <- olympics2 %>% filter(year >= "1952" & season == "Winter")
olympics2 <- olympics2 %>%
mutate(
age_range = case_when(
age <16 ~ "11-15",
age >= 16 & age <21 ~ "16-20",
age >= 21 & age <26 ~ "21-25",
age >= 26 & age <31 ~ "26-30",
age >= 31 & age <36 ~ "31-35",
age >= 36 & age <41 ~ "36-40",
age >= 41 & age <46 ~ "41-45",
age >= 46 & age <51 ~ "46-50",
age > 50 ~ "51+"
)
)
olympics2 <- olympics2 %>%
relocate(age_range, .after = age)
Creating age ranges allows us to more easily view the data when graphing.
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
olympics2 %>% skim
Name | Piped data |
Number of rows | 45100 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
sex | 0 | 1.00 | 1 | 1 | 0 | 2 | 0 |
age_range | 0 | 1.00 | 3 | 5 | 0 | 9 | 0 |
season | 0 | 1.00 | 6 | 6 | 0 | 1 | 0 |
team | 0 | 1.00 | 4 | 30 | 0 | 219 | 0 |
noc | 0 | 1.00 | 3 | 3 | 0 | 119 | 0 |
sport | 0 | 1.00 | 4 | 25 | 0 | 15 | 0 |
event | 0 | 1.00 | 18 | 60 | 0 | 115 | 0 |
medal | 39959 | 0.11 | 4 | 6 | 0 | 3 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1 | 24.97 | 4.69 | 11 | 22 | 24 | 28 | 55 | ▁▇▂▁▁ |
year | 0 | 1 | 1991.70 | 17.47 | 1952 | 1980 | 1994 | 2006 | 2014 | ▂▂▃▅▇ |
The transformed data shows a reduction in the number of variables, along with the addition of ‘age_range’. Restricting our analysis to just the Winter Olympics reduces the total number of rows from 271,116 to 45,100. The mean and maximum values for age have also been reduced by restricting analysis to the years 1952 through 2016. Additionally, ‘age’ no longer has values of “NA”.
(age_common <- olympics2 %>% tabyl(year, age_range) %>%
adorn_totals()) %>% gt()
year | 11-15 | 16-20 | 21-25 | 26-30 | 31-35 | 36-40 | 41-45 | 46-50 | 51+ |
---|---|---|---|---|---|---|---|---|---|
1952 | 5 | 130 | 518 | 264 | 121 | 31 | 13 | 6 | 0 |
1956 | 13 | 139 | 586 | 365 | 119 | 43 | 11 | 6 | 0 |
1960 | 7 | 241 | 453 | 313 | 83 | 11 | 0 | 0 | 0 |
1964 | 15 | 344 | 749 | 480 | 152 | 14 | 7 | 3 | 1 |
1968 | 15 | 334 | 847 | 541 | 111 | 23 | 0 | 0 | 1 |
1972 | 26 | 336 | 702 | 449 | 114 | 21 | 4 | 0 | 0 |
1976 | 21 | 442 | 806 | 426 | 129 | 18 | 5 | 3 | 0 |
1980 | 18 | 419 | 783 | 414 | 91 | 18 | 0 | 2 | 0 |
1984 | 12 | 481 | 1050 | 479 | 81 | 18 | 0 | 0 | 2 |
1988 | 10 | 510 | 1342 | 620 | 116 | 23 | 11 | 2 | 1 |
1992 | 23 | 635 | 1635 | 914 | 190 | 31 | 6 | 1 | 0 |
1994 | 18 | 485 | 1461 | 934 | 231 | 25 | 3 | 1 | 0 |
1998 | 11 | 503 | 1508 | 1156 | 364 | 56 | 4 | 1 | 0 |
2002 | 9 | 474 | 1569 | 1401 | 533 | 101 | 19 | 3 | 0 |
2006 | 10 | 538 | 1655 | 1436 | 603 | 115 | 19 | 5 | 1 |
2010 | 6 | 529 | 1689 | 1308 | 678 | 169 | 17 | 4 | 2 |
2014 | 10 | 560 | 1852 | 1647 | 634 | 157 | 27 | 3 | 1 |
Total | 229 | 7100 | 19205 | 13147 | 4350 | 874 | 146 | 40 | 9 |
medals <- olympics2 %>% filter(medal != "NA")
(age_medal <- olympics2 %>%
filter(medal != "NA") %>%
tabyl(medal, age_range) %>%
adorn_totals()) %>% gt()
medal | 11-15 | 16-20 | 21-25 | 26-30 | 31-35 | 36-40 | 41-45 | 46-50 |
---|---|---|---|---|---|---|---|---|
Bronze | 7 | 196 | 640 | 602 | 206 | 44 | 13 | 1 |
Gold | 5 | 159 | 692 | 594 | 224 | 39 | 4 | 2 |
Silver | 4 | 175 | 686 | 578 | 225 | 34 | 9 | 2 |
Total | 16 | 530 | 2018 | 1774 | 655 | 117 | 26 | 5 |
(age_medal <- olympics2 %>%
filter(medal != "NA") %>%
tabyl(medal, age_range) %>%
adorn_totals()) %>% gt()
medal | 11-15 | 16-20 | 21-25 | 26-30 | 31-35 | 36-40 | 41-45 | 46-50 |
---|---|---|---|---|---|---|---|---|
Bronze | 7 | 196 | 640 | 602 | 206 | 44 | 13 | 1 |
Gold | 5 | 159 | 692 | 594 | 224 | 39 | 4 | 2 |
Silver | 4 | 175 | 686 | 578 | 225 | 34 | 9 | 2 |
Total | 16 | 530 | 2018 | 1774 | 655 | 117 | 26 | 5 |
(age2125 <- 2018/19205)
## [1] 0.1050768
(age2630 <- 1774/13147)
## [1] 0.1349357
(n2125 <- 19205/45100)
## [1] 0.4258315
Are the values what you expected for the variables? Why or Why not?
Yes, the results are as I expected, with the 21-25 year age range being the most common age range for Olympic athletes, at 19,205. This is slightly less than the number of athletes in the two neighboring age ranges combined.
The 21-25 age range is also the group that has won the most Olympic medals, at 2018. The 26-30 age range was very close with 1774 medalists. Interestingly, as a proportion, the 26-30 age range won medals more often, with 10.51% of athletes in the 26-30 age range receiving medals, while only 13.49% of athletes in the 21-25 age range received medals.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
olympics2 %>%
summarize(mean_age = mean(age, na.rm = TRUE),
median_age = median(age, na.rm = TRUE),
minimum_age = min(age, na.rm = TRUE),
maximum_age = max(age, na.rm = TRUE))
## # A tibble: 1 × 4
## mean_age median_age minimum_age maximum_age
## <dbl> <dbl> <dbl> <dbl>
## 1 25.0 24 11 55
medals %>%
group_by(medal) %>%
summarize(mean_age = mean(age, na.rm = T),
median_age = median(age, na.rm = T),
minimum_age = min(age, na.rm = TRUE),
maximum_age = max(age, na.rm = TRUE))
## # A tibble: 3 × 5
## medal mean_age median_age minimum_age maximum_age
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Bronze 26.0 26 14 46
## 2 Gold 26.0 26 13 49
## 3 Silver 25.9 25 15 50
olympics2 %>%
group_by(sex) %>%
summarize(mean_age = mean(age, na.rm = T),
median_age = median(age, na.rm = T),
minimum_age = min(age, na.rm = TRUE),
maximum_age = max(age, na.rm = TRUE))
## # A tibble: 2 × 5
## sex mean_age median_age minimum_age maximum_age
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 F 24.0 24 11 48
## 2 M 25.4 25 12 55
medals %>%
group_by(sex, medal) %>%
summarize(mean_age = mean(age, na.rm = T),
median_age = median(age, na.rm = T),
minimum_age = min(age, na.rm = TRUE),
maximum_age = max(age, na.rm = TRUE))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups` argument.
## # A tibble: 6 × 6
## # Groups: sex [2]
## sex medal mean_age median_age minimum_age maximum_age
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 F Bronze 25.2 25 15 45
## 2 F Gold 25.3 25 13 43
## 3 F Silver 25.3 25 15 46
## 4 M Bronze 26.4 26 14 46
## 5 M Gold 26.4 26 16 49
## 6 M Silver 26.2 26 15 50
What are your findings about the summary? Are they what you expected?
TO obtain more precise results I chose to use the actual ages of the athletes, rather than the age range, for the summary analysis. The summary values obtained are as expected. Regarding the age of athletes, the mean is 24.97 years old, while the median is 24 years old. The median is a much better representation of the data, as it is more robust and not as susceptible to outliers as is the median. As we can see by the maximum age of 55, there are still some high ages present which would affect the mean.
The median ages for Gold and Bronze medals was 26 years, while the median age for Silver medal was 25 years old. All medals had very similar means ranging from 25.90 to 25.99. These values are much closer to each other, indicating that the higher age outliers have less of an effect for the medalists.
As an additional bit of information, I included sub-analysis and separated the sexes to see if there were any additional differences between male and female athletes. The median age for female athletes is 24 years, while the median age for male athletes is 25 years. The ages for medalists are also slightly different, with a mean age for female athletes of 25 years, and maximum age of 46 For male athletes the mean age is 26 years, with a maximum age for medalists of 50 years.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
ggplot(data = olympics2,
aes(x = age_range,
fill = age_range)) +
geom_bar(position = "stack") +
labs(title = "Olympic Athletes by Age Range",
x = "Age Range",
y = "Count") +
theme(legend.position = "none")
ggplot(data = medals,
aes(x = age_range,
fill = medal)) +
geom_bar() +
labs(title = "Olympic Athletes by Age Range",
x = "Age Range",
y = "Medal Count",
fill = "Medal")
ggplot(data = medals,
aes(x = age, y = medal)) +
geom_boxplot(color="darkgrey") +
labs(x = "Age of Medalist",
y = "Medal")
ggplot(data = medals,
aes(x = age, y = sex)) +
geom_boxplot(color="darkgrey") +
labs(x = "Age of Medalist",
y = "Sex")
Summarize your research question and findings below.
My aim was to determine the most common age of Olympic athletes in the Winter Games. Information was restricted to the years 1952 through 2016 to remove previous non-athletic Olympic events (such as Art Competition) which contributed greatly to the range of ages, the the outliers subsequently affecting non-robust measures such as the mean. The date restriction was also necessary to handle the gaps in data when no Olympic events were held due to World War I and World War II.
Of the 45,100 Olympic athletes participating in the Winter Games from 1952 through 2016 whose age was recorded, 19,205 of them were in the 21-25 year age range. This constitutes 42.58% of all athletes participating in the Winter Olympics. The mean age for athletes was 24.97 years, with a median of 24 years. Ages ranged from 11 to 55 years.
Of the 5141 Olympic medals, 2018 were won by athletes in the 21-25 year age range, constituting 39.25% of medals won during the Winter Olympics. Athletes who won Gold medals had a mean age of 25.97 years, with a median age of 26 years, and ranged from 14 to 46 years of age. Athletes who won Silver medals had a mean age of 25.99 years, with a mean age of 26years, and ranged from 13 to 49 years of age. Athletes who won Bronze medals had a mean age of 25.90 years, with a median age of 25 years, and ranged from 15 to 50 years of age.
Are your findings what you expected? Why or Why not?
Overall, these findings are what I expected. The mean age and age range for athletes and medalists were all around 25 years of age, with a majority of athletes in the 21-25 year age range. One of the most surprising things about the data was the range of values, with minimum age of 11 years for both male and female athletes, and maximum age of 48 years for female athletes and 55 years for male athletes (with medalists of 46 years and 50 years, respectively), even after limiting the dates of Olympic Games to those held in 1952 or later. Another unexpected finding was that the 21-25 year age range won more medals as an absolute value, but a greater proportion of the athletes in the 26-30 year age range won medals.