Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
#data
channel on Slack..csv
file into your data
folder.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
My research question is about which factor (company location, review date, bean origin, cocoa percentage, or number of ingredients) has relation to chocolate bar rating. The data interests me since it has over 2500 reviews of chocolate bars over the year and documents different characteristics of these chocolate bars. The specific question I want to find out about the data is whether there is any observable trend between the chocolate ratings and these characteristics or not.
Given your question, what is your expectation about the data?
I expect the data to include both the chocolate ratings (outcome variables) and the chocolate bars’ characteristics.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
chocolate <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
head(chocolate)
## ref company_manufacturer company_location review_date country_of_bean_origin
## 1 2454 5150 U.S.A. 2019 Tanzania
## 2 2458 5150 U.S.A. 2019 Dominican Republic
## 3 2454 5150 U.S.A. 2019 Madagascar
## 4 2542 5150 U.S.A. 2021 Fiji
## 5 2546 5150 U.S.A. 2021 Venezuela
## 6 2546 5150 U.S.A. 2021 Uganda
## specific_bean_origin_or_bar_name cocoa_percent ingredients
## 1 Kokoa Kamili, batch 1 76% 3- B,S,C
## 2 Zorzal, batch 1 76% 3- B,S,C
## 3 Bejofo Estate, batch 1 76% 3- B,S,C
## 4 Matasawalevu, batch 1 68% 3- B,S,C
## 5 Sur del Lago, batch 1 72% 3- B,S,C
## 6 Semuliki Forest, batch 1 80% 3- B,S,C
## most_memorable_characteristics rating
## 1 rich cocoa, fatty, bready 3.25
## 2 cocoa, vegetal, savory 3.50
## 3 cocoa, blackberry, full body 3.75
## 4 chewy, off, rubbery 3.00
## 5 fatty, earthy, moss, nutty,chalky 3.00
## 6 mildly bitter, basic cocoa, fatty 3.25
# use skimr to look at the data structure
skim(chocolate)
Name | chocolate |
Number of rows | 2530 |
Number of columns | 10 |
_______________________ | |
Column type frequency: | |
character | 7 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_manufacturer | 0 | 1 | 2 | 39 | 0 | 580 | 0 |
company_location | 0 | 1 | 4 | 21 | 0 | 67 | 0 |
country_of_bean_origin | 0 | 1 | 4 | 21 | 0 | 62 | 0 |
specific_bean_origin_or_bar_name | 0 | 1 | 3 | 51 | 0 | 1605 | 0 |
cocoa_percent | 0 | 1 | 3 | 6 | 0 | 46 | 0 |
ingredients | 0 | 1 | 0 | 14 | 87 | 22 | 0 |
most_memorable_characteristics | 0 | 1 | 3 | 37 | 0 | 2487 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ref | 0 | 1 | 1429.80 | 757.65 | 5 | 802 | 1454.00 | 2079.0 | 2712 | ▆▇▇▇▇ |
review_date | 0 | 1 | 2014.37 | 3.97 | 2006 | 2012 | 2015.00 | 2018.0 | 2021 | ▃▅▇▆▅ |
rating | 0 | 1 | 3.20 | 0.45 | 1 | 3 | 3.25 | 3.5 | 4 | ▁▁▅▇▇ |
The data set I chose is uploaded directly from the Tidy Tuesday website. This data set is a collection of chocolate reviews over the years of Brady Brelinski from the Manhattan Chocolate Society. I didn’t upload the data into the data folder since it is more convienient to read in the data manually for me.
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
I do notice that the cocoa_percent
column is character, so I’m going to try to strip it of the “%” sign and convert it to numeric.
I also realize that there are 87 empty values in the ingredients
column. Since I’m planning to use the ingredients in this column later, I’m going to filter out observations with empty values. However, this could potentially influence the trends we observe later on.
chocolate$cocoa_percent<-gsub("%","",as.character(chocolate$cocoa_percent))
chocolate$cocoa_percent <- as.numeric(chocolate$cocoa_percent)
#filter out empty observations in the ingredients column
chocolate <- chocolate %>%
filter(ingredients != "")
Make sure your data types are correct!
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
I want to observe the relation between chocolate ratings and the number of ingredients in a scatter plot. Since the column ingredients
include both the number of ingredients and the ingredients, I’m going to create another variable (number_ing
) that only include the number of ingredients using mutate
and case_when()
.
# check to see the maximum number of ingredients
unique(chocolate$ingredients) # maximum of 6 ingredients
## [1] "3- B,S,C" "4- B,S,C,L" "2- B,S" "4- B,S,C,V"
## [5] "5- B,S,C,V,L" "6-B,S,C,V,L,Sa" "5-B,S,C,V,Sa" "4- B,S,V,L"
## [9] "2- B,S*" "1- B" "3- B,S*,C" "3- B,S,L"
## [13] "3- B,S,V" "4- B,S*,C,L" "4- B,S*,C,Sa" "3- B,S*,Sa"
## [17] "4- B,S,C,Sa" "4- B,S*,V,L" "2- B,C" "4- B,S*,C,V"
## [21] "5- B,S,C,L,Sa"
# create variable that only includes the number of ingredients, use grepl to detect the 1st character
chocolate <- chocolate %>%
mutate(number_ing = case_when(
grepl("1", ingredients) ~ 1,
grepl("2", ingredients) ~ 2,
grepl("3", ingredients) ~ 3,
grepl("4", ingredients) ~ 4,
grepl("5", ingredients) ~ 5,
TRUE ~ 6
))
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
skim(chocolate)
Name | chocolate |
Number of rows | 2443 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 6 |
numeric | 5 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_manufacturer | 0 | 1 | 2 | 39 | 0 | 542 | 0 |
company_location | 0 | 1 | 4 | 21 | 0 | 67 | 0 |
country_of_bean_origin | 0 | 1 | 4 | 21 | 0 | 62 | 0 |
specific_bean_origin_or_bar_name | 0 | 1 | 3 | 51 | 0 | 1567 | 0 |
ingredients | 0 | 1 | 4 | 14 | 0 | 21 | 0 |
most_memorable_characteristics | 0 | 1 | 3 | 37 | 0 | 2403 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
ref | 0 | 1 | 1451.52 | 755.52 | 5 | 833 | 1474.00 | 2100.0 | 2712 | ▆▆▇▇▇ |
review_date | 0 | 1 | 2014.49 | 3.96 | 2006 | 2012 | 2015.00 | 2018.0 | 2021 | ▃▅▇▆▅ |
cocoa_percent | 0 | 1 | 71.50 | 5.16 | 42 | 70 | 70.00 | 74.0 | 100 | ▁▁▇▁▁ |
rating | 0 | 1 | 3.21 | 0.43 | 1 | 3 | 3.25 | 3.5 | 4 | ▁▁▅▇▇ |
number_ing | 0 | 1 | 3.04 | 0.91 | 1 | 2 | 3.00 | 4.0 | 6 | ▆▇▃▂▁ |
Are the values what you expected for the variables? Why or Why not?
Yes, the values are what I expected for the variables. I successfully converted the variable cocoa_percent
from character to numeric, and the newly created variable number_ing
only includes the number of ingredients.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
# summarize the rating by company location
chocolate %>%
group_by(company_location) %>%
summarize(mean_rating = mean(rating),
min = quantile(rating, probs = 0),
p25 = quantile(rating, probs = 0.25),
median = quantile(rating, probs = 0.5),
p75 = quantile(rating, probs = 0.75),
max = quantile(rating, probs = 1))
## # A tibble: 67 x 7
## company_location mean_rating min p25 median p75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Amsterdam 3.31 3 3.19 3.25 3.5 3.75
## 2 Argentina 3.5 3.5 3.5 3.5 3.5 3.5
## 3 Australia 3.37 2.5 3.12 3.5 3.75 4
## 4 Austria 3.26 2.75 3 3.25 3.5 3.75
## 5 Belgium 3.18 1 3 3.12 3.5 4
## 6 Bolivia 3.25 2.75 3 3.25 3.5 3.75
## 7 Brazil 3.27 2.5 3 3.25 3.5 4
## 8 Canada 3.31 2 3 3.25 3.5 4
## 9 Chile 3.75 3.75 3.75 3.75 3.75 3.75
## 10 Colombia 3.22 2 3 3.25 3.5 3.75
## # ... with 57 more rows
# summarize the rating by bean orgin
chocolate %>%
group_by(country_of_bean_origin) %>%
summarize(mean_rating = mean(rating),
min = quantile(rating, probs = 0),
p25 = quantile(rating, probs = 0.25),
median = quantile(rating, probs = 0.5),
p75 = quantile(rating, probs = 0.75),
max = quantile(rating, probs = 1))
## # A tibble: 62 x 7
## country_of_bean_origin mean_rating min p25 median p75 max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Australia 3.25 2.75 3 3.25 3.5 3.75
## 2 Belize 3.24 2.5 3 3.25 3.5 4
## 3 Blend 3.09 1 2.75 3 3.5 4
## 4 Bolivia 3.18 2 3 3.25 3.5 4
## 5 Brazil 3.26 1.75 3 3.25 3.5 4
## 6 Burma 3 3 3 3 3 3
## 7 Cameroon 3.08 3 3 3 3.12 3.25
## 8 China 3.5 3.5 3.5 3.5 3.5 3.5
## 9 Colombia 3.21 2 3 3.25 3.5 4
## 10 Congo 3.32 2.75 3.12 3.25 3.5 3.75
## # ... with 52 more rows
# summarize the rating by number of ingredients
chocolate %>%
group_by(number_ing) %>%
summarize(mean_rating = mean(rating),
min = quantile(rating, probs = 0),
p25 = quantile(rating, probs = 0.25),
median = quantile(rating, probs = 0.5),
p75 = quantile(rating, probs = 0.75),
max = quantile(rating, probs = 1))
## # A tibble: 6 x 7
## number_ing mean_rating min p25 median p75 max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2.96 2.5 2.81 3 3 3.5
## 2 2 3.22 2 3 3.25 3.5 4
## 3 3 3.27 1.5 3 3.25 3.5 4
## 4 4 3.13 1.5 2.75 3 3.5 4
## 5 5 3.08 1 2.75 3 3.5 4
## 6 6 2.94 2.75 2.75 2.75 2.94 3.5
What are your findings about the summary? Are they what you expected?
After grouping the data by the company location, the means and ranges of the ratings from these groups are pretty similar to each others. None of the means and median is surprisingly low (1) or high(4). The same also goes with the means and ranges of the ratings of the groups resulted from grouping by the cocoa bean origin. Overall, I don’t see any trend or relation between ratings and company location or bean origin. This is not different from what I expected.
After grouping the data by the number of ingredients, it is surprising to see that the mean ratings of these groups do not differ much from each other. The highest mean and median rating belongs to the group of three ingredients, while the lowest mean and median ratings belong to the group with 6 ingredients, so isn’t any positive or negative trends in rating as the number of ingredients increases. This is different from what I expected since I thought that increasing the number of ingredients can improve the flavor of chocolate and therefore increases the rating.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
# rating vs number of ingredients
p1 <- ggplot(chocolate,
aes(y = rating,
x = number_ing)) +
geom_point() +
labs(title = "Rating vs Number of Ingredients",
x = "Number of Ingredients",
y = "Rating")
# rating vs review date
p2 <- ggplot(chocolate,
aes(y = rating,
x = review_date,
color = number_ing)) +
geom_point() +
labs(title = "Rating vs Review Date",
x = "Review Year",
y = "Rating")
# rating vs cocoa percentage
p3 <- ggplot(chocolate,
aes(y = rating,
x = cocoa_percent,
color = number_ing)) +
geom_point() +
labs(title = "Rating vs Cocoa Percentage",
x = "Cocoa Percentage",
y = "Rating")
p1
p2
p3
# standard deviations of ratings over time
table1 <- chocolate %>%
group_by(review_date) %>%
summarize(standard_deviations = sd(rating))
table1
## # A tibble: 16 x 2
## review_date standard_deviations
## <int> <dbl>
## 1 2006 0.660
## 2 2007 0.598
## 3 2008 0.495
## 4 2009 0.439
## 5 2010 0.432
## 6 2011 0.476
## 7 2012 0.464
## 8 2013 0.432
## 9 2014 0.409
## 10 2015 0.382
## 11 2016 0.417
## 12 2017 0.346
## 13 2018 0.389
## 14 2019 0.377
## 15 2020 0.321
## 16 2021 0.344
# rating vs cocoa percentage
p4 <- ggplot(table1,
aes(y = standard_deviations,
x = review_date)) +
geom_line() +
labs(title = "Standard Diviation of Rating over Time",
x = "Review Year",
y = "Standard Deviation")
p4
From the plot above, it appears that there aren’t any negative or positive trends between chocolate bar rating (outcome variable) and the number of ingredients, review date or cocoa percentage. However, I do notice that the variation in rating seems to decrease over (standard deviation vs review year plot).
Summarize your research question and findings below.
In order to answer my research question of whether any of the mentioned factors (company location, bean origin, number of ingredients, review date, cocoa percentage) has relation to chocolate ratings, I looked at summaries and plots of these factors in relation to chocolate rating.
After grouping the data observations based on company locations, I found that the average and ranges of these ratings are very similar to each other. There isn’t any particular group that has mean and/or median rating out of the ordinary (too high or too low). Grouping the observations by bean origin and the number of of ingredients also resulted in no observable trend between the ratings and these factors. Therefore, I conclude that there aren’t any observable relations between chocolate ratings and company location, bean origin, or number of ingredients.
From the plot of rating vs number of ingredients, I found no observable positive or negative trend between the two variables. However, I notice that there are less observations for the one and six-ingredient groups compared to the other groups. This could explain why these groups’ average ratings are lower than the others’ as shown in the summary.
From the rating vs review date scatter plot, although I found no observable negative or positive trend between these variables, I found that the variation in rating decreases over time. This is confirmed in the standard deviations over time plot. This is interesting since this result could either be due quality of different chocolate brands getting closer to each other over time or the chocolate tester finds becomes more consistent at reviewing chocolate.
From the rating vs cocoa percentage plot, I found no observable positive or negative trend between these two variables.
From the summaries and plots above, I found no observable relations between chocolate rating and the mentioned factors. However, I do found that the variation in chocolate ratings decreases over time.
Keep in that I did remove about 82 observations from the data set since they have empty values for ingredients and this may have influenced these observations. Furthermore, I only looked at summaries and scatter plot to conclude these findings. In order to be more precise about the significance of these factors in relation to chocolate rating, we have to find the underlying model and fit it using these factors to look at these factors’ influence on ratings.
Are your findings what you expected? Why or Why not?
My findings are different from my original expectation. I expected that there is some relationship or trend between cocoa percentage, number ingredients, and bean origin since these are the characteristics that consumer often look for when buying chocolate. Finding these factors to not have any observable relations to chocolate ratings have me question whether there are other factors out there that have influence on chocolate ratings.