Please submit your .Rmd
and .html
files in Sakai. If you are working together, both people should submit the files.
The goal of the midterm project is to showcase skills that you have learned in class so far. The midterm is open note, but if you use someone else’s code, you must attribute them.
#data
channel on Slack..csv
file into your data
folder.Define a research question, involving at least one categorical variable. You may schedule a time with Jessica or Colin to discuss your dataset and research question, or you may message it to one of us in slack or email. Please do one of the two options pretty early on. We just want to look at the data and make sure that it is appropriate for your question.
You must use each of the following functions at least once:
mutate()
group_by()
summarize()
ggplot()
and at least one of the following:
case_when()
across()
*_join()
(i.e. left_join()
)pivot_*()
(i.e. pivot_longer()
)The code chunks below are guides, please add more code chunks to do what you need.
If you do not want your final project posted on the public website, please let Jessica know. We can also keep it anonymous if you’d like to remove your name from the Rmd and html, or use a pseudonym.
You may remove these instructions from your final Rmd if you like
If you’d like to work together, that is encouraged, but you must divide the work equitably and you must note who worked on what. This is probably easiest as notes in the text. Please let Colin or Jessica know that you’ll be working together.
No acknowledgements of contributions = -10 points overall.
I will take off points (-5 points for each section) if you don’t add observations and notes in your RMarkdown document. I want you to think and reason through your analysis, even if they are preliminary thoughts.
Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I am a chocolate lover and I enjoy all types of chocolate. Luckily I found the chocolatebar dataset from TidyTuesday on GitHub website so I could dedicate this project to chocolate. I’ve used this dataset for my function of the week, and I’d like to keep exploring this dataset in the midterm project.
The specific questions that I’m interested in is to see if there is a correlation between percentage of cocoa and the rating of the chocolate, and if adding different numbers of ingredients would correlate with the rating score as well. I also want to know where the most popular beans are among the countries of origin and who the company manufacturers are with relatively high rating.
Given your question, what is your expectation about the data?
Intuitively, customers do not like the chocolate with high cocoa percentage. More ingredients can maximize chocolate’s flavor, therefore the chocolate bars with more ingredients probably have better reviews scores. I’m expecting the most popular beans come from the countries where located in the southern hemisphere. I’ve never heard of the chocolate manufacturers in this datasets, so I’d like to know which of them have the best reviews.
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
#tuesdata <- tt_load('2022-01-18')
#ChocolateBar <- tuesdata$chocolate
<- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-18/chocolate.csv')
ChocolateBar
glimpse(ChocolateBar)
## Rows: 2,530
## Columns: 10
## $ ref <dbl> 2454, 2458, 2454, 2542, 2546, 2546, 2~
## $ company_manufacturer <chr> "5150", "5150", "5150", "5150", "5150~
## $ company_location <chr> "U.S.A.", "U.S.A.", "U.S.A.", "U.S.A.~
## $ review_date <dbl> 2019, 2019, 2019, 2021, 2021, 2021, 2~
## $ country_of_bean_origin <chr> "Tanzania", "Dominican Republic", "Ma~
## $ specific_bean_origin_or_bar_name <chr> "Kokoa Kamili, batch 1", "Zorzal, bat~
## $ cocoa_percent <chr> "76%", "76%", "76%", "68%", "72%", "8~
## $ ingredients <chr> "3- B,S,C", "3- B,S,C", "3- B,S,C", "~
## $ most_memorable_characteristics <chr> "rich cocoa, fatty, bready", "cocoa, ~
## $ rating <dbl> 3.25, 3.50, 3.75, 3.00, 3.00, 3.25, 3~
# Check all data frame columns and get a sum of NA values in them
cbind(lapply(lapply(ChocolateBar, is.na), sum))
## [,1]
## ref 0
## company_manufacturer 0
## company_location 0
## review_date 0
## country_of_bean_origin 0
## specific_bean_origin_or_bar_name 0
## cocoa_percent 0
## ingredients 87
## most_memorable_characteristics 0
## rating 0
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
Make sure your data types are correct!
After I had an general idea of the dataset, I made a list about what I wanted to do in the next step, transforming the data:
Remove the variables which are not relative to my research questions: ref, company_location, review_date, specific_bean_origin_or_bar_name, and most_memorable_characteristics.
Remove % sign for cocoa_percent variable, and then convert it into numeric variable.
Separate ingredients variable and only keep the number of ingredients in the chocolate, delete the information of the types of ingredients. Then convert it into factor variable(for model building purpose).
For 87 missing values “NA” in character variable ingredients, I will remove them from my dataset in the next section.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
# Remove the variables that not needed
<- ChocolateBar %>%
ChocolateBar select(-ref, -company_location, -review_date, -specific_bean_origin_or_bar_name, -most_memorable_characteristics)
# Remove % sign
<- ChocolateBar %>%
ChocolateBar mutate(cocoa_percent = as.numeric(gsub("%", "", cocoa_percent)))
# Separate ingredients variable
<- ChocolateBar %>%
ChocolateBar separate(
col = ingredients,
into = c("ingredients", "type"),
sep = "-") %>%
mutate(ingredients = na_if(ingredients, "NA"),
type = na_if(type, "NA")) %>%
select(-type) %>%
mutate(ingredients = as.factor(ingredients)) %>%
drop_na()
Bonus points (5 points) for datasets that require merging of tables, but only if you reason through whether you should use
left_join
,inner_join
, orright_join
on these tables. No credit will be provided if you don’t.
Show your transformed table here. Use tools such as
glimpse()
,skim()
orhead()
to illustrate your point.
skim(ChocolateBar)
Name | ChocolateBar |
Number of rows | 2443 |
Number of columns | 5 |
_______________________ | |
Column type frequency: | |
character | 2 |
factor | 1 |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
company_manufacturer | 0 | 1 | 2 | 39 | 0 | 542 | 0 |
country_of_bean_origin | 0 | 1 | 4 | 21 | 0 | 62 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
ingredients | 0 | 1 | FALSE | 6 | 3: 1023, 2: 750, 4: 469, 5: 191 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
cocoa_percent | 0 | 1 | 71.50 | 5.16 | 42 | 70 | 70.00 | 74.0 | 100 | ▁▁▇▁▁ |
rating | 0 | 1 | 3.21 | 0.43 | 1 | 3 | 3.25 | 3.5 | 4 | ▁▁▅▇▇ |
Are the values what you expected for the variables? Why or Why not?
After transformed my variables as I planed, I got two character variables(company_manufacturer and country_of_bean_origin), two numeric variables (cocoa_percent and rating) and one factor variable(ingredients). They are the variables with right formats as I expected, and I’m going to use them for answering my research questions.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
%>%
ChocolateBar summarise(across(where(is.character), n_distinct))
## # A tibble: 1 x 2
## company_manufacturer country_of_bean_origin
## <int> <int>
## 1 542 62
# country_of_bean_origin
%>%
ChocolateBar tabyl(country_of_bean_origin) %>%
arrange(desc(n)) %>%
top_frac(.25) %>%
gt()
country_of_bean_origin | n | percent |
---|---|---|
Venezuela | 246 | 0.10069587 |
Peru | 231 | 0.09455587 |
Dominican Republic | 220 | 0.09005321 |
Ecuador | 201 | 0.08227589 |
Madagascar | 171 | 0.06999591 |
Blend | 144 | 0.05894392 |
Nicaragua | 100 | 0.04093328 |
Bolivia | 79 | 0.03233729 |
Colombia | 78 | 0.03192796 |
Tanzania | 78 | 0.03192796 |
Brazil | 77 | 0.03151862 |
Belize | 74 | 0.03029063 |
Vietnam | 73 | 0.02988129 |
Guatemala | 62 | 0.02537863 |
Mexico | 54 | 0.02210397 |
%>%
ChocolateBar group_by(country_of_bean_origin) %>%
filter(n() >= 54) %>%
mutate(count = n()) %>%
ggplot(aes(x = reorder(country_of_bean_origin, count))) +
geom_bar(fill="lightblue") +
coord_flip() +
geom_text(stat = "count",aes(label = ..count..)) +
labs(x = 'Country of bean origin', y = 'Count', title = 'Top 25% most frequently used country of bean origins') +
theme_minimal()
# company_manufacturer
%>%
ChocolateBar tabyl(company_manufacturer) %>%
arrange(desc(n)) %>%
top_frac(.05) %>%
gt()
company_manufacturer | n | percent |
---|---|---|
Soma | 56 | 0.022922636 |
Fresco | 39 | 0.015963979 |
Arete | 32 | 0.013098649 |
Bonnat | 29 | 0.011870651 |
A. Morin | 26 | 0.010642652 |
Dandelion | 25 | 0.010233320 |
Pralus | 25 | 0.010233320 |
Domori | 22 | 0.009005321 |
Guittard | 22 | 0.009005321 |
Valrhona | 22 | 0.009005321 |
Zotter | 21 | 0.008595989 |
Castronovo | 19 | 0.007777323 |
Dick Taylor | 19 | 0.007777323 |
Coppeneur | 18 | 0.007367990 |
Mast Brothers | 18 | 0.007367990 |
Duffy's | 17 | 0.006958657 |
Pierre Marcolini | 17 | 0.006958657 |
Scharffen Berger | 17 | 0.006958657 |
Smooth Chocolator, The | 17 | 0.006958657 |
Rogue | 16 | 0.006549325 |
Artisan du Chocolat | 15 | 0.006139992 |
Letterpress | 15 | 0.006139992 |
Palette de Bine | 15 | 0.006139992 |
Szanto Tibor | 15 | 0.006139992 |
Bittersweet Origins | 14 | 0.005730659 |
Fruition | 14 | 0.005730659 |
Hotel Chocolat (Coppeneur) | 14 | 0.005730659 |
Map Chocolate | 14 | 0.005730659 |
Sirene | 14 | 0.005730659 |
Tejas | 14 | 0.005730659 |
%>%
ChocolateBar group_by(company_manufacturer) %>%
filter(n() >= 14) %>%
summarize(mean_rating = mean(rating)) %>%
arrange(desc(mean_rating)) %>%
gt()
company_manufacturer | mean_rating |
---|---|
Soma | 3.589286 |
Bonnat | 3.534483 |
Arete | 3.531250 |
Domori | 3.522727 |
Smooth Chocolator, The | 3.514706 |
Dick Taylor | 3.486842 |
Duffy's | 3.441176 |
Letterpress | 3.433333 |
Fruition | 3.428571 |
A. Morin | 3.423077 |
Rogue | 3.421875 |
Fresco | 3.416667 |
Szanto Tibor | 3.400000 |
Pierre Marcolini | 3.397059 |
Sirene | 3.392857 |
Castronovo | 3.381579 |
Zotter | 3.321429 |
Valrhona | 3.318182 |
Dandelion | 3.300000 |
Bittersweet Origins | 3.267857 |
Coppeneur | 3.250000 |
Map Chocolate | 3.250000 |
Palette de Bine | 3.233333 |
Scharffen Berger | 3.220588 |
Pralus | 3.180000 |
Guittard | 3.170455 |
Artisan du Chocolat | 3.166667 |
Mast Brothers | 3.125000 |
Hotel Chocolat (Coppeneur) | 3.053571 |
Tejas | 3.017857 |
%>%
ChocolateBar group_by(company_manufacturer) %>%
filter(n() >= 14) %>%
mutate(mean_rating = mean(rating)) %>%
ggplot(aes(reorder(company_manufacturer, mean_rating), rating, fill = mean_rating)) +
geom_boxplot() +
coord_flip() +
labs(x = 'Company manufacturer', y = 'Rating', title = 'Top 5% most frequently rated company manufacturers') +
scale_fill_continuous(name = "Mean rating") +
theme_minimal()
What are your findings about the summary? Are they what you expected?
I have 62 different countries of beans origin and 542 chocolate company manufacturers. I choose the top 25% country of origins with most frequently used chocolate beans and top 5% chocolate company manufacturers with most frequently rated times. Based on above plot, it appears that Venezula is the most popular country of chocolate bean’s origin and Soma chocolate maker has the higest rating. The results are as expected. I’m going to explain the details in final summary section.
%>%
ChocolateBar tabyl(ingredients) %>%
gt()
ingredients | n | percent |
---|---|---|
1 | 6 | 0.002455997 |
2 | 750 | 0.306999591 |
3 | 1023 | 0.418747442 |
4 | 469 | 0.191977077 |
5 | 191 | 0.078182562 |
6 | 4 | 0.001637331 |
# numbers of ingredients vs. the rating of the chocolate
%>%
ChocolateBar group_by(ingredients) %>%
ggplot(aes(x = ingredients, y = rating)) +
geom_boxplot() +
labs(title = "Rating vs. Groups with 1 - 6 ingredients", x = 'Numbers of ingredients', y = 'Rating') +
theme_minimal()
# percentage of cocoa vs. the rating of the chocolate
%>%
ChocolateBar ggplot(aes(x = cocoa_percent, y = rating)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE, col = 'blue') +
labs(title = "Correlation of Cocoa percentage vs. Rating", x = 'Cocoa percentage', y = 'Rating') +
theme_minimal()
What are your findings about the summary?
Three ingredients in chocolate appears the highest frequency. The above boxplot also shows that the distribution of rating values is different among the six levels of ingredients. Two and three ingredients have the highest median value while six ingredients has a lowest median value. In the second scatter plot, with increasing of cocoa percentage, the rating of chocolate bar slightly decreases.
# linear regression model
<- lm(rating ~ cocoa_percent, data = ChocolateBar)
model summary(model)
##
## Call:
## lm(formula = rating ~ cocoa_percent, data = ChocolateBar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.20045 -0.21968 0.03032 0.28673 0.89570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.668419 0.120159 30.530 < 2e-16 ***
## cocoa_percent -0.006410 0.001676 -3.824 0.000134 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4272 on 2441 degrees of freedom
## Multiple R-squared: 0.005956, Adjusted R-squared: 0.005548
## F-statistic: 14.63 on 1 and 2441 DF, p-value: 0.0001344
<- lm(rating ~ cocoa_percent + ingredients, data = ChocolateBar)
model summary(model)
##
## Call:
## lm(formula = rating ~ cocoa_percent + ingredients, data = ChocolateBar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04543 -0.28104 0.00846 0.27519 0.94620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.795007 0.245475 15.460 < 2e-16 ***
## cocoa_percent -0.008367 0.001749 -4.782 1.84e-06 ***
## ingredients2 0.026235 0.179641 0.146 0.884
## ingredients3 0.074040 0.179637 0.412 0.680
## ingredients4 -0.075624 0.180720 -0.418 0.676
## ingredients5 -0.138801 0.183154 -0.758 0.449
## ingredients6 -0.250919 0.276482 -0.908 0.364
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4218 on 2436 degrees of freedom
## Multiple R-squared: 0.03287, Adjusted R-squared: 0.03048
## F-statistic: 13.8 on 6 and 2436 DF, p-value: 1.772e-15
What are your findings about the summary?
According to the model summary, cocoa_percent is significantly related with rating. However, the \(R^2\) of 0.6% is a fairly small value. It represents only 0.6% of data fit this model. After adding variabale ingredients into the model, the \(R^2\) greatly increased to 3% but non of the levels of ingredients is significant.
Summarize your research question and findings below.
In summary, regardless of the R-squared value, there is a significant relationship between cocoa percentage and the rating of the chocolate bar. Although ingredients could help to explain some of the variation in the rating variable around its mean, a statistically significant relationship is not found between ingredients and the rating scores. To further explore, to my way of thinking, we might need to include more variable in the model for a better prediction. The top three most popular beans are from Venezuela, Peru and Dominican Republic. The highest frequently rated with the highest rating score company manufacturers are Soma, Bonnat and Arete.
Are your findings what you expected? Why or Why not?
The findings generally meet my expectation. The linear regression models shows the evidence that cocoa percentage is significantly correlated to the rating score. But non of the level of ingredients was significantly correlated with the rating, which is a bit unexpected for me. Cacao trees must grow in a limited geographical zone, with hotter tropical climates and lower elevations. Hence, the top three of most frequently used country of bean origins: Venezuela, Peru and Dominican Republic, are all the habitat which meet the growing requirement. For the highest rating score company manufacturers, I googled their names and found their information on the website of Academy of Chocolate. I also tried to find the individual manufacturers information, for instance, Soma, it is a small chocolate factory which start their business in 2003, and Bonnat is a French chocolate maker. It makes sense to me that why I am not familiar with their company manufacturers names. The reasons could be they are foreign brands, small factory or establishing their brand very late.