library(skimr) # get overview of data
library(tidyverse) # data management + ggplot2 graphics
library(readxl) # import excel data
library(ggplot2) # plots
library(gtsummary) # summary statistics and tests
library(here) # helps with file management
library(knitr)
library(janitor) # for data cleaning, making tables
library(wesanderson) # ggplot2 palettes
library(paletteer) # extra ggplot2 palettes
library(dplyr)
library(broom)
library(stringr)
library(viridisLite)
library(glue) # this was added so gt() would show
library(gt)
This dataset is sourced from mrpantherson on Kaggle which is titled “Metal Bands by Nation.”
This data analysis project was done solely by Saffron Evergreen, aside from the maker of the dataset obtained on Kaggle. There is no contribution from classmates or other peers.
(10 points)
Research question:
Between the time brackets of 1980-1989 and 1990-1999, which decade had the most significant growth in death/black metal band formation in the United States?
Reasoning:
My hypothesis is that there would be a significant increase in death/black metal bands forming and the growth of fans over the course of those two decades due to the Satanic Panic that was happening primarily in the U.S. and the effect that had on various types of pop-culture. I have chosen to only look at death and black sub-types of metal since those two seem to be mostly connected to modern-day Satanism and utilize their shock-factor abilities as part of their stage presence.
(10 points)
metal_bands <- read_excel("Metal_Bands_BSTA.xlsx", col_types = c("numeric", "text",
"numeric", "numeric", "text", "numeric", "text")) # changing col_types here will automatically change the numeric columns as numeric, while keeping NA values as NA
gt(head(metal_bands)) # previewed the data to get a peek of how it's formatted
...1 | band_name | fans | formed | origin | split | style |
---|---|---|---|---|---|---|
1 | Iron Maiden | 4195 | 1975 | United Kingdom | NA | New wave of british heavy,Heavy |
2 | Opeth | 4147 | 1990 | Sweden | 1990 | Extreme progressive,Progressive rock,Progressive |
3 | Metallica | 3712 | 1981 | USA | NA | Heavy,Bay area thrash |
4 | Megadeth | 3105 | 1983 | USA | 1983 | Thrash,Heavy,Hard rock |
5 | Amon Amarth | 3054 | 1988 | Sweden | NA | Melodic death |
6 | Slayer | 2955 | 1981 | USA | 1981 | Thrash |
skim_metal <- skim(metal_bands) %>%
as_tibble() %>%
print() # saved skim() to keep as a reference point
## # A tibble: 7 x 17
## skim_type skim_variable n_missing complete_rate character.min character.max
## <chr> <chr> <int> <dbl> <int> <int>
## 1 character band_name 0 1 1 48
## 2 character origin 8 0.998 3 31
## 3 character style 0 1 2 82
## 4 numeric ...1 0 1 NA NA
## 5 numeric fans 0 1 NA NA
## 6 numeric formed 4 0.999 NA NA
## 7 numeric split 2215 0.557 NA NA
## # ... with 11 more variables: character.empty <int>, character.n_unique <int>,
## # character.whitespace <int>, numeric.mean <dbl>, numeric.sd <dbl>,
## # numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## # numeric.p100 <dbl>, numeric.hist <chr>
I utilized the gt() and head() functions to get a clean and quick look at how this data is structured and organized. This table shows the top 6 popular metal bands (includes all styles), how many fans they had/have, what year they formed, the country of origin, what year the band split and descriptors of the style of metal the bands played.
I saved a skim() table as a reference point to make sure my data types were correct and to be able to look back at the completion rates and missing values of the variables.
What needs to be done for transformation:
(15 points)
# filter, make data table showing only U.S. bands
metal_filtered <- metal_bands %>%
filter(origin %in% ("USA"))
# remove the column '...1' (ranked in popularity) and 'split' since it is not
# important in this analysis and takes up space
metal_filtered <- subset(metal_filtered, select = -c(1, 6))
# this allows me to minimize unnecessary bulk so I can transform and analyze
# easier
# mutate and create new table with a factored column labeling which bands
# formed between the years 1980-1989 and 1990-1999
metal_new <- metal_filtered %>%
mutate(formed_category = case_when(formed >= 1980 & formed < 1990 ~ "1980-1989",
formed >= 1990 & formed < 2000 ~ "1990-1999")) %>%
mutate(formed_category = factor(formed_category))
metal_new <- na.omit(metal_new) # this gets rid of all rows containing N/A
# na.omit shows only bands (all styles) that formed within the two decades
skim_metal_new <- skim(metal_new) %>%
as_tibble() %>%
print()
## # A tibble: 6 x 20
## skim_type skim_variable n_missing complete_rate character.min character.max
## <chr> <chr> <int> <dbl> <int> <int>
## 1 character band_name 0 1 2 27
## 2 character origin 0 1 3 3
## 3 character style 0 1 2 65
## 4 factor formed_category 0 1 NA NA
## 5 numeric fans 0 1 NA NA
## 6 numeric formed 0 1 NA NA
## # ... with 14 more variables: character.empty <int>, character.n_unique <int>,
## # character.whitespace <int>, factor.ordered <lgl>, factor.n_unique <int>,
## # factor.top_counts <chr>, numeric.mean <dbl>, numeric.sd <dbl>,
## # numeric.p0 <dbl>, numeric.p25 <dbl>, numeric.p50 <dbl>, numeric.p75 <dbl>,
## # numeric.p100 <dbl>, numeric.hist <chr>
# saved this skim table as another reference point, making sure there are no
# missing values and the completion rate is 1 for all columns
# 454 out of 5,000 bands (all styles) that formed in the USA between 1980-1999
# Reminder: objective is to find out the growth/loss rate of bands that are death and/or black metal in the USA, between the two decades
metal_complete <- dplyr::filter( # dplyr::filter because this wouldn't work without specifically calling the package
metal_new, grepl("death|black", style, ignore.case = TRUE)) %>% #grepl(keywords, column)
filter(!duplicated(band_name)) # use !duplicated to make extra sure there is no overlap in band_names
This dataset needed reduction and organizing; no merging was needed, at least at this point of the project.
Beyond this point, the data represented in various tables meet these parameters; formed between 1980-1999, originated from the US, and the band’s metal style is labeled under any variation of “death” or “black”.
(head(metal_complete)) %>%
print()
## # A tibble: 6 x 6
## band_name fans formed origin style formed_category
## <chr> <dbl> <dbl> <chr> <chr> <fct>
## 1 Death 2690 1983 USA Progressive death,Death,P~ 1980-1989
## 2 Agalloch 1881 1995 USA Atmospheric black,Neofolk 1990-1999
## 3 Nile 1189 1993 USA Brutal death,Technical de~ 1990-1999
## 4 Cannibal Corpse 1162 1988 USA Death 1980-1989
## 5 Morbid Angel 975 1984 USA Death 1980-1989
## 6 Deicide 628 1987 USA Death 1980-1989
skim_metal_complete <- skim(metal_complete) %>%
as_tibble() %>%
view() # saved for another reference point, looking at this table I'm able to see the band count for either black or death metal style has dropped to 122 over the 2 decades
Are the values what you expected for the variables? Why or Why not?
After reviewing my transformed data frame, I am able to see that of there are 122 death/black metal bands out of 422 metal bands formed in the U.S. between 1980-1999. I looked at the two histograms provided in the skim table and am suprised that the fan count is negatively skewed and band formation count is mostly unimodal, may bimodal, with a faint positive skew. I wasn’t expecting this because I was thinking there would’ve been a take off in death/black metal band formations and fan counts, with it pretty much leveling out over time.
(15 points)
# Reminder: objective is to find out the growth/loss rate of bands that are death and/or black metal in the USA, between the two decades
# shows/organizes how many bands were formed, between the two decades
decade_count <- metal_complete %>%
group_by(formed_category) %>%
summarize(count = n())
gt(decade_count)
formed_category | count |
---|---|
1980-1989 | 43 |
1990-1999 | 79 |
# saved this to compare for analysis and visualization, n = 122, 80s = 43, 90s = 79
# shows/organizes total fans between the two decades
fan_count <- metal_complete %>% # shows decade sum of fans
group_by(formed_category) %>%
summarize(sum_fans = sum(fans)) %>% print()
## # A tibble: 2 x 2
## formed_category sum_fans
## <fct> <dbl>
## 1 1980-1989 10408
## 2 1990-1999 7709
# n = 18,117, 80s = 10,408, 90s = 7709
N_fan_count <- sum(fan_count$sum_fans) # used this to pull for later if needed
# shows/organizes the average # of fans per band between the two decades
mean_fan_group <- metal_complete %>%
group_by(formed_category) %>%
summarize(mean_fans = mean(fans)) %>% print()
## # A tibble: 2 x 2
## formed_category mean_fans
## <fct> <dbl>
## 1 1980-1989 242.
## 2 1990-1999 97.6
# avg 242 fans/band in 80s, avg 97.6 fans/band in 90s
count_and_mean_fans <- bind_cols(fan_count, mean_fan_group[2])
gt(count_and_mean_fans)
formed_category | sum_fans | mean_fans |
---|---|---|
1980-1989 | 10408 | 242.04651 |
1990-1999 | 7709 | 97.58228 |
# noted- 80s = n(43), 90s = n(79) - the data is showing that though there were less death/black metal bands formed in the 80's the bands that were formed during this period had significantly more fans than those formed in the 90s
# breaking down these transformations below, into separate tables, is easier
# for me to understand what I'm looking at and understand, then after looking
# at them individually, I will bind them into one data frame and then look at
# the whole picture
# shows/organizes band formation count per year 1980-1999
i_fan_count <- metal_complete %>%
group_by(formed) %>%
summarize(count = n()) %>%
print()
## # A tibble: 17 x 2
## formed count
## <dbl> <int>
## 1 1983 5
## 2 1984 6
## 3 1985 4
## 4 1986 5
## 5 1987 9
## 6 1988 7
## 7 1989 7
## 8 1990 7
## 9 1991 4
## 10 1992 7
## 11 1993 11
## 12 1994 3
## 13 1995 11
## 14 1996 10
## 15 1997 10
## 16 1998 7
## 17 1999 9
# shows/organizes total amount of fans per year 1980-1999
i_fan_sum <- metal_complete %>%
group_by(formed) %>%
summarize(sum_fans = sum(fans)) %>%
print()
## # A tibble: 17 x 2
## formed sum_fans
## <dbl> <dbl>
## 1 1983 3005
## 2 1984 2187
## 3 1985 107
## 4 1986 604
## 5 1987 1688
## 6 1988 1271
## 7 1989 1546
## 8 1990 663
## 9 1991 620
## 10 1992 394
## 11 1993 1628
## 12 1994 80
## 13 1995 2329
## 14 1996 934
## 15 1997 512
## 16 1998 221
## 17 1999 328
# shows/organizes average fans per year 1980-1999
mean_i_fan_count <- metal_complete %>%
group_by(formed) %>%
summarize(mean_fans = mean(fans)) %>%
print()
## # A tibble: 17 x 2
## formed mean_fans
## <dbl> <dbl>
## 1 1983 601
## 2 1984 364.
## 3 1985 26.8
## 4 1986 121.
## 5 1987 188.
## 6 1988 182.
## 7 1989 221.
## 8 1990 94.7
## 9 1991 155
## 10 1992 56.3
## 11 1993 148
## 12 1994 26.7
## 13 1995 212.
## 14 1996 93.4
## 15 1997 51.2
## 16 1998 31.6
## 17 1999 36.4
# combined 3 data frames from above, [] extracts the column wanted, avoiding
# duplicate columns
summary_i_fans <- bind_cols(i_fan_count, i_fan_sum[2], mean_i_fan_count[2]) %>%
print()
## # A tibble: 17 x 4
## formed count sum_fans mean_fans
## <dbl> <int> <dbl> <dbl>
## 1 1983 5 3005 601
## 2 1984 6 2187 364.
## 3 1985 4 107 26.8
## 4 1986 5 604 121.
## 5 1987 9 1688 188.
## 6 1988 7 1271 182.
## 7 1989 7 1546 221.
## 8 1990 7 663 94.7
## 9 1991 4 620 155
## 10 1992 7 394 56.3
## 11 1993 11 1628 148
## 12 1994 3 80 26.7
## 13 1995 11 2329 212.
## 14 1996 10 934 93.4
## 15 1997 10 512 51.2
## 16 1998 7 221 31.6
## 17 1999 9 328 36.4
# same process as last time I used case_when() but I'm just adding this to the new data frame for making it easier on me when I start making data visualizations (this is the only character variable besides USA, which is all the same)
use_summary_i_fans <- summary_i_fans %>%
mutate(
formed_category = case_when(
formed >= 1980 &
formed < 1990 ~ "1980-1989",
formed >= 1990 &
formed < 2000 ~ "1990-1999")) %>%
mutate(formed_category = factor(formed_category))
# rearranged columns so the years could be together for easier interpretations
use_summary_i_fans <- use_summary_i_fans[, c(1,5,2,3,4)] %>% print()
## # A tibble: 17 x 5
## formed formed_category count sum_fans mean_fans
## <dbl> <fct> <int> <dbl> <dbl>
## 1 1983 1980-1989 5 3005 601
## 2 1984 1980-1989 6 2187 364.
## 3 1985 1980-1989 4 107 26.8
## 4 1986 1980-1989 5 604 121.
## 5 1987 1980-1989 9 1688 188.
## 6 1988 1980-1989 7 1271 182.
## 7 1989 1980-1989 7 1546 221.
## 8 1990 1990-1999 7 663 94.7
## 9 1991 1990-1999 4 620 155
## 10 1992 1990-1999 7 394 56.3
## 11 1993 1990-1999 11 1628 148
## 12 1994 1990-1999 3 80 26.7
## 13 1995 1990-1999 11 2329 212.
## 14 1996 1990-1999 10 934 93.4
## 15 1997 1990-1999 10 512 51.2
## 16 1998 1990-1999 7 221 31.6
## 17 1999 1990-1999 9 328 36.4
# this was a function I spent way too much time on trying to figure out the differences and changes (rate) in fan counts per year, shout out to stack exchange
formation_rate <- use_summary_i_fans %>%
# first sort by year, most likely this was but helps make sure it's all uniform
arrange(formed) %>%
mutate(Diff_year = formed - lag(formed), # Difference in time (just in case there are gaps)
Diff_fans = sum_fans - lag(sum_fans), # Difference in count between years
fan_rate_percent = (Diff_fans / Diff_year)/sum_fans * 100) %>%
mutate(Diff_year = formed - lag(formed), # Difference in time (just in case there are gaps)
Diff_count = count - lag(count), # Difference in count between years
count_rate_percent = (Diff_count / Diff_year)/count * 100) %>% print()
## # A tibble: 17 x 10
## formed formed_category count sum_fans mean_fans Diff_year Diff_fans
## <dbl> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1983 1980-1989 5 3005 601 NA NA
## 2 1984 1980-1989 6 2187 364. 1 -818
## 3 1985 1980-1989 4 107 26.8 1 -2080
## 4 1986 1980-1989 5 604 121. 1 497
## 5 1987 1980-1989 9 1688 188. 1 1084
## 6 1988 1980-1989 7 1271 182. 1 -417
## 7 1989 1980-1989 7 1546 221. 1 275
## 8 1990 1990-1999 7 663 94.7 1 -883
## 9 1991 1990-1999 4 620 155 1 -43
## 10 1992 1990-1999 7 394 56.3 1 -226
## 11 1993 1990-1999 11 1628 148 1 1234
## 12 1994 1990-1999 3 80 26.7 1 -1548
## 13 1995 1990-1999 11 2329 212. 1 2249
## 14 1996 1990-1999 10 934 93.4 1 -1395
## 15 1997 1990-1999 10 512 51.2 1 -422
## 16 1998 1990-1999 7 221 31.6 1 -291
## 17 1999 1990-1999 9 328 36.4 1 107
## # ... with 3 more variables: fan_rate_percent <dbl>, Diff_count <int>,
## # count_rate_percent <dbl>
Assumptions made so far
Looking through the data frame formation_rate, I was able to quickly see that year-by-year there are significant fluctuations between the yearly rate of fans and band formations.
The top 2 years that had the most death/black metal band formation were 1993 and 1995, with all positive statistics. Though these two years were not the most successful compared to other years with more positive rates with fan or band formation count, from the data I have analysed prior to making graphs, I would make the conclusion that overall the 90’s were more successful overall despite the 80’s large fan base. However, I will look over these assumptions with graphs.
# Reminder: objective is to find out the growth/loss rate of bands that are
# death and/or black metal in the USA, between the two decades
density_sum_fans <- ggplot(data = formation_rate, aes(x = sum_fans, fill = formed_category # must always be categorical
)) + geom_density(alpha = 0.3) # transparency +
scale_fill_discrete(name = "Decade") # renames the legend +
## <ggproto object: Class ScaleDiscrete, Scale, gg>
## aesthetics: fill
## axis_order: function
## break_info: function
## break_positions: function
## breaks: waiver
## call: call
## clone: function
## dimension: function
## drop: TRUE
## expand: waiver
## get_breaks: function
## get_breaks_minor: function
## get_labels: function
## get_limits: function
## guide: legend
## is_discrete: function
## is_empty: function
## labels: waiver
## limits: NULL
## make_sec_title: function
## make_title: function
## map: function
## map_df: function
## n.breaks.cache: NULL
## na.translate: TRUE
## na.value: grey50
## name: Decade
## palette: function
## palette.cache: NULL
## position: left
## range: <ggproto object: Class RangeDiscrete, Range, gg>
## range: NULL
## reset: function
## train: function
## super: <ggproto object: Class RangeDiscrete, Range, gg>
## rescale: function
## reset: function
## scale_name: hue
## train: function
## train_df: function
## transform: function
## transform_df: function
## super: <ggproto object: Class ScaleDiscrete, Scale, gg>
labs(x = "Fans per Year", y = "Density", title = "Density Plot of Total Fans Accumulated per Year (1980-1999)")
## $x
## [1] "Fans per Year"
##
## $y
## [1] "Density"
##
## $title
## [1] "Density Plot of Total Fans Accumulated per Year (1980-1999)"
##
## attr(,"class")
## [1] "labels"
density_sum_fans
library(ggridges)
ridgeline_mean_fans <- ggplot(data = use_summary_i_fans, aes(x = mean_fans, y = formed_category)) +
geom_density_ridges_gradient() + aes(fill = formed_category) + scale_fill_discrete(name = "Decade") +
labs(x = "Average Fans Accumulated per Year", y = "Decade", title = "Ridgeline Density Plot of Average Fans Accumulated per Year (1980-1999)") +
theme(axis.text.x = element_text(angle = -30, hjust = 0))
ridgeline_mean_fans
I have decided to make extra graphs, because despite meeting the minimum 2, I had realized that my graphs did not represent my research question that well, especially in regards to analyzing the rate of changes in bands formed and fans gained per year.
count_rate_diff <- ggplot(data = formation_rate, aes(x = formed, y = count_rate_percent)) +
geom_point() + geom_line() + geom_hline(yintercept = 0) + theme_bw() + labs(x = "Year",
y = "Band Formation Rate (%) Per Year (1980-1999)", title = "Change in Rate of Bands Formed per Year",
subtitle = "to detect patterns of band formation growth or reduction")
count_rate_diff
Plot 3 shows that the 1980’s group had fluctuations in the amount of death/black metal bands that were formed, however, the rate changes were far more drastic in the 1990’s group. I included a horizontal line at the y-intercept for readability, as I think it helps break apart the positive and negative rates.
count_diff <-
ggplot(data = formation_rate,
aes(x = formed,
y = fan_rate_percent)) +
geom_line()+ # left out hline and geom_point since it seemed unnecessary
theme_bw() +
labs(
x = "Year",
y = "Fan Rate (%) Per Year (1980-1999)",
title = "Change in Rate of Fans per Year",
subtitle = "to detect patterns of fan growth or reduction")
count_diff
Plot 4 shows a fairly similar trend in fan rates between the 80’s and 90’s groups. By looking at this, I would make the assumption that the negative and positive rate changes in fans per year are weighted similarly between the two decades and that fan growth rate (+/-) should not be a sole indicator of death/black metal band formation.
combined_graph_count <- ggplot(formation_rate) + geom_line(mapping = aes(x = formed,
y = count, color = "purple")) + geom_line(mapping = aes(x = formed, y = Diff_count,
color = "orange")) + geom_hline(yintercept = 0, size = 0.5) + geom_vline(xintercept = 1990) +
theme(legend.position = "none") + labs(x = "Year Bands Formed", y = "Band Formation Count +/-",
title = "Line Graph of Bands Formed + Net Losses/Gains per Year")
combined_graph_count
(10 points)
My goal for this data analysis project was to determine which time bracket, between 1980-1989 and 1990-1999, had the most significant growth in death/black metal formation in the United States. The most broad conclusion I have is that there are several variables that are not accounted for in this analysis, like bands with the most frequent concerts, albums created vs albums sold, age-ranges of fans, ways to adjust for fans in rural areas or marginalized communities that may not have been counted, plus I’m sure there are more.
Given the data I have been working with and looking at strictly the variables that are within my dataset, I did notice some prominent differences in the amount and growth of death/black metal bands that were formed between the two decades, 1980-1989 and 1990-1999. One of them being that in 1980-1989, 43 death/black metal bands were formed, while 79 were formed in 1990-1999. However the amount of fans that were accounted for per band formation per year were higher in 1980-1989 (n = 10,408, mean = 242) than in 1990-1999 (n = 7,709, mean = 98).
My conclusion is that the overall rates of band formation and fan counts per year appear mostly non-linear and unpredictable and that there is not a significant growth in death/black metal bands around the era of the Satanic Panic. Black/death metal did start to gain more popularity beginning in the 1980’s, which would be another research project for another time, however, over the course of 19 years during the height of the Satanic Panic, there is not significant growth.