Define your research question below. What about the data interests you? What is a specific question you want to find out about the data?
I am interested in policy surrounding obesity management and the data we collect to understand it. I want to know from this data if being older means you are more likely to be obese, and will do less vigorous exercise. This then may cause a higher BP. My dad has issues with BP and weight gain, and I see his health has gone down hill as his weight has gone up. I’m wondering if this also happens in younger people as well, or if their BP stays low despite higher BMI’s.
Given your question, what is your expectation about the data?
Most likely, older people will account for the higher percentage of obesity, and they will do less exercise. According to the e-magazine Harvard Health, systolic BP can be a good measure of overall health. This will be used in conjunction with BMI to ask, is health declining as people age due to weight gain?
Load the data below and use
dplyr::glimpse()
orskimr::skim()
on the data. You should upload the data file into thedata
directory.
library(aplore3)
summary(nhanes)
## id gender age marstat
## Min. : 1 Male :3164 Min. :16.00 Married :3019
## 1st Qu.:1621 Female:3318 1st Qu.:30.00 Widowed : 505
## Median :3242 Median :46.00 Divorced : 643
## Mean :3242 Mean :46.34 Separated : 195
## 3rd Qu.:4862 3rd Qu.:62.00 Never Married :1034
## Max. :6482 Max. :80.00 Living Together: 457
## NA's : 629
## samplewt psu strata tchol
## Min. : 4084 Min. :1.00 Min. : 1.000 Min. : 90.0
## 1st Qu.: 16460 1st Qu.:1.00 1st Qu.: 4.000 1st Qu.:162.0
## Median : 24217 Median :2.00 Median : 7.000 Median :188.0
## Mean : 34546 Mean :1.51 Mean : 7.225 Mean :192.1
## 3rd Qu.: 50435 3rd Qu.:2.00 3rd Qu.:11.000 3rd Qu.:218.0
## Max. :153810 Max. :2.00 Max. :15.000 Max. :383.0
## NA's :395
## hdl sysbp dbp wt
## Min. : 11.00 Min. : 90 Min. : 40.00 Min. : 33.20
## 1st Qu.: 41.00 1st Qu.:110 1st Qu.: 62.00 1st Qu.: 65.90
## Median : 50.00 Median :120 Median : 70.00 Median : 77.60
## Mean : 52.41 Mean :123 Mean : 69.59 Mean : 80.38
## 3rd Qu.: 61.00 3rd Qu.:132 3rd Qu.: 78.00 3rd Qu.: 91.80
## Max. :144.00 Max. :220 Max. :134.00 Max. :159.10
## NA's :395 NA's :553 NA's :594 NA's :37
## ht bmi vigwrk modwrk wlkbik
## Min. :123.3 Min. :13.18 Yes :1131 Yes :2190 Yes :1807
## 1st Qu.:160.0 1st Qu.:23.99 No :5350 No :4291 No :4674
## Median :166.9 Median :27.68 NA's: 1 NA's: 1 NA's: 1
## Mean :167.4 Mean :28.62
## 3rd Qu.:175.0 3rd Qu.:32.15
## Max. :202.7 Max. :65.19
## NA's :37 NA's :37
## vigrecexr modrecexr sedmin obese
## Yes :1402 Yes :2493 Min. : 0.0 No :5459
## No :5079 No :3987 1st Qu.:180.0 Yes : 986
## NA's: 1 NA's: 2 Median :300.0 NA's: 37
## Mean :321.1
## 3rd Qu.:480.0
## Max. :840.0
## NA's :79
write.csv(x = nhanes, file = "nhanes_new.csv", row.names = FALSE)
nhanes_new <- read_csv("~/Desktop/BSTA 504/nhanes_new.csv")
## Rows: 6482 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): gender, marstat, vigwrk, modwrk, wlkbik, vigrecexr, modrecexr, obese
## dbl (13): id, age, samplewt, psu, strata, tchol, hdl, sysbp, dbp, wt, ht, bm...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nhanes_new<-na.omit(nhanes_new)
skimr::skim(nhanes_new)
Name | nhanes_new |
Number of rows | 4915 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
character | 8 |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
marstat | 0 | 1 | 7 | 15 | 0 | 6 | 0 |
vigwrk | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
modwrk | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
wlkbik | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
vigrecexr | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
modrecexr | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
obese | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
id | 0 | 1 | 3252.09 | 1869.78 | 1.00 | 1641.50 | 3262.00 | 4871.50 | 6482.00 | ▇▇▇▇▇ |
age | 0 | 1 | 49.27 | 17.73 | 20.00 | 34.00 | 48.00 | 63.00 | 80.00 | ▇▇▇▆▆ |
samplewt | 0 | 1 | 36257.60 | 26599.99 | 4084.48 | 17990.12 | 25259.46 | 55242.77 | 153810.26 | ▇▂▂▁▁ |
psu | 0 | 1 | 1.52 | 0.50 | 1.00 | 1.00 | 2.00 | 2.00 | 2.00 | ▇▁▁▁▇ |
strata | 0 | 1 | 7.10 | 4.13 | 1.00 | 3.00 | 7.00 | 11.00 | 15.00 | ▇▇▆▆▅ |
tchol | 0 | 1 | 195.74 | 40.94 | 92.00 | 166.00 | 192.00 | 221.00 | 383.00 | ▂▇▅▁▁ |
hdl | 0 | 1 | 52.75 | 16.34 | 15.00 | 41.00 | 50.00 | 62.00 | 144.00 | ▃▇▂▁▁ |
sysbp | 0 | 1 | 123.98 | 18.45 | 90.00 | 112.00 | 122.00 | 134.00 | 220.00 | ▇▇▂▁▁ |
dbp | 0 | 1 | 70.44 | 11.80 | 40.00 | 62.00 | 70.00 | 78.00 | 120.00 | ▂▇▆▁▁ |
wt | 0 | 1 | 81.19 | 19.83 | 33.20 | 67.00 | 78.50 | 92.60 | 159.10 | ▂▇▅▁▁ |
ht | 0 | 1 | 167.46 | 10.23 | 135.40 | 160.10 | 167.20 | 175.00 | 202.70 | ▁▅▇▃▁ |
bmi | 0 | 1 | 28.88 | 6.30 | 13.18 | 24.38 | 28.04 | 32.34 | 64.97 | ▂▇▂▁▁ |
sedmin | 0 | 1 | 311.31 | 184.24 | 0.00 | 180.00 | 300.00 | 420.00 | 840.00 | ▅▇▆▂▁ |
If there are any quirks that you have to deal with
NA
coded as something else, or it is multiple tables, please make some notes here about what you need to do before you start transforming the data in the next section.
I took out all the NA rows, since there are so many data points and so few of the are NA.
If the data needs to be transformed in any way (values recoded, pivoted, etc), do it here. Examples include transforming a continuous variable into a categorical using
case_when()
, etc.
Age
I turned age into a categorical variable. Based on the hitogram below, it looks like good bins would be 10 year increments, and starting at 30 and ending at 70. It was noted that this variable is fairly normally distirbuted, but does look a bit strange since there are pretty equal amounts of subjects in each age group.
histogram(nhanes_new$age)
nhanes_new<-
nhanes_new %>%
mutate(
age_category = case_when(
age <30 ~ "<30",
(age >= 30 & age <40) ~ "[30,40)",
(age >= 40 & age<50) ~ "[40,50)",
(age >= 50 & age<60) ~ "[50,60)",
(age >= 60 & age<70) ~ "[60,70)",
age >= 70 ~ ">=70"
)
)
nha <- nhanes_new %>%
tabyl(age_category)
nha
## age_category n percent
## [30,40) 813 0.1654120
## [40,50) 898 0.1827060
## [50,60) 761 0.1548321
## [60,70) 767 0.1560529
## <30 843 0.1715158
## >=70 833 0.1694812
nhanes_new$age_category <- factor(nhanes_new$age_category)
levels(nhanes_new$age_category)
## [1] "[30,40)" "[40,50)" "[50,60)" "[60,70)" "<30" ">=70"
nhanes_new$age_category <- factor(nhanes_new$age_category,
levels = c("<30", "[30,40)", "[40,50)","[50,60)","[60,70)",">=70" ))
table(nhanes_new$age_category)
##
## <30 [30,40) [40,50) [50,60) [60,70) >=70
## 843 813 898 761 767 833
BMI
According to the CDC, the following BMI categories are good measurement categories:
Underweight: Under 18.5
Healthy: 18.5 - 25
Overweight: 25 - 30
Obese: Over 30
These categories were used here as well. This was done in order to group subjects into BMI bins, so that I could look at other variables at the same time. It was noted that this variable is normally distributed.
histogram(nhanes_new$bmi)
nhanes_new<-
nhanes_new %>%
mutate(
bmi_category = case_when(
bmi <18.5 ~ "Underweight",
( bmi>= 18.5 & bmi<25) ~ "Healthy",
( bmi >= 25 & bmi<30) ~ "Overweight",
bmi >= 30 ~ "Obese"
)
)
nhab <- nhanes_new %>%
tabyl(bmi_category)
nhab
## bmi_category n percent
## Healthy 1327 0.26998983
## Obese 1818 0.36988810
## Overweight 1697 0.34526958
## Underweight 73 0.01485249
nhanes_new$bmi_category <- factor(nhanes_new$bmi_category)
levels(nhanes_new$bmi_category)
## [1] "Healthy" "Obese" "Overweight" "Underweight"
nhanes_new$bmi_category <- factor(nhanes_new$bmi_category,
levels = c("Underweight", "Healthy", "Overweight","Obese" ))
table(nhanes_new$bmi_category)
##
## Underweight Healthy Overweight Obese
## 73 1327 1697 1818
Are the values what you expected for the variables? Why or Why not?
I was surprised to see so many people in the obese category. About 80% of the subjects are either overweight or obese. I didn’t expect this, because I thought BMI would follow a normal distribution in which a healthy BMI was in the center.
Use
group_by()
andsummarize()
to make a summary of the data here. The summary should be relevant to your research question
by_marstat <- nhanes_new %>% group_by(marstat)
by_marstat
## # A tibble: 4,915 × 23
## # Groups: marstat [6]
## id gender age marstat samplewt psu strata tchol hdl sysbp dbp
## <dbl> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 Male 34 Married 80101. 1 9 135 50 114 88
## 2 3 Female 60 Widowed 20090. 2 1 202 45 154 70
## 3 4 Male 26 Married 22538. 1 14 160 45 102 50
## 4 5 Female 49 Living Toge… 74212. 2 11 259 45 118 82
## 5 6 Male 80 Married 11998. 1 3 182 75 142 62
## 6 7 Male 80 Widowed 21807. 1 5 148 49 126 62
## 7 11 Female 45 Married 86636. 2 4 225 82 106 62
## 8 12 Male 28 Never Marri… 23137. 1 2 299 51 108 62
## 9 14 Male 44 Divorced 90640. 2 5 197 37 118 90
## 10 15 Male 66 Married 56053. 2 12 193 26 124 64
## # … with 4,905 more rows, and 12 more variables: wt <dbl>, ht <dbl>, bmi <dbl>,
## # vigwrk <chr>, modwrk <chr>, wlkbik <chr>, vigrecexr <chr>, modrecexr <chr>,
## # sedmin <dbl>, obese <chr>, age_category <fct>, bmi_category <fct>
by_tab <- nhanes_new %>% group_by(bmi_category,vigwrk)
summarize(by_tab,avg= mean(bmi,na.rm=TRUE))
## `summarise()` has grouped output by 'bmi_category'. You can override using the `.groups` argument.
## # A tibble: 8 × 3
## # Groups: bmi_category [4]
## bmi_category vigwrk avg
## <fct> <chr> <dbl>
## 1 Underweight No 17.4
## 2 Underweight Yes 17.6
## 3 Healthy No 22.4
## 4 Healthy Yes 22.5
## 5 Overweight No 27.4
## 6 Overweight Yes 27.5
## 7 Obese No 35.4
## 8 Obese Yes 35.3
What are your findings about the summary? Are they what you expected?
These are what I expected yes. It looks like on average, the age of underweight people is under 18, healthy is about 23 years old and subjects overweight are over 27. In terms of age, the vigorous work routine doesn’t seem to change much however between the groups within each bmi category. I thought the age variation would be more, thinking perhaps younger people would have more vigorous activity.
Make at least two plots that help you answer your question on the transformed or summarized data. Use scales and/or labels to make each plot informative.
This first plot shows the difference between vigorous activity and no vigorous activity in subjects grouped by obesity scoring. Age is on the y axis. Observations that standout here are that across the age groups, more people are likely to be not engaging in vigorous activity than engaging in it. Less people who are older than 50 are engaging in vigorous exercise. Similar proportions of people engaging in vigorous exercise are obese and underweight. It’s possible exercise is not a good marker for whether or not someone is obese.
p<-ggplot(nhanes_new) +
aes(x = bmi_category,
y = age,
fill = vigrecexr) +
geom_boxplot()+
labs(title = "Boxplot of Age and Obesity",
x = "Obesity",
y = "Age (Years)")
p +
theme_minimal() +
scale_fill_viridis_d()
This plot shows that BMI and systolic BP are positively correlated, as expected. The positive trend is more striking with people in younger age groups. In subjects 60 and over, this trend flattens out somewhat, and overall BP is higher. Because of different health conditions, it is possible that systolic BP increasing as we age is something to be expected.
ggplot(nhanes_new, aes(x = sysbp,
y = bmi,
color = age_category)) + geom_point(size = 0.5) + labs(title="Plot of BMI vs Systolic BP",
x ="Systolic BP", y = "BMI")+
facet_wrap(~ age_category)
In the last plot, again we can see that vigorous activity is difficult to note trends in. It appears that activity is fairly equally distributed in all groups below, regardless of age, BMI, or BP. We can perhaps see that the last graph, of people over 70, has less subject’s attesting to vigorous activity that perhaps the other age groups. This is perhaps giving us a sense that it is less important what people do, and more important what they eat, in terms of who become obese.
ggplot(nhanes_new, aes(x = sysbp,
y = bmi,
color = vigwrk)) + geom_point(size = 0.5) + labs(title="Plot of BMI vs Systolic BP",
x ="Systolic BP", y = "BMI")+
facet_wrap(~ age_category)
Summarize your research question and findings below.
It looks like as predicted, people get higher blood pressures as their BMI goes up, and both go up with age. This is what was predicted. Also, it looks like by eye people in all age groups do varying amount of “vigorous” activity. However, subjects who reported No seem to be more ubiquitous and seem to be the outliers in the graphs. Overall, it looks like health goes down as people age and gain more weight, and do less activity. According to Harvard Health, an e-magazine, blood pressure is a good measure of overall health. With this in mind, it looks like overall health declines as people gain weight, and people gain weight as they age.
Are your findings what you expected? Why or Why not?
It is not surprising that people lose their health as they age, caused in major part by gaining weight. I was surprised that young people didn’t have a higher rate of obesity, but this data is older. Perhaps newer data would shine more of a light on younger people and their health issues. It was very surprising that vigorous exercise didn’t seem to play a part in people having lower weights. But perhaps this is just because the variable I chose is “vigorous” exercise. Probably, this is going to be a smaller group of people than people who do less exercise, but perhaps excersise every day. It would be interesting to look at levels of exercise, and see if people that engage in less vigorous exercise actually have healthier weights than those who do vigorous exercise.