TakeHome Exercise 2

The Original DataViz

In the first task, we are supposed to create Age-sex Pyramid to cover 9 planning areas of Singapore in term of Male/ Female and the age group.

Below is one of the DataViz that was created by a classmate.

https://public.tableau.com/app/profile/ruipeng.wang/viz/take_home_01/Dashboard1?fbclid=IwAR1xjLcdtI9B6IQsXvE8DeuHGT0Xe1sL20M6RPAZ2gvhEV_hLizpZ9QdADg


Original DataViz

Critiques:

  1. Clarity: The viz was created for 9 planning areas with 9 different populations. However, they use to same absolute scales in terms of number of people. Therefore, the areas with small number of populations ( Downtown Core and Outram) when putting in the same x-axis scale as other areas, does not have high enough number of people, to show case the distributions as Age-sex pyramid.

  2. Clarity Some age groups have data labels ( number of people in that age group), however many age groups do not have any data labels. It looks like the data label are randomly assigned.

  3. Aesthetics Age-sex pyramid of 9 area, should not occupy such big area of spaces (taking up entire horizontal dataviz). There are other designs that can save spaces for other graphs / charts that will portray more information from the data.

Propose Designs

  1. To address the difference in term of absolute numbers of populations in different areas, we should be converting them into percentage of each age groups versus the total population of that areas, instead keeping them on the same scale of population numbers.

  2. Should avoid individual bar labels, as there are quite a big numbers of populations. Avoid randomly labeling and it can be quite confusing

  3. To address the aesthetics, we can use the upper half of the dashboard to showcase all 9 age-sex pyramid of the areas, and using the other spaces for other graphs that can showcase more information from the same data.

Creating Age-sex pyramid using ggplot2

Loading the necessary package

pacman::p_load(tidyverse)
pacman::p_load(readxl)
overall_data <- read_excel("Singapore Residents by Planning Area, Subzone, Age Group, Sex and Type of Dwelling, June 2022.xlsx", sheet = "TakeHome 2 data table")
tail(overall_data)
# A tibble: 6 × 6
  `Row Labels` `AGE GROUP` `Sum of MALE` `Sum of FEMALE` Percentage o…¹ Percen…²
  <chr>        <chr>               <dbl>           <dbl>          <dbl>    <dbl>
1 Woodlands    70 - 74             23230           23830        0.0314   0.0324 
2 Woodlands    75 - 79             10990           12980        0.0148   0.0177 
3 Woodlands    80 - 84              6740            9550        0.00910  0.0130 
4 Woodlands    85 - 89              2960            4550        0.00400  0.00619
5 Woodlands    90 & Over            1230            2490        0.00166  0.00339
6 Grand Total  <NA>              3303750         3463030       NA       NA      
# … with abbreviated variable names ¹​`Percentage of Male`,
#   ²​`Percentage Of Female`

We want to use draw the percentage of the populations versus the Area population, so we just need to get the percentage column

overall_data = overall_data[c(1,2,5,6)]
tail(overall_data)
# A tibble: 6 × 4
  `Row Labels` `AGE GROUP` `Percentage of Male` `Percentage Of Female`
  <chr>        <chr>                      <dbl>                  <dbl>
1 Woodlands    70 - 74                  0.0314                 0.0324 
2 Woodlands    75 - 79                  0.0148                 0.0177 
3 Woodlands    80 - 84                  0.00910                0.0130 
4 Woodlands    85 - 89                  0.00400                0.00619
5 Woodlands    90 & Over                0.00166                0.00339
6 Grand Total  <NA>                    NA                     NA      

Let’s try with Area Bedok first.

Create dataframe where Rows Labels = Bedok

bedok_data = overall_data[overall_data$`Row Labels` == 'Bedok',]
tail(bedok_data)
# A tibble: 6 × 4
  `Row Labels` `AGE GROUP` `Percentage of Male` `Percentage Of Female`
  <chr>        <chr>                      <dbl>                  <dbl>
1 Bedok        65 - 69                  0.0719                  0.0741
2 Bedok        70 - 74                  0.0602                  0.0626
3 Bedok        75 - 79                  0.0336                  0.0370
4 Bedok        80 - 84                  0.0204                  0.0268
5 Bedok        85 - 89                  0.00978                 0.0151
6 Bedok        90 & Over                0.00475                 0.0101

We isolate the data for male population in Bedok first. We also rename the column to Population percentage. Now in order to create a Pyramid graph, one of the column has to be negative. Hence, we will apply mathematics x (-1) to the Population Percentage for Male.

We added one last column as ID to identify these rows as Male

bedok_data_male = bedok_data[c(1,2,3)]
names(bedok_data_male)[names(bedok_data_male) == 'Percentage of Male'] <- 'Population Perc'
bedok_data_male[3] = bedok_data_male[3]*(-1) 
bedok_data_male$ID <- "Male"

tail(bedok_data_male)
# A tibble: 6 × 4
  `Row Labels` `AGE GROUP` `Population Perc` ID   
  <chr>        <chr>                   <dbl> <chr>
1 Bedok        65 - 69              -0.0719  Male 
2 Bedok        70 - 74              -0.0602  Male 
3 Bedok        75 - 79              -0.0336  Male 
4 Bedok        80 - 84              -0.0204  Male 
5 Bedok        85 - 89              -0.00978 Male 
6 Bedok        90 & Over            -0.00475 Male 

We repeat simiar steps for female population Except we DO NOT have to convert the female population percentage to negative.

bedok_data_female = bedok_data[c(1,2,4)]
names(bedok_data_female)[names(bedok_data_female) == 'Percentage Of Female'] <- 'Population Perc'
bedok_data_female$ID <- "Female"
tail(bedok_data_female)
# A tibble: 6 × 4
  `Row Labels` `AGE GROUP` `Population Perc` ID    
  <chr>        <chr>                   <dbl> <chr> 
1 Bedok        65 - 69                0.0741 Female
2 Bedok        70 - 74                0.0626 Female
3 Bedok        75 - 79                0.0370 Female
4 Bedok        80 - 84                0.0268 Female
5 Bedok        85 - 89                0.0151 Female
6 Bedok        90 & Over              0.0101 Female

Next we use rbind to combine the dataframe for male and female

bedok_df = rbind.data.frame(bedok_data_male ,bedok_data_female )

head(bedok_df)
# A tibble: 6 × 4
  `Row Labels` `AGE GROUP` `Population Perc` ID   
  <chr>        <chr>                   <dbl> <chr>
1 Bedok        0 - 4                 -0.0373 Male 
2 Bedok        5 - 9                 -0.0424 Male 
3 Bedok        10 - 14               -0.0461 Male 
4 Bedok        15 - 19               -0.0491 Male 
5 Bedok        20 - 24               -0.0585 Male 
6 Bedok        25 - 29               -0.0679 Male 

Now with the combined dataframe bedok_df , we are ready to plot the graph with ggplot2.

AGE-SEX PYRAMID GRAPH FOR BEDOK

We use fill = bedok_df$ID (the new column added) is to separate the 2 genders.

ggplot(bedok_df,
       aes( x = bedok_df$`Population Perc` , y = bedok_df$`AGE GROUP` , fill = bedok_df$ID ))+
  geom_bar(stat = "identity") +
  ggtitle("Distribution of Bedok")+
  ylab("AGE GROUP")+
  xlab("Percentage of Area Population") +
  labs(fill = "Gender")+
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
Warning: Use of `` bedok_df$`Population Perc` `` is discouraged.
ℹ Use `Population Perc` instead.
Warning: Use of `` bedok_df$`AGE GROUP` `` is discouraged.
ℹ Use `AGE GROUP` instead.
Warning: Use of `bedok_df$ID` is discouraged.
ℹ Use `ID` instead.

Finally we have created a Age-Sex pyramid, using percentage of the Population of the area, rather than their absolute numbers.

As such, even with the area of low population (like Outram or Downtown Core) , each age groups will still be better visualized using percentage of its own population.

We can repeat this same steps for any planning areas in our original data.

Creating Gender Pyramid for total populations of 9 planning areas using ggplot2

Let reload the data from the excel.

total_data <- read_excel("Singapore Residents by Planning Area, Subzone, Age Group, Sex and Type of Dwelling, June 2022.xlsx", sheet = "TakeHome 2 data table")

tail(total_data)
# A tibble: 6 × 6
  `Row Labels` `AGE GROUP` `Sum of MALE` `Sum of FEMALE` Percentage o…¹ Percen…²
  <chr>        <chr>               <dbl>           <dbl>          <dbl>    <dbl>
1 Woodlands    70 - 74             23230           23830        0.0314   0.0324 
2 Woodlands    75 - 79             10990           12980        0.0148   0.0177 
3 Woodlands    80 - 84              6740            9550        0.00910  0.0130 
4 Woodlands    85 - 89              2960            4550        0.00400  0.00619
5 Woodlands    90 & Over            1230            2490        0.00166  0.00339
6 Grand Total  <NA>              3303750         3463030       NA       NA      
# … with abbreviated variable names ¹​`Percentage of Male`,
#   ²​`Percentage Of Female`

We can rename the Row Labels to become “Area”

names(total_data)[names(total_data) == 'Row Labels'] <- 'Area'
tail(total_data)
# A tibble: 6 × 6
  Area        `AGE GROUP` `Sum of MALE` `Sum of FEMALE` Percentage of…¹ Percen…²
  <chr>       <chr>               <dbl>           <dbl>           <dbl>    <dbl>
1 Woodlands   70 - 74             23230           23830         0.0314   0.0324 
2 Woodlands   75 - 79             10990           12980         0.0148   0.0177 
3 Woodlands   80 - 84              6740            9550         0.00910  0.0130 
4 Woodlands   85 - 89              2960            4550         0.00400  0.00619
5 Woodlands   90 & Over            1230            2490         0.00166  0.00339
6 Grand Total <NA>              3303750         3463030        NA       NA      
# … with abbreviated variable names ¹​`Percentage of Male`,
#   ²​`Percentage Of Female`

Next, again, before we create a Pyramid graph, we need to have 2 dataframes, one is for Male and the other is for Female

We use a groupby and a summarize to group total number of male population for each Area ( AMK, Bedok, Tampines, etc….).

Then we will need to convert the Population number to negative.

We added one last column as Gender to identify these rows as Male.

And it is better to drop the rows of Grand Total, so that we can clearly chart the graph to compare between the 9 areas only.

total_data_male = total_data %>% group_by(Area) %>% summarise(`Population` = sum(`Sum of MALE`))
# total_data_male = total_data_male[order(total_data_male$Population,decreasing = TRUE),]

total_data_male[2] = total_data_male[2] * (-1)
total_data_male$Gender = "Male"
total_data_male <- total_data_male[!(total_data_male$Area) %in% c("Grand Total"),]

tail(total_data_male)
# A tibble: 6 × 3
  Area          Population Gender
  <chr>              <dbl> <chr> 
1 Clementi         -240040 Male  
2 Downtown Core      -8340 Male  
3 Geylang          -293120 Male  
4 Hougang          -609480 Male  
5 Outram            -51290 Male  
6 Woodlands        -740260 Male  

We repeat the same steps for female population for each Area

total_data_female = total_data %>% group_by(Area) %>% summarise(`Population` = sum(`Sum of FEMALE`))
# total_data_female = total_data_female[order(total_data_female$Population,decreasing = TRUE),]
total_data_female$Gender = "Female"
total_data_female <- total_data_female[!(total_data_female$Area) %in% c("Grand Total"),]

tail(total_data_female)
# A tibble: 6 × 3
  Area          Population Gender
  <chr>              <dbl> <chr> 
1 Clementi          267580 Female
2 Downtown Core       8790 Female
3 Geylang           303490 Female
4 Hougang           637040 Female
5 Outram             51070 Female
6 Woodlands         734600 Female

We now use rbind to combine the 2 sets of dataframes together.

sg_pop = rbind(total_data_male , total_data_female)
sg_pop <- sg_pop[!(row.names(sg_pop) %in% c("Grand Total")),]
tail(sg_pop)
# A tibble: 6 × 3
  Area          Population Gender
  <chr>              <dbl> <chr> 
1 Clementi          267580 Female
2 Downtown Core       8790 Female
3 Geylang           303490 Female
4 Hougang           637040 Female
5 Outram             51070 Female
6 Woodlands         734600 Female

PYRAMID GRAPH FOR 9 PLANNING AREAS POPULATION, BY GENDER

With the combined dataframe, we can draw the pyramid graph using the ggplot2 now

ggplot(sg_pop,
       aes( x = Population , y = Area , fill = Gender))+
  geom_bar(stat = "identity") +
  ggtitle("SINGAPORE POPULATION DISTRIBUTION")+
  ylab("AREA")+
  xlab("Population") +
  labs(fill = "Gender")

  scale_x_continuous(labels = scales::comma_format(big.mark = ',' , decimal.mark = '.') , breaks = scales::pretty_breaks(n = 5))
<ScaleContinuousPosition>
 Range:  
 Limits:    0 --    1

As you can see, Outram and Downtown Core have the lowest number of population among the 9 chosen area. While Woodlands have the largest population.