The purpose of this post is not to “reinvent the wheel”, but rather to be used for a playground with a somewhat difficult Data Science problem where data is highly correlated or not correlated at all and new useful conclusions are difficult to achieve. I did not put much effort programming the following analyses, so I hope at least my code can be used as a starting guide for something more productive.
Description of datasets
Life expectancy dataset
We decided to remove for each dataset all entries (countries) with NA (NULL value). For pairwise analyses (say, comparing fertility versus and population and versus life expectancy) we only included countries and years shared among the three datasets in order to make fair comparisons. In other words, include countries and years that are represented by all datasets. Dimensions of matrices are: fertility, life expectancy and population are of size |country| × |year|.
Boxplots of datasets over time and by world region
This section includes boxplots of each dataset over time (from 1960s to early 2010s). Additionally the distribution of values given world region is shown.
We show that fertility is decreasing over time:
Below we show the distribution per region across all years. There are some remarks we can point out, for example countries of Sub Saharan Africa (SSA) have a median of ~6 children per family (largest of all regions) but with some outliers having less than ~4 children/family. On the other hand, regions such as Europe and Central Asia (ECA) and North America (NAM) have, in general the lowest number of children per family. This behaviour might be caused by different cultural and economic factors.
In the following figure we show that life expectancy is increasing over time. This result might suggest that in general, living conditions (education, medical) improves life expectancies in all countries over time.
Below we show that for example, Sub Saharan Africa (SSA) and South Asia (SAS) countries have lower life expectancies in all years while North American (NAM), Europe and Central Asia countries (ECA) have larger life expectancies in all years.
Pairwise comparisons of values by time and by region
In this section, we include pairwise comparisons of all datasets (comparing fertility versus and population and versus life expectancy). We performed year by year (mean value of each column from each dataset) and region by region comparisons (mean value of each row from each dataset). The former explains differences between years and the latter explains differences between world regions.
This figure shows that fertility and life expectancy have a correlation of -0.88 in all regions, indicating that these factors change regardless of the country of origin. In other words, this correlation value indicates that the decreasing of family members and the increasing of life expectancy is a global trend that affect all countries. Population is a factor with no correlation at all with fertility and life expectancy.
Besides, we found that looking at the distribution of fertility data, there is two peaks consisting of abundance of families with ~2 and ~6 children per family.
You can download materials (code and datasets) clicking here