CS4132 Data Analytics
Prannaya Gupta (M21404)
Done as part of CS4232: Data Analytics
A side-effect of technological advancement has been the amount of light pollution in the world today. Ever since Thomas Edison’s revolutionary invention of the light bulb, the world has been thrust into a landscape of light-afflicted skies. According to various studies, around 80% of people live under light pollution-afflicted skies every day, and whilst this may not affect the day-to-day life of an individual, astronomers are very much affected by the sudden illumination in the skies. Even the Singaporean sky is very much damaged by light pollution, with 99.5% of all stars being completely invisible without optical aid[1].
In this project, I aim to analyze the implications of the changes in Light Pollution levels over the past few years, utilising global and local data to find possibly relationships. As shown below, I aim to solve a list of Light Pollution-related questions that utilise alternative data to analyse and interpret patterns, including simple statistical and machine-learning related models.
A. What are locations of minimal light pollution intensity which are optimum for astronomical observation?
While one might think that the best locations are in the middle of wilderness or large water bodies (i.e. the ocean), as are where the observatories Arecibo and FAST are situated, some of these locations need to be filtered based on accessibility, especially since locations like the middle of the ocean are not feasible locations for people to assemble to watch and will thus not be a great location for astrophotography. Additionally, we must also be able to identify locations for observatories to be located.
In order to locate these, a plethora of factors are considered, like the amount of light pollution in the vicinity, the education level around that area (on the basis of the rank of the top university situated in the country, area available and population density in the area, and some computations indicate that the Auckland Islands are the best locations for astronomical observation, especially in the context of placing an observatory in the area.
B. How has the light pollution data around the world changed? Which countries are most susceptible to high light pollution in the future? Which countries are lessening in terms of light pollution?
There are many countries that have had rampant increases in Light Pollution over the past few years, while others have made an effort to reduce their already increased Light Pollution. We need to map the data in order to find out which countries are susceptible to the level of problematic light pollution that can be found in cities like Singapore.
We use the Time Series Data available and model possible time series progressions, worldwise and in individual countries. This can be used to make predictions and compare to other light pollution levels.
In the end, we find that there was a sudden increase in Light Pollution back in 2014, Palestine, Qatar and Gibraltar are the most susceptible to light pollution, and some areas like Bermuda have lessening amounts of Light Pollution.
C. What is the relation between the average energy consumption and general demographics in each region/country and the Light Pollution?
A reason for considering this is the lack of mathematical analysis as to how energy consumption in specific districts affect the light pollution there. General mathematics regarding data in Singapore itself hasn’t been explored, hence using this is a good way of exploring something new. We can model the types of houses, general types of people living there, and the energy consumption based on that and from there, we can test how it affects nearby light pollution.
I intend to use the Economical Datasets that give data regarding the Singapore Energy Consumption statistics[20] and the Singapore Resident demographics[21]. The usage of the Absolute positional brightness datasets[4-7], filtering Singapore similarly to the previous research question. This can therefore be using in conjunction to find possible patterns.
Numbered list of dataset (with downloadable links) and a brief description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).
In this project, we are using the following datasets:
countries.csv
DatasetFirstly, we need to perform a simple set-up. This involves the following steps:
autoreload
extension which can allow us to avoid leaving Jupyter at a stagnant file state.np
, imagecodecs
, lxml
and gdal
random
, re
and glob
numpy
and scipy
pandas
and lxml
matplotlib
, seaborn
and plotly
bs4
and requests
cv2
, skimage
and PIL
sklearn
In this project, I am using the following external libraries:
scikit-image
: download using conda if possible.pillow
: predownloaded in conda, install using pip if you are using a default python environmentopencv-python
: install using pip in any caselxml
: install using pip if possiblegeopy
: install using pip to avoid installation bugsgeopandas
: install using conda if possiblerasterio
and rasterstats
: install using conda if possible keplergl
: install using pip if possibleleafmap
: install using conda if possiblegeojson-area
: install using pip if possible.A sample installation script is shown below:
. '/c/ProgramData/Anaconda3/etc/profile.d'/conda.sh
conda activate
conda create --name data-analytics python==3.8
conda activate data-analytics
pip install opencv-python lxml geopy
conda install scikit-image geopandas rasterio rasterstats --yes
conda install leafmap -c conda-forge
pip install keplergl area
Please make the data
directory as follows:
data
├───country
├───gan
├───ibol
├───nightLight
└───stats
autoreload
extension¶The autoreload
extension is a IPython too that exists to ensure that the kernel resets when files change around it. This specifically helps for modules that exhibit significant changes when being used, and has mostly just been found experimentally. Hence, this is effectively just a debugging step.
%load_ext autoreload
%autoreload 2
Some crucial modules used in this project include:
numpy
and scipy
for numerical analysis.pandas
and geopandas
for storing geospatial and non-geospatial data.matplotlib
and seaborn
for plotting static plots.bs4
for web scraping for data acquisition.scikit-learn
for data modelling.folium
and plotly
for plotting interactive geospatial data.rasterio
and rasterstats
for analyses of raster files.import random, time, math, datetime, os, re
from glob import glob
import sys, tarfile, gzip
from pprint import pprint
from functools import reduce
from IPython.display import clear_output, HTML, Markdown
from IPython import display
import warnings
warnings.filterwarnings("ignore")
import np, numpy
import pandas as pd
from lxml import objectify, etree, html
from io import StringIO, BytesIO
import scipy as sp
from scipy import stats, special
import matplotlib as mpl
import matplotlib.patches as mpl_patches
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.rcParams["figure.figsize"] = (16.0, 8.0)
from bs4 import BeautifulSoup
import requests, bs4
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.linear_model import LinearRegression
import folium
from folium import plugins
import plotly.express as px
import geopandas as gp
import rasterio as rs
import rasterstats as rstats
from area import area as geojson_area
This section pertains to tools created with the express purpose of mathematical and image-based analyses.
Here, you can find the following:
Just to ensure that the correlation computation is substantially accurate, I am also using a self-developed system to identify correlation.
Based on some simple manipulation, we get the following equations: $$\Gamma(x) = \frac{1}{x} t^x e^{-t}|_{0}^{\infty} - \dfrac{1}{x}\int_{0}^{\infty} t^{x} (- e^{-t})dt$$ $$= 0 + \dfrac{1}{x}\int_{0}^{\infty} t^{x} ( e^{-t})dt$$ $$= \dfrac{1}{x}\displaystyle\int_{0}^{\infty} t^{x} ( e^{-t})\,dt =\dfrac{1}{x}\Gamma (x + 1)$$ $$\implies \Gamma (x + 1) = x\Gamma (x)$$ $$\Gamma (x) = \int_{0}^{\infty} t^{x - 1}e^{-t}dt$$ $$\implies \Gamma (1) = \int_{0}^{\infty} t^{1 - 1}e^{-t}dt = \int_{0}^{\infty} e^{-t}\,dt =(-e^{-t})|_{0}^{\infty} = 1$$
Therefore, we can see the following pattern:
$$\Gamma (1 + 1) = (1) \times \Gamma (1) = 1 \implies \Gamma (2) = 1!$$$$\Gamma (1 + 2) = (2) \times \Gamma(2) = 2 \implies \Gamma (3) = 2!$$$$\Gamma (1 + 3) = (3) \times \Gamma(3) = 6 \implies \Gamma (4) = 3!$$$$\vdots$$$$\Gamma (n+1) = n! \implies \Gamma (n) = (n-1)!$$Since the actual factorial only takes in integers, we must introduce a gamma function with 1/2.
Hence, we can compute some more to get $$\Gamma(\frac{1}{2}) = \sqrt{\pi}$$
The incomplete beta function, a generalization of the beta function, is defined as $$B(x;\,a,b) = \int_0^x t^{a-1}\,(1-t)^{b-1}\,dt$$
From this, we also introduce the concept of the regularised incomplete beta function, as shown below $$I_x(a,b) = \frac{B(x;\,a,b)}{\beta(a,b)}$$
Some, actual values are as below: $$I_0(a,b) = 0$$ $$I_1(a,b) = 1$$ $$I_x(a,1) = x^a$$ $$I_x(1,b) = 1 - (1-x)^b$$ $$I_x(a,b) = 1 - I_{1-x}(b,a)$$ $$I_x(a,b) = I_x(a-1,b)-\frac{x^{a-1} (1-x)^b}{(a-1) \beta(a-1,b)}$$ $$I_x(a,b) = I_x(a,b-1)+\frac{x^a(1-x)^{b-1}}{(b-1) \beta(a,b-1)}$$
From here, you can define a simplified recursive function to compute the regularised incomplete beta function.
Covariance is used to determine how much the variables differ from their means.
First off, let take a dataset which has been normalised, i.e $$\bar x = \bar y = 0$$ This can be achieved by subtracting each values by the mean, i.e $$x_i := x_i - \bar x$$ $$y_i := y_i - \bar y$$
We can then compute this Covariance using: $$cov(x, y) = \frac{1}{n} \sum_{i=1}^n x_i y_i $$
First off, let take a dataset which has been normalised, i.e $$\bar x = \bar y = 0$$ This can be achieved by subtracting each values by the mean, i.e $$x_i := x_i - \bar x$$
From here, computing the Pearson R Correlation Coefficient is not very intensive. $$r(x, y) = \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i \sum_{i=1}^n y_i}} $$
From here, we can also compute the P Value using the previous two functions. It is simply given by: $$p(r) = 2 I_{\frac{1 - |r|}{2}}(\frac{n}{2} - 1, \frac{n}{2} - 1)$$ where n is the number of samples.
The Spearman Rank Correlation Cofficient is used to find correlation between specifc ranks in data. Below, we use the tied rank method, which simply finds the ranks of the actual data and then computes the Pearson Correlation Coefficient of these ranks.
Thus, the formula is as follows: $$\rho(x, y) = r(rank(x), rank(y))$$
def gamma(n):
"""
A Simplified Gamma Function for "powers" of 1/2
Gamma Function for Integers are given by: Γ(n) = (n-1)!
Gamma Function for Values ending with 1/2: Γ(n) = 0.5 * 1.5 * ... * (n-1) * √(π)
Gamma Function is generally given by the following expression
"""
return np.arange(1 - (n%1), n).prod() * np.pi ** (n%1)
def beta(x, y):
"""
Actual Beta Function based ont he Simplified Gamma Function above.
"""
return gamma(x)*gamma(y)/gamma(x+y)
def I(x, a, b):
if gamma(a) == np.inf or gamma(b) == np.inf or gamma(a+b) == 0: return 0
if x == 0 or x == 1: return x
if b == 1: return x ** a
if a == 1: return 1 - (1-x)**b
if b > 1: return I(x, a, b-1) + (x**a * (1-x)**(b-1))/((b-1)*beta(a, b-1))
if a > 1: return I(x, a-1, b) - (x**(a-1) * (1-x)**b)/((a-1)*beta(a-1, b))
return ((-1)**a) * I(x/(x-1), a, 1-a-b) * beta(a, 1-a-b)
def corr(data):
norm = data - data.mean(axis=0)
return norm.prod(axis=1).sum() / np.sqrt((norm**2).sum(axis=0).prod())
def p(r, bound):
return 2*I((1-abs(r))/2, bound, bound)
def pearson(data):
r = corr(data)
return r, p(r, data.shape[0]/2-1)
def pearsonr(x, y):
x, y = x - x.mean(axis=0), y - y.mean(axis=0)
r = (x.T @ y).sum(axis=0) / np.sqrt(((x**2).sum(axis=0) * (y**2).sum(axis=0)))
p_val = 2 * np.vectorize(lambda r: p(r, x.shape[0]/2-1))(r)
return np.stack((r, p_val), axis=len(r.shape))
def rank(x):
return pd.Series(x).rank().to_numpy()
def spearman(x, y):
return pearsonr(rank(x), rank(y))
#return 1 - (6*((rank(x) - rank(y))**2).sum() / ((lambda x: x**3 - x)(len(x))))
def rankedCorrelation(df):
return (lambda dt: dt.loc[np.abs(dt.Correlation).sort_values(ascending=False).index])((lambda corr: pd.DataFrame(corr[corr.index.str.contains("Light") & (~corr.index.str.contains("count")) & (np.abs(corr) > 0.5) & (np.abs(corr) != 1) & pd.Series(corr.index.str.split(" "), index=corr.index).apply(lambda cats: (lambda lst: all((lambda t: [item for sublist in t for item in sublist])([[(i != j) for i in lst[1]] for j in lst[0]])))(list(map(lambda text: re.sub(r"([A-Z])", r" \1",text).strip().split(), cats))))]))((lambda df: pd.Series(list(df), index=df.index.map(lambda tup: " ".join(sorted(tup)))))(df.corr().stack()).sort_index().reset_index().rename(columns={"index":"Categories", 0:"Correlation"}).groupby("Categories").Correlation.mean()).rename(index=lambda categories: categories.replace(" ", " and "))) #.Correlation)
In this section, we adopt some alternative practices to incorporate Maclaurin and Taylor Series fitting methods into this project.
Whilst we do know that Maclaurin Series operates based on the following simplistic approximation: $$ f(x) \approx \sum_{r = 0}^{N} \frac{f^{(r)}(0)}{r!} x^r $$ where N is sufficiently large number.
We can also simplify this expression by placing limit as below: $$ f(x) = \lim_{N\to\infty} \sum_{r = 0}^{N} \frac{f^{(r)}(0)}{r!} x^r = \sum_{r = 0}^{\infty} \frac{f^{(r)}(0)}{r!} x^r $$
Let's thus consider that we apply this operation for the function $$ g(x) = \frac{1}{1+x} $$
From here, we can get some very simple values as shown below: \begin{align*} g^{(0)}(x) &= \left ( 1 + x \right )^{-1} \\ g^{(1)}(x) &= g'(x) \\ &= \frac{d}{dx} \left [ \left ( 1 + x \right )^{-1} \right ] \\ &= (-1) \cdot \left ( 1 + x \right )^{-2} \\ &= - \left ( 1 + x \right )^{-2} \\ g^{(1)}(x) &= (-1)^{1} \cdot 1! \cdot \left ( 1 + x \right )^{-2} \\\\ \hline \\ g^{(2)}(x) &= g''(x) \\ &= \frac{d^2}{dx^2} \left [ \left ( 1 + x \right )^{-1} \right ] \\ &= \frac{d}{dx} \left [ - \left ( 1 + x \right )^{-2} \right ] \\ &= (-1) \cdot (-2) \cdot \left ( 1 + x \right )^{-3} \\ &= 2 \left ( 1 + x \right )^{-3} \\ g^{(2)}(x) &= (-1)^{2} \cdot 2! \cdot \left ( 1 + x \right )^{-3} \\\\ \hline \\ g^{(3)}(x) &= g'''(x) \\ &= \frac{d^3}{dx^3} \left [ \left ( 1 + x \right )^{-1} \right ] \\ &= \frac{d}{dx} \left [ 2 \left ( 1 + x \right )^{-3} \right ] \\ &= (2) \cdot (-3) \cdot \left ( 1 + x \right )^{-4} \\ &= - 6 \left ( 1 + x \right )^{-4} \\ g^{(3)}(x) &= (-1)^{3} \cdot 3! \cdot \left ( 1 + x \right )^{-4} \\ \end{align*}
Hence, we derive the following expression: $$ g^{(k)}(x) = (-1)^k \cdot k! \cdot \left ( 1 + x \right )^{-k-1} $$
Hence, for any k, we get the following: $$ g^{(k)}(0) = (-1)^k \cdot k! $$
Thus, we get the following: \begin{align*} g(x) &= \sum_{r = 0}^{\infty} \frac{g^{(r)}(0)}{r!} x^r \\ &= \sum_{r = 0}^{\infty} \frac{(-1)^r \cdot r!}{r!} x^r \\ \frac{1}{1+x} &= \sum_{r = 0}^{\infty} \left (-x \right)^r \\ \frac{1}{1+x} &= \sum_{r = 0}^{\infty} \left[1 - \left(1+x \right) \right]^r \end{align*}
From here, we define a value $y = 1 + x$. Thus, we can get the following: $$ \frac{1}{y} = \sum_{r = 0}^{\infty} \left(1 - y \right)^r $$
Thus, we get the Maclaurin Series of $ f(x) = \frac{1}{x} $ to be as follows: $$ f(x) = \frac{1}{x} = \sum_{r = 0}^{\infty} \left(1 - x \right)^r = \sum_{r = 0}^{\infty} (-1)^r \left(x - 1 \right)^r $$
However, the introduction of the alternate variable proffers us a solution to solve this value in the case where $f(0)$ and subsequent derivatives do not exist. Hence, we can introduce a term $a$ such that the following is valid:
$$ f(x) = \sum_{r = 0}^{\infty} \frac{f^{(r)}(a)}{r!} (x-a)^r $$This is the definition of the Taylor Series, wherein we generalise Maclaurin Series itself to be a form of Taylor Series such that $a = 0$. However, when $a \neq 0$, Taylor Series is used.
For example, applying $a = 1$ on the function $f$ mentioned above, we can derive the Taylor Series of $f$. Firstly, we notice the following: $$ f^{(k)}(1) = (-1)^f \cdot f! $$
This piece of information has not changed from function $g$. Hence, we can now apply the Taylor Series: \begin{align*} f(x) &= \sum_{r = 0}^{\infty} \frac{f^{(r)}(1)}{r!} (x - 1)^r \\ &= \sum_{r = 0}^{\infty} \frac{(-1)^r \cdot r!}{r!} (x - 1)^r \\ \frac{1}{x} &= \sum_{r = 0}^{\infty} \left (1 - x \right)^r \\ f(x) &= \sum_{r = 0}^{\infty} (-1)^r \left(x - 1 \right)^r \end{align*}
Hence, for a function $f(x)$, to be expanded about a point $x = a$, let us define a new function $g(x)$ such that $g(x - a) = f(x)$. Then,
$$g(x) = g(0) + g'(0)x + \frac{1}{2}g''(0)x^2 + ...$$$$g(x) = f(a) + f'(a)x + \frac{1}{2}f''(a)x^2 + ...$$$$f(x) = f(a) + f'(a)(x - a) + \frac{1}{2}f''(a)(x - a)^2 + ...$$Hence, $$ f(x) = \sum_{r = 0}^{\infty} \frac{f^{(r)}(a)}{r!} (x-a)^r $$ for some a, such that f(a) exists.
Using numpy
's polyfit
function, we can compute the coefficients of a function represented simply as a Taylor Series or Maclaurin Series Expansion, up to a degree of 100.
def maclaurin(x, y, N=100):
return (np.polyfit(x, y, N)[:, np.newaxis] * (x ** np.arange(N, -1, -1)[:, np.newaxis])).sum(axis=0)
def taylor(x, y, a, N=100):
return maclaurin((x-a), y, N)
def taylorplot(x, y, **kwargs):
sns.lineplot(x, taylor(x, y, x.min()), ax=plt.gca())
To prove that these methods work, we investigate based on two curves: the sigmoid curve used in Logistic Regresssion and the activation functions in Deep Learning, and the sine curve which can be used easily for Taylor Series.
x = np.linspace(-10, 10, 100)
z = 1 / (1 + np.exp(-x))
plt.scatter(x, z)
plt.plot(x, maclaurin(x, z))
plt.title("Graph of Sigmoid fitted using Maclaurin")
Text(0.5, 1.0, 'Graph of Sigmoid fitted using Maclaurin')
x = np.linspace(0,10*np.pi, 100)
y = np.sin(x)
plt.scatter(x, y)
plt.plot(x, taylor(x, y, np.pi))
plt.title("Graph of Sine Curve fitted using Taylor")
Text(0.5, 1.0, 'Graph of Sine Curve fitted using Taylor')
pandas
Extension Functions¶This snippet contains functions to retrieve large online sources instead of simply relying on the pandas scraping function, which runs into some errors.
def read_csv(loc, *args, **kwargs):
if re.search(r"[(http(s)?):\/\/(www\.)?a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)", loc):
return read_csv(requests.get(loc, allow_redirects=True).content, *args, **kwargs)
loc = loc.strip("\n")
if re.search(r"[\n:<>\"/\|?*]", loc):
return pd.read_csv(StringIO(loc), *args, **kwargs)
else:
return pd.read_csv(loc, *args, **kwargs)
You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc). For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.
I will be acquiring the following datasets in this project:
The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge. The dataset contains some key information like the Continent, Area in km^2 and Population. I have renamed some columns for clarity and replaced some pandas
reading errors, like NA, which depicts the continent North America, being read as a null value.
geocountries = pd.read_html(requests.get("https://www.geonames.org/countries/").content)[1].rename(columns={"ISO-3166alpha2":"CountryCode", "ISO-3166alpha3":"CountryCode3", "ISO-3166numeric":"NumericCountryCode"}).replace('São Tomé and PrÃ\xadncipe', "São Tomé and Príncipe").replace('Saint Barthélemy', "Saint Barthélemy").replace("Czechia", "Czech Republic").replace('Réunion', "Réunion").replace('The Gambia', "Gambia").replace('South Georgia and South Sandwich Islands', 'South Georgia and the South Sandwich Islands').replace('St Kitts and Nevis', 'Saint Kitts and Nevis').replace('Cabo Verde', 'Cape Verde').replace("Eswatini", "Swaziland").replace("Curaçao", "Curaçao").replace('Ã\x85land', "Aland").replace('Timor-Leste', "East Timor")
geocountries.to_csv("data/country/geocountries.csv", index=False)
geocountries = pd.read_csv("data/country/geocountries.csv")
geocountries.Continent.replace(np.nan, "NA", inplace=True)
geocountries
CountryCode | CountryCode3 | NumericCountryCode | fips | Country | Capital | Area in km² | Population | Continent | |
---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 77006 | EU |
1 | AE | ARE | 784 | AE | United Arab Emirates | Abu Dhabi | 82880.0 | 9630959 | AS |
2 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 37172386 | AS |
3 | AG | ATG | 28 | AC | Antigua and Barbuda | St. John's | 443.0 | 96286 | NA |
4 | AI | AIA | 660 | AV | Anguilla | The Valley | 102.0 | 13254 | NA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
245 | YE | YEM | 887 | YM | Yemen | Sanaa | 527970.0 | 28498687 | AS |
246 | YT | MYT | 175 | MF | Mayotte | Mamoudzou | 374.0 | 279471 | AF |
247 | ZA | ZAF | 710 | SF | South Africa | Pretoria | 1219912.0 | 57779622 | AF |
248 | ZM | ZMB | 894 | ZA | Zambia | Lusaka | 752614.0 | 17351822 | AF |
249 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 14439018 | AF |
250 rows × 9 columns
Google Developers' countries.csv
which contains Latitude-Longitude data for each country. This indentifies a plausible center for the country.
I have cleared some of the data by getting the latitude, longitude and country name and adapt it to help with the merge later.
latlong = pd.read_html(requests.get("https://developers.google.com/public-data/docs/canonical/countries_csv").content)[0][["name", "latitude", "longitude"]].rename(columns={"name":"Country"}).replace('Myanmar [Burma]', "Myanmar").replace("Congo [Republic]", "Congo Republic").replace('Congo [DRC]', 'DR Congo').replace('Saint Vincent and the Grenadines', "St Vincent and Grenadines").replace("Falkland Islands [Islas Malvinas]", "Falkland Islands").replace('Cocos [Keeling] Islands', 'Cocos (Keeling) Islands').replace('Palestinian Territories', "Palestine").replace('U.S. Minor Outlying Islands', 'U.S. Outlying Islands').replace('Macedonia [FYROM]', 'North Macedonia').replace('Heard Island and McDonald Islands', 'Heard and McDonald Islands').replace("Côte d'Ivoire", "Ivory Coast").replace("Myanmar [Burma]", "Myanmar")
latlong.to_csv("data/country/latlong.csv", index=False)
latlong = pd.read_csv("data/country/latlong.csv")
latlong
Country | latitude | longitude | |
---|---|---|---|
0 | Andorra | 42.546245 | 1.601554 |
1 | United Arab Emirates | 23.424076 | 53.847818 |
2 | Afghanistan | 33.939110 | 67.709953 |
3 | Antigua and Barbuda | 17.060816 | -61.796428 |
4 | Anguilla | 18.220554 | -63.068615 |
... | ... | ... | ... |
240 | Yemen | 15.552727 | 48.516388 |
241 | Mayotte | -12.827500 | 45.166244 |
242 | South Africa | -30.559482 | 22.937506 |
243 | Zambia | -13.133897 | 27.849332 |
244 | Zimbabwe | -19.015438 | 29.154857 |
245 rows × 3 columns
Wikipedia is kind enough to have compiled a list of countries and dependencies by population density, which lists the number of people per square kilometer in some specific areas.
popden = pd.read_html(requests.get("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population_density").content)[0]
popden.columns = popden.columns.map("".join)
popden = popden.drop(columns=["Areasqmi", "Density/sqmi", "Sources& DatesSources& Dates"]).rename(columns={"Country (or territory)Country (or territory)": "name", "PopulationPopulation": "Population", "Areakm2": "Area", "Density/km2": "Density"})
popden.name = popden.name.apply(lambda name: name[:name.index("*")-1] if "*" in name else name[:name.index("(")-1] if "(" in name else name) #.str.contains("*")
popden
name | Population | Area | Density | |
---|---|---|---|---|
0 | Macau | 631636 | 30.0 | 21055.0 |
1 | Monaco | 38682 | 2.0 | 19341.0 |
2 | Singapore | 5757499 | 716.0 | 8041.0 |
3 | Hong Kong | 7371730 | 1104.0 | 6677.0 |
4 | Gibraltar | 33718 | 6.0 | 5620.0 |
... | ... | ... | ... | ... |
247 | Mongolia | 3170216 | 1564116.0 | 2.0 |
248 | Pitcairn Islands | 56 | 5.0 | 11.0 |
249 | Falkland Islands | 3234 | 12173.0 | 0.0 |
250 | Svalbard and Jan Mayen | 2655 | 62422.0 | 0.0 |
251 | Greenland | 56564 | 2166086.0 | 0.0 |
252 rows × 4 columns
DataHub's Countries GeoJSON Dataset is a geodata data package providing geojson polygons for all the world's countries. The data comes from Natural Earth, a community effort to make visually pleasing, well-crafted maps with cartography or GIS software at small scale.
I have opened this dataset as a geopandas.GeoDataFrame
object, as shown below. Following this, I have saved it in the formal definition, world_countries.json
for later usage by folium
and plotly
.
countries_geojson = gp.read_file("https://datahub.io/core/geo-countries/r/countries.geojson").rename(columns={"ADMIN":"Country"}).replace("United States of America", "United States").replace('United Republic of Tanzania', "Tanzania").replace('Democratic Republic of the Congo', "DR Congo").replace('Republic of Serbia', "Serbia").replace('The Bahamas', "Bahamas").replace('Hong Kong S.A.R.', "Hong Kong").replace('Guinea Bissau', 'Guinea-Bissau').replace("United States Minor Outlying Islands", "U.S. Outlying Islands").replace('South Georgia and South Sandwich Islands', 'South Georgia and the South Sandwich Islands').replace('Macao S.A.R', "Macao").replace('Republic of Congo', "Congo Republic").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace('Saint Vincent and the Grenadines', 'St Vincent and Grenadines').replace('Vatican', 'Vatican City').replace('Macedonia', 'North Macedonia').replace('Federated States of Micronesia', "Micronesia").replace('United States Virgin Islands', 'U.S. Virgin Islands').replace('Saint Barthelemy', 'Saint Barthélemy').replace("Heard Island and McDonald Islands", "Heard and McDonald Islands").replace('French Southern and Antarctic Lands',"French Southern Territories")
countries_geojson.to_file("data/country/world_countries.json", driver='GeoJSON')
countries_geojson = gp.read_file("data/country/world_countries.json")
countries_geojson
Country | ISO_A3 | geometry | |
---|---|---|---|
0 | Aruba | ABW | POLYGON ((-69.99694 12.57758, -69.93639 12.531... |
1 | Afghanistan | AFG | POLYGON ((71.04980 38.40866, 71.05714 38.40903... |
2 | Angola | AGO | MULTIPOLYGON (((11.73752 -16.69258, 11.73851 -... |
3 | Anguilla | AIA | MULTIPOLYGON (((-63.03767 18.21296, -63.09952 ... |
4 | Albania | ALB | POLYGON ((19.74777 42.57890, 19.74601 42.57993... |
... | ... | ... | ... |
250 | Samoa | WSM | MULTIPOLYGON (((-171.57002 -13.93816, -171.564... |
251 | Yemen | YEM | MULTIPOLYGON (((53.30824 12.11839, 53.31027 12... |
252 | South Africa | ZAF | MULTIPOLYGON (((37.86378 -46.94085, 37.83644 -... |
253 | Zambia | ZMB | POLYGON ((31.11984 -8.61663, 31.14102 -8.60619... |
254 | Zimbabwe | ZWE | POLYGON ((30.01065 -15.64623, 30.05024 -15.640... |
255 rows × 3 columns
countries_geojson.plot(figsize=(20, 10)).set_title("Plot of Countries Data")
Text(0.5, 1.0, 'Plot of Countries Data')
To verify how accurate the polygons are, we zoom in on Singapore, which is currently shown below. We plot a leafmap
plot using the keplergl
library to analyse how accurately the polygons fit to the countries.
As we can see, the data does not fill the country border fully but it is still considerably accurate on a large scale to be considered valid. Hence, we continue with using this dataset.
latlong[latlong.Country == "Singapore"]
Country | latitude | longitude | |
---|---|---|---|
194 | Singapore | 1.352083 | 103.819836 |
import leafmap.kepler as leafmap
m = leafmap.Map(center=[1.352083, 103.819836], zoom=10)
m.add_gdf(countries_geojson, layer_name="Countries")
m.to_html("maps/countries.html")
m
DataHub's Natural Earth Polygons GeoJSON Dataset is a geodata data package geojson polygons for the largest administrative subdivisions in every country. The data comes from Natural Earth, a community effort to make visually pleasing, well-crafted maps with cartography or GIS software at small scale.
I have opened this dataset as a geopandas.GeoDataFrame
object, as shown below. Following this, I have saved it in a local file, world_locations.json
for later usage by folium
and plotly
.
states_geojson = gp.read_file("https://datahub.io/core/geo-ne-admin1/r/admin1.geojson").replace("United States of America", "United States").replace('United Republic of Tanzania', "Tanzania").replace('Democratic Republic of the Congo', "DR Congo").replace('Republic of Serbia', "Serbia").replace('The Bahamas', "Bahamas").replace('Hong Kong S.A.R.', "Hong Kong").replace('Guinea Bissau', 'Guinea-Bissau').replace("United States Minor Outlying Islands", "U.S. Outlying Islands").replace('South Georgia and South Sandwich Islands', 'South Georgia and the South Sandwich Islands').replace('Macao S.A.R', "Macao").replace('Republic of Congo', "Congo Republic").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace('Saint Vincent and the Grenadines', 'St Vincent and Grenadines').replace('Vatican', 'Vatican City').replace('Macedonia', 'North Macedonia').replace('Federated States of Micronesia', "Micronesia").replace('United States Virgin Islands', 'U.S. Virgin Islands').replace('Saint Barthelemy', 'Saint Barthélemy').replace("Heard Island and McDonald Islands", "Heard and McDonald Islands").replace('French Southern and Antarctic Lands',"French Southern Territories")
states_geojson.to_file("data/country/world_locations.json", driver='GeoJSON')
del states_geojson
states_geojson = gp.read_file("data/country/world_locations.json").replace("United States of America", "United States").replace('United Republic of Tanzania', "Tanzania").replace('Democratic Republic of the Congo', "DR Congo").replace('Republic of Serbia', "Serbia").replace('The Bahamas', "Bahamas").replace('Hong Kong S.A.R.', "Hong Kong").replace('Guinea Bissau', 'Guinea-Bissau').replace("United States Minor Outlying Islands", "U.S. Outlying Islands").replace('South Georgia and South Sandwich Islands', 'South Georgia and the South Sandwich Islands').replace('Macao S.A.R', "Macao").replace('Republic of Congo', "Congo Republic").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace('Saint Vincent and the Grenadines', 'St Vincent and Grenadines').replace('Vatican', 'Vatican City').replace('Macedonia', 'North Macedonia').replace('Federated States of Micronesia', "Micronesia").replace('United States Virgin Islands', 'U.S. Virgin Islands').replace('Saint Barthelemy', 'Saint Barthélemy').replace("Heard Island and McDonald Islands", "Heard and McDonald Islands").replace('French Southern and Antarctic Lands',"French Southern Territories")
states_geojson
ISO3166-1-Alpha-3 | country | id | name | geometry | |
---|---|---|---|---|---|
0 | ABW | Aruba | 5150 | Aruba | POLYGON ((-69.99694 12.57758, -69.93639 12.531... |
1 | AFG | Afghanistan | 1741 | Badghis | POLYGON ((64.30624 35.39722, 64.32468 35.40177... |
2 | AFG | Afghanistan | 1742 | Hirat | POLYGON ((61.36393 35.59824, 61.36548 35.59850... |
3 | AFG | Afghanistan | 1743 | Bamyan | POLYGON ((67.74391 35.44342, 67.75476 35.44412... |
4 | AFG | Afghanistan | 1744 | Balkh | POLYGON ((67.25913 37.18515, 67.28145 37.18866... |
... | ... | ... | ... | ... | ... |
4642 | ZWE | Zimbabwe | 529 | Manicaland | POLYGON ((33.01165 -17.38399, 32.99745 -17.404... |
4643 | ZWE | Zimbabwe | 530 | Matabeleland South | POLYGON ((29.43994 -19.87930, 29.45699 -19.874... |
4644 | ZWE | Zimbabwe | 531 | Bulawayo | POLYGON ((28.49757 -20.06270, 28.50532 -20.062... |
4645 | ZWE | Zimbabwe | 532 | Masvingo | POLYGON ((31.04181 -19.25226, 31.19870 -19.248... |
4646 | ZWE | Zimbabwe | 533 | Mashonaland West | POLYGON ((30.01065 -15.64623, 30.05024 -15.640... |
4647 rows × 5 columns
states_geojson.plot(figsize=(20, 10)).set_title("Plot of States Data")
Text(0.5, 1.0, 'Plot of States Data')
These are mostly datasets found on the UN Database. The data is quite dirty, so it needs to be cleaned. Some additional datasets do not have a CSV Link, hence they are attached with this project notebook. The datasets covered include:
Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Population, Surface Area and Population Density, although the data needs to be cleaned.
popsaden = pd.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB63_1_202105_Population,%20Surface%20Area%20and%20Density.csv")
popsaden.to_csv("data/country/popsaden.csv", index=False)
popsaden = pd.read_csv("data/country/popsaden.csv")
popsaden = popsaden.iloc[1:, 1:5].rename(columns={"Population, density and surface area":"Country", "Unnamed: 2":"Year", "Unnamed: 3":"Quantity", "Unnamed: 4":"Value"})
popsaden = popsaden[popsaden.Country.isin(popsaden["Country"].unique()[31:])]
popsaden.Year = popsaden.Year.astype(int)
popsaden = popsaden.groupby(["Country", "Year", "Quantity"]).Value.sum().unstack().rename(columns={"Population mid-year estimates (millions)": "Population", "Population mid-year estimates for males (millions)": "MalePopulation", "Population mid-year estimates for females (millions)": "FemalePopulation", "Sex ratio (males per 100 females)": "SexRatio", "Population aged 0 to 14 years old (percentage)": "ChildrenPopulation", "Population aged 60+ years old (percentage)": "ElderlyPopulation", "Population density":"PopulationDensity", "Surface area (thousand km2)": "SurfaceArea"})[["Population", "MalePopulation", "FemalePopulation", "SexRatio", "ChildrenPopulation", "ElderlyPopulation", "SurfaceArea", "PopulationDensity"]].astype(float)
popsaden.loc[:, ["ChildrenPercent", "ElderlyPercent"]]= popsaden[["ChildrenPopulation", "ElderlyPopulation"]].to_numpy()
popsaden[["Population", "MalePopulation", "FemalePopulation"]] = (popsaden[["Population", "MalePopulation", "FemalePopulation"]] * (10**6)).astype(int, errors="ignore")
popsaden.SexRatio = popsaden.SexRatio / 100
popsaden[["ChildrenPopulation", "ElderlyPopulation"]] = (popsaden[["ChildrenPopulation", "ElderlyPopulation"]] * popsaden[["Population"]].to_numpy() / 100).astype(int, errors="ignore")
popsaden.SurfaceArea = popsaden.SurfaceArea * 1000
popsaden.reset_index(inplace=True)
isna = pd.DataFrame(popsaden.isna().any(), columns=["isna"])
popsaden[['MalePopulation', 'FemalePopulation', 'SexRatio', 'ChildrenPopulation', 'ElderlyPopulation', 'SurfaceArea']] = isna[isna["isna"]].apply(axis=1, func=lambda row: popsaden[row.name].fillna(popsaden.groupby("Country")[row.name].transform('mean'))).T[['MalePopulation', 'FemalePopulation', 'SexRatio', 'ChildrenPopulation', 'ElderlyPopulation', 'SurfaceArea']]
popsaden = popsaden.replace("United States of America", "United States").replace("Brunei Darussalam", "Brunei").replace("Republic of Korea", "South Korea").replace("Dem. People's Rep. Korea", "North Korea").replace("Iran (Islamic Republic of)", "Iran").replace("Lao People's Dem. Rep.", "Laos").replace("Venezuela (Boliv. Rep. of)", 'Venezuela').replace("Bolivia (Plurin. State of)", "Bolivia").replace('Viet Nam', "Vietnam").replace('Russian Federation', "Russia").replace('Syrian Arab Republic', "Syria").replace("Wallis and Futuna Islands", "Wallis and Futuna").replace("Saint Martin (French part)", "Saint Martin").replace("Falkland Islands (Malvinas)", "Falkland Islands").replace('Republic of Moldova', 'Moldova').replace('Sint Maarten (Dutch part)', 'Sint Maarten').replace('Côte d’Ivoire', "Ivory Coast").replace('United Rep. of Tanzania', "Tanzania").replace('Timor-Leste', "East Timor").replace('State of Palestine', "Palestine").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace("Congo", "Congo Republic").replace("Dem. Rep. of the Congo", "DR Congo").replace("Cabo Verde", "Cape Verde").replace('China, Hong Kong SAR', "Hong Kong").replace('Saint Vincent & Grenadines', 'St Vincent and Grenadines').replace("Eswatini", "Swaziland").replace("Czechia", "Czech Republic").replace('Micronesia (Fed. States of)', "Micronesia").replace('United States Virgin Islands', "U.S. Virgin Islands").replace('China, Macao SAR', "Macao")
popsaden
Quantity | Country | Year | Population | MalePopulation | FemalePopulation | SexRatio | ChildrenPopulation | ElderlyPopulation | SurfaceArea | PopulationDensity | ChildrenPercent | ElderlyPercent |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2005 | 25654300.0 | 13239700.0 | 12414600.0 | 1.066462 | 1.227805e+07 | 9.384086e+05 | 652864.0 | 39.2952 | 47.8596 | 3.6579 |
1 | Afghanistan | 2010 | 29185500.0 | 14935800.0 | 14249800.0 | 1.048141 | 1.406102e+07 | 1.130646e+06 | 652864.0 | 44.7041 | 48.1781 | 3.8740 |
2 | Afghanistan | 2017 | 36296100.0 | 18652000.0 | 17644100.0 | 1.057122 | 1.585504e+07 | 1.475836e+06 | 652864.0 | 55.5956 | 43.6825 | 4.0661 |
3 | Afghanistan | 2019 | 38041800.0 | 19529700.0 | 18512000.0 | 1.054975 | 1.615723e+07 | 1.584631e+06 | 652864.0 | 58.2694 | 42.4723 | 4.1655 |
4 | Albania | 2005 | 3086800.0 | 1552000.0 | 1534800.0 | 1.011167 | 8.180298e+05 | 3.792597e+05 | 28748.0 | 112.6573 | 26.5009 | 12.2865 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
986 | Zambia | 2019 | 17861000.0 | 8843200.0 | 9017800.0 | 0.980638 | 7.941447e+06 | 6.044520e+05 | 752612.0 | 24.0265 | 44.4625 | 3.3842 |
987 | Zimbabwe | 2005 | 12076700.0 | 5783700.0 | 6293000.0 | 0.919065 | 4.993341e+06 | 5.694768e+05 | 390757.0 | 31.2180 | 41.3469 | 4.7155 |
988 | Zimbabwe | 2010 | 12697700.0 | 6049000.0 | 6648700.0 | 0.909805 | 5.283488e+06 | 5.587369e+05 | 390757.0 | 32.8234 | 41.6098 | 4.4003 |
989 | Zimbabwe | 2017 | 14236600.0 | 6777100.0 | 7459500.0 | 0.908508 | 6.064493e+06 | 6.367035e+05 | 390757.0 | 36.8013 | 42.5979 | 4.4723 |
990 | Zimbabwe | 2019 | 14645500.0 | 6983400.0 | 7662100.0 | 0.911413 | 6.174265e+06 | 6.705296e+05 | 390757.0 | 37.8583 | 42.1581 | 4.5784 |
991 rows × 12 columns
Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Population Increase, Life Expectancy, Infant and Maternal Mortality and Total Fertility Rate, although the data needs to be cleaned. The column names represent the following quantities:
InfantMortality
: Infant mortality for both sexes (for each live birth)LifeExpectancy
: Life expectancy at birth for both sexes (years)FemaleLifeExpectancy
: Life expectancy at birth for females (years)MaleLifeExpectancy
: Life expectancy at birth for males (years)MaternalMortality
: Maternal mortality ratio (deaths per person)PopulationIncrease
: Population annual rate of increase (percent)FertilityRate
: Total fertility rate (children per women)humanity = pd.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB62_246_201907_Population%20Growth,%20Fertility%20and%20Mortality%20Indicators.csv", error_bad_lines=False, engine='python', encoding = "ISO-8859-1")
humanity.to_csv("data/country/humanity.csv", index=False)
del humanity
humanity = pd.read_csv("data/country/humanity.csv")
humanity = humanity.iloc[1:, 1:5].rename(columns={"Population growth and indicators of fertility and mortality":"Country", "Unnamed: 2":"Year", "Unnamed: 3":"Quantity", "Unnamed: 4":"Value"})
humanity = humanity[humanity.Country.isin(humanity.Country.unique()[31:])]
humanity.Year = humanity.Year.astype(int)
humanity = humanity.groupby(["Country", "Year", "Quantity"]).Value.sum().unstack().rename(columns={'Infant mortality for both sexes (per 1,000 live births)': "InfantMortality", 'Life expectancy at birth for both sexes (years)': "LifeExpectancy", 'Life expectancy at birth for females (years)': "FemaleLifeExpectancy", 'Life expectancy at birth for males (years)': "MaleLifeExpectancy", 'Maternal mortality ratio (deaths per 100,000 population)': "MaternalMortality", 'Population annual rate of increase (percent)': "PopulationIncrease", "Total fertility rate (children per women)": "FertilityRate"})[["PopulationIncrease", "LifeExpectancy", "MaleLifeExpectancy", "FemaleLifeExpectancy", "FertilityRate", "InfantMortality", "MaternalMortality"]].astype(float).reset_index()
humanity.InfantMortality = humanity.InfantMortality / 1000
humanity.MaternalMortality = humanity.MaternalMortality / 100000
isna = pd.DataFrame(humanity.isna().any(), columns=["isna"])
humanity[["PopulationIncrease", "LifeExpectancy", "MaleLifeExpectancy", "FemaleLifeExpectancy", "FertilityRate", "InfantMortality", "MaternalMortality"]] = isna[isna["isna"]].apply(axis=1, func=lambda row: humanity[row.name].fillna(humanity.groupby("Country")[row.name].transform('mean'))).T[["PopulationIncrease", "LifeExpectancy", "MaleLifeExpectancy", "FemaleLifeExpectancy", "FertilityRate", "InfantMortality", "MaternalMortality"]]
humanity = humanity.replace("United States of America", "United States").replace("Brunei Darussalam", "Brunei").replace("Republic of Korea", "South Korea").replace("Dem. People's Rep. Korea", "North Korea").replace("Iran (Islamic Republic of)", "Iran").replace("Lao People's Dem. Rep.", "Laos").replace("Venezuela (Boliv. Rep. of)", 'Venezuela').replace("Bolivia (Plurin. State of)", "Bolivia").replace('Viet Nam', "Vietnam").replace('Russian Federation', "Russia").replace('Syrian Arab Republic', "Syria").replace("Wallis and Futuna Islands", "Wallis and Futuna").replace("Saint Martin (French part)", "Saint Martin").replace("Falkland Islands (Malvinas)", "Falkland Islands").replace('Republic of Moldova', 'Moldova').replace('Sint Maarten (Dutch part)', 'Sint Maarten').replace('Côte d’Ivoire', "Ivory Coast").replace('United Rep. of Tanzania', "Tanzania").replace('Timor-Leste', "East Timor").replace('State of Palestine', "Palestine").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace("Congo", "Congo Republic").replace("Dem. Rep. of the Congo", "DR Congo").replace("Cabo Verde", "Cape Verde").replace('China, Hong Kong SAR', "Hong Kong").replace('Saint Vincent & Grenadines', 'St Vincent and Grenadines').replace("Eswatini", "Swaziland").replace("Czechia", "Czech Republic").replace('Micronesia (Fed. States of)', "Micronesia").replace('United States Virgin Islands', "U.S. Virgin Islands").replace('China, Macao SAR', "Macao")
humanity
Quantity | Country | Year | PopulationIncrease | LifeExpectancy | MaleLifeExpectancy | FemaleLifeExpectancy | FertilityRate | InfantMortality | MaternalMortality |
---|---|---|---|---|---|---|---|---|---|
0 | Albania | 2005 | -0.273 | 74.822 | 72.25 | 77.75 | 1.9465 | 0.021153 | 0.000300 |
1 | Albania | 2010 | -0.920 | 75.639 | 73.16 | 78.54 | 1.6400 | 0.016781 | 0.000301 |
2 | Albania | 2015 | -0.394 | 77.487 | 75.21 | 79.96 | 1.7135 | 0.009212 | 0.000286 |
3 | Algeria | 2005 | 1.314 | 71.827 | 70.47 | 73.24 | 2.3843 | 0.033620 | 0.001481 |
4 | Algeria | 2010 | 1.637 | 74.169 | 72.95 | 75.43 | 2.7240 | 0.029398 | 0.001473 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
746 | Zambia | 2010 | 2.753 | 51.750 | 49.57 | 53.95 | 5.6000 | 0.064721 | 0.002624 |
747 | Zambia | 2015 | 3.090 | 59.291 | 57.03 | 61.46 | 5.2000 | 0.053306 | 0.002242 |
748 | Zimbabwe | 2005 | 0.326 | 43.735 | 41.96 | 45.62 | 3.7200 | 0.065008 | 0.006290 |
749 | Zimbabwe | 2010 | 1.003 | 45.017 | 43.34 | 46.68 | 3.8854 | 0.062686 | 0.004458 |
750 | Zimbabwe | 2015 | 1.686 | 56.710 | 54.88 | 58.26 | 4.0897 | 0.051221 | 0.004433 |
751 rows × 9 columns
Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding enrollment in primary, secondary and tertiary education levels, although the data needs to be cleaned. The column names represent the following quantities:
PrimaryStudents
: Students enrolled in primary educationPrimaryMaleGrossEnrolment
: Gross enrollment ratio - Primary (male)PrimaryFemaleGrossEnrolment
: Gross enrollment ratio - Primary (female)SecondaryStudents
: Students enrolled in secondary educationSecondaryMaleGrossEnrolment
: Gross enrollment ratio - Secondary (male)SecondaryFemaleGrossEnrolment
: Gross enrollment ratio - Secondary (female)TertiaryStudents
: Students enrolled in tertiary educationTertiaryMaleGrossEnrolment
: Gross enrollment ratio - Tertiary (male)TertiaryFemaleGrossEnrolment
: Gross enrollment ratio - Tertiary (female)education = pd.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB63_309_202009_Education.csv")
education.to_csv("data/country/education.csv", index=False)
del education
education = pd.read_csv("data/country/education.csv").iloc[1:, 1:5].rename(columns={"Enrolment in primary, secondary and tertiary education levels":"Country", "Unnamed: 2":"Year", "Unnamed: 3":"Quantity", "Unnamed: 4":"Value"})
education = education[education.Country.isin(education.Country.unique()[11:])]
education.Year = education.Year.astype(int)
education = education.groupby(["Country", "Year", "Quantity"]).Value.sum().unstack().rename(columns={
'Students enrolled in primary education (thousands)': "PrimaryStudents",
'Gross enrollment ratio - Primary (male)': "PrimaryMaleGrossEnrolment",
'Gross enrollment ratio - Primary (female)': "PrimaryFemaleGrossEnrolment",
'Students enrolled in secondary education (thousands)': "SecondaryStudents",
'Gross enrollment ratio - Secondary (male)': "SecondaryMaleGrossEnrolment",
'Gross enrollment ratio - Secondary (female)': "SecondaryFemaleGrossEnrolment",
'Students enrolled in tertiary education (thousands)': "TertiaryStudents", 'Gross enrollment ratio - Tertiary (male)': "TertiaryMaleGrossEnrolment", 'Gross enrollment ratio - Tertiary (female)': "TertiaryFemaleGrossEnrolment"})[['PrimaryStudents', 'PrimaryMaleGrossEnrolment', 'PrimaryFemaleGrossEnrolment', 'SecondaryStudents', 'SecondaryMaleGrossEnrolment', 'SecondaryFemaleGrossEnrolment', 'TertiaryStudents', 'TertiaryMaleGrossEnrolment', 'TertiaryFemaleGrossEnrolment']].astype(float).reset_index()
education[["PrimaryStudents", "SecondaryStudents", "TertiaryStudents"]] = 1000 * education[["PrimaryStudents", "SecondaryStudents", "TertiaryStudents"]]
isna = pd.DataFrame(education.isna().any(), columns=["isna"])
education[['PrimaryStudents', 'PrimaryMaleGrossEnrolment', 'PrimaryFemaleGrossEnrolment', 'SecondaryStudents', 'SecondaryMaleGrossEnrolment', 'SecondaryFemaleGrossEnrolment', 'TertiaryStudents', 'TertiaryMaleGrossEnrolment', 'TertiaryFemaleGrossEnrolment']] = isna[isna["isna"]].apply(axis=1, func=lambda row: education[row.name].fillna(education.groupby("Country")[row.name].transform('mean'))).T[['PrimaryStudents', 'PrimaryMaleGrossEnrolment', 'PrimaryFemaleGrossEnrolment', 'SecondaryStudents', 'SecondaryMaleGrossEnrolment', 'SecondaryFemaleGrossEnrolment', 'TertiaryStudents', 'TertiaryMaleGrossEnrolment', 'TertiaryFemaleGrossEnrolment']]
education = education.replace("United States of America", "United States").replace("Brunei Darussalam", "Brunei").replace("Republic of Korea", "South Korea").replace("Dem. People's Rep. Korea", "North Korea").replace("Iran (Islamic Republic of)", "Iran").replace("Lao People's Dem. Rep.", "Laos").replace("Venezuela (Boliv. Rep. of)", 'Venezuela').replace("Bolivia (Plurin. State of)", "Bolivia").replace('Viet Nam', "Vietnam").replace('Russian Federation', "Russia").replace('Syrian Arab Republic', "Syria").replace("Wallis and Futuna Islands", "Wallis and Futuna").replace("Saint Martin (French part)", "Saint Martin").replace("Falkland Islands (Malvinas)", "Falkland Islands").replace('Republic of Moldova', 'Moldova').replace('Sint Maarten (Dutch part)', 'Sint Maarten').replace('Côte d’Ivoire', "Ivory Coast").replace('United Rep. of Tanzania', "Tanzania").replace('Timor-Leste', "East Timor").replace('State of Palestine', "Palestine").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace("Congo", "Congo Republic").replace("Dem. Rep. of the Congo", "DR Congo").replace("Cabo Verde", "Cape Verde").replace('China, Hong Kong SAR', "Hong Kong").replace('Saint Vincent & Grenadines', 'St Vincent and Grenadines').replace("Eswatini", "Swaziland").replace("Czechia", "Czech Republic").replace('Micronesia (Fed. States of)', "Micronesia").replace('United States Virgin Islands', "U.S. Virgin Islands").replace('China, Macao SAR', "Macao")
education
Quantity | Country | Year | PrimaryStudents | PrimaryMaleGrossEnrolment | PrimaryFemaleGrossEnrolment | SecondaryStudents | SecondaryMaleGrossEnrolment | SecondaryFemaleGrossEnrolment | TertiaryStudents | TertiaryMaleGrossEnrolment | TertiaryFemaleGrossEnrolment |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2004 | 4430142.0 | 144.61190 | 62.5910 | 594306.0 | 28.35130 | 5.94270 | 27648.000000 | 1.821500 | 0.514000 |
1 | Afghanistan | 2005 | 4318819.0 | 123.10600 | 71.8121 | 651453.0 | 27.59330 | 9.00690 | 189079.250000 | 8.659350 | 2.579550 |
2 | Afghanistan | 2009 | 4945632.0 | 116.36470 | 76.5104 | 1716190.0 | 59.38230 | 28.58370 | 95185.000000 | 5.973800 | 1.412100 |
3 | Afghanistan | 2010 | 5279326.0 | 118.61450 | 80.6355 | 2044157.0 | 66.89290 | 33.30270 | 189079.250000 | 8.659350 | 2.579550 |
4 | Afghanistan | 2014 | 6217756.0 | 124.20940 | 86.7296 | 2602734.0 | 67.09690 | 37.33300 | 262874.000000 | 12.634000 | 3.487000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1172 | Zambia | 2017 | 3284841.0 | 97.54330 | 99.9042 | NaN | NaN | NaN | 56680.000000 | 4.750300 | 3.501700 |
1173 | Zimbabwe | 2003 | 2361588.0 | 105.41620 | 103.2450 | 758229.0 | 42.60140 | 37.43540 | 108100.333333 | 8.860033 | 6.995267 |
1174 | Zimbabwe | 2010 | 2512387.5 | 108.24595 | 105.9784 | 857845.0 | 48.04635 | 44.39055 | 94611.000000 | 7.795800 | 5.714700 |
1175 | Zimbabwe | 2013 | 2663187.0 | 111.07570 | 108.7118 | 957461.0 | 53.49130 | 51.34570 | 94115.000000 | 7.864600 | 6.099000 |
1176 | Zimbabwe | 2015 | 2512387.5 | 108.24595 | 105.9784 | 857845.0 | 48.04635 | 44.39055 | 135575.000000 | 10.919700 | 9.172100 |
1177 rows × 11 columns
Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Labour Force Participation and Unemployment Rate, although the data needs to be cleaned. The column names represent as follows:
LabourForce
: Labour force participation - TotalLabourForceMale
: Labour force participation - MaleLabourForceFemale
: Labour force participation - FemaleUnemploymentRate
: Unemployment rate - TotalUnemploymentRateMale
: Unemployment rate - MaleUnemploymentRateFemale
: Unemployment rate - FemalelabourForce = pd.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB63_329_202009_Labour%20Force%20and%20Unemployment.csv")
labourForce.to_csv("data/country/labourForce.csv", index=False)
del labourForce
labourForce = pd.read_csv("data/country/labourForce.csv").iloc[1:, 1:5].rename(columns={"Labour force participation and unemployment":"Country", "Unnamed: 2":"Year", "Unnamed: 3":"Quantity", "Unnamed: 4":"Value"})
labourForce = labourForce[labourForce.Country.isin(labourForce.Country.unique()[25:])]
labourForce.Year = labourForce.Year.astype(int)
labourForce = labourForce.groupby(["Country", "Year", "Quantity"]).Value.sum().unstack().rename(columns={'Labour force participation - Total': "LabourForce", 'Labour force participation - Male': "LabourForceMale", 'Labour force participation - Female': "LabourForceFemale", 'Unemployment rate - Total': "UnemploymentRate", 'Unemployment rate - Male': "UnemploymentRateMale", 'Unemployment rate - Female': "UnemploymentRateFemale"})[['LabourForce', 'LabourForceMale', 'LabourForceFemale', 'UnemploymentRate', 'UnemploymentRateMale', 'UnemploymentRateFemale']].astype(float).reset_index()
isna = pd.DataFrame(labourForce.isna().any(), columns=["isna"])
labourForce[['LabourForce', 'LabourForceMale', 'LabourForceFemale', 'UnemploymentRate', 'UnemploymentRateMale', 'UnemploymentRateFemale']] = isna[isna["isna"]].apply(axis=1, func=lambda row: labourForce[row.name].fillna(labourForce.groupby("Country")[row.name].transform('mean'))).T[['LabourForce', 'LabourForceMale', 'LabourForceFemale', 'UnemploymentRate', 'UnemploymentRateMale', 'UnemploymentRateFemale']]
labourForce = labourForce.replace("United States of America", "United States").replace("Brunei Darussalam", "Brunei").replace("Republic of Korea", "South Korea").replace("Dem. People's Rep. Korea", "North Korea").replace("Iran (Islamic Republic of)", "Iran").replace("Lao People's Dem. Rep.", "Laos").replace("Venezuela (Boliv. Rep. of)", 'Venezuela').replace("Bolivia (Plurin. State of)", "Bolivia").replace('Viet Nam', "Vietnam").replace('Russian Federation', "Russia").replace('Syrian Arab Republic', "Syria").replace("Wallis and Futuna Islands", "Wallis and Futuna").replace("Saint Martin (French part)", "Saint Martin").replace("Falkland Islands (Malvinas)", "Falkland Islands").replace('Republic of Moldova', 'Moldova').replace('Sint Maarten (Dutch part)', 'Sint Maarten').replace('Côte d’Ivoire', "Ivory Coast").replace('United Rep. of Tanzania', "Tanzania").replace('Timor-Leste', "East Timor").replace('State of Palestine', "Palestine").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace("Congo", "Congo Republic").replace("Dem. Rep. of the Congo", "DR Congo").replace("Cabo Verde", "Cape Verde").replace('China, Hong Kong SAR', "Hong Kong").replace('Saint Vincent & Grenadines', 'St Vincent and Grenadines').replace("Eswatini", "Swaziland").replace("Czechia", "Czech Republic").replace('Micronesia (Fed. States of)', "Micronesia").replace('United States Virgin Islands', "U.S. Virgin Islands").replace('China, Macao SAR', "Macao")
labourForce
Quantity | Country | Year | LabourForce | LabourForceMale | LabourForceFemale | UnemploymentRate | UnemploymentRateMale | UnemploymentRateFemale |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2005 | 48.49 | 78.82 | 15.80 | 11.43 | 10.78 | 14.91 |
1 | Afghanistan | 2010 | 47.43 | 78.40 | 14.94 | 11.48 | 10.88 | 14.82 |
2 | Afghanistan | 2015 | 48.37 | 76.20 | 18.76 | 11.39 | 10.68 | 14.43 |
3 | Afghanistan | 2020 | 48.94 | 74.61 | 21.77 | 11.16 | 10.37 | 14.06 |
4 | Albania | 2005 | 57.46 | 67.11 | 47.94 | 16.46 | 16.13 | 16.92 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
836 | Zambia | 2020 | 74.57 | 78.97 | 70.39 | 11.41 | 10.88 | 11.98 |
837 | Zimbabwe | 2005 | 82.20 | 88.15 | 77.06 | 4.47 | 4.31 | 4.64 |
838 | Zimbabwe | 2010 | 82.41 | 88.45 | 77.29 | 5.22 | 4.62 | 5.79 |
839 | Zimbabwe | 2015 | 83.01 | 89.17 | 77.84 | 5.30 | 4.69 | 5.90 |
840 | Zimbabwe | 2020 | 83.12 | 88.95 | 78.18 | 4.99 | 4.44 | 5.52 |
841 rows × 8 columns
Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Employment by different Industry (eg Agriculture and Services), although the data needs to be cleaned.
Post-cleaning, the following describes each of the column names:
AgricultureEmployment
: Employment by industry: Agriculture (%) Male and FemaleAgricultureEmploymentMale
: Employment by industry: Agriculture (%) MaleAgricultureEmploymentFemale
: Employment by industry: Agriculture (%) FemaleIndustryEmployment
: Employment by industry: Industry (%) Male and FemaleIndustryEmploymentMale
: Employment by industry: Industry (%) MaleIndustryEmploymentFemale
: Employment by industry: Industry (%) FemaleServicesEmployment
: Employment by industry: Services (%) Male and FemaleServicesEmploymentMale
: Employment by industry: Services (%) MaleServicesEmploymentFemale
: Employment by industry: Services (%) Femaleemployment = pd.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB63_200_202009_Employment.csv")
employment.to_csv("data/country/employment.csv", index=False)
del employment
employment = pd.read_csv("data/country/employment.csv").iloc[1:, 1:5].rename(columns={"Employment by economic activity":"Country", "Unnamed: 2":"Year", "Unnamed: 3":"Quantity", "Unnamed: 4":"Value"})
employment = employment[employment.Country.isin(employment.Country.unique()[23:-1])]
employment.Year = employment.Year.astype(int)
employment = employment.groupby(["Country", "Year", "Quantity"]).Value.sum().unstack().rename(columns={'Employment by industry: Agriculture (%) Male and Female': "AgricultureEmployment", 'Employment by industry: Agriculture (%) Male': "AgricultureEmploymentMale", 'Employment by industry: Agriculture (%) Female': "AgricultureEmploymentFemale", 'Employment by industry: Industry (%) Male and Female': "IndustryEmployment", 'Employment by industry: Industry (%) Male': "IndustryEmploymentMale", 'Employment by industry: Industry (%) Female': "IndustryEmploymentFemale", 'Employment by industry: Services (%) Male and Female': "ServicesEmployment", 'Employment by industry: Services (%) Male': "ServicesEmploymentMale", 'Employment by industry: Services (%) Female': "ServicesEmploymentFemale"})[['AgricultureEmployment', 'AgricultureEmploymentMale', 'AgricultureEmploymentFemale', 'IndustryEmployment', 'IndustryEmploymentMale', 'IndustryEmploymentFemale', 'ServicesEmployment', 'ServicesEmploymentMale', 'ServicesEmploymentFemale']].astype(float).reset_index().replace("United States of America", "United States").replace("Brunei Darussalam", "Brunei").replace("Republic of Korea", "South Korea").replace("Dem. People's Rep. Korea", "North Korea").replace("Iran (Islamic Republic of)", "Iran").replace("Lao People's Dem. Rep.", "Laos").replace("Venezuela (Boliv. Rep. of)", 'Venezuela').replace("Bolivia (Plurin. State of)", "Bolivia").replace('Viet Nam', "Vietnam").replace('Russian Federation', "Russia").replace('Syrian Arab Republic', "Syria").replace("Wallis and Futuna Islands", "Wallis and Futuna").replace("Saint Martin (French part)", "Saint Martin").replace("Falkland Islands (Malvinas)", "Falkland Islands").replace('Republic of Moldova', 'Moldova').replace('Sint Maarten (Dutch part)', 'Sint Maarten').replace('Côte d’Ivoire', "Ivory Coast").replace('United Rep. of Tanzania', "Tanzania").replace('Timor-Leste', "East Timor").replace('State of Palestine', "Palestine").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace("Congo", "Congo Republic").replace("Dem. Rep. of the Congo", "DR Congo").replace("Cabo Verde", "Cape Verde").replace('China, Hong Kong SAR', "Hong Kong").replace('Saint Vincent & Grenadines', 'St Vincent and Grenadines').replace("Eswatini", "Swaziland").replace("Czechia", "Czech Republic").replace('Micronesia (Fed. States of)', "Micronesia").replace('United States Virgin Islands', "U.S. Virgin Islands").replace('China, Macao SAR', "Macao")
isna = pd.DataFrame(employment.isna().any(), columns=["isna"])
employment[isna[isna["isna"]].index] = isna[isna["isna"]].apply(axis=1, func=lambda row: employment[row.name].fillna(employment.groupby("Country")[row.name].transform('mean'))).T
employment
Quantity | Country | Year | AgricultureEmployment | AgricultureEmploymentMale | AgricultureEmploymentFemale | IndustryEmployment | IndustryEmploymentMale | IndustryEmploymentFemale | ServicesEmployment | ServicesEmploymentMale | ServicesEmploymentFemale |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 2005 | 62.245 | 59.362 | 78.491 | 11.528 | 11.535 | 11.488 | 26.227 | 29.103 | 10.021 |
1 | Afghanistan | 2010 | 54.685 | 51.512 | 72.966 | 14.391 | 14.064 | 16.280 | 30.924 | 34.425 | 10.755 |
2 | Afghanistan | 2015 | 47.114 | 42.381 | 68.470 | 17.043 | 15.959 | 21.933 | 35.844 | 41.660 | 9.598 |
3 | Afghanistan | 2020 | 42.352 | 36.408 | 64.832 | 18.252 | 16.509 | 24.845 | 39.396 | 47.083 | 10.323 |
4 | Albania | 2005 | 47.214 | 38.604 | 59.199 | 15.770 | 21.402 | 7.931 | 37.016 | 39.994 | 32.870 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
797 | Zambia | 2020 | 48.499 | 43.744 | 53.642 | 10.783 | 16.658 | 4.427 | 40.718 | 39.598 | 41.931 |
798 | Zimbabwe | 2005 | 64.549 | 57.144 | 71.890 | 9.555 | 15.708 | 3.454 | 25.897 | 27.147 | 24.657 |
799 | Zimbabwe | 2010 | 65.529 | 59.341 | 71.615 | 9.243 | 15.466 | 3.121 | 25.229 | 25.193 | 25.264 |
800 | Zimbabwe | 2015 | 67.169 | 62.857 | 71.371 | 7.141 | 12.269 | 2.145 | 25.690 | 24.874 | 26.484 |
801 | Zimbabwe | 2020 | 66.270 | 62.885 | 69.578 | 6.545 | 11.306 | 1.891 | 27.186 | 25.810 | 28.531 |
802 rows × 11 columns
Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Energy Production, Trade and Consumption, although the data needs to be cleaned. The column names represent the following quantities:
EnergyProduction
: Primary energy production in petajoulesEnergySupply
: Total energy supply in petajoulesEnergySupplyPerCapita
: Energy supply per capita in gigajoulesEnergyTrade
: Net Energy imports, exports and bunkers in petajoulesEnergyStockChange
: Change in energy stocks in petajoulesenergyConsumption = pd.read_csv("https://data.un.org/_Docs/SYB/CSV/SYB63_263_202009_Production,%20Trade%20and%20Supply%20of%20Energy.csv")
energyConsumption.to_csv("data/country/energyConsumption.csv", index=False)
del energyConsumption
energyConsumption = pd.read_csv("data/country/energyConsumption.csv").iloc[1:, 1:5].rename(columns={"Production, trade and supply of energy":"Country", "Unnamed: 2":"Year", "Unnamed: 3":"Quantity", "Unnamed: 4":"Value"})
energyConsumption = energyConsumption[energyConsumption.Country.isin(energyConsumption.Country.unique()[7:])]
energyConsumption.Year = energyConsumption.Year.astype(int)
energyConsumption = energyConsumption.groupby(["Country", "Year", "Quantity"]).Value.sum().unstack().rename(columns={
'Primary energy production (petajoules)': "EnergyProduction",
'Total supply (petajoules)': "EnergySupply",
'Supply per capita (gigajoules)': "EnergySupplyPerCapita",
'Net imports [Imports - Exports - Bunkers] (petajoules)': "EnergyTrade",
'Changes in stocks (petajoules)': "EnergyStockChange",
})[['EnergyProduction', 'EnergySupply', 'EnergySupplyPerCapita', 'EnergyTrade', 'EnergyStockChange']].astype(float).reset_index()
energyConsumption = energyConsumption.replace("United States of America", "United States").replace("Brunei Darussalam", "Brunei").replace("Republic of Korea", "South Korea").replace("Dem. People's Rep. Korea", "North Korea").replace("Iran (Islamic Republic of)", "Iran").replace("Lao People's Dem. Rep.", "Laos").replace("Venezuela (Boliv. Rep. of)", 'Venezuela').replace("Bolivia (Plurin. State of)", "Bolivia").replace('Viet Nam', "Vietnam").replace('Russian Federation', "Russia").replace('Syrian Arab Republic', "Syria").replace("Wallis and Futuna Islands", "Wallis and Futuna").replace("Saint Martin (French part)", "Saint Martin").replace("Falkland Islands (Malvinas)", "Falkland Islands").replace('Republic of Moldova', 'Moldova').replace('Sint Maarten (Dutch part)', 'Sint Maarten').replace('Côte d’Ivoire', "Ivory Coast").replace('United Rep. of Tanzania', "Tanzania").replace('Timor-Leste', "East Timor").replace('State of Palestine', "Palestine").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace("Congo", "Congo Republic").replace("Dem. Rep. of the Congo", "DR Congo").replace("Cabo Verde", "Cape Verde").replace('China, Hong Kong SAR', "Hong Kong").replace('Saint Vincent & Grenadines', 'St Vincent and Grenadines').replace("Eswatini", "Swaziland").replace("Czechia", "Czech Republic").replace('Micronesia (Fed. States of)', "Micronesia").replace('United States Virgin Islands', "U.S. Virgin Islands").replace('China, Macao SAR', "Macao")
isna = pd.DataFrame(energyConsumption.isna().any(), columns=["isna"])
energyConsumption[isna[isna["isna"]].index] = isna[isna["isna"]].apply(axis=1, func=lambda row: energyConsumption[row.name].fillna(energyConsumption.groupby("Country")[row.name].transform('mean'))).T
energyConsumption
Quantity | Country | Year | EnergyProduction | EnergySupply | EnergySupplyPerCapita | EnergyTrade | EnergyStockChange |
---|---|---|---|---|---|---|---|
0 | Afghanistan | 1990 | 19.0 | 46.0 | 4.0 | 28.0 | 0.000000 |
1 | Afghanistan | 1995 | 16.0 | 29.0 | 1.0 | 13.0 | 0.000000 |
2 | Afghanistan | 2000 | 18.0 | 25.0 | 1.0 | 8.0 | 0.000000 |
3 | Afghanistan | 2005 | 23.0 | 36.0 | 1.0 | 14.0 | 0.000000 |
4 | Afghanistan | 2010 | 41.0 | 136.0 | 5.0 | 95.0 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... |
1731 | Zimbabwe | 2005 | 379.0 | 414.0 | 33.0 | 36.0 | 1.000000 |
1732 | Zimbabwe | 2010 | 369.0 | 395.0 | 31.0 | 26.0 | -1.428571 |
1733 | Zimbabwe | 2015 | 450.0 | 473.0 | 34.0 | 47.0 | 24.000000 |
1734 | Zimbabwe | 2016 | 381.0 | 467.0 | 33.0 | 48.0 | -39.000000 |
1735 | Zimbabwe | 2017 | 427.0 | 474.0 | 33.0 | 50.0 | 4.000000 |
1736 rows × 7 columns
In the end, we merge all of this data into a compound DataFrame object undata
. I have decided to use the OUTER JOIN
operation, then substituted values based on the known mean.
undata = reduce(lambda a,b: pd.merge(a, b, how="outer"), [popsaden, humanity, education, labourForce, employment, energyConsumption]).sort_values(["Country", "Year"], ignore_index=True)
isna = pd.DataFrame(undata.isna().any(), columns=["isna"])
undata[isna[isna["isna"]].index] = isna[isna["isna"]].apply(axis=1, func=lambda row: undata[row.name].fillna(undata.groupby("Country")[row.name].transform('mean'))).T
undata = undata.set_index(["Country", "Year"])
undata
Quantity | Population | MalePopulation | FemalePopulation | SexRatio | ChildrenPopulation | ElderlyPopulation | SurfaceArea | PopulationDensity | ChildrenPercent | ElderlyPercent | ... | IndustryEmploymentMale | IndustryEmploymentFemale | ServicesEmployment | ServicesEmploymentMale | ServicesEmploymentFemale | EnergyProduction | EnergySupply | EnergySupplyPerCapita | EnergyTrade | EnergyStockChange | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | Year | |||||||||||||||||||||
Afghanistan | 1990 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 14.51675 | 18.63650 | 33.09775 | 38.06775 | 10.17425 | 19.000 | 46.000 | 4.00 | 28.00 | 0.000000 |
1995 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 14.51675 | 18.63650 | 33.09775 | 38.06775 | 10.17425 | 16.000 | 29.000 | 1.00 | 13.00 | 0.000000 | |
2000 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 14.51675 | 18.63650 | 33.09775 | 38.06775 | 10.17425 | 18.000 | 25.000 | 1.00 | 8.00 | 0.000000 | |
2004 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 14.51675 | 18.63650 | 33.09775 | 38.06775 | 10.17425 | 41.125 | 81.375 | 2.75 | 40.75 | 0.000000 | |
2005 | 25654300.0 | 13239700.0 | 12414600.0 | 1.066462 | 1.227805e+07 | 9.384086e+05 | 652864.0 | 39.295200 | 47.859600 | 3.657900 | ... | 11.53500 | 11.48800 | 26.22700 | 29.10300 | 10.02100 | 23.000 | 36.000 | 1.00 | 14.00 | 0.000000 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Zimbabwe | 2015 | 13414125.0 | 6398300.0 | 7015825.0 | 0.912198 | 5.628896e+06 | 6.088617e+05 | 390757.0 | 34.675250 | 41.928175 | 4.541625 | ... | 12.26900 | 2.14500 | 25.69000 | 24.87400 | 26.48400 | 450.000 | 473.000 | 34.00 | 47.00 | 24.000000 |
2016 | 13414125.0 | 6398300.0 | 7015825.0 | 0.912198 | 5.628896e+06 | 6.088617e+05 | 390757.0 | 34.675250 | 41.928175 | 4.541625 | ... | 13.68725 | 2.65275 | 26.00050 | 25.75600 | 26.23400 | 381.000 | 467.000 | 33.00 | 48.00 | -39.000000 | |
2017 | 14236600.0 | 6777100.0 | 7459500.0 | 0.908508 | 6.064493e+06 | 6.367035e+05 | 390757.0 | 36.801300 | 42.597900 | 4.472300 | ... | 13.68725 | 2.65275 | 26.00050 | 25.75600 | 26.23400 | 427.000 | 474.000 | 33.00 | 50.00 | 4.000000 | |
2019 | 14645500.0 | 6983400.0 | 7662100.0 | 0.911413 | 6.174265e+06 | 6.705296e+05 | 390757.0 | 37.858300 | 42.158100 | 4.578400 | ... | 13.68725 | 2.65275 | 26.00050 | 25.75600 | 26.23400 | 390.250 | 436.000 | 34.25 | 44.25 | -1.428571 | |
2020 | 13414125.0 | 6398300.0 | 7015825.0 | 0.912198 | 5.628896e+06 | 6.088617e+05 | 390757.0 | 34.675250 | 41.928175 | 4.541625 | ... | 11.30600 | 1.89100 | 27.18600 | 25.81000 | 28.53100 | 390.250 | 436.000 | 34.25 | 44.25 | -1.428571 |
2828 rows × 46 columns
It is necessary to merge all the data, so that is pretty much all that is done here. I have decided to perform an INNER join operation on the data so that I have all the data necessary. This data is stored in a Pandas DataFrame geocountries_latlong
. This has then been converted into a geopandas.GeoDataFrame
point-based object named countries
. Following this, I have plotted this data in the form of a scatterplot by Population.
geocountries_latlong = pd.merge(geocountries, latlong)
geocountries_latlong
CountryCode | CountryCode3 | NumericCountryCode | fips | Country | Capital | Area in km² | Population | Continent | latitude | longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 77006 | EU | 42.546245 | 1.601554 |
1 | AE | ARE | 784 | AE | United Arab Emirates | Abu Dhabi | 82880.0 | 9630959 | AS | 23.424076 | 53.847818 |
2 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 37172386 | AS | 33.939110 | 67.709953 |
3 | AG | ATG | 28 | AC | Antigua and Barbuda | St. John's | 443.0 | 96286 | NA | 17.060816 | -61.796428 |
4 | AI | AIA | 660 | AV | Anguilla | The Valley | 102.0 | 13254 | NA | 18.220554 | -63.068615 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
236 | YE | YEM | 887 | YM | Yemen | Sanaa | 527970.0 | 28498687 | AS | 15.552727 | 48.516388 |
237 | YT | MYT | 175 | MF | Mayotte | Mamoudzou | 374.0 | 279471 | AF | -12.827500 | 45.166244 |
238 | ZA | ZAF | 710 | SF | South Africa | Pretoria | 1219912.0 | 57779622 | AF | -30.559482 | 22.937506 |
239 | ZM | ZMB | 894 | ZA | Zambia | Lusaka | 752614.0 | 17351822 | AF | -13.133897 | 27.849332 |
240 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 14439018 | AF | -19.015438 | 29.154857 |
241 rows × 11 columns
countries = gp.GeoDataFrame(geocountries_latlong, geometry=gp.points_from_xy(geocountries_latlong.longitude, geocountries_latlong.latitude))
countries
CountryCode | CountryCode3 | NumericCountryCode | fips | Country | Capital | Area in km² | Population | Continent | latitude | longitude | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 77006 | EU | 42.546245 | 1.601554 | POINT (1.60155 42.54624) |
1 | AE | ARE | 784 | AE | United Arab Emirates | Abu Dhabi | 82880.0 | 9630959 | AS | 23.424076 | 53.847818 | POINT (53.84782 23.42408) |
2 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 37172386 | AS | 33.939110 | 67.709953 | POINT (67.70995 33.93911) |
3 | AG | ATG | 28 | AC | Antigua and Barbuda | St. John's | 443.0 | 96286 | NA | 17.060816 | -61.796428 | POINT (-61.79643 17.06082) |
4 | AI | AIA | 660 | AV | Anguilla | The Valley | 102.0 | 13254 | NA | 18.220554 | -63.068615 | POINT (-63.06862 18.22055) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
236 | YE | YEM | 887 | YM | Yemen | Sanaa | 527970.0 | 28498687 | AS | 15.552727 | 48.516388 | POINT (48.51639 15.55273) |
237 | YT | MYT | 175 | MF | Mayotte | Mamoudzou | 374.0 | 279471 | AF | -12.827500 | 45.166244 | POINT (45.16624 -12.82750) |
238 | ZA | ZAF | 710 | SF | South Africa | Pretoria | 1219912.0 | 57779622 | AF | -30.559482 | 22.937506 | POINT (22.93751 -30.55948) |
239 | ZM | ZMB | 894 | ZA | Zambia | Lusaka | 752614.0 | 17351822 | AF | -13.133897 | 27.849332 | POINT (27.84933 -13.13390) |
240 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 14439018 | AF | -19.015438 | 29.154857 | POINT (29.15486 -19.01544) |
241 rows × 12 columns
countries.plot(figsize=(20, 10), column="Population", legend=True).set_title("Plot of Countries Data by Population and Center")
Text(0.5, 1.0, 'Plot of Countries Data by Population and Center')
world = gp.GeoDataFrame(pd.merge(geocountries, countries_geojson))
world
CountryCode | CountryCode3 | NumericCountryCode | fips | Country | Capital | Area in km² | Population | Continent | ISO_A3 | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | AD | AND | 20 | AN | Andorra | Andorra la Vella | 468.0 | 77006 | EU | AND | POLYGON ((1.70701 42.50278, 1.69750 42.49446, ... |
1 | AE | ARE | 784 | AE | United Arab Emirates | Abu Dhabi | 82880.0 | 9630959 | AS | ARE | MULTIPOLYGON (((53.86305 24.23469, 53.88860 24... |
2 | AF | AFG | 4 | AF | Afghanistan | Kabul | 647500.0 | 37172386 | AS | AFG | POLYGON ((71.04980 38.40866, 71.05714 38.40903... |
3 | AG | ATG | 28 | AC | Antigua and Barbuda | St. John's | 443.0 | 96286 | NA | ATG | MULTIPOLYGON (((-61.77302 17.12653, -61.75642 ... |
4 | AI | AIA | 660 | AV | Anguilla | The Valley | 102.0 | 13254 | NA | AIA | MULTIPOLYGON (((-63.03767 18.21296, -63.09952 ... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
234 | XK | XKX | 0 | KV | Kosovo | Pristina | 10908.0 | 1845300 | EU | -99 | POLYGON ((20.86470 43.21734, 20.86160 43.21752... |
235 | YE | YEM | 887 | YM | Yemen | Sanaa | 527970.0 | 28498687 | AS | YEM | MULTIPOLYGON (((53.30824 12.11839, 53.31027 12... |
236 | ZA | ZAF | 710 | SF | South Africa | Pretoria | 1219912.0 | 57779622 | AF | ZAF | MULTIPOLYGON (((37.86378 -46.94085, 37.83644 -... |
237 | ZM | ZMB | 894 | ZA | Zambia | Lusaka | 752614.0 | 17351822 | AF | ZMB | POLYGON ((31.11984 -8.61663, 31.14102 -8.60619... |
238 | ZW | ZWE | 716 | ZI | Zimbabwe | Harare | 390580.0 | 14439018 | AF | ZWE | POLYGON ((30.01065 -15.64623, 30.05024 -15.640... |
239 rows × 11 columns
world.plot(column="Continent", figsize=(20, 10), legend=True, cmap="viridis").set_title("Grouped Map of Countries by Continent")
Text(0.5, 1.0, 'Grouped Map of Countries by Continent')
world.plot(column="Population", figsize=(20, 10), legend=True, cmap="viridis").set_title("Chloropleth Map of Countries by Population")
Text(0.5, 1.0, 'Chloropleth Map of Countries by Population')
_ = world.plot(column="Area in km²", figsize=(20, 10), legend=True, cmap="viridis").set_title("Chloropleth Map of Countries by Area")
undata_world = gp.GeoDataFrame(pd.merge(undata.reset_index(), world[list(world.columns[:-5])+list(world.columns[-3:])], on="Country", how="outer").set_index(["Country", "Year"]))
undata_world
Population | MalePopulation | FemalePopulation | SexRatio | ChildrenPopulation | ElderlyPopulation | SurfaceArea | PopulationDensity | ChildrenPercent | ElderlyPercent | ... | EnergyTrade | EnergyStockChange | CountryCode | CountryCode3 | NumericCountryCode | fips | Capital | Continent | ISO_A3 | geometry | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Country | Year | |||||||||||||||||||||
Afghanistan | 1990.0 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 28.00 | 0.0 | AF | AFG | 4.0 | AF | Kabul | AS | AFG | POLYGON ((71.04980 38.40866, 71.05714 38.40903... |
1995.0 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 13.00 | 0.0 | AF | AFG | 4.0 | AF | Kabul | AS | AFG | POLYGON ((71.04980 38.40866, 71.05714 38.40903... | |
2000.0 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 8.00 | 0.0 | AF | AFG | 4.0 | AF | Kabul | AS | AFG | POLYGON ((71.04980 38.40866, 71.05714 38.40903... | |
2004.0 | 32294425.0 | 16589300.0 | 15705125.0 | 1.056675 | 1.458783e+07 | 1.282380e+06 | 652864.0 | 49.466075 | 45.548125 | 3.940875 | ... | 40.75 | 0.0 | AF | AFG | 4.0 | AF | Kabul | AS | AFG | POLYGON ((71.04980 38.40866, 71.05714 38.40903... | |
2005.0 | 25654300.0 | 13239700.0 | 12414600.0 | 1.066462 | 1.227805e+07 | 9.384086e+05 | 652864.0 | 39.295200 | 47.859600 | 3.657900 | ... | 14.00 | 0.0 | AF | AFG | 4.0 | AF | Kabul | AS | AFG | POLYGON ((71.04980 38.40866, 71.05714 38.40903... | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Pitcairn Islands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | PN | PCN | 612.0 | PC | Adamstown | OC | PCN | MULTIPOLYGON (((-130.08625 -25.07513, -130.095... |
French Southern Territories | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | TF | ATF | 260.0 | FS | Port-aux-Francais | AN | ATF | MULTIPOLYGON (((70.01612 -49.56422, 70.00709 -... |
Taiwan | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | TW | TWN | 158.0 | TW | Taipei | AS | TWN | MULTIPOLYGON (((121.58766 22.01386, 121.56691 ... |
U.S. Outlying Islands | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | UM | UMI | 581.0 | NaN | NaN | OC | UMI | MULTIPOLYGON (((-160.01244 -0.38209, -160.0247... |
Vatican City | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | VA | VAT | 336.0 | VT | Vatican City | EU | VAT | POLYGON ((12.45314 41.90275, 12.45271 41.90302... |
2839 rows × 54 columns
_ = undata_world.plot(column="SurfaceArea", figsize=(20, 10), legend=True, cmap="viridis").set_title("Chloropleth Map of Countries by Surface Area")
_ = undata_world.plot(column="SexRatio", figsize=(20, 10), legend=True, cmap="viridis").set_title("Chloropleth Map of Countries by Sex Ratio")
This dataset is specifically the largest, containing close to 20 billion data points (specifically 20322960028
), hence this data has to be acquired in chunks. All the data can be downloaded from a zip file as shown below. Here are the instructions:
data/nightLight
nightLight
directory as created in step 2.Installation can take place as follows:
curl https://figshare.com/ndownloader/articles/9828827/versions/2 > nightLight.zip
unzip nightLight.zip -d data/nightLight/
rm nightLight.zip
Due to the large amount of data, this data has been separated into country/state-wise and year-wise statistical summaries, containing the mean
, median
, mode
, min
, max
, count
and std
. GeoTIFF files are simply converted into DataFrames as shown below.
fig, (axim, axhist) = plt.subplots(1, 2, figsize=(40, 10), gridspec_kw={'width_ratios': [3, 1]})
rf = rs.open("data/nightLight/DN_NTL_2013_simVIIRS.tif", "r")
show(rf, ax=axim, cmap="inferno")
show_hist(rf, ax=axhist)
axim.set(xlabel="Longitude", ylabel="Latitude", title="Image of 2013 VIIRS Data")
axhist.set_title("Color Historgram of 2013 VIIRS Data")
del rf
def retrieveGeoStats(raster_file, geojson=states_geojson, nodata=64):
year = re.search(r"\d+", raster_file).group()
print(f"Processing {year}...")
rf = rs.open(raster_file,"r")
values = pd.DataFrame(rstats.zonal_stats(geojson, rf.read(1), affine=rf.transform, stats=["count", "mean", "min", "max", 'median', 'std', "majority"], geojson_out=True, nodata=nodata))
del rf
return pd.concat([values, values.properties.apply(pd.Series)], axis=1).drop(columns=["id", "type", "properties", "bbox", "ISO3166-1-Alpha-3", "geometry"]).rename(columns={"majority": "mode"}).fillna(0)
nightLight2013 = retrieveGeoStats(raster_file = "data/nightLight/DN_NTL_2013_simVIIRS.tif")
nightLight2013
Processing 2013...
country | name | min | max | mean | count | std | median | mode | |
---|---|---|---|---|---|---|---|---|---|
0 | Aruba | Aruba | 7.0 | 57.0 | 28.666667 | 201 | 16.178831 | 28.0 | 7.0 |
1 | Afghanistan | Badghis | 0.0 | 32.0 | 0.236040 | 29334 | 1.336606 | 0.0 | 0.0 |
2 | Afghanistan | Hirat | 0.0 | 56.0 | 0.741159 | 78071 | 2.932197 | 0.0 | 0.0 |
3 | Afghanistan | Bamyan | 0.0 | 8.0 | 0.180535 | 25153 | 1.039803 | 0.0 | 0.0 |
4 | Afghanistan | Balkh | 0.0 | 61.0 | 2.011429 | 23275 | 5.675208 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4642 | Zimbabwe | Manicaland | 0.0 | 46.0 | 1.915220 | 42982 | 3.263857 | 0.0 | 0.0 |
4643 | Zimbabwe | Matabeleland South | 0.0 | 52.0 | 0.677659 | 68170 | 2.175918 | 0.0 | 0.0 |
4644 | Zimbabwe | Bulawayo | 0.0 | 55.0 | 21.061511 | 569 | 15.322821 | 14.0 | 7.0 |
4645 | Zimbabwe | Masvingo | 0.0 | 26.0 | 0.701861 | 69850 | 2.066331 | 0.0 | 0.0 |
4646 | Zimbabwe | Mashonaland West | 0.0 | 49.0 | 4.252489 | 70118 | 3.467990 | 6.0 | 6.0 |
4647 rows × 9 columns
nightLight2013.to_csv("data/nightLight/nightLight2013.csv", index=False)
del nightLight2013
for raster_file in glob("data/nightLight/Harmonized*.tif"):
year = re.search(r"\d+", raster_file).group()
nightLightTemp = retrieveGeoStats(raster_file)
nightLightTemp.to_csv(f"data/nightLight/harmonizedNightLight{year}.csv", index=False)
del nightLightTemp
Processing 1992... Processing 1993... Processing 1994... Processing 1995... Processing 1996... Processing 1997... Processing 1998... Processing 1999... Processing 2000... Processing 2001... Processing 2002... Processing 2003... Processing 2004... Processing 2005... Processing 2006... Processing 2007... Processing 2008... Processing 2009... Processing 2010... Processing 2011... Processing 2012... Processing 2013... Processing 2014... Processing 2015... Processing 2016... Processing 2017... Processing 2018...
for raster_file in glob("data/nightLight/Harmonized*.tif"):
year = re.search(r"\d+", raster_file).group()
nightLightTemp = retrieveGeoStats(raster_file, nodata=0)
nightLightTemp.to_csv(f"data/nightLight/harmonizedNightLightNoData{year}.csv", index=False)
del nightLightTemp
print("Processing Completed!")
Processing 1992... Processing 1993... Processing 1994... Processing 1995... Processing 1996... Processing 1997... Processing 1998... Processing 1999... Processing 2000... Processing 2001... Processing 2002... Processing 2003... Processing 2004... Processing 2005... Processing 2006... Processing 2007... Processing 2008... Processing 2009... Processing 2010... Processing 2011... Processing 2012... Processing 2013... Processing 2014... Processing 2015... Processing 2016... Processing 2017... Processing 2018... Processing Completed!
def retrieveGeoStats(raster_file, geojson=countries_geojson, nodata=64):
year = re.search(r"\d+", raster_file).group()
print(f"Processing {year}...")
rf = rs.open(raster_file,"r")
values = pd.DataFrame(rstats.zonal_stats(geojson, rf.read(1), affine=rf.transform, stats=["count", "mean", "min", "max", 'median', 'std', "majority"], geojson_out=True, nodata=nodata))
del rf
return pd.concat([values, values.properties.apply(pd.Series)], axis=1).drop(columns=["id", "type", "properties", "bbox", "ISO_A3", "geometry"]).rename(columns={"majority": "mode"}).fillna(0).set_index("Country").rename(columns=lambda x: x+" "+year)
nightLightIndiv = [retrieveGeoStats(raster_file, countries_geojson) for raster_file in glob("data/nightLight/Harmonized*.tif")]
nightLight = functools.reduce(pd.DataFrame.join, nightLightIndiv).stack().reset_index()
nightLight
Processing 1992... Processing 1993... Processing 1994... Processing 1995... Processing 1996... Processing 1997... Processing 1998... Processing 1999... Processing 2000... Processing 2001... Processing 2002... Processing 2003... Processing 2004... Processing 2005... Processing 2006... Processing 2007... Processing 2008... Processing 2009... Processing 2010... Processing 2011... Processing 2012... Processing 2013... Processing 2014... Processing 2015... Processing 2016... Processing 2017... Processing 2018...
Country | level_1 | 0 | |
---|---|---|---|
0 | Aruba | min 1992 | 5.000000 |
1 | Aruba | max 1992 | 63.000000 |
2 | Aruba | mean 1992 | 30.731343 |
3 | Aruba | count 1992 | 201.000000 |
4 | Aruba | std 1992 | 18.759739 |
... | ... | ... | ... |
48190 | Zimbabwe | mean 2018 | 2.009717 |
48191 | Zimbabwe | count 2018 | 481200.000000 |
48192 | Zimbabwe | std 2018 | 3.633314 |
48193 | Zimbabwe | median 2018 | 0.000000 |
48194 | Zimbabwe | mode 2018 | 0.000000 |
48195 rows × 3 columns
nightLight = functools.reduce(pd.DataFrame.join, nightLightIndiv).stack().reset_index()
nightLight.loc[:, ["Quantity", "Year"]] = nightLight.level_1.str.split().apply(pd.Series).rename(columns={0:"Quantity", 1:"Year"})
nightLightCSV = nightLight.rename(columns={0: "Value"}).drop(columns="level_1").set_index(["Year", "Quantity", "Country"]).unstack()["Value"]
nightLightCSV
Country | Afghanistan | Akrotiri Sovereign Base Area | Aland | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antarctica | ... | Uzbekistan | Vanuatu | Vatican City | Venezuela | Vietnam | Wallis and Futuna | Western Sahara | Yemen | Zambia | Zimbabwe | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | Quantity | |||||||||||||||||||||
1992 | count | 901332.000000 | 137.000000 | 2190.000000 | 43766.000000 | 3.062687e+06 | 216.000000 | 711.000000 | 1.491515e+06 | 100.000000 | 73912.000000 | ... | 698379.000000 | 14990.000000 | 0.0 | 1.076792e+06 | 402417.000000 | 166.000000 | 116159.000000 | 550751.000000 | 904610.000000 | 481200.000000 |
max | 59.000000 | 39.000000 | 53.000000 | 39.000000 | 6.300000e+01 | 32.000000 | 63.000000 | 6.300000e+01 | 21.000000 | 8.000000 | ... | 63.000000 | 33.000000 | 0.0 | 6.300000e+01 | 63.000000 | 5.000000 | 19.000000 | 63.000000 | 63.000000 | 63.000000 | |
mean | 0.027943 | 14.912409 | 2.225571 | 0.264269 | 5.061918e-01 | 9.685185 | 15.614627 | 3.853196e-02 | 9.650000 | 0.004127 | ... | 1.726949 | 0.148032 | 0.0 | 1.132978e+00 | 0.428158 | 0.614458 | 0.002961 | 0.290173 | 0.086135 | 0.237517 | |
median | 0.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 8.000000 | 9.000000 | 0.000000e+00 | 9.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018 | mean | 2.047282 | 34.408759 | 4.557534 | 8.770347 | 2.811383e+00 | 18.037037 | 24.846695 | 1.631923e+00 | 15.900000 | 0.621469 | ... | 3.258340 | 1.618412 | 0.0 | 4.400938e+00 | 8.588959 | 5.855422 | 0.053453 | 1.301654 | 3.077475 | 2.009717 |
median | 0.000000 | 34.000000 | 0.000000 | 7.000000 | 0.000000e+00 | 13.500000 | 19.000000 | 0.000000e+00 | 14.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 7.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
min | 0.000000 | 9.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 7.000000 | 0.000000e+00 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
mode | 0.000000 | 51.000000 | 0.000000 | 7.000000 | 0.000000e+00 | 7.000000 | 8.000000 | 0.000000e+00 | 11.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 7.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | |
std | 3.555359 | 14.801142 | 10.894895 | 7.885158 | 6.959879e+00 | 11.895320 | 17.029646 | 3.834231e+00 | 7.139328 | 1.978841 | ... | 5.914810 | 6.371589 | 0.0 | 7.879986e+00 | 10.506223 | 1.767392 | 0.608731 | 4.136651 | 4.123423 | 3.633314 |
189 rows × 255 columns
nightLightCSV.to_csv("data/nightLight/nightLight.csv")
del nightLightCSV
del nightLight
del nightLightIndiv
nightLightNoDarkIndiv = [retrieveGeoStats(raster_file, countries_geojson, 0) for raster_file in glob("data/nightLight/Harmonized*.tif")]
nightLightNoDark = functools.reduce(pd.DataFrame.join, nightLightNoDarkIndiv).stack().reset_index()
nightLightNoDark
Processing 1992... Processing 1993... Processing 1994... Processing 1995... Processing 1996... Processing 1997... Processing 1998... Processing 1999... Processing 2000... Processing 2001... Processing 2002... Processing 2003... Processing 2004... Processing 2005... Processing 2006... Processing 2007... Processing 2008... Processing 2009... Processing 2010... Processing 2011... Processing 2012... Processing 2013... Processing 2014... Processing 2015... Processing 2016... Processing 2017... Processing 2018...
Country | level_1 | 0 | |
---|---|---|---|
0 | Aruba | min 1992 | 5.000000 |
1 | Aruba | max 1992 | 63.000000 |
2 | Aruba | mean 1992 | 30.731343 |
3 | Aruba | count 1992 | 201.000000 |
4 | Aruba | std 1992 | 18.759739 |
... | ... | ... | ... |
48190 | Zimbabwe | mean 2018 | 6.837552 |
48191 | Zimbabwe | count 2018 | 141436.000000 |
48192 | Zimbabwe | std 2018 | 3.449983 |
48193 | Zimbabwe | median 2018 | 6.000000 |
48194 | Zimbabwe | mode 2018 | 6.000000 |
48195 rows × 3 columns
nightLightNoDark = functools.reduce(pd.DataFrame.join, nightLightNoDarkIndiv).stack().reset_index()
nightLightNoDark.loc[:, ["Quantity", "Year"]] = nightLightNoDark.level_1.str.split().apply(pd.Series).rename(columns={0:"Quantity", 1:"Year"})
nightLightNoDarkCSV = nightLightNoDark.rename(columns={0: "Value"}).drop(columns="level_1").set_index(["Year", "Quantity", "Country"]).unstack()["Value"]
nightLightNoDarkCSV
Country | Afghanistan | Akrotiri Sovereign Base Area | Aland | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | Antarctica | ... | Uzbekistan | Vanuatu | Vatican City | Venezuela | Vietnam | Wallis and Futuna | Western Sahara | Yemen | Zambia | Zimbabwe | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | Quantity | |||||||||||||||||||||
1992 | count | 2088.000000 | 135.000000 | 266.000000 | 1481.000000 | 143436.000000 | 156.000000 | 626.000000 | 5237.000000 | 94.000000 | 60.000000 | ... | 94961.000000 | 240.000000 | 0.0 | 86842.000000 | 26758.000000 | 24.000000 | 61.000000 | 19512.000000 | 6528.000000 | 10854.000000 |
max | 59.000000 | 39.000000 | 53.000000 | 39.000000 | 63.000000 | 32.000000 | 63.000000 | 63.000000 | 21.000000 | 8.000000 | ... | 63.000000 | 33.000000 | 0.0 | 63.000000 | 63.000000 | 5.000000 | 19.000000 | 63.000000 | 63.000000 | 63.000000 | |
mean | 12.062261 | 15.133333 | 18.323308 | 7.809588 | 10.808354 | 13.410256 | 17.734824 | 10.974031 | 10.265957 | 5.083333 | ... | 12.700635 | 9.245833 | 0.0 | 14.048295 | 6.439121 | 4.250000 | 5.639344 | 8.190498 | 11.936121 | 10.530035 | |
median | 7.000000 | 14.000000 | 13.000000 | 6.000000 | 6.000000 | 13.000000 | 11.000000 | 7.000000 | 9.000000 | 5.000000 | ... | 9.000000 | 6.000000 | 0.0 | 8.000000 | 5.000000 | 4.000000 | 5.000000 | 5.000000 | 7.000000 | 5.000000 | |
min | 3.000000 | 4.000000 | 7.000000 | 4.000000 | 3.000000 | 3.000000 | 4.000000 | 3.000000 | 4.000000 | 4.000000 | ... | 3.000000 | 4.000000 | 0.0 | 3.000000 | 2.000000 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 3.000000 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018 | mean | 6.765070 | 34.408759 | 12.460674 | 9.207076 | 10.239452 | 18.377358 | 24.846695 | 7.011606 | 16.060606 | 6.910486 | ... | 8.777289 | 11.546882 | 0.0 | 9.268471 | 10.850339 | 6.352941 | 6.612354 | 8.261827 | 6.739344 | 6.837552 |
median | 6.000000 | 34.000000 | 7.000000 | 7.000000 | 7.000000 | 14.000000 | 19.000000 | 6.000000 | 14.000000 | 7.000000 | ... | 7.000000 | 7.000000 | 0.0 | 7.000000 | 7.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | |
min | 6.000000 | 9.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 7.000000 | 6.000000 | 7.000000 | 6.000000 | ... | 6.000000 | 6.000000 | 0.0 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | |
mode | 6.000000 | 51.000000 | 6.000000 | 7.000000 | 7.000000 | 7.000000 | 8.000000 | 6.000000 | 11.000000 | 7.000000 | ... | 7.000000 | 6.000000 | 0.0 | 6.000000 | 7.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | 6.000000 | |
std | 3.139037 | 14.801142 | 15.035082 | 7.826291 | 10.018264 | 11.743688 | 17.029646 | 5.044252 | 6.993241 | 0.286536 | ... | 6.767632 | 13.228966 | 0.0 | 9.255113 | 10.719424 | 0.477885 | 1.571477 | 7.149014 | 3.543350 | 3.449983 |
189 rows × 255 columns
nightLightNoDarkCSV.to_csv("data/nightLight/nightLightNoDark.csv")
del nightLightNoDarkCSV
del nightLightNoDark
del nightLightNoDarkIndiv
nightLight = pd.read_csv("data/nightLight/nightLight.csv").set_index("Year")
nightLight
Quantity | Afghanistan | Akrotiri Sovereign Base Area | Aland | Albania | Algeria | American Samoa | Andorra | Angola | Anguilla | ... | Uzbekistan | Vanuatu | Vatican City | Venezuela | Vietnam | Wallis and Futuna | Western Sahara | Yemen | Zambia | Zimbabwe | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Year | |||||||||||||||||||||
1992 | count | 901332.000000 | 137.000000 | 2190.000000 | 43766.000000 | 3.062687e+06 | 216.000000 | 711.000000 | 1.491515e+06 | 100.000000 | ... | 698379.000000 | 14990.000000 | 0.0 | 1.076792e+06 | 402417.000000 | 166.000000 | 116159.000000 | 550751.000000 | 904610.000000 | 481200.000000 |
1992 | max | 59.000000 | 39.000000 | 53.000000 | 39.000000 | 6.300000e+01 | 32.000000 | 63.000000 | 6.300000e+01 | 21.000000 | ... | 63.000000 | 33.000000 | 0.0 | 6.300000e+01 | 63.000000 | 5.000000 | 19.000000 | 63.000000 | 63.000000 | 63.000000 |
1992 | mean | 0.027943 | 14.912409 | 2.225571 | 0.264269 | 5.061918e-01 | 9.685185 | 15.614627 | 3.853196e-02 | 9.650000 | ... | 1.726949 | 0.148032 | 0.0 | 1.132978e+00 | 0.428158 | 0.614458 | 0.002961 | 0.290173 | 0.086135 | 0.237517 |
1992 | median | 0.000000 | 14.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 8.000000 | 9.000000 | 0.000000e+00 | 9.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
1992 | min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2018 | mean | 2.047282 | 34.408759 | 4.557534 | 8.770347 | 2.811383e+00 | 18.037037 | 24.846695 | 1.631923e+00 | 15.900000 | ... | 3.258340 | 1.618412 | 0.0 | 4.400938e+00 | 8.588959 | 5.855422 | 0.053453 | 1.301654 | 3.077475 | 2.009717 |
2018 | median | 0.000000 | 34.000000 | 0.000000 | 7.000000 | 0.000000e+00 | 13.500000 | 19.000000 | 0.000000e+00 | 14.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 7.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2018 | min | 0.000000 | 9.000000 | 0.000000 | 0.000000 | 0.000000e+00 | 0.000000 | 7.000000 | 0.000000e+00 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2018 | mode | 0.000000 | 51.000000 | 0.000000 | 7.000000 | 0.000000e+00 | 7.000000 | 8.000000 | 0.000000e+00 | 11.000000 | ... | 0.000000 | 0.000000 | 0.0 | 0.000000e+00 | 7.000000 | 6.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
2018 | std | 3.555359 | 14.801142 | 10.894895 | 7.885158 | 6.959879e+00 | 11.895320 | 17.029646 | 3.834231e+00 | 7.139328 | ... | 5.914810 | 6.371589 | 0.0 | 7.879986e+00 | 10.506223 | 1.767392 | 0.608731 | 4.136651 | 4.123423 | 3.633314 |
189 rows × 256 columns
nightLight2013 = states_geojson.merge(pd.read_csv("data/nightLight/nightLight2013.csv").replace("United States of America", "United States").replace('United Republic of Tanzania', "Tanzania").replace('Democratic Republic of the Congo', "DR Congo").replace('Republic of Serbia', "Serbia").replace('The Bahamas', "Bahamas").replace('Hong Kong S.A.R.', "Hong Kong").replace('Guinea Bissau', 'Guinea-Bissau').replace("United States Minor Outlying Islands", "U.S. Outlying Islands").replace('South Georgia and South Sandwich Islands', 'South Georgia and the South Sandwich Islands').replace('Macao S.A.R', "Macao").replace('Republic of Congo', "Congo Republic").replace('Sao Tome and Principe', 'São Tomé and Príncipe').replace('Saint Vincent and the Grenadines', 'St Vincent and Grenadines').replace('Vatican', 'Vatican City').replace('Macedonia', 'North Macedonia').replace('Federated States of Micronesia', "Micronesia").replace('United States Virgin Islands', 'U.S. Virgin Islands').replace('Saint Barthelemy', 'Saint Barthélemy').replace("Heard Island and McDonald Islands", "Heard and McDonald Islands").replace('French Southern and Antarctic Lands',"French Southern Territories"))
nightLight2013
ISO3166-1-Alpha-3 | country | id | name | geometry | min | max | mean | count | std | median | mode | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ABW | Aruba | 5150 | Aruba | POLYGON ((-69.99694 12.57758, -69.93639 12.531... | 7.0 | 57.0 | 28.666667 | 201 | 16.178831 | 28.0 | 7.0 |
1 | AFG | Afghanistan | 1741 | Badghis | POLYGON ((64.30624 35.39722, 64.32468 35.40177... | 0.0 | 32.0 | 0.236040 | 29334 | 1.336606 | 0.0 | 0.0 |
2 | AFG | Afghanistan | 1742 | Hirat | POLYGON ((61.36393 35.59824, 61.36548 35.59850... | 0.0 | 56.0 | 0.741159 | 78071 | 2.932197 | 0.0 | 0.0 |
3 | AFG | Afghanistan | 1743 | Bamyan | POLYGON ((67.74391 35.44342, 67.75476 35.44412... | 0.0 | 8.0 | 0.180535 | 25153 | 1.039803 | 0.0 | 0.0 |
4 | AFG | Afghanistan | 1744 | Balkh | POLYGON ((67.25913 37.18515, 67.28145 37.18866... | 0.0 | 61.0 | 2.011429 | 23275 | 5.675208 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4663 | ZWE | Zimbabwe | 529 | Manicaland | POLYGON ((33.01165 -17.38399, 32.99745 -17.404... | 0.0 | 46.0 | 1.915220 | 42982 | 3.263857 | 0.0 | 0.0 |
4664 | ZWE | Zimbabwe | 530 | Matabeleland South | POLYGON ((29.43994 -19.87930, 29.45699 -19.874... | 0.0 | 52.0 | 0.677659 | 68170 | 2.175918 | 0.0 | 0.0 |
4665 | ZWE | Zimbabwe | 531 | Bulawayo | POLYGON ((28.49757 -20.06270, 28.50532 -20.062... | 0.0 | 55.0 | 21.061511 | 569 | 15.322821 | 14.0 | 7.0 |
4666 | ZWE | Zimbabwe | 532 | Masvingo | POLYGON ((31.04181 -19.25226, 31.19870 -19.248... | 0.0 | 26.0 | 0.701861 | 69850 | 2.066331 | 0.0 | 0.0 |
4667 | ZWE | Zimbabwe | 533 | Mashonaland West | POLYGON ((30.01065 -15.64623, 30.05024 -15.640... | 0.0 | 49.0 | 4.252489 | 70118 | 3.467990 | 6.0 | 6.0 |
4668 rows × 12 columns
Link: https://visibleearth.nasa.gov/images/144898/earth-at-night-black-marble-2016-color-maps
Satellite images of Earth at night—often referred to as “night lights”—have been a curiosity for the public and a tool of fundamental research for at least 25 years. They have provided a broad, beautiful picture, showing how humans have shaped the planet and lit up the darkness. Produced every decade or so, such maps have spawned hundreds of pop-culture uses and dozens of economic, social science, and environmental research projects.
These images show Earth’s night lights as observed in 2016. The data were reprocessed with new compositing techniques that select the best cloud-free nights in each month over each land mass.
The images are available as JPEG and GeoTIFF, in three different resolutions: 0.1 degrees (3600x1800
), 3km (13500x6750
), and 500m (86400x43200
). The 500m global map is divided into tiles (21600x21600) according to a gridding scheme.
# Black Marble
blackMarble2016 = list(np.vectorize("https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_{}_geo.tif".format)(np.array(["A", "B", "C", "D"], dtype=object)[:, np.newaxis] + np.array(["1", "2"], dtype=object)).flatten())
blackMarble2016
['https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_A1_geo.tif', 'https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_A2_geo.tif', 'https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_B1_geo.tif', 'https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_B2_geo.tif', 'https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_C1_geo.tif', 'https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_C2_geo.tif', 'https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_D1_geo.tif', 'https://eoimages.gsfc.nasa.gov/images/imagerecords/144000/144898/BlackMarble_2016_D2_geo.tif']
Globe at Night collects data based on specific locations, and in this case, contains a column named LimitingMag
which can be related to Light Pollution standards in the region. The following commands showcase a way to download the dataset programmatically, while also removing unnecessary datasets.
gan_url = "https://www.globeatnight.org/"
files = [gan_url + i["href"] for i in BeautifulSoup(requests.get(gan_url+"maps.php").content, "lxml").findAll(href=re.compile("\.csv$"))]
!mkdir "data/gan"
gan = []
for file in files:
filename = "data/gan/"+file.split("/")[-1]
print(file, "==>", filename)
file = BytesIO(requests.get(file, allow_redirects=True).content)
data = pd.read_csv(file, error_bad_lines=False)[["Latitude", "Longitude", "LocalDate", "LocalTime", "UTDate", "UTTime", "LimitingMag", "Country"]]
data = data[data.LimitingMag > 0]
data.LocalTime = pd.to_datetime(data.apply(lambda row: row["LocalDate"] + " " + row["LocalTime"], axis=1), format='%Y-%m-%d %H:%M')
data.UTTime = pd.to_datetime(data.apply(lambda row: row["UTDate"] + " " + row["UTTime"], axis=1), format='%Y-%m-%d %H:%M')
data.loc[:, "Year"] = int(filename[-8:-4])
data = data[["Latitude", "Longitude", "LocalTime", "UTTime", "LimitingMag", "Country", "Year"]]
data.to_csv(filename)
gan.append(data)
gan = pd.concat(gan, ignore_index=True)
gan.to_csv("data/gan/GaN.csv", index=False)
A subdirectory or file data/gan already exists.
https://www.globeatnight.org/2020data/GaN2020.csv ==> data/gan/GaN2020.csv
b'Skipping line 2574: expected 18 fields, saw 19\n'
https://www.globeatnight.org/2019data/GaN2019.csv ==> data/gan/GaN2019.csv
b'Skipping line 5547: expected 18 fields, saw 19\n'
https://www.globeatnight.org/2018data/GaN2018.csv ==> data/gan/GaN2018.csv https://www.globeatnight.org/2017data/GaN2017.csv ==> data/gan/GaN2017.csv https://www.globeatnight.org/2016data/GaN2016.csv ==> data/gan/GaN2016.csv https://www.globeatnight.org/2015data/GaN2015.csv ==> data/gan/GaN2015.csv https://www.globeatnight.org/2014data/GaN2014.csv ==> data/gan/GaN2014.csv https://www.globeatnight.org/2013data/GaN2013.csv ==> data/gan/GaN2013.csv https://www.globeatnight.org/2012data/GaN2012.csv ==> data/gan/GaN2012.csv
b'Skipping line 10111: expected 18 fields, saw 19\n'
https://www.globeatnight.org/2011data/GaN2011.csv ==> data/gan/GaN2011.csv https://www.globeatnight.org/2010data/GaN2010.csv ==> data/gan/GaN2010.csv https://www.globeatnight.org/2009data/GaN2009.csv ==> data/gan/GaN2009.csv https://www.globeatnight.org/2008data/GaN2008.csv ==> data/gan/GaN2008.csv https://www.globeatnight.org/2007data/GaN2007.csv ==> data/gan/GaN2007.csv https://www.globeatnight.org/2006data/GaN2006.csv ==> data/gan/GaN2006.csv
gan = pd.read_csv("data/gan/GaN.csv").sort_values(["Year", "Country"], ignore_index=True)
gan.Country = gan.Country.str.replace("United States.*", "United States").str.replace("Republic of the Union of Myanmar", "Myanmar").replace("Republic of the Congo", "Congo Republic").replace('Myanmar (Burma)', "Myanmar").replace("Czechia", "Czech Republic").replace("Republic of Kosovo", "Kosovo").replace("Brunei Darussalam", "Brunei").replace("Democratic Republic of the Congo", "DR Congo").replace("The Bahamas", "Bahamas").replace('Macedonia (FYROM)', "North Macedonia").replace("Reunion", "Réunion").replace('Virgin Islands', 'U.S. Virgin Islands').replace('St Vincent and the Grenadines', 'St Vincent and Grenadines').replace('Kingdom of Norway', "Norway").replace('The Netherlands', 'Netherlands')
gan.UTTime = pd.to_datetime(gan.UTTime)
gan.LocalTime = pd.to_datetime(gan.LocalTime)
gan = pd.merge(gan[gan.Latitude < 10**6], geocountries_latlong.rename(columns={"latitude":"countryLatitude", "longitude":"countryLongitude"}))
gan = gp.GeoDataFrame(gan, geometry=gp.points_from_xy(gan.Longitude, gan.Latitude))
gan
Latitude | Longitude | LocalTime | UTTime | LimitingMag | Country | Year | CountryCode | CountryCode3 | NumericCountryCode | fips | Capital | Area in km² | Population | Continent | countryLatitude | countryLongitude | geometry | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 34.56000 | 65.4300 | 2006-03-23 21:00:00 | 2006-03-23 16:30:00 | 7.0 | Afghanistan | 2006 | AF | AFG | 4 | AF | Kabul | 647500.0 | 37172386 | AS | 33.939110 | 67.709953 | POINT (65.43000 34.56000) |
1 | 31.86000 | 64.2100 | 2010-03-06 12:30:00 | 2010-03-06 08:00:00 | 6.0 | Afghanistan | 2010 | AF | AFG | 4 | AF | Kabul | 647500.0 | 37172386 | AS | 33.939110 | 67.709953 | POINT (64.21000 31.86000) |
2 | 31.51000 | 64.1300 | 2010-03-06 20:30:00 | 2010-03-06 16:00:00 | 6.0 | Afghanistan | 2010 | AF | AFG | 4 | AF | Kabul | 647500.0 | 37172386 | AS | 33.939110 | 67.709953 | POINT (64.13000 31.51000) |
3 | 31.85650 | 64.2108 | 2010-03-11 20:16:00 | 2010-03-11 15:46:00 | 6.0 | Afghanistan | 2010 | AF | AFG | 4 | AF | Kabul | 647500.0 | 37172386 | AS | 33.939110 | 67.709953 | POINT (64.21080 31.85650) |
4 | 31.85600 | 64.2100 | 2010-03-12 20:10:00 | 2010-03-12 15:40:00 | 6.0 | Afghanistan | 2010 | AF | AFG | 4 | AF | Kabul | 647500.0 | 37172386 | AS | 33.939110 | 67.709953 | POINT (64.21000 31.85600) |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
189325 | -29.04100 | 167.9610 | 2020-06-21 20:50:00 | 2020-06-21 09:50:00 | 6.0 | Norfolk Island | 2020 | NF | NFK | 574 | NF | Kingston | 34.6 | 1828 | OC | -29.040835 | 167.954712 | POINT (167.96100 -29.04100) |
189326 | -29.04090 | 167.9850 | 2020-06-21 21:13:00 | 2020-06-21 10:13:00 | 6.0 | Norfolk Island | 2020 | NF | NFK | 574 | NF | Kingston | 34.6 | 1828 | OC | -29.040835 | 167.954712 | POINT (167.98500 -29.04090) |
189327 | -29.03800 | 167.9630 | 2020-06-21 21:49:00 | 2020-06-21 10:49:00 | 5.0 | Norfolk Island | 2020 | NF | NFK | 574 | NF | Kingston | 34.6 | 1828 | OC | -29.040835 | 167.954712 | POINT (167.96300 -29.03800) |
189328 | 7.34623 | 134.4530 | 2020-06-21 20:28:00 | 2020-06-21 11:28:00 | 3.0 | Palau | 2020 | PW | PLW | 585 | PS | Melekeok | 458.0 | 17907 | OC | 7.514980 | 134.582520 | POINT (134.45300 7.34623) |
189329 | 52.22880 | 21.3435 | 2020-04-16 20:00:00 | 2020-04-16 18:00:00 | 3.0 | Pitcairn Islands | 2020 | PN | PCN | 612 | PC | Adamstown | 47.0 | 46 | OC | -24.703615 | -127.439308 | POINT (21.34350 52.22880) |
189330 rows × 18 columns
_ = gan.plot.scatter(x="Longitude", y="Latitude")
base = world.plot(color='white', edgecolor='black')
gan[["geometry"]].plot(ax=base, marker='o', color='red', markersize=2)
_ = (base.set_xlabel("Longitude"), base.set_ylabel("Latitude"), base.set_title("Plot of GaN Data Points Around the World"))
heatmap, xedges, yedges = np.histogram2d(gan.Latitude, gan.Longitude, bins=250)
logheatmap = np.log(heatmap)
logheatmap[np.isneginf(logheatmap)] = 0
logheatmap = sp.ndimage.filters.gaussian_filter(logheatmap, 2, mode='nearest')
plt.figure(figsize=(20, 10))
plt.imshow(logheatmap, cmap="jet", extent=[yedges[0], yedges[-1], xedges[-1], xedges[0]])
plt.colorbar()
ax = plt.gca()
ax.invert_yaxis()
ax.set_xlim(-175,180)
world.boundary.plot(edgecolor='white', ax=ax)
_ = ax.set_title("Heat Map of GaN Data")
gan.loc[:, "Rating"] = gan.LimitingMag.apply(lambda p: "Extremely High" if p >= 7 else "Very High" if p >= 6 else "High" if p >= 5 else "Moderate" if p >= 4 else "Good" if p >= 3 else "Low" if p >= 2.0 else "Very Low" if p >= 1.0 else "Extremely Low")
gan.Rating.value_counts()
Good 49844 Moderate 42924 Low 29529 High 27221 Very Low 19610 Very High 16488 Extremely High 3714 Name: Rating, dtype: int64
_ = sns.countplot(y=gan.Rating, order=gan.Rating.value_counts().index).set_title("Number of Data Records by Light Pollution Rating")
m = folium.Map(location=[0, 0], zoom_start=1)
plugins.FastMarkerCluster(gan[["Latitude", "Longitude", "LimitingMag"]].to_numpy()).add_to(m)
plugins.Fullscreen().add_to(m)
m.save("maps/ganMarkerClusters.html")
m
latlonglim = gan.set_index("UTTime").sort_index()[["Latitude", "Longitude", "LimitingMag"]]
latlonglim.LimitingMag /= 7.0
m = folium.Map(location = [15,30], tiles='Cartodb dark_matter', zoom_start = 2)
plugins.HeatMap(latlonglim.to_numpy()).add_to(m)
plugins.Fullscreen().add_to(m)
m.save("maps/ganHeatMap.html")
m