nhslogo CS4132 Data Analytics

An Analysis of The Global Light Pollution Standards

Prannaya Gupta (M21404)

Done as part of CS4232: Data Analytics

Table of Contents

Introduction

A side-effect of technological advancement has been the amount of light pollution in the world today. Ever since Thomas Edison’s revolutionary invention of the light bulb, the world has been thrust into a landscape of light-afflicted skies. According to various studies, around 80% of people live under light pollution-afflicted skies every day, and whilst this may not affect the day-to-day life of an individual, astronomers are very much affected by the sudden illumination in the skies. Even the Singaporean sky is very much damaged by light pollution, with 99.5% of all stars being completely invisible without optical aid[1].

Credits: Bing Hui Yau, Unsplash

In this project, I aim to analyze the implications of the changes in Light Pollution levels over the past few years, utilising global and local data to find possibly relationships. As shown below, I aim to solve a list of Light Pollution-related questions that utilise alternative data to analyse and interpret patterns, including simple statistical and machine-learning related models.

Research Questions

A. What are locations of minimal light pollution intensity which are optimum for astronomical observation?

While one might think that the best locations are in the middle of wilderness or large water bodies (i.e. the ocean), as are where the observatories Arecibo and FAST are situated, some of these locations need to be filtered based on accessibility, especially since locations like the middle of the ocean are not feasible locations for people to assemble to watch and will thus not be a great location for astrophotography. Additionally, we must also be able to identify locations for observatories to be located.

In order to locate these, a plethora of factors are considered, like the amount of light pollution in the vicinity, the education level around that area (on the basis of the rank of the top university situated in the country, area available and population density in the area, and some computations indicate that the Auckland Islands are the best locations for astronomical observation, especially in the context of placing an observatory in the area.

B. How has the light pollution data around the world changed? Which countries are most susceptible to high light pollution in the future? Which countries are lessening in terms of light pollution?

There are many countries that have had rampant increases in Light Pollution over the past few years, while others have made an effort to reduce their already increased Light Pollution. We need to map the data in order to find out which countries are susceptible to the level of problematic light pollution that can be found in cities like Singapore.

We use the Time Series Data available and model possible time series progressions, worldwise and in individual countries. This can be used to make predictions and compare to other light pollution levels.

In the end, we find that there was a sudden increase in Light Pollution back in 2014, Palestine, Qatar and Gibraltar are the most susceptible to light pollution, and some areas like Bermuda have lessening amounts of Light Pollution.

C. What is the relation between the average energy consumption and general demographics in each region/country and the Light Pollution?

A reason for considering this is the lack of mathematical analysis as to how energy consumption in specific districts affect the light pollution there. General mathematics regarding data in Singapore itself hasn’t been explored, hence using this is a good way of exploring something new. We can model the types of houses, general types of people living there, and the energy consumption based on that and from there, we can test how it affects nearby light pollution.

I intend to use the Economical Datasets that give data regarding the Singapore Energy Consumption statistics[20] and the Singapore Resident demographics[21]. The usage of the Absolute positional brightness datasets[4-7], filtering Singapore similarly to the previous research question. This can therefore be using in conjunction to find possible patterns.

Dataset

Numbered list of dataset (with downloadable links) and a brief description of each dataset used. Draw reference to the numbering when describing methodology (data cleaning and analysis).

In this project, we are using the following datasets:

  1. The GeoNames Geographical Dataset
  2. The Google Developers countries.csv Dataset
  3. DataHub's Countries GeoJSON Dataset
  4. DataHub's Natural Earth Polygons GeoJSON Dataset
  5. UN's Population, Surface Area and Density Dataset
  6. UN's Dataset of Population Growth and Indicators of Fertility and Mortality
  7. UN's Dataset for Literacy Amongst Students
  8. UN's Dataset for Labour Force and Unemployment
  9. UN's Dataset for Employment by Industry
  10. UN's Dataset for Energy Consumption
  11. UN's Dataset for Educational Attainment
  12. UN's Dataset for Population by Age, Sex, Educational Attainment
  13. The Harmonized Global Nighttime Light (1992 - 2018) Dataset
  14. The Globe at Night - Sky Brightness Monitoring Network (GaN-MN)

Set-Up and Imports

Firstly, we need to perform a simple set-up. This involves the following steps:

  1. Import the autoreload extension which can allow us to avoid leaving Jupyter at a stagnant file state.
  2. Install some libraries, like np, imagecodecs, lxml and gdal
    • Note: For gdal, which is the Python API Wrapper for the C library GDAL, the installation is a lot more complex than just pip install, hence there is a script. Additional info is provided in the corresponding section
  3. Import libraries, which include:
    • Crucial Ones like random, re and glob
    • Mathematical Libraries like numpy and scipy
    • Data Wrangling Libraries like pandas and lxml
    • Plotting Libraries like matplotlib, seaborn and plotly
    • Web Scraping Libraries like bs4 and requests
    • Image Processing Libraries like cv2, skimage and PIL
    • Machine Learning Libraries like sklearn

Installation Instructions

In this project, I am using the following external libraries:

A sample installation script is shown below:

. '/c/ProgramData/Anaconda3/etc/profile.d'/conda.sh
conda activate
conda create --name data-analytics python==3.8
conda activate data-analytics
pip install opencv-python lxml geopy
conda install scikit-image geopandas rasterio rasterstats --yes
conda install leafmap -c conda-forge
pip install keplergl area

Directory Structure

Please make the data directory as follows:

data
├───country
├───gan
├───ibol
├───nightLight
└───stats

Activating the autoreload extension

The autoreload extension is a IPython too that exists to ensure that the kernel resets when files change around it. This specifically helps for modules that exhibit significant changes when being used, and has mostly just been found experimentally. Hence, this is effectively just a debugging step.

Imports

Some crucial modules used in this project include:

Definitions

This section pertains to tools created with the express purpose of mathematical and image-based analyses.

Here, you can find the following:

  1. Correlation Computational Methods
  2. Custom Regression Lines
  3. Pandas-based customized data reading functions

Correlation Computational Functions

Just to ensure that the correlation computation is substantially accurate, I am also using a self-developed system to identify correlation.

Gamma Function
$$\Gamma(x) = \int_{0}^{\infty} t^{x-1} e^{-t} dt$$

Based on some simple manipulation, we get the following equations: $$\Gamma(x) = \frac{1}{x} t^x e^{-t}|_{0}^{\infty} - \dfrac{1}{x}\int_{0}^{\infty} t^{x} (- e^{-t})dt$$ $$= 0 + \dfrac{1}{x}\int_{0}^{\infty} t^{x} ( e^{-t})dt$$ $$= \dfrac{1}{x}\displaystyle\int_{0}^{\infty} t^{x} ( e^{-t})\,dt =\dfrac{1}{x}\Gamma (x + 1)$$ $$\implies \Gamma (x + 1) = x\Gamma (x)$$ $$\Gamma (x) = \int_{0}^{\infty} t^{x - 1}e^{-t}dt$$ $$\implies \Gamma (1) = \int_{0}^{\infty} t^{1 - 1}e^{-t}dt = \int_{0}^{\infty} e^{-t}\,dt =(-e^{-t})|_{0}^{\infty} = 1$$

Therefore, we can see the following pattern:

$$\Gamma (1 + 1) = (1) \times \Gamma (1) = 1 \implies \Gamma (2) = 1!$$$$\Gamma (1 + 2) = (2) \times \Gamma(2) = 2 \implies \Gamma (3) = 2!$$$$\Gamma (1 + 3) = (3) \times \Gamma(3) = 6 \implies \Gamma (4) = 3!$$$$\vdots$$$$\Gamma (n+1) = n! \implies \Gamma (n) = (n-1)!$$

Since the actual factorial only takes in integers, we must introduce a gamma function with 1/2.

Hence, we can compute some more to get $$\Gamma(\frac{1}{2}) = \sqrt{\pi}$$

Beta Function
$$\beta (x, y) = \int_0^1 t^{x-1} (1-t)^{y-1} dt = \frac{\Gamma(x) \Gamma(y)}{\Gamma(x+y)} $$
Incomplete Beta Function

The incomplete beta function, a generalization of the beta function, is defined as $$B(x;\,a,b) = \int_0^x t^{a-1}\,(1-t)^{b-1}\,dt$$

Regularised Incomplete Beta Function

From this, we also introduce the concept of the regularised incomplete beta function, as shown below $$I_x(a,b) = \frac{B(x;\,a,b)}{\beta(a,b)}$$

Some, actual values are as below: $$I_0(a,b) = 0$$ $$I_1(a,b) = 1$$ $$I_x(a,1) = x^a$$ $$I_x(1,b) = 1 - (1-x)^b$$ $$I_x(a,b) = 1 - I_{1-x}(b,a)$$ $$I_x(a,b) = I_x(a-1,b)-\frac{x^{a-1} (1-x)^b}{(a-1) \beta(a-1,b)}$$ $$I_x(a,b) = I_x(a,b-1)+\frac{x^a(1-x)^{b-1}}{(b-1) \beta(a,b-1)}$$

From here, you can define a simplified recursive function to compute the regularised incomplete beta function.

The Measure of Covariance

Covariance is used to determine how much the variables differ from their means.

First off, let take a dataset which has been normalised, i.e $$\bar x = \bar y = 0$$ This can be achieved by subtracting each values by the mean, i.e $$x_i := x_i - \bar x$$ $$y_i := y_i - \bar y$$

We can then compute this Covariance using: $$cov(x, y) = \frac{1}{n} \sum_{i=1}^n x_i y_i $$

Pearson R Correlation Coefficient

First off, let take a dataset which has been normalised, i.e $$\bar x = \bar y = 0$$ This can be achieved by subtracting each values by the mean, i.e $$x_i := x_i - \bar x$$

From here, computing the Pearson R Correlation Coefficient is not very intensive. $$r(x, y) = \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i \sum_{i=1}^n y_i}} $$

The p Value

From here, we can also compute the P Value using the previous two functions. It is simply given by: $$p(r) = 2 I_{\frac{1 - |r|}{2}}(\frac{n}{2} - 1, \frac{n}{2} - 1)$$ where n is the number of samples.

The Spearman Rank Correlation Coefficient

The Spearman Rank Correlation Cofficient is used to find correlation between specifc ranks in data. Below, we use the tied rank method, which simply finds the ranks of the actual data and then computes the Pearson Correlation Coefficient of these ranks.

Thus, the formula is as follows: $$\rho(x, y) = r(rank(x), rank(y))$$

Regression Lines

In this section, we adopt some alternative practices to incorporate Maclaurin and Taylor Series fitting methods into this project.

An Introduction to Taylor Series

Whilst we do know that Maclaurin Series operates based on the following simplistic approximation: $$ f(x) \approx \sum_{r = 0}^{N} \frac{f^{(r)}(0)}{r!} x^r $$ where N is sufficiently large number.

We can also simplify this expression by placing limit as below: $$ f(x) = \lim_{N\to\infty} \sum_{r = 0}^{N} \frac{f^{(r)}(0)}{r!} x^r = \sum_{r = 0}^{\infty} \frac{f^{(r)}(0)}{r!} x^r $$

Let's thus consider that we apply this operation for the function $$ g(x) = \frac{1}{1+x} $$

From here, we can get some very simple values as shown below: \begin{align*} g^{(0)}(x) &= \left ( 1 + x \right )^{-1} \\ g^{(1)}(x) &= g'(x) \\ &= \frac{d}{dx} \left [ \left ( 1 + x \right )^{-1} \right ] \\ &= (-1) \cdot \left ( 1 + x \right )^{-2} \\ &= - \left ( 1 + x \right )^{-2} \\ g^{(1)}(x) &= (-1)^{1} \cdot 1! \cdot \left ( 1 + x \right )^{-2} \\\\ \hline \\ g^{(2)}(x) &= g''(x) \\ &= \frac{d^2}{dx^2} \left [ \left ( 1 + x \right )^{-1} \right ] \\ &= \frac{d}{dx} \left [ - \left ( 1 + x \right )^{-2} \right ] \\ &= (-1) \cdot (-2) \cdot \left ( 1 + x \right )^{-3} \\ &= 2 \left ( 1 + x \right )^{-3} \\ g^{(2)}(x) &= (-1)^{2} \cdot 2! \cdot \left ( 1 + x \right )^{-3} \\\\ \hline \\ g^{(3)}(x) &= g'''(x) \\ &= \frac{d^3}{dx^3} \left [ \left ( 1 + x \right )^{-1} \right ] \\ &= \frac{d}{dx} \left [ 2 \left ( 1 + x \right )^{-3} \right ] \\ &= (2) \cdot (-3) \cdot \left ( 1 + x \right )^{-4} \\ &= - 6 \left ( 1 + x \right )^{-4} \\ g^{(3)}(x) &= (-1)^{3} \cdot 3! \cdot \left ( 1 + x \right )^{-4} \\ \end{align*}

Hence, we derive the following expression: $$ g^{(k)}(x) = (-1)^k \cdot k! \cdot \left ( 1 + x \right )^{-k-1} $$

Hence, for any k, we get the following: $$ g^{(k)}(0) = (-1)^k \cdot k! $$

Thus, we get the following: \begin{align*} g(x) &= \sum_{r = 0}^{\infty} \frac{g^{(r)}(0)}{r!} x^r \\ &= \sum_{r = 0}^{\infty} \frac{(-1)^r \cdot r!}{r!} x^r \\ \frac{1}{1+x} &= \sum_{r = 0}^{\infty} \left (-x \right)^r \\ \frac{1}{1+x} &= \sum_{r = 0}^{\infty} \left[1 - \left(1+x \right) \right]^r \end{align*}

From here, we define a value $y = 1 + x$. Thus, we can get the following: $$ \frac{1}{y} = \sum_{r = 0}^{\infty} \left(1 - y \right)^r $$

Thus, we get the Maclaurin Series of $ f(x) = \frac{1}{x} $ to be as follows: $$ f(x) = \frac{1}{x} = \sum_{r = 0}^{\infty} \left(1 - x \right)^r = \sum_{r = 0}^{\infty} (-1)^r \left(x - 1 \right)^r $$

However, the introduction of the alternate variable proffers us a solution to solve this value in the case where $f(0)$ and subsequent derivatives do not exist. Hence, we can introduce a term $a$ such that the following is valid:

$$ f(x) = \sum_{r = 0}^{\infty} \frac{f^{(r)}(a)}{r!} (x-a)^r $$

This is the definition of the Taylor Series, wherein we generalise Maclaurin Series itself to be a form of Taylor Series such that $a = 0$. However, when $a \neq 0$, Taylor Series is used.

For example, applying $a = 1$ on the function $f$ mentioned above, we can derive the Taylor Series of $f$. Firstly, we notice the following: $$ f^{(k)}(1) = (-1)^f \cdot f! $$

This piece of information has not changed from function $g$. Hence, we can now apply the Taylor Series: \begin{align*} f(x) &= \sum_{r = 0}^{\infty} \frac{f^{(r)}(1)}{r!} (x - 1)^r \\ &= \sum_{r = 0}^{\infty} \frac{(-1)^r \cdot r!}{r!} (x - 1)^r \\ \frac{1}{x} &= \sum_{r = 0}^{\infty} \left (1 - x \right)^r \\ f(x) &= \sum_{r = 0}^{\infty} (-1)^r \left(x - 1 \right)^r \end{align*}

Hence, for a function $f(x)$, to be expanded about a point $x = a$, let us define a new function $g(x)$ such that $g(x - a) = f(x)$. Then,

$$g(x) = g(0) + g'(0)x + \frac{1}{2}g''(0)x^2 + ...$$$$g(x) = f(a) + f'(a)x + \frac{1}{2}f''(a)x^2 + ...$$$$f(x) = f(a) + f'(a)(x - a) + \frac{1}{2}f''(a)(x - a)^2 + ...$$

Hence, $$ f(x) = \sum_{r = 0}^{\infty} \frac{f^{(r)}(a)}{r!} (x-a)^r $$ for some a, such that f(a) exists.

Functions to compute the Maclaurin and Taylor derivations

Using numpy's polyfit function, we can compute the coefficients of a function represented simply as a Taylor Series or Maclaurin Series Expansion, up to a degree of 100.

Validation of these Methods

To prove that these methods work, we investigate based on two curves: the sigmoid curve used in Logistic Regresssion and the activation functions in Deep Learning, and the sine curve which can be used easily for Taylor Series.

pandas Extension Functions

This snippet contains functions to retrieve large online sources instead of simply relying on the pandas scraping function, which runs into some errors.

Methodology

You should demonstrate the data science life cycle here (from data acquisition to cleaning to EDA and analysis etc). For data cleaning, be clear in which dataset (or variables) are used, what has been done for missing data, how was merging performed, explanation of data transformation (if any). If data is calculated or summarized from the raw dataset, explain the rationale and steps clearly.

Data Acquisition and Cleaning

I will be acquiring the following datasets in this project:

  1. Country-Based Datasets
  2. UN Datasets
  3. Light Pollution Datasets

Country-Based Datasets

The GeoNames Geographical Dataset

The GeoNames geographical database covers all countries and contains over eleven million placenames that are available for download free of charge. The dataset contains some key information like the Continent, Area in km^2 and Population. I have renamed some columns for clarity and replaced some pandas reading errors, like NA, which depicts the continent North America, being read as a null value.

Google's Lat-Long Dataset For Countries Worldwide

Google Developers' countries.csv which contains Latitude-Longitude data for each country. This indentifies a plausible center for the country.

I have cleared some of the data by getting the latitude, longitude and country name and adapt it to help with the merge later.

Wikipedia's List of Countries and Dependencies by Population Density

Wikipedia is kind enough to have compiled a list of countries and dependencies by population density, which lists the number of people per square kilometer in some specific areas.

DataHub's GeoJSON Datasets

DataHub's Countries GeoJSON Dataset

DataHub's Countries GeoJSON Dataset is a geodata data package providing geojson polygons for all the world's countries. The data comes from Natural Earth, a community effort to make visually pleasing, well-crafted maps with cartography or GIS software at small scale.

I have opened this dataset as a geopandas.GeoDataFrame object, as shown below. Following this, I have saved it in the formal definition, world_countries.json for later usage by folium and plotly.

Verifying the Validity of the Countries GeoJSON Dataset

To verify how accurate the polygons are, we zoom in on Singapore, which is currently shown below. We plot a leafmap plot using the keplergl library to analyse how accurately the polygons fit to the countries.

As we can see, the data does not fill the country border fully but it is still considerably accurate on a large scale to be considered valid. Hence, we continue with using this dataset.

DataHub's Natural Earth Polygons GeoJSON Dataset

DataHub's Natural Earth Polygons GeoJSON Dataset is a geodata data package geojson polygons for the largest administrative subdivisions in every country. The data comes from Natural Earth, a community effort to make visually pleasing, well-crafted maps with cartography or GIS software at small scale.

I have opened this dataset as a geopandas.GeoDataFrame object, as shown below. Following this, I have saved it in a local file, world_locations.json for later usage by folium and plotly.

UN Datasets

These are mostly datasets found on the UN Database. The data is quite dirty, so it needs to be cleaned. Some additional datasets do not have a CSV Link, hence they are attached with this project notebook. The datasets covered include:

UN's Dataset of Population, Surface Area and Density

Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Population, Surface Area and Population Density, although the data needs to be cleaned.

UN's Dataset of Population Growth and Indicators of Fertility and Mortality

Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Population Increase, Life Expectancy, Infant and Maternal Mortality and Total Fertility Rate, although the data needs to be cleaned. The column names represent the following quantities:

UN's Dataset for Literacy Amongst Students

Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding enrollment in primary, secondary and tertiary education levels, although the data needs to be cleaned. The column names represent the following quantities:

UN's Dataset for Labour Force and Unemployment

Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Labour Force Participation and Unemployment Rate, although the data needs to be cleaned. The column names represent as follows:

UN's Dataset for Employment by Industry

Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Employment by different Industry (eg Agriculture and Services), although the data needs to be cleaned.

Post-cleaning, the following describes each of the column names:

UN's Dataset for Energy Consumption

Found on the UN Database, this dataset consists of multiple data samples per country per year, hence providing a large array of values for use. The database contains data regarding Energy Production, Trade and Consumption, although the data needs to be cleaned. The column names represent the following quantities:

Merging all the UN Data Together

In the end, we merge all of this data into a compound DataFrame object undata. I have decided to use the OUTER JOIN operation, then substituted values based on the known mean.

Merging the Data

It is necessary to merge all the data, so that is pretty much all that is done here. I have decided to perform an INNER join operation on the data so that I have all the data necessary. This data is stored in a Pandas DataFrame geocountries_latlong. This has then been converted into a geopandas.GeoDataFrame point-based object named countries. Following this, I have plotted this data in the form of a scatterplot by Population.

Light Pollution Datasets

The Harmonized Global Nighttime Light (1992 - 2018) Dataset

This dataset is specifically the largest, containing close to 20 billion data points (specifically 20322960028), hence this data has to be acquired in chunks. All the data can be downloaded from a zip file as shown below. Here are the instructions:

  1. Download the zip file from https://figshare.com/ndownloader/articles/9828827/versions/2
  2. Make a directory relative to this notebook called data/nightLight
  3. Unzip all the contents in the zip file into the nightLight directory as created in step 2.
  4. Remove the zip file, so as to conserve disk space.

Installation can take place as follows:

curl https://figshare.com/ndownloader/articles/9828827/versions/2 > nightLight.zip
unzip nightLight.zip -d data/nightLight/
rm nightLight.zip

Due to the large amount of data, this data has been separated into country/state-wise and year-wise statistical summaries, containing the mean, median, mode, min, max, count and std. GeoTIFF files are simply converted into DataFrames as shown below.

NASA's EaN Blue Marble 2016 Dataset

Link: https://visibleearth.nasa.gov/images/144898/earth-at-night-black-marble-2016-color-maps

Satellite images of Earth at night—often referred to as “night lights”—have been a curiosity for the public and a tool of fundamental research for at least 25 years. They have provided a broad, beautiful picture, showing how humans have shaped the planet and lit up the darkness. Produced every decade or so, such maps have spawned hundreds of pop-culture uses and dozens of economic, social science, and environmental research projects.

These images show Earth’s night lights as observed in 2016. The data were reprocessed with new compositing techniques that select the best cloud-free nights in each month over each land mass.

The images are available as JPEG and GeoTIFF, in three different resolutions: 0.1 degrees (3600x1800), 3km (13500x6750), and 500m (86400x43200). The 500m global map is divided into tiles (21600x21600) according to a gridding scheme.

The Globe at Night (GaN) Dataset

Globe at Night collects data based on specific locations, and in this case, contains a column named LimitingMag which can be related to Light Pollution standards in the region. The following commands showcase a way to download the dataset programmatically, while also removing unnecessary datasets.

Exploratory Data Analysis

Research Question A: What are locations of minimal light pollution intensity which are optimum for astronomical observation?

While one might think that the best locations are in the middle of wilderness or large water bodies (i.e. the ocean), as are where the observatories Arecibo and FAST are situated, some of these locations need to be filtered based on accessibility, especially since locations like the middle of the ocean are not feasible locations for people to assemble to watch and will thus not be a great location for astrophotography. Additionally, we must also be able to identify locations for observatories to be located.

For this question, I will be using the nightLight2013 Dataset as retrieved above. I have also placed the retrieval instructions here for new users.

nightLight2013 = states_geojson.merge(pd.read_csv("data/nightLight/nightLight2013.csv"))
Based on States

Firstly, we analyse the data with respect to states to find the best specific states for Astronomical Observations. For this, consider countries that have an Average Light Pollution of 0, which represents that it is in an optimal light pollution capacity.

Astronomical observation often entails the creation of massive observatories and telescopes to witness that part of the sky. Hence, it is an important consideration to ensure that an observatory is maintained.

In this section, we use the Arecibo Observatory as the base of our investigation. A former observatory that recently collapsed in December 2020, the Arecibo Observatory is built as a radio reflector dish with a diameter of 305 metres, making it one of the largest observatories and easily one of the most recognisable telescopes on Earth. The observatory covers an area of about twenty acres, which roughly translates to about 0.0809371 square kilometers. To account for additional space taken up by labs and rooms in the observatory, we round this up to about 0.1 square kilometers.

Below is the Arecibo Observatory, as shown in the form of an ArcGIS Folium Map.

Another telescope to be considered is the Five-hundred-meter Aperture Spherical Radio Telescope (FAST) in Southern China, which has started to be used for astronomical observation tasks recently. As the name suggests, this telescope has a diameter roughly 500 metres, hence in order to derive the approximate area, we can apply a rough area-based scaling formula as shown below:

\begin{align*} Area_{FAST} &= \frac{Area_{FAST Radar Disk}}{Area_{Acredibo Radar Disk}} Area_{Acredibo} \\ &= \frac{\pi r_{FAST}^2}{\pi r_{Acredibo}^2} Area_{Acredibo} \\ &= \frac{r_{FAST}^2}{r_{Acredibo}^2} Area_{Acredibo} \\ &= \frac{d_{FAST}^2}{d_{Acredibo}^2} Area_{Acredibo} \\ &= \left ( \frac{500}{305} \right)^2 \cdot (0.0809371) \\ &= 0.21751437785541522 km^2 \end{align*}

Hence, we retrieve that FAST has a rough area coverage of 0.21751437785541522 square kilometers, more that twice of that of Acredibo. Hence, this is also something of consideration.

Below is the FAST Observatory in a similar ArcGIS Folium Map, but unfortunately covered by clouds.

Keeping in mind possible human life living in the area, we state that we need an area of at least 1 square kilometer. This is to ensure that the telescope personnel also have the ability to live around it. Another measure of this is the Population Density, which is available with the popsaden DataFrame.

To check if this area is enough, we use the geojson_area package, which can be installed simply with the command:

pip install area

As was established previously, this data is slightly off from the real result since the polygons are not fully in line with with the real shape of the country / city. However, this still gives a reasonable estimate for how large the area is and hence how capable it is to hold an observatory the size of Arecibo or even FAST.

Hence, another location in consideration is the Tibesti Mountain Range in Chad, which is a similar location to FAST in Guizhou and Arecibo in Puerto Rico, but in drylands, which may make it not as desirable as one would think. Below is Tibesti, as shown by the map.

Next, we note the Population Density of each area. Since the population density of the individual territories is not available, we instead use the country's average population density, since it gives a reasonable estimate for the population density of that location. This population density data has been previously saved in the popsaden DataFrame, and we simply filter out the ones currently saved in the above datasets. Following this, we merge the data into idealAreaNightLight to get denseNightLight2013.

From here, we can determine that Norfolk Island is one of the better areas to have an observatory located. Below is Norfolk Island in its full glory. Norfolk Island is situated between Northland, New Zealand and Brisbane, Australia, but is considerably remote such that light pollution is quite small. However, it may not be a location that working astronmers at universities would like to continually commute to, since it is only accessible by plane.

Another consideration would be proximity to a top-ranking university, since astronomers would prefer working nearer to a recognized universty with good equipment. To determine this information, we look at the Webometrics Ranking Web of Universities. We derive data from these sites and locate the best university in this area. This tells us the level of the university in the area.

From here, we can determine that the Isles of Scilly are also one of the better areas to have an observatory located. Below is a map of the Isles of Scilly. While an interesting location, the Isles of Scilly are incredibly small and may hence not be the best for astronomical observatories.

To consider each value to an equal weightage, we apply the following index: $$ \lambda = rank \left(\lambda_{university} \right) + rank \left(\frac{1}{\lambda_{area}} \right) + rank \left(\lambda_{density} \right) $$

The next location is given to be the Auckland Islands, which are shown below. The Auckland Islands are shown to be quite close to Queenstown, New Zealand, and is surprisingly quite far away from Auckland itself. It is much more accessible compared to Norfolk Island, hence this may be the best location.

Hence, we conclude that the following are the best locations for astronomical observation (in order of suitability):

  1. The Auckland Islands, New Zealand
  2. Norfolk Island, Australia
  3. Tibesti Mountain Region, Chad
  4. THe Isles of Scilly, UK
Based on Countries

In some cases, astronomers would prefer to be situated in one specific country rather than move around a lot.

According to the 2010 US National Academies Decadal Survey of Astronomy, having to move so frequently, which is a career necessity for astronomers and astrophysicists, with many, as postdocs, relocating every two or three years until such time that they secure a permanent position, may be unattractive to people looking to start a family, especially impacting women.

In fact, according to a recent paper released in January 2021, which brought to a light a project known as ASTROMOVES that studies the diversity and mobility of astrophysicists today, most postdocs have a fixed salary despite the constant location changes, which may not be able to cover for the cost for relocating with a family, hence here we investigate the best countries for astronomical observation, specifically by astronomers and astrophysicists.

The countries Somalia and Mauritania are seemingly the least polluted outside of the French Southern Territories, although there are some outliers in both, especially in Mauritania. However, the left-skewed nature of the data is incredibly indicative of the number of problematic outliers.

Research Question B: How has the light pollution data around the world changed? Which countries are most susceptible to high light pollution in the future? Which countries are lessening in terms of light pollution?

There are many countries that have had rampant increases in Light Pollution over the past few years, while others have made an effort to reduce their already increased Light Pollution. We need to map the data in order to find out which countries are susceptible to the level of problematic light pollution that can be found in cities like Singapore.

Due to the possession of the Time Series Data of night-time light pollution data, we can easily map locations based on relative changes in absolute light magnitude. I intend to use the geolocation datasets against the time-series night-time light datasets to find changes and from there and map it using a possible regression model so as to predict the curve of increase/decrease. This can be used to make predictions and compare to other light pollution levels.

Firstly, we find the average light pollution of each country in each year, and then plot a Time Series-based Choropleth Map using the Plotly Express Library. From here, we can get a quick grasp of how the Average Light Pollution of each country changes.

We use this on both the GaN and NightLight Datasets.

How has the light pollution data around the world changed?

Here, we go through light pollution in general, using two different algorithms as follows:

\begin{align*} \mu_1 &= \frac{1}{27} \sum_{y = 1992}^{2018} \mu_y \\ \mu_2 &= \frac{\sum_{y = 1992}^{2018} n_y\mu_y}{\sum_{y = 1992}^{2018} n_y} \end{align*}

We plot graphs based on both of these graphs.

It is clear that the original prediction was altered by the presence of outliers from 2014 onwards, and the second prediction, which has a much higher $R^2$ value and much lower $MSPE$, is a more accurate look at the still increasing light pollution over the years. However, after 2014, the value suddenly nearly doubles, and continutes on a pretty uneven trajectory.

Which countries have been most and least susceptible to Light Pollution?

To indentify susceptibility, we consider the following scenarios:

  1. The largest range of values in average light pollution intensity
  2. The greatest difference in values in 1992 and 2018 (positive and negative separately)
  3. Countries with the highest and lowest average light pollution over the years
Countries with the largest range of values in Average Light Pollution Intensity

To calculate this, we simply compute the max and min over the years and find the largest differences over the list of countries. The countries found are Guernsey, Saint Barthelemy, Palestine, Qatar and Jersey. Based on the results below, Palestine and Qatar have seen a clear increase in valus over the past few years, and their predictions are also the most accurate, according to the residual plot shown below. Guernsey, on the other hand, has seen a huge decrease in light pollution over the past few years.

Countries with the greatest difference between values in 1992 and 2018.

To calculate this, we simply took the difference, the took the top 5 positive and top 5 negative results. The top 5 most positive countries found are Palestine, Lebanon, Qatar, Gibraltar and Akrotiri, which are all increasing rapidly. The top 5 most negative countries are Saint Barthelemey, US Virgin Islands, Bermuda, Jersey and Guernsey. Based on the results, Bermuda, Jersey and Guernsey have largely decreasing results, while the US Virigin Islands are also in that situation but to a limited extent.

Countries with the largest amount of Light Pollution

The countries in consideration are Macao, Singapore, Monaco, Bahrain and Gibraltar. Based on our results, Macao, Gibraltar and Singapore's minimum values are increasing, while Monaco's have decreased. As for mean, all but Singapore have had incresing light pollution values, whereas Singapore's have remained stagnated.

We now investigate general distribution in light pollutions for countries with the highest average light pollutions. This is done in the form of box plots and violin plots that give insight into the distribution of values in 2013 and overall. Singapore is clearly shown to be skewed towards the highest possible values, which Bahrain and Gibraltar are bimodal distributions. The values for Hong Kong, Malta, Sint Maarten, San Marino, Kuwait and Singapore are skewed towards the right, while other countries have data skewed to the left. Hong Kong, Malta and Bahrain possess a few outliers less that the median. Israel, Trinidad+Tobago and Kuwait have quite large ranges of values as well as large IQRs. The centeral tendency of many counries seems to be on the lower end, besides Singapore, Hong Kong and Malta.

This shows that Bermuda is very negatively correlated to Qatar and Bahrain, which indicate that these are rapidly differing quantities.

This shows that all the values are very well correlated to one another, indicating that values have not changed too much for many countries over the past 27 years, with a few exceptions.

This data shows that human observation has some limits, with many countries seeing a rapid decrease in data between 2006 and 2020, with the exception of Thailand. Covid-19 has also apparently had a similar decrease impact in apparent light pollution. Additonally, it seems that overall, India has had a decresing amount of Limiting Magnitude, as has Japan, although has been generally increasing.

Research Question C: What is the relation between the general demographics in each region/country and the Light Pollution?

To investigate this, we first merge the data amassed from the UN archives with that of the Night Light dataset. Then using an elaborate algorithm, we all categories that seems to show a strong correlation to the nightLight data columns.

Now to investigate the correlations and their general $R^2$ values and $MSPE$ values, we run through a for loop as shown below:

We now investigate the best correlation found, which is the Population Density against the Minimum Light Pollution.

According to the following results, $$R^2 = 0.67$$ $$MSPE = 677251$$

We now train a MLRM to predict the average light pollution, using all the factors we have but normalized using the following algorithm: $$x_{new} = \frac{x - x_{min}}{x_{max} - x_{min}}$$

From here, we can conclude that Population, Employment, Energy Production and Life Expectancies have a significant impact on the average light pollution.

Results Discussion

Present and discuss your research results. Treat each of your research questions separately. Focus in particular on the results that are most interesting, surprising, or important. Discuss the consequences or implications. Interpret the results: if the answers are unexpected, then see whether you can find an explanation for them, such as an external factor that your analysis did not account for. Include some visualization of your results (a graph, plot, bar chart, etc.). These plots should be created programmatically in the code you submit.

What are locations of minimal light pollution intensity which are optimum for astronomical observation?

We find that the Auckland Islands are the best for Astronomical Observation, and especially for placing an observatory around there. This is found based on the fact that the location ranks 7th in University Ranking and Population Density Ranking and 17th in Area Ranking, and has an overall score of 30, which is the lowest on the list. This is not surprising, since countries in Oceania do in fact have lower light pollution that most locations humanly habitable.

How has Light Pollution Changed Over the Years

It seems that Light Pollution has generally incresed over the past few years, seeing a rapid increase between 2013 and 2014, likely due to the dataset used changing altogether.

It seems that the data has generally remained consistent between 1992 and 2018, despite rapid advancements in digital technology. This could be due to the fact that 1992 was already a year where digital technology was in its prime.

Which countries are most susceptible to light pollution and which have improved over the years?

It appears that Palestine, Qatar, Macau and Gibraltar are most susceptible to light pollution, while Guernsey, Bermuda, US Virgin Islands and Jersey have improved very much.

Which factors have a significant impact on the Light Pollution Standards?

We find that Population Density has a significant positive correlation with the Minimum Light Pollution, while Population, Employment, Energy Production and Life Expectancies have a significant impact on the average light pollution.

Conclusion and recommendations

Based on the results, countries like Singapore, Qatar and Palestine have to adopt a differing approach to combat the problem of light pollution, and countries with a high Population Density need to beware of possible increase in light pollution over the next few years.

References

Cite any references made, and links where you obtained the data.

You may wish to read about how to use markdown in Jupyter notebook to make your report easier to read. https://www.ibm.com/docs/en/db2-event-store/2.0.0?topic=notebooks-markdown-jupyter-cheatsheet

Readings

[1] Drake, N. (2019, April 3). Light pollution is getting worse, and Earth is paying the price. National Geographic. https://www.nationalgeographic.com/science/article/nights-are-getting-brighter-earth-paying-the-price-light-pollution-dark-skies

[2] Are the stars out tonight? (n.d.). Notre Dame Senior School ArcGIS Maps. Retrieved October 10, 2021, from https://notredamecobham.maps.arcgis.com/apps/Cascade/index.html?appid=0d5f0c8b80ca4fff9ff5aa6834b68a63

[3] Lazar, M. (2010). Shedding Light on the Global Distribution of Economic Activity. The Open Geography Journal, 3(1), 147–160. https://doi.org/10.2174/1874923201003010147