Analyzing CIA factbook data using sqlite and python¶

In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:

population - The population as of 2015.
population_growth - The annual population growth rate, as a percentage.
area - The total land and water area.

The github link to dowload the sqlite database factbook.db is here.

First,we will see which tables are there currently in the database using query SELECT * FROM sqlite_master WHERE type='table' which will return all the objects in the database which are of type tables.

In [20]:

import sqlite3
import pandas as pd


conn = sqlite3.connect("factbook.db")
q0 = "SELECT * FROM sqlite_master WHERE type='table';"
pd.read_sql(q0, conn)

Out[20]:

	type	name	tbl_name	rootpage	sql
0	table	sqlite_sequence	sqlite_sequence	3	CREATE TABLE sqlite_sequence(name,seq)
1	table	facts	facts	47	CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY...

Okay, looks like the info is stored in facts table.Lets check fetch first five rows of the table.

In [21]:

q1 = "Select * from facts LIMIT 5;"
pd.read_sql(q1,conn)

Out[21]:

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
0	1	af	Afghanistan	652230	652230	0	32564342	2.32	38.57	13.89	1.51
1	2	al	Albania	28748	27398	1350	3029278	0.30	12.92	6.58	3.30
2	3	ag	Algeria	2381741	2381741	0	39542166	1.84	23.67	4.31	0.92
3	4	an	Andorra	468	468	0	85580	0.12	8.13	6.96	0.00
4	5	ao	Angola	1246700	1246700	0	19625353	2.78	38.78	11.49	0.46

Now, let's run some couple of queries.

Summary Statistics¶

Calculate summary statistics to find the minimum population, maximum population , minimum population growth and maximum population growth

In [22]:

q2 = "SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth) FROM facts LIMIT 5"
pd.read_sql(q2, conn)

Out[22]:

	MIN(population)	MAX(population)	MIN(population_growth)	MAX(population_growth)
0	0	7256490011	0.0	4.02

Look's like there is a country with a population of 0 and one with population of 7.2 billion.Let's take a look at this countries.

Query to return countrie(s) with a population of `0`.¶

In [23]:

q3 = "SELECT * FROM facts where population = (SELECT MIN(population) from facts)"
pd.read_sql(q3, conn)

Out[23]:

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
0	250	ay	Antarctica	None	280000	None	0	None	None	None	None

Seems reasonable that the population is zero since Anatartica is the southernmost continent in the Artic circle. This also match with the CIA Factbook page for Antartica.

Now, let's explore the countries with 7.2 billion population

Query to return countrie(s) with a populaiton of `7.2` billion¶

In [24]:

q4 = "SELECT * FROM facts where population = (SELECT MAX(population) from facts)"
pd.read_sql(q4, conn)

Out[24]:

	id	code	name	area	area_land	area_water	population	population_growth	birth_rate	death_rate	migration_rate
0	261	xx	World	None	None	None	7256490011	1.08	18.6	7.8	None

Seems that this record is not a legit one. Possibly, the name = World in the dataset sums all the value across all the countries in the world and assigns it to the World row.

Histograms¶

Now let's generate histograms for some columns:

population
population_growth
birth_rate
death_rate

In [25]:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

q5 = "SELECT population, population_growth, birth_rate, death_rate from facts WHERE population != (SELECT MIN(population) from facts) and population != (SELECT MAX(population) FROM facts)"

hist_df = pd.read_sql(q5,conn)

hist_df.hist()

Out[25]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8efbe10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8ec5cf8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8e12b38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8dd6160>]],
      dtype=object)

Query to find countrie(s) that have the highest population density¶

Population density is nothing but the ratio of population to land area.

In [30]:

q5 = "SELECT name, CAST(population as Float)/CAST(area_land as Float) as population_density FROM facts ORDER BY population_density desc"
pop_den_df= pd.read_sql(q5, conn)
pop_den_df

Out[30]:

	name	population_density
0	Macau	21168.964286
1	Monaco	15267.500000
2	Singapore	8259.784571
3	Hong Kong	6655.271202
4	Gaza Strip	5191.819444
5	Gibraltar	4876.333333
6	Bahrain	1771.859211
7	Maldives	1319.640940
8	Malta	1310.015823
9	Bermuda	1299.925926
10	Bangladesh	1297.977606
11	Sint Maarten	1167.323529
12	Guernsey	847.179487
13	Jersey	838.741379
14	Taiwan	725.825356
15	Barbados	675.823256
16	Mauritius	660.013300
17	Aruba	623.122222
18	Lebanon	604.565103
19	Saint Martin	588.037037
20	San Marino	541.311475
21	Rwanda	513.285755
22	Korea, South	506.760173
23	Netherlands	500.041424
24	West Bank	493.859220
25	Nauru	454.285714
26	India	420.993721
27	Burundi	418.312928
28	Tuvalu	418.038462
29	Puerto Rico	405.677227
...	...	...
231	Greenland	0.026653
232	Antarctica	0.000000
233	Ethiopia	NaN
234	South Sudan	NaN
235	Sudan	NaN
236	Holy See (Vatican City)	NaN
237	European Union	NaN
238	Ashmore and Cartier Islands	NaN
239	Coral Sea Islands	NaN
240	Heard Island and McDonald Islands	NaN
241	Clipperton Island	NaN
242	French Southern and Antarctic Lands	NaN
243	Saint Barthelemy	NaN
244	Bouvet Island	NaN
245	Jan Mayen	NaN
246	Akrotiri	NaN
247	British Indian Ocean Territory	NaN
248	Dhekelia	NaN
249	South Georgia and South Sandwich Islands	NaN
250	Navassa Island	NaN
251	Wake Island	NaN
252	United States Pacific Island Wildlife Refuges	NaN
253	Paracel Islands	NaN
254	Spratly Islands	NaN
255	Arctic Ocean	NaN
256	Atlantic Ocean	NaN
257	Indian Ocean	NaN
258	Pacific Ocean	NaN
259	Southern Ocean	NaN
260	World	NaN

261 rows × 2 columns

The wikipedia page also aligns with results . Also, this MACAU article vicariously suggests the reason behind the high dense population of the city.

In [31]:

pop_den_df.hist(column = 'population_density')

Out[31]:

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8b500f0>]],
      dtype=object)

Query to return countrie(s) with highest ratios of water to land¶

In [32]:

q6 = "SELECT name, CAST(area_water as Float)/CAST(area_land as Float) as water_to_land from facts ORDER BY water_to_land desc"
pd.read_sql(q6,conn)

Out[32]:

	name	water_to_land
0	British Indian Ocean Territory	905.666667
1	Virgin Islands	4.520231
2	Puerto Rico	0.554791
3	Bahamas, The	0.386613
4	Guinea-Bissau	0.284673
5	Malawi	0.259396
6	Netherlands	0.225710
7	Uganda	0.222922
8	Eritrea	0.164356
9	Liberia	0.156240
10	Bangladesh	0.140509
11	Gambia, The	0.116601
12	Taiwan	0.115313
13	Finland	0.112996
14	India	0.105634
15	Canada	0.098000
16	Sweden	0.097384
17	Colombia	0.096476
18	Brunei	0.094967
19	Guyana	0.092050
20	French Polynesia	0.088842
21	Nicaragua	0.086507
22	Burundi	0.083723
23	Iran	0.076130
24	United States	0.072551
25	Tanzania	0.069429
26	Vietnam	0.068178
27	Rwanda	0.067699
28	Estonia	0.067000
29	Norway	0.064151
...	...	...
231	Turks and Caicos Islands	0.000000
232	American Samoa	0.000000
233	Guam	0.000000
234	Navassa Island	0.000000
235	Northern Mariana Islands	0.000000
236	Wake Island	0.000000
237	Gaza Strip	0.000000
238	Paracel Islands	0.000000
239	Spratly Islands	0.000000
240	Western Sahara	0.000000
241	Ethiopia	NaN
242	New Zealand	NaN
243	South Sudan	NaN
244	Sudan	NaN
245	Holy See (Vatican City)	NaN
246	European Union	NaN
247	Greenland	NaN
248	French Southern and Antarctic Lands	NaN
249	Saint Barthelemy	NaN
250	Saint Martin	NaN
251	Akrotiri	NaN
252	Dhekelia	NaN
253	United States Pacific Island Wildlife Refuges	NaN
254	Antarctica	NaN
255	Arctic Ocean	NaN
256	Atlantic Ocean	NaN
257	Indian Ocean	NaN
258	Pacific Ocean	NaN
259	Southern Ocean	NaN
260	World	NaN