Analyzing CIA factbook data using sqlite and python¶

In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:

  • population - The population as of 2015.
  • population_growth - The annual population growth rate, as a percentage.
  • area - The total land and water area.

The github link to dowload the sqlite database factbook.db is here.

First,we will see which tables are there currently in the database using query SELECT * FROM sqlite_master WHERE type='table' which will return all the objects in the database which are of type tables.

In [20]:
import sqlite3
import pandas as pd


conn = sqlite3.connect("factbook.db")
q0 = "SELECT * FROM sqlite_master WHERE type='table';"
pd.read_sql(q0, conn)
Out[20]:
type name tbl_name rootpage sql
0 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
1 table facts facts 47 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY...

Okay, looks like the info is stored in facts table.Lets check fetch first five rows of the table.

In [21]:
q1 = "Select * from facts LIMIT 5;"
pd.read_sql(q1,conn)
Out[21]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
0 1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51
1 2 al Albania 28748 27398 1350 3029278 0.30 12.92 6.58 3.30
2 3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92
3 4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.00
4 5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46

Now, let's run some couple of queries.

Summary Statistics¶

Calculate summary statistics to find the minimum population, maximum population , minimum population growth and maximum population growth

In [22]:
q2 = "SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth) FROM facts LIMIT 5"
pd.read_sql(q2, conn)
Out[22]:
MIN(population) MAX(population) MIN(population_growth) MAX(population_growth)
0 0 7256490011 0.0 4.02

Look's like there is a country with a population of 0 and one with population of 7.2 billion.Let's take a look at this countries.

Query to return countrie(s) with a population of 0.¶

In [23]:
q3 = "SELECT * FROM facts where population = (SELECT MIN(population) from facts)"
pd.read_sql(q3, conn)
Out[23]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
0 250 ay Antarctica None 280000 None 0 None None None None

Seems reasonable that the population is zero since Anatartica is the southernmost continent in the Artic circle. This also match with the CIA Factbook page for Antartica.

Now, let's explore the countries with 7.2 billion population

Query to return countrie(s) with a populaiton of 7.2 billion¶

In [24]:
q4 = "SELECT * FROM facts where population = (SELECT MAX(population) from facts)"
pd.read_sql(q4, conn)
Out[24]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate
0 261 xx World None None None 7256490011 1.08 18.6 7.8 None

Seems that this record is not a legit one. Possibly, the name = World in the dataset sums all the value across all the countries in the world and assigns it to the World row.

Histograms¶

Now let's generate histograms for some columns:

  • population
  • population_growth
  • birth_rate
  • death_rate
In [25]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

q5 = "SELECT population, population_growth, birth_rate, death_rate from facts WHERE population != (SELECT MIN(population) from facts) and population != (SELECT MAX(population) FROM facts)"



hist_df = pd.read_sql(q5,conn)

hist_df.hist()
Out[25]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8efbe10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8ec5cf8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8e12b38>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8dd6160>]],
      dtype=object)

Query to find countrie(s) that have the highest population density¶

Population density is nothing but the ratio of population to land area.

In [30]:
q5 = "SELECT name, CAST(population as Float)/CAST(area_land as Float) as population_density FROM facts ORDER BY population_density desc"
pop_den_df= pd.read_sql(q5, conn)
pop_den_df
Out[30]:
name population_density
0 Macau 21168.964286
1 Monaco 15267.500000
2 Singapore 8259.784571
3 Hong Kong 6655.271202
4 Gaza Strip 5191.819444
5 Gibraltar 4876.333333
6 Bahrain 1771.859211
7 Maldives 1319.640940
8 Malta 1310.015823
9 Bermuda 1299.925926
10 Bangladesh 1297.977606
11 Sint Maarten 1167.323529
12 Guernsey 847.179487
13 Jersey 838.741379
14 Taiwan 725.825356
15 Barbados 675.823256
16 Mauritius 660.013300
17 Aruba 623.122222
18 Lebanon 604.565103
19 Saint Martin 588.037037
20 San Marino 541.311475
21 Rwanda 513.285755
22 Korea, South 506.760173
23 Netherlands 500.041424
24 West Bank 493.859220
25 Nauru 454.285714
26 India 420.993721
27 Burundi 418.312928
28 Tuvalu 418.038462
29 Puerto Rico 405.677227
... ... ...
231 Greenland 0.026653
232 Antarctica 0.000000
233 Ethiopia NaN
234 South Sudan NaN
235 Sudan NaN
236 Holy See (Vatican City) NaN
237 European Union NaN
238 Ashmore and Cartier Islands NaN
239 Coral Sea Islands NaN
240 Heard Island and McDonald Islands NaN
241 Clipperton Island NaN
242 French Southern and Antarctic Lands NaN
243 Saint Barthelemy NaN
244 Bouvet Island NaN
245 Jan Mayen NaN
246 Akrotiri NaN
247 British Indian Ocean Territory NaN
248 Dhekelia NaN
249 South Georgia and South Sandwich Islands NaN
250 Navassa Island NaN
251 Wake Island NaN
252 United States Pacific Island Wildlife Refuges NaN
253 Paracel Islands NaN
254 Spratly Islands NaN
255 Arctic Ocean NaN
256 Atlantic Ocean NaN
257 Indian Ocean NaN
258 Pacific Ocean NaN
259 Southern Ocean NaN
260 World NaN

261 rows × 2 columns

The wikipedia page also aligns with results . Also, this MACAU article vicariously suggests the reason behind the high dense population of the city.

In [31]:
pop_den_df.hist(column = 'population_density')
Out[31]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8b500f0>]],
      dtype=object)

Query to return countrie(s) with highest ratios of water to land¶

In [32]:
q6 = "SELECT name, CAST(area_water as Float)/CAST(area_land as Float) as water_to_land from facts ORDER BY water_to_land desc"
pd.read_sql(q6,conn)
Out[32]:
name water_to_land
0 British Indian Ocean Territory 905.666667
1 Virgin Islands 4.520231
2 Puerto Rico 0.554791
3 Bahamas, The 0.386613
4 Guinea-Bissau 0.284673
5 Malawi 0.259396
6 Netherlands 0.225710
7 Uganda 0.222922
8 Eritrea 0.164356
9 Liberia 0.156240
10 Bangladesh 0.140509
11 Gambia, The 0.116601
12 Taiwan 0.115313
13 Finland 0.112996
14 India 0.105634
15 Canada 0.098000
16 Sweden 0.097384
17 Colombia 0.096476
18 Brunei 0.094967
19 Guyana 0.092050
20 French Polynesia 0.088842
21 Nicaragua 0.086507
22 Burundi 0.083723
23 Iran 0.076130
24 United States 0.072551
25 Tanzania 0.069429
26 Vietnam 0.068178
27 Rwanda 0.067699
28 Estonia 0.067000
29 Norway 0.064151
... ... ...
231 Turks and Caicos Islands 0.000000
232 American Samoa 0.000000
233 Guam 0.000000
234 Navassa Island 0.000000
235 Northern Mariana Islands 0.000000
236 Wake Island 0.000000
237 Gaza Strip 0.000000
238 Paracel Islands 0.000000
239 Spratly Islands 0.000000
240 Western Sahara 0.000000
241 Ethiopia NaN
242 New Zealand NaN
243 South Sudan NaN
244 Sudan NaN
245 Holy See (Vatican City) NaN
246 European Union NaN
247 Greenland NaN
248 French Southern and Antarctic Lands NaN
249 Saint Barthelemy NaN
250 Saint Martin NaN
251 Akrotiri NaN
252 Dhekelia NaN
253 United States Pacific Island Wildlife Refuges NaN
254 Antarctica NaN
255 Arctic Ocean NaN
256 Atlantic Ocean NaN
257 Indian Ocean NaN
258 Pacific Ocean NaN
259 Southern Ocean NaN
260 World NaN

261 rows × 2 columns