In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:
population
- The population as of 2015
.population_growth
- The annual population growth rate, as a percentage.area
- The total land and water area.The github link to dowload the sqlite database factbook.db
is here.
First,we will see which tables are there currently in the database using query SELECT * FROM sqlite_master WHERE type='table'
which will return all the objects in the database which are of type tables
.
import sqlite3
import pandas as pd
conn = sqlite3.connect("factbook.db")
q0 = "SELECT * FROM sqlite_master WHERE type='table';"
pd.read_sql(q0, conn)
type | name | tbl_name | rootpage | sql | |
---|---|---|---|---|---|
0 | table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
1 | table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY... |
Okay, looks like the info is stored in facts
table.Lets check fetch first five rows of the table.
q1 = "Select * from facts LIMIT 5;"
pd.read_sql(q1,conn)
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
1 | 2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.30 | 12.92 | 6.58 | 3.30 |
2 | 3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
3 | 4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.00 |
4 | 5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
Now, let's run some couple of queries.
Calculate summary statistics to find the minimum population
, maximum population
, minimum population growth
and maximum population growth
q2 = "SELECT MIN(population), MAX(population), MIN(population_growth), MAX(population_growth) FROM facts LIMIT 5"
pd.read_sql(q2, conn)
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) | |
---|---|---|---|---|
0 | 0 | 7256490011 | 0.0 | 4.02 |
Look's like there is a country with a population of 0 and one with population of 7.2 billion.Let's take a look at this countries.
0
.¶q3 = "SELECT * FROM facts where population = (SELECT MIN(population) from facts)"
pd.read_sql(q3, conn)
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 250 | ay | Antarctica | None | 280000 | None | 0 | None | None | None | None |
Seems reasonable that the population is zero since Anatartica is the southernmost continent in the Artic circle. This also match with the CIA Factbook page for Antartica.
Now, let's explore the countries with 7.2 billion population
7.2
billion¶q4 = "SELECT * FROM facts where population = (SELECT MAX(population) from facts)"
pd.read_sql(q4, conn)
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 261 | xx | World | None | None | None | 7256490011 | 1.08 | 18.6 | 7.8 | None |
Seems that this record is not a legit one. Possibly, the name = World
in the dataset sums all the value across all the countries in the world and assigns it to the World
row.
Now let's generate histograms for some columns:
population
population_growth
birth_rate
death_rate
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
q5 = "SELECT population, population_growth, birth_rate, death_rate from facts WHERE population != (SELECT MIN(population) from facts) and population != (SELECT MAX(population) FROM facts)"
hist_df = pd.read_sql(q5,conn)
hist_df.hist()
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8efbe10>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8ec5cf8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8e12b38>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8dd6160>]], dtype=object)
Population density is nothing but the ratio of population to land area.
q5 = "SELECT name, CAST(population as Float)/CAST(area_land as Float) as population_density FROM facts ORDER BY population_density desc"
pop_den_df= pd.read_sql(q5, conn)
pop_den_df
name | population_density | |
---|---|---|
0 | Macau | 21168.964286 |
1 | Monaco | 15267.500000 |
2 | Singapore | 8259.784571 |
3 | Hong Kong | 6655.271202 |
4 | Gaza Strip | 5191.819444 |
5 | Gibraltar | 4876.333333 |
6 | Bahrain | 1771.859211 |
7 | Maldives | 1319.640940 |
8 | Malta | 1310.015823 |
9 | Bermuda | 1299.925926 |
10 | Bangladesh | 1297.977606 |
11 | Sint Maarten | 1167.323529 |
12 | Guernsey | 847.179487 |
13 | Jersey | 838.741379 |
14 | Taiwan | 725.825356 |
15 | Barbados | 675.823256 |
16 | Mauritius | 660.013300 |
17 | Aruba | 623.122222 |
18 | Lebanon | 604.565103 |
19 | Saint Martin | 588.037037 |
20 | San Marino | 541.311475 |
21 | Rwanda | 513.285755 |
22 | Korea, South | 506.760173 |
23 | Netherlands | 500.041424 |
24 | West Bank | 493.859220 |
25 | Nauru | 454.285714 |
26 | India | 420.993721 |
27 | Burundi | 418.312928 |
28 | Tuvalu | 418.038462 |
29 | Puerto Rico | 405.677227 |
... | ... | ... |
231 | Greenland | 0.026653 |
232 | Antarctica | 0.000000 |
233 | Ethiopia | NaN |
234 | South Sudan | NaN |
235 | Sudan | NaN |
236 | Holy See (Vatican City) | NaN |
237 | European Union | NaN |
238 | Ashmore and Cartier Islands | NaN |
239 | Coral Sea Islands | NaN |
240 | Heard Island and McDonald Islands | NaN |
241 | Clipperton Island | NaN |
242 | French Southern and Antarctic Lands | NaN |
243 | Saint Barthelemy | NaN |
244 | Bouvet Island | NaN |
245 | Jan Mayen | NaN |
246 | Akrotiri | NaN |
247 | British Indian Ocean Territory | NaN |
248 | Dhekelia | NaN |
249 | South Georgia and South Sandwich Islands | NaN |
250 | Navassa Island | NaN |
251 | Wake Island | NaN |
252 | United States Pacific Island Wildlife Refuges | NaN |
253 | Paracel Islands | NaN |
254 | Spratly Islands | NaN |
255 | Arctic Ocean | NaN |
256 | Atlantic Ocean | NaN |
257 | Indian Ocean | NaN |
258 | Pacific Ocean | NaN |
259 | Southern Ocean | NaN |
260 | World | NaN |
261 rows × 2 columns
The wikipedia page also aligns with results . Also, this MACAU article vicariously suggests the reason behind the high dense population of the city.
pop_den_df.hist(column = 'population_density')
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f37b8b500f0>]], dtype=object)
q6 = "SELECT name, CAST(area_water as Float)/CAST(area_land as Float) as water_to_land from facts ORDER BY water_to_land desc"
pd.read_sql(q6,conn)
name | water_to_land | |
---|---|---|
0 | British Indian Ocean Territory | 905.666667 |
1 | Virgin Islands | 4.520231 |
2 | Puerto Rico | 0.554791 |
3 | Bahamas, The | 0.386613 |
4 | Guinea-Bissau | 0.284673 |
5 | Malawi | 0.259396 |
6 | Netherlands | 0.225710 |
7 | Uganda | 0.222922 |
8 | Eritrea | 0.164356 |
9 | Liberia | 0.156240 |
10 | Bangladesh | 0.140509 |
11 | Gambia, The | 0.116601 |
12 | Taiwan | 0.115313 |
13 | Finland | 0.112996 |
14 | India | 0.105634 |
15 | Canada | 0.098000 |
16 | Sweden | 0.097384 |
17 | Colombia | 0.096476 |
18 | Brunei | 0.094967 |
19 | Guyana | 0.092050 |
20 | French Polynesia | 0.088842 |
21 | Nicaragua | 0.086507 |
22 | Burundi | 0.083723 |
23 | Iran | 0.076130 |
24 | United States | 0.072551 |
25 | Tanzania | 0.069429 |
26 | Vietnam | 0.068178 |
27 | Rwanda | 0.067699 |
28 | Estonia | 0.067000 |
29 | Norway | 0.064151 |
... | ... | ... |
231 | Turks and Caicos Islands | 0.000000 |
232 | American Samoa | 0.000000 |
233 | Guam | 0.000000 |
234 | Navassa Island | 0.000000 |
235 | Northern Mariana Islands | 0.000000 |
236 | Wake Island | 0.000000 |
237 | Gaza Strip | 0.000000 |
238 | Paracel Islands | 0.000000 |
239 | Spratly Islands | 0.000000 |
240 | Western Sahara | 0.000000 |
241 | Ethiopia | NaN |
242 | New Zealand | NaN |
243 | South Sudan | NaN |
244 | Sudan | NaN |
245 | Holy See (Vatican City) | NaN |
246 | European Union | NaN |
247 | Greenland | NaN |
248 | French Southern and Antarctic Lands | NaN |
249 | Saint Barthelemy | NaN |
250 | Saint Martin | NaN |
251 | Akrotiri | NaN |
252 | Dhekelia | NaN |
253 | United States Pacific Island Wildlife Refuges | NaN |
254 | Antarctica | NaN |
255 | Arctic Ocean | NaN |
256 | Atlantic Ocean | NaN |
257 | Indian Ocean | NaN |
258 | Pacific Ocean | NaN |
259 | Southern Ocean | NaN |
260 | World | NaN |
261 rows × 2 columns