Visualizing earnings based on majors¶

In this project, we will be analzing using visualizations the job outcomes of students who grauated from college between 2010 and 2012.The original data on job outcomes was released by American Community Survey, which conducts surveys and aggregates the data. FiveThirtyEight cleaned the dataset and released it on their Github repo.

Using visualizations, we can start to explore questions from the dataset like:

  • Do students in more popular majors make more money?
  • How many majors are predominantly male? Predominantly female?
  • Which category of majors have the most students?

Let's import the required libraries and start exploring.

In [6]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

recent_grads = pd.read_csv('recent-grads.csv')
print(recent_grads.iloc[0,:])
Rank                                        1
Major_code                               2419
Major                   PETROLEUM ENGINEERING
Total                                    2339
Men                                      2057
Women                                     282
Major_category                    Engineering
ShareWomen                           0.120564
Sample_size                                36
Employed                                 1976
Full_time                                1849
Part_time                                 270
Full_time_year_round                     1207
Unemployed                                 37
Unemployment_rate                   0.0183805
Median                                 110000
P25th                                   95000
P75th                                  125000
College_jobs                             1534
Non_college_jobs                          364
Low_wage_jobs                             193
Name: 0, dtype: object
In [7]:
print(recent_grads.head())
   Rank  Major_code                                      Major    Total  \
0     1        2419                      PETROLEUM ENGINEERING   2339.0   
1     2        2416             MINING AND MINERAL ENGINEERING    756.0   
2     3        2415                  METALLURGICAL ENGINEERING    856.0   
3     4        2417  NAVAL ARCHITECTURE AND MARINE ENGINEERING   1258.0   
4     5        2405                       CHEMICAL ENGINEERING  32260.0   

       Men    Women Major_category  ShareWomen  Sample_size  Employed  \
0   2057.0    282.0    Engineering    0.120564           36      1976   
1    679.0     77.0    Engineering    0.101852            7       640   
2    725.0    131.0    Engineering    0.153037            3       648   
3   1123.0    135.0    Engineering    0.107313           16       758   
4  21239.0  11021.0    Engineering    0.341631          289     25694   

       ...        Part_time  Full_time_year_round  Unemployed  \
0      ...              270                  1207          37   
1      ...              170                   388          85   
2      ...              133                   340          16   
3      ...              150                   692          40   
4      ...             5180                 16697        1672   

   Unemployment_rate  Median  P25th   P75th  College_jobs  Non_college_jobs  \
0           0.018381  110000  95000  125000          1534               364   
1           0.117241   75000  55000   90000           350               257   
2           0.024096   73000  50000  105000           456               176   
3           0.050125   70000  43000   80000           529               102   
4           0.061098   65000  50000   75000         18314              4440   

   Low_wage_jobs  
0            193  
1             50  
2              0  
3              0  
4            972  

[5 rows x 21 columns]
In [8]:
print(recent_grads.describe())
             Rank   Major_code          Total            Men          Women  \
count  173.000000   173.000000     172.000000     172.000000     172.000000   
mean    87.000000  3879.815029   39370.081395   16723.406977   22646.674419   
std     50.084928  1687.753140   63483.491009   28122.433474   41057.330740   
min      1.000000  1100.000000     124.000000     119.000000       0.000000   
25%     44.000000  2403.000000    4549.750000    2177.500000    1778.250000   
50%     87.000000  3608.000000   15104.000000    5434.000000    8386.500000   
75%    130.000000  5503.000000   38909.750000   14631.000000   22553.750000   
max    173.000000  6403.000000  393735.000000  173809.000000  307087.000000   

       ShareWomen  Sample_size       Employed      Full_time      Part_time  \
count  172.000000   173.000000     173.000000     173.000000     173.000000   
mean     0.522223   356.080925   31192.763006   26029.306358    8832.398844   
std      0.231205   618.361022   50675.002241   42869.655092   14648.179473   
min      0.000000     2.000000       0.000000     111.000000       0.000000   
25%      0.336026    39.000000    3608.000000    3154.000000    1030.000000   
50%      0.534024   130.000000   11797.000000   10048.000000    3299.000000   
75%      0.703299   338.000000   31433.000000   25147.000000    9948.000000   
max      0.968954  4212.000000  307933.000000  251540.000000  115172.000000   

       Full_time_year_round    Unemployed  Unemployment_rate         Median  \
count            173.000000    173.000000         173.000000     173.000000   
mean           19694.427746   2416.329480           0.068191   40151.445087   
std            33160.941514   4112.803148           0.030331   11470.181802   
min              111.000000      0.000000           0.000000   22000.000000   
25%             2453.000000    304.000000           0.050306   33000.000000   
50%             7413.000000    893.000000           0.067961   36000.000000   
75%            16891.000000   2393.000000           0.087557   45000.000000   
max           199897.000000  28169.000000           0.177226  110000.000000   

              P25th          P75th   College_jobs  Non_college_jobs  \
count    173.000000     173.000000     173.000000        173.000000   
mean   29501.445087   51494.219653   12322.635838      13284.497110   
std     9166.005235   14906.279740   21299.868863      23789.655363   
min    18500.000000   22000.000000       0.000000          0.000000   
25%    24000.000000   42000.000000    1675.000000       1591.000000   
50%    27000.000000   47000.000000    4390.000000       4595.000000   
75%    33000.000000   60000.000000   14444.000000      11783.000000   
max    95000.000000  125000.000000  151643.000000     148395.000000   

       Low_wage_jobs  
count     173.000000  
mean     3859.017341  
std      6944.998579  
min         0.000000  
25%       340.000000  
50%      1231.000000  
75%      3466.000000  
max     48207.000000  
In [9]:
raw_data_count = recent_grads.shape[0]
recent_grads = recent_grads.dropna()
print(raw_data_count)
print(recent_grads.shape[0])
173
172

Let's just explore the distribution for some columns using histogram.

In [10]:
recent_grads['Sample_size'].hist(bins = 30, range = (0,2500))
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc602e0e278>
In [11]:
recent_grads['Median'].hist(range= (15000,80000))
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc602d3c780>
In [12]:
recent_grads.loc[recent_grads['ShareWomen']<0.5,'Men'].hist()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc602db1518>

What percent of majors are predominatly male? Predominatly female?¶

In [13]:
recent_grads['ShareWomen'].hist()
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc600b23710>

The total number of majors is the summation of all the bars(approx) 173. The total number of majors which have predominatly female, is where the ShareWomen is >= 50 , thus, we add all the bars from ShareWomen = 0.5 which is approx. 89. Thus, the total number of majors which have predominatly male is 173 - 89 = 84. Thus, the percent of majors which are predominatly male is = 49%(approx) and the percent of majors which are predominatly female is = 48%.

Do students in more popular majors make more money?¶

We will have to take a look at plot of Sample_size vs Median. Assuming that the most popular major is one where the sample size is higher, we will see if the median salary grows to increase.

In [14]:
recent_grads.plot(x='Sample_size',y='Median', kind = 'scatter')
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc600c0d898>

Just looking at the above scatterplot, we can certainly see that as the sample size is more, that is more popular major , the salary is decreasing . Although, we don't have that much data, but visually it can be understood that as the majors having sample size > 1000 , the salary starts dropping.

Is there any link between the number of full-time employees and median salary?¶

We will have to look at the scatter plot of Full Time vs Median to discover more.

In [15]:
recent_grads.plot(x='Full_time',y='Median',kind = 'scatter')
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc600ace9b0>

Majors which have number of full-time employees less than 1000 tend to get wide-range of salaries. Also, these majors are the ones where one can expect to get more salaries(p.s It can be seen that all the dots line up at 0th tick, with the highest salary being around 110K dollars).However, as the total full-time employees increase after 50,000 , the salary tends to drop below 60k$.

Additionally, we could do a regression tests to explore the strength of this relationship. Clearly, although this is not a linear relationship . Also, we would have to restrict the range of full time count so as to perform statistical tests.

Do students that majored in subjects that were majority female make more money?¶

We will have to see plots of Median salary vs Men and Median salary vs Women to understand more.

In [16]:
# recent_grads.plot(x='ShareWomen',y='Unemployment_rate',kind = 'scatter')
recent_grads.plot(x='Men',y='Median',kind = 'scatter')
recent_grads.plot(x='Women',y='Median',kind = 'scatter')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc600ac2ba8>

Looking at the scatter plot of Men v/s Median and Women v/s Median. It can be observed that majors which had number of women count - 50,000 and above tend to have salaries below 60K. In contrast, for the majors with the same men count - 50,000 and above tend to have little bit higher salaries. Also, for majors which had the men count and women count in 1000s had the same salary range between 20k - 80k. Surprisingly, the outlier in both the scatter plots look the same.

What's the most common median salary range?¶

Above we discovered that the Median salary had an outlier at 110K dollar, so we just remove it. Also, we expand the number of bins (experimented value 60)

In [17]:
med_hist = recent_grads['Median'].hist(bins=60, range=(0,80000))
med_hist.set_title('Median salary distribution')
Out[17]:
<matplotlib.text.Text at 0x7fc6008ffac8>

We can see that there are 3 peaks for the histogram for salary range between 30k dollars till 40k dollars. So, it can be said that the most common salary range is between 32k - 40k dollars.

Using scatter matrix plots to explore previous questions¶

In [18]:
from pandas.plotting import scatter_matrix


scatter_matrix(recent_grads[['Sample_size','Median']], hist_kwds = {'bins':20} ,figsize = (5,5))
Out[18]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fc6007eee80>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc6007d7160>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fc60079f9b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc60075a828>]],
      dtype=object)

Looking at the median hist plot, it can be seen that the median salary range is between 32k and 42k. Also, we can look at the scatter plot of Sample_size v/s Median, as the sample size increases - i.e. major is popular, the salary range tends to decrease. We can also look at the scatter plot of Median v/s Sample_size to answer the same things. Thus, we explored two previous questions:

  • Do students in more popular majors make more money?
  • What's the most common median salary range?

Let's create another scatter matrix plot to explore another question.

In [19]:
scatter_matrix(recent_grads[['Men','Women','Median']])
Out[19]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fc6006684e0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc600653e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc6006228d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fc6005e04a8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc6005276d8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc6004e3fd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fc6004b1cc0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc600467e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fc60043b198>]],
      dtype=object)

Comparing the scatter plots, of Men v/s Median and Women v/s Median, we are able to follow to see that as the number of men samples increase , the max. salary earned is at 60k ,however, as the number of women increases, the max. salary earned drops below 60K. Thus, the students that majored in subjects that where majority females earn less than with students with majority male.

Also, one other conclusion to be derived is that the as the sample size(in both men and women) increases, the salary range is dropping or the max salary earned is dropping. Thus, it can also be deduced that the students in more popular majors make more money.

Thus,we answered two previous questions.

  • Do students in more popular majors make more money?
  • Do students that majored in subjects that were majority female make more money?

Seems scatter matrix plots can save alot of time.

Use barplots to compare the percentages of women(ShareWomen) from the first ten rows and last ten rows¶

In [20]:
import numpy as np
ax1 = recent_grads[:10].plot.bar(x=np.arange(1,11), y= 'ShareWomen')
ax1.set_title('Percentage of Women from the first ten rows')

ax2 = recent_grads[-10:].plot.bar(x=np.arange(-10,0,1), y= 'ShareWomen')
ax2.set_title('Percentage of Women from the last ten rows')
Out[20]:
<matplotlib.text.Text at 0x7fc6002a04a8>

It can be observed that the first ten rows of dataset have on average percentage of Women below 0.2.Whereas, the last rows of the dataset have on average the percentage of Women above 0.6.

Use bar plots to compare the unemployment_rate from the first ten rows and last ten rows.¶

In [21]:
ax3 = recent_grads[:10].plot.bar(x=np.arange(1,11), y= 'Unemployment_rate')
ax3.set_title('Unemployment rate from the first ten rows')
ax3.set_xlabel('First 10 rows from the dataset')

ax4 = recent_grads[-10:].plot.bar(x=np.arange(-10,0,1), y= 'Unemployment_rate')
ax4.set_title('Unemployment rate from the last ten rows')
ax4.set_xlabel('Last 10 rows from the dataset')
Out[21]:
<matplotlib.text.Text at 0x7fc600147c88>

Use a grouped bar plot to compare the number of men with the number of women in each category of majors¶

In [22]:
final_df = recent_grads.groupby(['Major_category']).sum().reset_index()
final_df.head()
Out[22]:
Major_category Rank Major_code Total Men Women ShareWomen Sample_size Employed Full_time Part_time Full_time_year_round Unemployed Unemployment_rate Median P25th P75th College_jobs Non_college_jobs Low_wage_jobs
0 Agriculture & Natural Resources 993 10421 75620.0 40357.0 35263.0 3.647407 1068 63794 55585 15470 41891 3486 0.466352 316000 222000 410100 18677 33217 7414
1 Arts 1049 48121 357130.0 134390.0 222740.0 4.829264 3260 288114 207773 114791 153111 28228 0.721382 264500 175700 349300 94785 163720 60116
2 Biology & Life Science 1335 48662 453862.0 184919.0 268943.0 8.220700 2317 302797 240377 116736 165802 22854 0.852849 509900 372600 645200 151233 127182 42742
3 Business 726 80769 1302376.0 667852.0 634524.0 6.281573 15505 1088742 988870 196936 790425 79877 0.923826 566000 435000 713000 148538 496570 126788
4 Communications & Journalism 416 7610 392601.0 131921.0 260680.0 2.633536 4508 330660 273330 89817 214228 26852 0.302151 138000 105000 179900 86556 172992 49595
In [23]:
final_df.plot(x='Major_category', y =['Men','Women'], kind='bar')
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/plotting/_core.py:1716: UserWarning:

Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.pydata.org/pandas-docs/stable/indexing.html#attribute-access

Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc6001cd320>

It can be seen that the major categories Engineering,Computers & Mathematics,Business have less women as compared to men , with Engineering category , the difference seems larger. Major Categories like Education,Psychology and Social Work and Health are some of the majors where there are really larger women as compared to that of men. Thus, it can be deduced that men usually get enrolled in money-making majors, but women get enrolled in the social aspect majors geared towards helping the community.

Use a box plot to explore the distributions of median salaries and unemployment rate.¶

In [24]:
recent_grads.boxplot(column=['Unemployment_rate'])
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc600003080>

From the above box plot,the average unemployment rate is 0.65. There are also potential outliers above 0.14.

In [25]:
recent_grads.boxplot(column=['Median'])
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5fff9a390>

From the above boxplot, it can be said that the average/median salary for all the majors is typically 35k and the median salary range for all majors is between 32k to 45k. There seems to outliers above 60k , with the highest outlier being 110k.

Use a hexagonal bin plot to visualize the columns that had dense scatter plots from earlier in the project.¶

Let's explore the scatterplot of Median v/s Sample_size and Median v/s Full_time

In [26]:
recent_grads.plot.hexbin(x='Sample_size',y='Median',gridsize=10)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5fffcada0>
In [27]:
recent_grads.plot.hexbin(x='Full_time',y='Median',gridsize=10)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fc5ffe4dc88>

The hexagonal bin plots are the same for both scatter plots.

Conclusion¶

In this project, we visualized earnings based on college majors. We did analysis using different plots like line plots, scatter plots, histogram, bar plot, scatter matrix plot, box plots and special instance plots like grouped bar plot, hexagonal bin plot.