In this project, we will be analyzing popular technology site Hacker News.It is a site started by the startup incubator Y Combinator, where user-submitted stories(known as "posts") are voted and commented upon, similar to reddit.
We will be using a dataset that contains all the posts that did recieve comments. Below is the data dictionary for the dataset:
id
: The unique identifie from Hacker News for the posttitle
: The title of the posturl
: The URL that the posts links to, if the post has a URLnum_points
: The number of points the post acquired, calculated as the total number of upvotes minus the total number of downvotesnum_comments
: The number of the comments that were made on the postauthor
: The username of the person who submitted the postcreated_at
: The date and time at which the post was submittedWe can use this dataset to find some interesting insights for instance, users submit Ask HN
posts to ask the Hacker News community to ask a specific question or Show HN
posts to show the Hacker News community a project, product or just generally something interesting.
We can also compare these two types of posts to determine the following:
Ask HN
or Show HN
receive more comments on average?First, let's load the dataset and take a glimpse of it.
from csv import reader
hn = list(reader(open("hacker_news.csv")))
print('The first five rows of HN dataset')
print(hn[:5])
headers = hn[0]
hn = hn[1:]
print('The header for HN dataset',headers)
print('First five rows of HN dataset after cleaning', hn[:5])
The first five rows of HN dataset [['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'], ['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01']] The header for HN dataset ['id', 'title', 'url', 'num_points', 'num_comments', 'author', 'created_at'] First five rows of HN dataset after cleaning [['12224879', 'Interactive Dynamic Video', 'http://www.interactivedynamicvideo.com/', '386', '52', 'ne0phyte', '8/4/2016 11:52'], ['10975351', 'How to Use Open Source and Shut the Fuck Up at the Same Time', 'http://hueniverse.com/2016/01/26/how-to-use-open-source-and-shut-the-fuck-up-at-the-same-time/', '39', '10', 'josep2', '1/26/2016 19:30'], ['11964716', "Florida DJs May Face Felony for April Fools' Water Joke", 'http://www.thewire.com/entertainment/2013/04/florida-djs-april-fools-water-joke/63798/', '2', '1', 'vezycash', '6/23/2016 22:20'], ['11919867', 'Technology ventures: From Idea to Enterprise', 'https://www.amazon.com/Technology-Ventures-Enterprise-Thomas-Byers/dp/0073523429', '3', '1', 'hswarna', '6/17/2016 0:01'], ['10301696', 'Note by Note: The Making of Steinway L1037 (2007)', 'http://www.nytimes.com/2007/11/07/movies/07stein.html?_r=0', '8', '2', 'walterbell', '9/30/2015 4:12']]
Now , that we've removed the first five rows. Let's explore the first question. For this, we will create three seperate lists, one for tracking the ask_posts
, show_posts
and other_posts
. Then , we will just calculate the total no. of posts for each of these lists.
ask_posts = []
show_posts = []
other_posts = []
for row in hn:
title = row[1]
if title.lower().startswith('ask hn'):
ask_posts.append(row)
elif title.lower().startswith('show hn'):
show_posts.append(row)
else:
other_posts.append(row)
print('The number of ask posts: ',len(ask_posts))
print('The number of show posts: ',len(show_posts))
print('The number of other posts: ',len(other_posts))
The number of ask posts: 1744 The number of show posts: 1162 The number of other posts: 17194
Next, let's determin if ask posts or show posts receive more comments on average
######### Avg number of comments in ask HN posts ##########
total_ask_comments = 0
for row in ask_posts:
total_ask_comments += int(row[4])
avg_ask_comments = total_ask_comments / len(ask_posts)
print('The average no. of comments for Ask HN posts:',avg_ask_comments)
######### Avg number of comments in show HN posts ##########
total_show_comments = 0
for row in show_posts:
total_show_comments +=int(row[4])
avg_show_comments = total_show_comments / len(show_posts)
print('The average no. of comments for Show HN posts:',avg_show_comments)
The average no. of comments for Ask HN posts: 14.038417431192661 The average no. of comments for Show HN posts: 10.31669535283993
Since ask posts are more likely to receive comments, we'll focus our remaining analysis just on these posts. Next, we'll determine if ask posts created at a certain time are more likely to attract comments.
We'll use the following steps to perform this analysis:
Let's proceed with Step 1. To do this, first we will create a list of list with each inner list containing two elements. The first element should contain the date when the post has been done and the number of comments received on it.
Using these list of lists, we can then create two dictionaries. The first dictionary will keep track of the number of ask posts during a particular datetime and the second dictionary will keep track of the number of comments receieved for that corresponding datetime. Later, we will be just looking at the hour. For now, just use the datetime column as it is.
Finally,we will then use these dictionaries to calculate the average comments by post for the hour when it was posted.
import datetime as dt
result_list = []
for row in ask_posts:
created_at = row[6]
num_comments = int(row[4])
result_list.append([created_at,num_comments])
counts_by_hour = {}
comments_by_hour = {}
for row in result_list:
created_dt = dt.datetime.strptime(row[0],"%m/%d/%Y %H:%M")
hr = created_dt.strftime("%H")
if hr not in counts_by_hour:
counts_by_hour[hr] = 1
comments_by_hour[hr] = row[1]
else:
counts_by_hour[hr] +=1
comments_by_hour[hr] +=row[1]
print(comments_by_hour)
{'21': 1745, '09': 251, '16': 1814, '04': 337, '07': 267, '10': 793, '06': 397, '08': 492, '14': 1416, '12': 687, '17': 1146, '23': 543, '03': 421, '05': 464, '20': 1722, '01': 683, '22': 479, '02': 1381, '18': 1439, '00': 447, '15': 4477, '13': 1253, '11': 641, '19': 1188}
Now, let's proceed with Step 2
avg_by_hour = []
for h in comments_by_hour:
avg_by_hour.append([h, comments_by_hour[h] / counts_by_hour[h]])
print(avg_by_hour)
[['21', 16.009174311926607], ['09', 5.5777777777777775], ['16', 16.796296296296298], ['04', 7.170212765957447], ['07', 7.852941176470588], ['10', 13.440677966101696], ['06', 9.022727272727273], ['08', 10.25], ['14', 13.233644859813085], ['12', 9.41095890410959], ['17', 11.46], ['23', 7.985294117647059], ['03', 7.796296296296297], ['05', 10.08695652173913], ['20', 21.525], ['01', 11.383333333333333], ['22', 6.746478873239437], ['02', 23.810344827586206], ['18', 13.20183486238532], ['00', 8.127272727272727], ['15', 38.5948275862069], ['13', 14.741176470588234], ['11', 11.051724137931034], ['19', 10.8]]
Although we now have the results we need, this format makes it hard to identify the hours with the highest values. Let's finish by sorting the list of lists and printing the five highest values in a format that's easier to read.
swap_avg_by_hour = []
for row in avg_by_hour:
swap_avg_by_hour.append([row[1],row[0]])
print(swap_avg_by_hour)
sorted_swap = sorted(swap_avg_by_hour,reverse= True)
[[16.009174311926607, '21'], [5.5777777777777775, '09'], [16.796296296296298, '16'], [7.170212765957447, '04'], [7.852941176470588, '07'], [13.440677966101696, '10'], [9.022727272727273, '06'], [10.25, '08'], [13.233644859813085, '14'], [9.41095890410959, '12'], [11.46, '17'], [7.985294117647059, '23'], [7.796296296296297, '03'], [10.08695652173913, '05'], [21.525, '20'], [11.383333333333333, '01'], [6.746478873239437, '22'], [23.810344827586206, '02'], [13.20183486238532, '18'], [8.127272727272727, '00'], [38.5948275862069, '15'], [14.741176470588234, '13'], [11.051724137931034, '11'], [10.8, '19']]
Now,let's see the Top 5 Hours for Ask Posts Comments
for row in sorted_swap[:5]:
print("{}:00: {:.2f} average comments per post".format(row[1],row[0]))
15:00: 38.59 average comments per post 02:00: 23.81 average comments per post 20:00: 21.52 average comments per post 16:00: 16.80 average comments per post 21:00: 16.01 average comments per post
Thus,there are approximately 39 average comments per post, for every post posted on hacker news at 15:00. Thus, hours 15:00, 2:00 ,20:00, 16:00 and 21:00 are likely the hours,one should create a post for having a higher chance of receiving comments on it.
In this project, we analyzed the Hacker News dataset. Our findings indicate that the Ask HN posts have more average comments per post than the Show HN. Thus, if you want to have either a first opinion or a second opinion or probably just market your ideas, Hacker News is the place. Besides this, we also found that during this time of the day - 15:00 ,2:00 and 20:00 (top 3) are more guaranteed to bring more comments if you post at this time. You can expect to have atleast 20 comments on an average for posts on Hacker News during those specific hours