Someone recently asked a question in the comments about internally generated vs submitted posts (http://hackaday.com/2016/06/07/hacking-a-fluke-multimeter-hertz-so-good/#comments). This (d)evolved into a discussion about changes over time and then Brian gave permission to scrape the HAD website. I figured this was a good excuse to learn how to scrape a website and exercise my data crunching-fu.
Files
HackadayFixed.txt.gz
HAD stats scraped on June 8th, 2016. Fixed two lines that were missing tag/post data.
I'll continue to put up interesting things as I think of them. Here are a few interesting tidbits.
Most often used post tags:
misc hacks
2327
Arduino Hacks
1792
news
1492
classic hacks
1291
robots hacks
1248
tool hacks
1200
home hacks
1024
led hacks
1024
Microcontrollers
893
Hackaday Columns
813
peripherals hacks
778
Featured
750
transportation hacks
742
slider
711
3d Printer hacks
696
hardware
661
security hacks
657
Raspberry Pi
634
digital cameras hacks
589
home entertainment hacks
587
Perhaps unsurprisingly, arduino hacks are near the top of the list.
If you look at the most prolific authors you get:
Mike Szczys
5716
Brian Benchoff
3834
Caleb Kraft
1567
Eliot
1332
James Hobson
1063
Mike Nathan
1039
Will O'Brien
805
Adam Fabio
530
Elliot Williams
405
Al Williams
401
Kristina Panos
322
Rich Bremer
290
Jakob Griffith
269
Eric Evenchick
265
Rick Osgood
225
Gerrit Coetzee
215
Marsh
213
Jeremy Cook
199
Dan Maloney
198
Bryan Cockfield
187
Kevin Dady
187
Mathieu Stephan
180
Anool Mahidharia
160
Juan Aguilar
160
Vine Veneziani
137
Plotting the number of articles per week, segregated by the top ten authors, over time gives the following picture:
You can clearly see where submitters became active and when when they stopped. Brian had a early submission somewhere in 2006 before he joined HAD. Mike Szczys was active early and then starting tailing off around 2013-- other behind the scenes activities I imagine.
Here is the data requested: featured per week and %featured.
The above was for articles with the "Featured" post marker. If you include "Featured","Retrotechtacular","Hackaday Columns", "The Hackaday Prize", "Ask Hackaday", "Hackaday Store", "Interviews", that roughly triples the number of articles, but the overall shape looks the same.
OK, first plot of the data before I go to bed. I munged the data and plotted posts per day as a function of time. Not surprisingly, the number of posts per day have been going up since the early days. Somewhat surprisingly the maximum posts per day was way back in Feb 28, 2011 when there were no less than 16 posts! Here you go:
Staying true to its name, most days early on had one article per day. Now the mode appears to be 8 per day.
Found a few websites that explain the xtree syntax and I was off to the races. So a few baby steps first.
from lxml import html
import requests
page = requests.get('http://hackaday.com/blog/page/3000/')
tree = html.fromstring(page.content)
# get post titles
tree.xpath('//article/header/h1/a/text()')
# get post IDs
tree.xpath('//article/@id')
# get Date of publication
tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
Eventually wrote a script to scrape the entire HAD archives. On Wednesday June 8th at 11PM Pacific time, it had 3223 pages. Decided to include article ID, date of publication, title, author, #comments, "posted ins", and tags. Here is a quick and dirty python script to output all data to a tab delimited file:
from lxml import html
import requests
fh = open("Hackaday.txt", 'w')
for pageNum in xrange(1,3224,1):
page = requests.get('http://hackaday.com/blog/page/%d/'%pageNum)
tree = html.fromstring(page.content)
titles = tree.xpath('//article/header/h1/a/text()')
postIDs = tree.xpath('//article/@id')
dates = tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
authors = tree.xpath('//article/header/div/a[@rel="author"]/text()')
commentCounts = tree.xpath('//article/header/div/a[@class="comments-counts comments-counts-top"]/text()')
commentCounts =[i.strip() for i in commentCounts]
posts = []
tags = []
for i in xrange(len(titles)):
posts.append(tree.xpath('//article[%d]/footer/span/a[@rel="category tag"]/text()'%(i+1)))
tags.append(tree.xpath('//article[%d]/footer/span/a[@rel="tag"]/text()'%(i+1)))
for i in xrange(len(titles)):
#print postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i])
fh.write(postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i]) + '\n')
fh.close()
I felt a bit guilty about scraping the entire website but Brian said it was OK. The html file for each page is ~60KB times 3223 pages is about 193 MB of data. This was distilled down to 3.5 MB of data and took about 25 minutes.
The latested post is #207753 and the earliest is post # 7. The numbers are not sequential and there are total of 22556 articles. The file looks like this
post-207753 June 8, 2016 Hackaday Prize Entry: The Green Machine Anool Mahidharia 1 Comment The Hackaday Prize 2016 Hackaday Prize,arduino,Coating machine,grbl,Hackaday Prize,linear motion,motor,raspberry pi,Spraying machine,stepper driver,the hackaday prize
post-208524 June 8, 2016 Rainbow Cats Announce Engagement Kristina Panos 1 Comment ATtiny Hacks attiny,because cats,blinkenlights,RGB LED,smd soldering,wedding announcements
post-208544 June 8, 2016 Talking Star Trek Al Williams 8 Comments linux hacks,software hacks computer speech,natural language,speech recognition,star trek,text to speech,voice command,voice recognition
.....
post-11 September 9, 2004 hack the dakota disposable camera Phillip Torrone 1 Comment digital cameras hacks
post-10 September 8, 2004 mod the cuecat, and scan barcodes… Phillip Torrone 1 Comment misc hacks
post-9 September 7, 2004 make a nintendo controller into a usb joystick Phillip Torrone 22 Comments computer hacks,macs hacks
post-8 September 6, 2004 change the voice of an aibo ers-7 Phillip Torrone 10 Comments robots hacks
post-7 September 5, 2004 radioshack phone dialer – red box Phillip Torrone 38 Comments misc hacks
I'll upload a zipped version. Hopefully this will save HAD from being scraped over and over again.I'll start slicing and dicing the data soon.
Addendum: for whatever reason, two articles were missing the posts/tags fields. I fixed them manually and uploaded the corrected file.
I want to create a web application like https://mandp.pk/ , I think it's scraping data from another site, can you please guide me how can I create a site like this one?
Hi,
I want to create a web application like https://mandp.pk/ , I think it's scraping data from another site, can you please guide me how can I create a site like this one?