I started off knowing nothing about web scraping. I found a good link which shows how to scrape using python:
http://docs.python-guide.org/en/latest/scenarios/scrape/
Found a few websites that explain the xtree syntax and I was off to the races. So a few baby steps first.
from lxml import html
import requests
page = requests.get('http://hackaday.com/blog/page/3000/')
tree = html.fromstring(page.content)
# get post titles
tree.xpath('//article/header/h1/a/text()')
# get post IDs
tree.xpath('//article/@id')
# get Date of publication
tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
Eventually wrote a script to scrape the entire HAD archives. On Wednesday June 8th at 11PM Pacific time, it had 3223 pages. Decided to include article ID, date of publication, title, author, #comments, "posted ins", and tags. Here is a quick and dirty python script to output all data to a tab delimited file:
from lxml import html
import requests
fh = open("Hackaday.txt", 'w')
for pageNum in xrange(1,3224,1):
page = requests.get('http://hackaday.com/blog/page/%d/'%pageNum)
tree = html.fromstring(page.content)
titles = tree.xpath('//article/header/h1/a/text()')
postIDs = tree.xpath('//article/@id')
dates = tree.xpath('//article/header/div/span[@class="entry-date"]/a/text()')
authors = tree.xpath('//article/header/div/a[@rel="author"]/text()')
commentCounts = tree.xpath('//article/header/div/a[@class="comments-counts comments-counts-top"]/text()')
commentCounts =[i.strip() for i in commentCounts]
posts = []
tags = []
for i in xrange(len(titles)):
posts.append(tree.xpath('//article[%d]/footer/span/a[@rel="category tag"]/text()'%(i+1)))
tags.append(tree.xpath('//article[%d]/footer/span/a[@rel="tag"]/text()'%(i+1)))
for i in xrange(len(titles)):
#print postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i])
fh.write(postIDs[i] + '\t' + dates[i] +'\t' +titles[i] +'\t' + authors[i]+'\t'+commentCounts[i]+ '\t' + ",".join(posts[i]) + '\t' + ",".join(tags[i]) + '\n')
fh.close()
I felt a bit guilty about scraping the entire website but Brian said it was OK. The html file for each page is ~60KB times 3223 pages is about 193 MB of data. This was distilled down to 3.5 MB of data and took about 25 minutes.
The latested post is #207753 and the earliest is post # 7. The numbers are not sequential and there are total of 22556 articles. The file looks like this
post-207753 June 8, 2016 Hackaday Prize Entry: The Green Machine Anool Mahidharia 1 Comment The Hackaday Prize 2016 Hackaday Prize,arduino,Coating machine,grbl,Hackaday Prize,linear motion,motor,raspberry pi,Spraying machine,stepper driver,the hackaday prize
post-208524 June 8, 2016 Rainbow Cats Announce Engagement Kristina Panos 1 Comment ATtiny Hacks attiny,because cats,blinkenlights,RGB LED,smd soldering,wedding announcements
post-208544 June 8, 2016 Talking Star Trek Al Williams 8 Comments linux hacks,software hacks computer speech,natural language,speech recognition,star trek,text to speech,voice command,voice recognition
.....
post-11 September 9, 2004 hack the dakota disposable camera Phillip Torrone 1 Comment digital cameras hacks
post-10 September 8, 2004 mod the cuecat, and scan barcodes… Phillip Torrone 1 Comment misc hacks
post-9 September 7, 2004 make a nintendo controller in to a usb joystick Phillip Torrone 22 Comments computer hacks,macs hacks
post-8 September 6, 2004 change the voice of an aibo ers-7 Phillip Torrone 10 Comments robots hacks
post-7 September 5, 2004 radioshack phone dialer – red box Phillip Torrone 38 Comments misc hacks
I'll upload a zipped version. Hopefully this will save HAD from being scraped over and over again.I'll start slicing and dicing the data soon.Addendum: for whatever reason, two articles were missing the posts/tags fields. I fixed them manually and uploaded the corrected file.
Discussions
Become a Hackaday.io Member
Create an account to leave a comment. Already have an account? Log In.
Interesting article, but I did not really understand anything hahah. I want to learn Python, I wonder, a person with a creative mind, will be able to master such knowledge? The last thing I did was high school homework help https://domymathhomework.org/assignment-help/ (btw using this site) my friends. But it starts to bother me. I want to know for myself the world of numbers and calculations. Today I will start my training :)
Are you sure? yes | no