0x0e.org | pentesting perspective

braindump on pentesting, QA, metasploit, constant learning

scrape scrape scrape

with 2 comments

totally half-finished thought. maybe it’ll spawn an idea for you… there’s a zillion+1 ways to scrape information from the web these days. here’s the easiest i’ve found:

require 'nokogiri'
require 'open-uri'
require 'tidy_ffi'

class CrappyScraper

	attr_accessor :doc	
	
	def search(keyword)
		@doc = Nokogiri::HTML(open("http://www.google.com/search?q=" + keyword))

		@doc.xpath('//h3/a').each do |node|
			puts node.text
		end

	end
	
	def scrape(url)
		@doc = Nokogiri::HTML(open(url))
		
		@doc.xpath('//span/a').each do |node|
  			puts node.text
		end
	end

	def write_clean(filename)
		File.open(filename, 'w') do |f| 
						doc_clean = TidyFFI::Tidy.new(@doc.to_s).clean
						f.write(doc_clean) 
		end
	end
	
	def to_s
		TidyFFI::Tidy.new(@doc.to_s).clean
	end
	
	def write(filename)
		File.open(filename, 'w') { |f| f.write(@doc) }
	end
end


x = CrappyScraper.new
x.search('cowabunga')
puts x.to_s
Advertisements

Written by jcran

July 12, 2010 at 2:36 PM

Posted in Uncategorized

2 Responses

Subscribe to comments with RSS.

  1. I was actually able to follow along on this one.

    gSaenz

    July 12, 2010 at 10:12 PM

  2. hehe, glad to hear it. admittedly, some of my posting has been pretty cryptic (this one isn’t much better), but thankfully ruby’s super-readable.

    jcran

    July 13, 2010 at 12:48 AM


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: