home icon contact icon rss icon

Textmode

tools I use

Archive for Home

Picidae - Breaking Internet Censorship

As you may know, internet access is censored in some countries by the government (eg. The golden shield in China).

Picidae is a Project of Servers that let you enter a specific URL, and return an image of the given site. The image is clickable and thus makes it possible to overcome those content firewalls and surf almost as you were on the original page (Forms do work too).

So let’s help to spread it and maybe donate a server.

In the land of the free..

Although it is offtopic, I want to post it:

LA Homicide Map

Web2.0 map of death, this is pretty depressive, the numbers and especially the stories behind those incidents.

Railsbased.org - reusable applications built with Ruby on Rails

Since I still haven’t found the right one for me I thought I’d put up a small list of Ruby on Rails based Content Management Systems, to give at least a basic starting point for people searching for such a publishing system.

My favorites until know are Mephisto and Radiant, although Radiant lacks i18n support.

The list is not only for Publishing Systems, feel free to suggest any kind of reusable (open sourced) application thats based on Rails, wether or not there is a category for it.

Web Scraping 1: Hpricot

Web Scraping is the process of automatically retrieving content from web pages for further utilization like aggregating content or generate different output formats.

Although this technique is often used by spammers to fake content, there are times you wish to automate the retrieval of important data, for example if no feed is available.

One straight forward way to do this is simply download your desired webpage with wget and then write a parser to extract the parts of interest, which is pretty time consuming.

So we need tools.

Since today web pages are using XHTML, which is basically XML, we could use XPath to access the XML nodes we are interested in.

Luckily there is a nice parser written in Ruby called Hpricot which is actually based (but rewritten in C) on HTree and JQuery (check out JQuery if you don’t know it already).

I’ll show you a quick example that retrieves some google data, to see how it works:

To install it, simply get the gem:

sudo gem install hpricot

To make it available, we have to require it in our script, and we also use open-uri which is needed to fetch an URL:


require 'rubygems'
require 'hpricot'
require 'open-uri'

The next step is to build a query that selects the content we want, I’m going to use Xpath but Hpricot also supports CSS like queries. If you don’t know XPath check out W3Schools for a quick introduction, it is really powerful.

To check your XPath query you could simply look at the DOM of your desired page or use XPath tools like the XPather Firefox Extension

As I want to retrieve the search result count for a specific word (ruby in this case), I constructed the following query:


//td[@align="right"]//b[3]/

which is not very accurate, but works :)

Alright, thats basically all we need.. lets try it in our script: The steps are pretty easy so I’ll just show you everything at once:

  • Load the document
  • Search the document with our query
  • Display the result

which looks like:


doc = Hpricot(open("http://www.google.com/search?hl=en&q=ruby"))
count = doc.search('//td[@align="right"]//b[3]/')
puts count

And you have the count, this really can’t get any easier.

Now we want to exctract all the links (on the first page) of our search result, again we use a simplified query like ’//h2/a’ to get the links.

Now we can easily loop through the resulting elements and extract the link (the href attribute):


items = doc.search('//h2/a')
items.each do |item|
  puts item.attributes['href']
end

This is really just a very basic example of the possibilities Hpricot gives you, so if think thats sexy, check out the basics.

No-www!

There are different opinions about the ‘www’ prefix for domain names.

If you need some pros and cons, visit no-www and yes-www (seems to be down right now).

Well, I for myself decided to remove the ‘www’ prefix from all of my subdomains today, since its completely redundant to access the same domain with two different urls.

If youre using nginx, here is a small snippet to remove the ‘www’-prefix, just drop it into your server directive:


if ($host ~* ^www.(.+)$) {
  set $newhost $1;
  rewrite ^/(.*)$ http://$newhost/$1 permanent;
}