Archive for April 2007

 
 

Mac OS X, Ruby, CGI without Rails

In the midst of my Rails based Web development I was tasked with the creation of a simple, one page Web form. I knew it would be a snap to write given my background in PHP and the broad support for PHP in OS X but I decided to “teach myself” how I could do the same thing in Ruby without the crutch of Rails.

The first step was to create the Web form with static HTML which took about 15 minutes, start to finish. I configured the form action to point to my Ruby script that would simply echo “Thanks!”.

Clicking the “Submit” button on my form produced the “Opening form_handler.rb” window (Windows readers your mileage may vary from this point forward) meaning that Apache did not know what to do with files with “.rb” extensions. A quick visit to /etc/httpd/httpd.conf to add the following line beneath the other AddHandler options did the trick.

AddHandler cgi-script .rb

After restarting Apache, and clicking the “Submit” button again I was greeted with a 500 error. A peek at /var/log/httpd/error_log revealed the following:

Options ExecCGI is off in this directory: /Users/sjobs/Sites/cgi-bin/form_handler.rb

That’s easy enough to fix. Placing a .htaccess file containing the following line resolved that issue:

Options ExecCGI

Finally, with no special modification to Apache included out-of-the-box with OS X except those changes noted above my Ruby CGI script was performing as expected. Total time invested: 30 minutes. Not bad. The basic script, which just chunks out the form values and some other stuff appears below. An excellent reference to Ruby’s CGI library is, of course, located in RDoc.

#!/usr/bin/env ruby
require 'cgi'

cgi = CGI.new("html4")
params = cgi.params

cgi.out() do
  cgi.html() do
    cgi.head{ cgi.title{"TITLE"} } +
    cgi.body() do
      cgi.pre() do
        CGI::escapeHTML(
          "params: " + cgi.params.inspect + "\n" +
          "cookies: " + cgi.cookies.inspect + "\n" +
          ENV.collect() do |key, value|
            key + " --> " + value + "\n"
          end.join("")
        )
      end
    end
  end
end

Just Missed Me – Chapter Two

The “Rails” components of the website were completed in about 20 minutes and consist of nothing more than a form which provides the user a means to supply a date and returns an appropriate (some would argue) list of the deceased.

After quite a bit of experimentation and refactoring I have refined the script which parses Wikipedia’s Persondata from the XML dump to the following:

#!/usr/bin/env ruby

require 'ParseDate'

fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml"
page_start = /\
/
page_end = /\<\/page\>/
has_persondata = /\{\{Persondata/
regex_name = /\|NAME=(.*)/
regex_page_title = /\(.*)\<\/title\>/
regex_page_id = /\(.*)\<\/id\>/
regex_date_death = /\|DATE OF DEATH=(.*)/

def cleanse data
  wiki_marks = /\[\[|\]\]|\{\{|\}\}/
  clean = data.strip.gsub(wiki_marks, '')
  clean.gsub(/'/, "\\\\'")
end

def process_date date
  if !date.nil?
    date = ParseDate.parsedate(date)
    if !date[0].nil? and !date[1].nil? and !date[2].nil?
      sprintf("%04d-%02d-%02d", date[0], date[1], date[2])
    else
      nil
    end
  end
end

File.exists? fp and File.readable? fp
File.open(fp, "r") do |file|
  i = 0
  concat = false
  page_text = ''
  while line = file.gets
    if !page_start.match(line).nil? or concat
      concat = true
      page_text = page_text + line
      if !page_end.match(line).nil?
        if !has_persondata.match(page_text).nil?
          i = i + 1
          date_death = regex_date_death.match(page_text)[1] unless ↵
              regex_date_death.match(page_text).nil?
          name = regex_name.match(page_text)[1] unless ↵
              regex_name.match(page_text).nil?
          page_title = regex_page_title.match(page_text)[1] unless ↵
              regex_page_title.match(page_text).nil?
          page_id = regex_page_id.match(page_text)[1] unless ↵
              regex_page_id.match(page_text).nil?
          date_of_death = process_date(cleanse(date_death))
          if !date_of_death.nil?
            puts "- !ruby/object:Person"
            puts "  attributes:"
            puts "    date_of_death: " + date_of_death unless date_of_death.nil?
            puts "    name: " + cleanse(name) unless name.nil?
            puts "    page_title: " + cleanse(page_title) unless page_title.nil?
            puts "    page_id: " + cleanse(page_id) unless page_id.nil?
            puts "    id: " + i.to_s
          end
        end
        concat = false
        page_text = ''
      end
    end
  end
end

Much of the data I was parsing is no longer being collected as I am re-evaluating the schema since much of the data within the Wikipedia database is void of any consistent form (as expected).

The keen observer will note that this script generates YaML which I import painlessly via Geoffrey Grosenbach’s ar_fixtures plugin for Rails.

The next problem to solve is the parse the free-form date to be entered by the user. I had this working using Ruby’s ParseDate but I’m unable to reliably create a date that will then be feed to MySQL or generate a nil value. I am passing the output from ParseDate’s sole method, parsedate, to the new method of Ruby’s Date class but after working for a while the method is now returning an “invalid date” error. More later…

Just Missed Me – Chapter One

OK, need proof that I’m a nuby? How’s this for proof?

#!/usr/bin/env ruby

fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml"
page_start = /\
/
page_end = /\<\/page\>/
has_persondata = /\{\{Persondata/
regex_name = /\|NAME=(.*)/
regex_alter_names = /\|ALTERNATIVE NAMES=(.*)/
regex_description = /\|SHORT DESCRIPTION=(.*)/
regex_page_title = /\(.*)\<\/title\>/
regex_page_id = /\(.*)\<\/id\>/
regex_date_birth = /\|DATE OF BIRTH=(.*)/
regex_place_birth = /\|PLACE OF BIRTH=(.*)/
regex_date_death = /\|DATE OF DEATH=(.*)/
regex_place_death = /\|PLACE OF DEATH=(.*)/
regex_revision_timestamp = ↵
   /\.*\(.*)\<\/timestamp\>.*\<\/revision\>/m

File.exists? fp and File.readable? fp
File.open(fp, "r") do |file|
  concat = false
  page_text = ''
  while line = file.gets
    if !page_start.match(line).nil? or concat
      concat = true
      page_text = page_text + line
      if !page_end.match(line).nil?
        if !has_persondata.match(page_text).nil?
          name = regex_name.match(page_text)[1] unless ↵
            regex_name.match(page_text).nil?
          alternative_names = regex_alter_names.match(page_text)[1] unless ↵
            regex_alter_names.match(page_text).nil?
          description = regex_description.match(page_text)[1] unless ↵
            regex_description.match(page_text).nil?
          page_title = regex_page_title.match(page_text)[11] unless ↵
            regex_page_title.match(page_text).nil?
          page_id = regex_page_id.match(page_text)[1] unless ↵
            regex_page_id.match(page_text).nil?
          date_birth = regex_date_birth.match(page_text)[1] unless ↵
            regex_date_birth.match(page_text).nil?
          place_birth = regex_place_birth.match(page_text)[1] unless ↵
            regex_place_birth.match(page_text).nil?
          date_death = regex_date_death.match(page_text)[1] unless ↵
            regex_date_death.match(page_text).nil?
          place_death = regex_place_death.match(page_text)[1] unless ↵
            regex_place_death.match(page_text).nil?
          revision_timestamp = regex_revision_timestamp.match(page_text)[1] unless ↵
            regex_revision_timestamp.match(page_text).nil?
        end
        concat = false
        page_text = ''
      end
    end
  end
end

I decided that STX would require too significant an investment in time and while I may integrate hpricot at some point the solution above seems sufficient for the moment.

I have yet to begin importing any of the data into the database. I’ll want to format the data consistently and remove the wiki specific mark-up. Feel free to rip the code apart and please do make suggestions. I am after all a nuby.

Just Missed Me – Chapter Zero

I became interested in hpricot while experimenting during the creation of a prospecting system for my employer. Driving home from work one day I hatched an idea for a website that would employ hpricot to pull its data from Wikipedia’s Persondata.

Due to the extremely large size of the associated data file I soon determined that an hpricot-based solution would not scale. Digging a bit on Wikipedia I discovered STX which appears to be a more promising means of proceeding. I’ll keep you posted.