Trying to remain agile in a BIG design world.

The “Rails” components of the website were completed in about 20 minutes and consist of nothing more than a form which provides the user a means to supply a date and returns an appropriate (some would argue) list of the deceased.

After quite a bit of experimentation and refactoring I have refined the script which parses Wikipedia’s Persondata from the XML dump to the following:

#!/usr/bin/env ruby

require 'ParseDate'

fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml"
page_start = /\
/
page_end = /\<\/page\>/
has_persondata = /\{\{Persondata/
regex_name = /\|NAME=(.*)/
regex_page_title = /\(.*)\<\/title\>/
regex_page_id = /\(.*)\<\/id\>/
regex_date_death = /\|DATE OF DEATH=(.*)/

def cleanse data
  wiki_marks = /\[\[|\]\]|\{\{|\}\}/
  clean = data.strip.gsub(wiki_marks, '')
  clean.gsub(/'/, "\\\\'")
end

def process_date date
  if !date.nil?
    date = ParseDate.parsedate(date)
    if !date[0].nil? and !date[1].nil? and !date[2].nil?
      sprintf("%04d-%02d-%02d", date[0], date[1], date[2])
    else
      nil
    end
  end
end

File.exists? fp and File.readable? fp
File.open(fp, "r") do |file|
  i = 0
  concat = false
  page_text = ''
  while line = file.gets
    if !page_start.match(line).nil? or concat
      concat = true
      page_text = page_text + line
      if !page_end.match(line).nil?
        if !has_persondata.match(page_text).nil?
          i = i + 1
          date_death = regex_date_death.match(page_text)[1] unless ↵
              regex_date_death.match(page_text).nil?
          name = regex_name.match(page_text)[1] unless ↵
              regex_name.match(page_text).nil?
          page_title = regex_page_title.match(page_text)[1] unless ↵
              regex_page_title.match(page_text).nil?
          page_id = regex_page_id.match(page_text)[1] unless ↵
              regex_page_id.match(page_text).nil?
          date_of_death = process_date(cleanse(date_death))
          if !date_of_death.nil?
            puts "- !ruby/object:Person"
            puts "  attributes:"
            puts "    date_of_death: " + date_of_death unless date_of_death.nil?
            puts "    name: " + cleanse(name) unless name.nil?
            puts "    page_title: " + cleanse(page_title) unless page_title.nil?
            puts "    page_id: " + cleanse(page_id) unless page_id.nil?
            puts "    id: " + i.to_s
          end
        end
        concat = false
        page_text = ''
      end
    end
  end
end

Much of the data I was parsing is no longer being collected as I am re-evaluating the schema since much of the data within the Wikipedia database is void of any consistent form (as expected).

The keen observer will note that this script generates YaML which I import painlessly via Geoffrey Grosenbach’s ar_fixtures plugin for Rails.

The next problem to solve is the parse the free-form date to be entered by the user. I had this working using Ruby’s ParseDate but I’m unable to reliably create a date that will then be feed to MySQL or generate a nil value. I am passing the output from ParseDate’s sole method, parsedate, to the new method of Ruby’s Date class but after working for a while the method is now returning an “invalid date” error. More later…

§5 · April 20, 2007 · JustMissedMe, Rails, Ruby · · [Print]