The “Rails” components of the website were completed in about 20 minutes and consist of nothing more than a form which provides the user a means to supply a date and returns an appropriate (some would argue) list of the deceased.
After quite a bit of experimentation and refactoring I have refined the script which parses Wikipedia’s Persondata from the XML dump to the following:
#!/usr/bin/env ruby require 'ParseDate' fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml" page_start = /\/ page_end = /\<\/page\>/ has_persondata = /\{\{Persondata/ regex_name = /\|NAME=(.*)/ regex_page_title = /\ (.*)\<\/title\>/ regex_page_id = /\ (.*)\<\/id\>/ regex_date_death = /\|DATE OF DEATH=(.*)/ def cleanse data wiki_marks = /\[\[|\]\]|\{\{|\}\}/ clean = data.strip.gsub(wiki_marks, '') clean.gsub(/'/, "\\\\'") end def process_date date if !date.nil? date = ParseDate.parsedate(date) if !date[0].nil? and !date[1].nil? and !date[2].nil? sprintf("%04d-%02d-%02d", date[0], date[1], date[2]) else nil end end end File.exists? fp and File.readable? fp File.open(fp, "r") do |file| i = 0 concat = false page_text = '' while line = file.gets if !page_start.match(line).nil? or concat concat = true page_text = page_text + line if !page_end.match(line).nil? if !has_persondata.match(page_text).nil? i = i + 1 date_death = regex_date_death.match(page_text)[1] unless ↵ regex_date_death.match(page_text).nil? name = regex_name.match(page_text)[1] unless ↵ regex_name.match(page_text).nil? page_title = regex_page_title.match(page_text)[1] unless ↵ regex_page_title.match(page_text).nil? page_id = regex_page_id.match(page_text)[1] unless ↵ regex_page_id.match(page_text).nil? date_of_death = process_date(cleanse(date_death)) if !date_of_death.nil? puts "- !ruby/object:Person" puts " attributes:" puts " date_of_death: " + date_of_death unless date_of_death.nil? puts " name: " + cleanse(name) unless name.nil? puts " page_title: " + cleanse(page_title) unless page_title.nil? puts " page_id: " + cleanse(page_id) unless page_id.nil? puts " id: " + i.to_s end end concat = false page_text = '' end end end end
Much of the data I was parsing is no longer being collected as I am re-evaluating the schema since much of the data within the Wikipedia database is void of any consistent form (as expected).
The keen observer will note that this script generates YaML which I import painlessly via Geoffrey Grosenbach’s ar_fixtures plugin for Rails.
The next problem to solve is the parse the free-form date to be entered by the user. I had this working using Ruby’s ParseDate but I’m unable to reliably create a date that will then be feed to MySQL or generate a nil value. I am passing the output from ParseDate’s sole method, parsedate, to the new method of Ruby’s Date class but after working for a while the method is now returning an “invalid date” error. More later…