The “Rails” components of the website were completed in about 20 minutes and consist of nothing more than a form which provides the user a means to supply a date and returns an appropriate (some would argue) list of the deceased.
After quite a bit of experimentation and refactoring I have refined the script which parses Wikipedia’s Persondata from the XML dump to the following:
#!/usr/bin/env ruby
require 'ParseDate'
fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml"
page_start = /\
/
page_end = /\<\/page\>/
has_persondata = /\{\{Persondata/
regex_name = /\|NAME=(.*)/
regex_page_title = /\(.*)\<\/title\>/
regex_page_id = /\(.*)\<\/id\>/
regex_date_death = /\|DATE OF DEATH=(.*)/
def cleanse data
wiki_marks = /\[\[|\]\]|\{\{|\}\}/
clean = data.strip.gsub(wiki_marks, '')
clean.gsub(/'/, "\\\\'")
end
def process_date date
if !date.nil?
date = ParseDate.parsedate(date)
if !date[0].nil? and !date[1].nil? and !date[2].nil?
sprintf("%04d-%02d-%02d", date[0], date[1], date[2])
else
nil
end
end
end
File.exists? fp and File.readable? fp
File.open(fp, "r") do |file|
i = 0
concat = false
page_text = ''
while line = file.gets
if !page_start.match(line).nil? or concat
concat = true
page_text = page_text + line
if !page_end.match(line).nil?
if !has_persondata.match(page_text).nil?
i = i + 1
date_death = regex_date_death.match(page_text)[1] unless ↵
regex_date_death.match(page_text).nil?
name = regex_name.match(page_text)[1] unless ↵
regex_name.match(page_text).nil?
page_title = regex_page_title.match(page_text)[1] unless ↵
regex_page_title.match(page_text).nil?
page_id = regex_page_id.match(page_text)[1] unless ↵
regex_page_id.match(page_text).nil?
date_of_death = process_date(cleanse(date_death))
if !date_of_death.nil?
puts "- !ruby/object:Person"
puts " attributes:"
puts " date_of_death: " + date_of_death unless date_of_death.nil?
puts " name: " + cleanse(name) unless name.nil?
puts " page_title: " + cleanse(page_title) unless page_title.nil?
puts " page_id: " + cleanse(page_id) unless page_id.nil?
puts " id: " + i.to_s
end
end
concat = false
page_text = ''
end
end
end
end
Much of the data I was parsing is no longer being collected as I am re-evaluating the schema since much of the data within the Wikipedia database is void of any consistent form (as expected).
The keen observer will note that this script generates YaML which I import painlessly via Geoffrey Grosenbach’s ar_fixtures plugin for Rails.
The next problem to solve is the parse the free-form date to be entered by the user. I had this working using Ruby’s ParseDate but I’m unable to reliably create a date that will then be feed to MySQL or generate a nil value. I am passing the output from ParseDate’s sole method, parsedate, to the new method of Ruby’s Date class but after working for a while the method is now returning an “invalid date” error. More later…