Trying to remain agile in a BIG design world.

OK, need proof that I’m a nuby? How’s this for proof?

#!/usr/bin/env ruby

fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml"
page_start = /\
/
page_end = /\<\/page\>/
has_persondata = /\{\{Persondata/
regex_name = /\|NAME=(.*)/
regex_alter_names = /\|ALTERNATIVE NAMES=(.*)/
regex_description = /\|SHORT DESCRIPTION=(.*)/
regex_page_title = /\(.*)\<\/title\>/
regex_page_id = /\(.*)\<\/id\>/
regex_date_birth = /\|DATE OF BIRTH=(.*)/
regex_place_birth = /\|PLACE OF BIRTH=(.*)/
regex_date_death = /\|DATE OF DEATH=(.*)/
regex_place_death = /\|PLACE OF DEATH=(.*)/
regex_revision_timestamp = ↵
   /\.*\(.*)\<\/timestamp\>.*\<\/revision\>/m

File.exists? fp and File.readable? fp
File.open(fp, "r") do |file|
  concat = false
  page_text = ''
  while line = file.gets
    if !page_start.match(line).nil? or concat
      concat = true
      page_text = page_text + line
      if !page_end.match(line).nil?
        if !has_persondata.match(page_text).nil?
          name = regex_name.match(page_text)[1] unless ↵
            regex_name.match(page_text).nil?
          alternative_names = regex_alter_names.match(page_text)[1] unless ↵
            regex_alter_names.match(page_text).nil?
          description = regex_description.match(page_text)[1] unless ↵
            regex_description.match(page_text).nil?
          page_title = regex_page_title.match(page_text)[11] unless ↵
            regex_page_title.match(page_text).nil?
          page_id = regex_page_id.match(page_text)[1] unless ↵
            regex_page_id.match(page_text).nil?
          date_birth = regex_date_birth.match(page_text)[1] unless ↵
            regex_date_birth.match(page_text).nil?
          place_birth = regex_place_birth.match(page_text)[1] unless ↵
            regex_place_birth.match(page_text).nil?
          date_death = regex_date_death.match(page_text)[1] unless ↵
            regex_date_death.match(page_text).nil?
          place_death = regex_place_death.match(page_text)[1] unless ↵
            regex_place_death.match(page_text).nil?
          revision_timestamp = regex_revision_timestamp.match(page_text)[1] unless ↵
            regex_revision_timestamp.match(page_text).nil?
        end
        concat = false
        page_text = ''
      end
    end
  end
end

I decided that STX would require too significant an investment in time and while I may integrate hpricot at some point the solution above seems sufficient for the moment.

I have yet to begin importing any of the data into the database. I’ll want to format the data consistently and remove the wiki specific mark-up. Feel free to rip the code apart and please do make suggestions. I am after all a nuby.

§4 · April 17, 2007 · JustMissedMe, Rails, Ruby · · [Print]