OK, need proof that I’m a nuby? How’s this for proof?
#!/usr/bin/env ruby fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml" page_start = /\/ page_end = /\<\/page\>/ has_persondata = /\{\{Persondata/ regex_name = /\|NAME=(.*)/ regex_alter_names = /\|ALTERNATIVE NAMES=(.*)/ regex_description = /\|SHORT DESCRIPTION=(.*)/ regex_page_title = /\ (.*)\<\/title\>/ regex_page_id = /\ (.*)\<\/id\>/ regex_date_birth = /\|DATE OF BIRTH=(.*)/ regex_place_birth = /\|PLACE OF BIRTH=(.*)/ regex_date_death = /\|DATE OF DEATH=(.*)/ regex_place_death = /\|PLACE OF DEATH=(.*)/ regex_revision_timestamp = ↵ /\ .*\ (.*)\<\/timestamp\>.*\<\/revision\>/m File.exists? fp and File.readable? fp File.open(fp, "r") do |file| concat = false page_text = '' while line = file.gets if !page_start.match(line).nil? or concat concat = true page_text = page_text + line if !page_end.match(line).nil? if !has_persondata.match(page_text).nil? name = regex_name.match(page_text)[1] unless ↵ regex_name.match(page_text).nil? alternative_names = regex_alter_names.match(page_text)[1] unless ↵ regex_alter_names.match(page_text).nil? description = regex_description.match(page_text)[1] unless ↵ regex_description.match(page_text).nil? page_title = regex_page_title.match(page_text)[11] unless ↵ regex_page_title.match(page_text).nil? page_id = regex_page_id.match(page_text)[1] unless ↵ regex_page_id.match(page_text).nil? date_birth = regex_date_birth.match(page_text)[1] unless ↵ regex_date_birth.match(page_text).nil? place_birth = regex_place_birth.match(page_text)[1] unless ↵ regex_place_birth.match(page_text).nil? date_death = regex_date_death.match(page_text)[1] unless ↵ regex_date_death.match(page_text).nil? place_death = regex_place_death.match(page_text)[1] unless ↵ regex_place_death.match(page_text).nil? revision_timestamp = regex_revision_timestamp.match(page_text)[1] unless ↵ regex_revision_timestamp.match(page_text).nil? end concat = false page_text = '' end end end end
I decided that STX would require too significant an investment in time and while I may integrate hpricot at some point the solution above seems sufficient for the moment.
I have yet to begin importing any of the data into the database. I’ll want to format the data consistently and remove the wiki specific mark-up. Feel free to rip the code apart and please do make suggestions. I am after all a nuby.