<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Agilo.us &#187; JustMissedMe</title>
	<atom:link href="http://agilo.us/category/justmissedme/feed/" rel="self" type="application/rss+xml" />
	<link>http://agilo.us</link>
	<description>Trying to remain agile in a BIG design world.</description>
	<lastBuildDate>Fri, 16 Apr 2010 06:03:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Just Missed Me &#8211; Chapter Two</title>
		<link>http://agilo.us/2007/04/20/just-missed-me-chapter-two/</link>
		<comments>http://agilo.us/2007/04/20/just-missed-me-chapter-two/#comments</comments>
		<pubDate>Sat, 21 Apr 2007 02:25:05 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[JustMissedMe]]></category>
		<category><![CDATA[Rails]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://agilo.us/?p=5</guid>
		<description><![CDATA[The &#8220;Rails&#8221; components of the website were completed in about 20 minutes and consist of nothing more than a form which provides the user a means to supply a date and returns an appropriate (some would argue) list of the deceased. After quite a bit of experimentation and refactoring I have refined the script which [...]]]></description>
			<content:encoded><![CDATA[<p>The &#8220;Rails&#8221; components of the website were completed in about 20 minutes and consist of nothing more than a form which provides the user a means to supply a date and returns an appropriate (some would argue) list of the deceased.</p>
<p>After quite a bit of experimentation and refactoring I have refined the script which parses Wikipedia&#8217;s Persondata from the <a href="http://download.wikimedia.org/enwikisource/latest/">XML dump</a> to the following:</p>
<pre>
#!/usr/bin/env ruby

require 'ParseDate'

fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml"
page_start = /\
<page\>/
page_end = /\&lt;\/page\&gt;/
has_persondata = /\{\{Persondata/
regex_name = /\|NAME=(.*)/
regex_page_title = /\<title\>(.*)\&lt;\/title\&gt;/
regex_page_id = /\<id\>(.*)\&lt;\/id\&gt;/
regex_date_death = /\|DATE OF DEATH=(.*)/

def cleanse data
  wiki_marks = /\[\[|\]\]|\{\{|\}\}/
  clean = data.strip.gsub(wiki_marks, '')
  clean.gsub(/'/, "\\\\'")
end

def process_date date
  if !date.nil?
    date = ParseDate.parsedate(date)
    if !date[0].nil? and !date[1].nil? and !date[2].nil?
      sprintf("%04d-%02d-%02d", date[0], date[1], date[2])
    else
      nil
    end
  end
end

File.exists? fp and File.readable? fp
File.open(fp, "r") do |file|
  i = 0
  concat = false
  page_text = ''
  while line = file.gets
    if !page_start.match(line).nil? or concat
      concat = true
      page_text = page_text + line
      if !page_end.match(line).nil?
        if !has_persondata.match(page_text).nil?
          i = i + 1
          date_death = regex_date_death.match(page_text)[1] unless â†µ
              regex_date_death.match(page_text).nil?
          name = regex_name.match(page_text)[1] unless â†µ
              regex_name.match(page_text).nil?
          page_title = regex_page_title.match(page_text)[1] unless â†µ
              regex_page_title.match(page_text).nil?
          page_id = regex_page_id.match(page_text)[1] unless â†µ
              regex_page_id.match(page_text).nil?
          date_of_death = process_date(cleanse(date_death))
          if !date_of_death.nil?
            puts "- !ruby/object:Person"
            puts "  attributes:"
            puts "    date_of_death: " + date_of_death unless date_of_death.nil?
            puts "    name: " + cleanse(name) unless name.nil?
            puts "    page_title: " + cleanse(page_title) unless page_title.nil?
            puts "    page_id: " + cleanse(page_id) unless page_id.nil?
            puts "    id: " + i.to_s
          end
        end
        concat = false
        page_text = ''
      end
    end
  end
end
</id\></title\></page\></pre>
<p>Much of the data I was parsing is no longer being collected as I am re-evaluating the schema since much of the data within the Wikipedia database is void of any consistent form (as expected).</p>
<p>The keen observer will note that this script generates YaML which I import painlessly via Geoffrey Grosenbach&#8217;s <a href="http://nubyonrails.com/articles/read/207">ar_fixtures plugin</a> for Rails.</p>
<p>The next problem to solve is the parse the free-form date to be entered by the user. I had this working using Ruby&#8217;s <a href="http://www.ruby-doc.org/stdlib/libdoc/parsedate/rdoc/index.html">ParseDate</a> but I&#8217;m unable to reliably create a date that will then be feed to MySQL or generate a nil value. I am passing the output from ParseDate&#8217;s sole method, parsedate, to the new method of Ruby&#8217;s Date class but after working for a while the method is now returning an &#8220;invalid date&#8221; error. More later&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://agilo.us/2007/04/20/just-missed-me-chapter-two/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Just Missed Me &#8211; Chapter One</title>
		<link>http://agilo.us/2007/04/17/just-missed-me-chapter-one/</link>
		<comments>http://agilo.us/2007/04/17/just-missed-me-chapter-one/#comments</comments>
		<pubDate>Wed, 18 Apr 2007 04:04:02 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[JustMissedMe]]></category>
		<category><![CDATA[Rails]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://agilo.us/?p=4</guid>
		<description><![CDATA[OK, need proof that I&#8217;m a nuby? How&#8217;s this for proof? #!/usr/bin/env ruby fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml" page_start = /\ / page_end = /\&#60;\/page\&#62;/ has_persondata = /\{\{Persondata/ regex_name = /\&#124;NAME=(.*)/ regex_alter_names = /\&#124;ALTERNATIVE NAMES=(.*)/ regex_description = /\&#124;SHORT DESCRIPTION=(.*)/ regex_page_title = /\(.*)\&#60;\/title\&#62;/ regex_page_id = /\(.*)\&#60;\/id\&#62;/ regex_date_birth = /\&#124;DATE OF BIRTH=(.*)/ regex_place_birth = /\&#124;PLACE OF BIRTH=(.*)/ regex_date_death = [...]]]></description>
			<content:encoded><![CDATA[<p>OK, need proof that I&#8217;m a nuby? How&#8217;s this for proof?</p>
<pre>
#!/usr/bin/env ruby

fp = "/Users/sjobs/data/enwiki-20061130-pages-articles.xml"
page_start = /\
<page\>/
page_end = /\&lt;\/page\&gt;/
has_persondata = /\{\{Persondata/
regex_name = /\|NAME=(.*)/
regex_alter_names = /\|ALTERNATIVE NAMES=(.*)/
regex_description = /\|SHORT DESCRIPTION=(.*)/
regex_page_title = /\<title\>(.*)\&lt;\/title\&gt;/
regex_page_id = /\<id\>(.*)\&lt;\/id\&gt;/
regex_date_birth = /\|DATE OF BIRTH=(.*)/
regex_place_birth = /\|PLACE OF BIRTH=(.*)/
regex_date_death = /\|DATE OF DEATH=(.*)/
regex_place_death = /\|PLACE OF DEATH=(.*)/
regex_revision_timestamp = â†µ
   /\<revision\>.*\<timestamp\>(.*)\&lt;\/timestamp\&gt;.*\&lt;\/revision\&gt;/m

File.exists? fp and File.readable? fp
File.open(fp, "r") do |file|
  concat = false
  page_text = ''
  while line = file.gets
    if !page_start.match(line).nil? or concat
      concat = true
      page_text = page_text + line
      if !page_end.match(line).nil?
        if !has_persondata.match(page_text).nil?
          name = regex_name.match(page_text)[1] unless â†µ
            regex_name.match(page_text).nil?
          alternative_names = regex_alter_names.match(page_text)[1] unless â†µ
            regex_alter_names.match(page_text).nil?
          description = regex_description.match(page_text)[1] unless â†µ
            regex_description.match(page_text).nil?
          page_title = regex_page_title.match(page_text)[11] unless â†µ
            regex_page_title.match(page_text).nil?
          page_id = regex_page_id.match(page_text)[1] unless â†µ
            regex_page_id.match(page_text).nil?
          date_birth = regex_date_birth.match(page_text)[1] unless â†µ
            regex_date_birth.match(page_text).nil?
          place_birth = regex_place_birth.match(page_text)[1] unless â†µ
            regex_place_birth.match(page_text).nil?
          date_death = regex_date_death.match(page_text)[1] unless â†µ
            regex_date_death.match(page_text).nil?
          place_death = regex_place_death.match(page_text)[1] unless â†µ
            regex_place_death.match(page_text).nil?
          revision_timestamp = regex_revision_timestamp.match(page_text)[1] unless â†µ
            regex_revision_timestamp.match(page_text).nil?
        end
        concat = false
        page_text = ''
      end
    end
  end
end
</timestamp\></revision\></id\></title\></page\></pre>
<p>I decided that <a href="http://en.wikipedia.org/wiki/Streaming_Transformations_for_XML">STX</a> would require too significant an investment in time and while I may integrate <a href="http://code.whytheluckystiff.net/hpricot/">hpricot</a> at some point the solution above seems sufficient for the moment.</p>
<p>I have yet to begin importing any of the data into the database. I&#8217;ll want to format the data consistently and remove the wiki specific mark-up. Feel free to rip the code apart and please <a href="mailto:billATagiloDOTus?Subject=A%20suggestion%21">do make suggestions</a>. I am after all a nuby.</p>
]]></content:encoded>
			<wfw:commentRss>http://agilo.us/2007/04/17/just-missed-me-chapter-one/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Just Missed Me &#8211; Chapter Zero</title>
		<link>http://agilo.us/2007/04/12/just-missed-me-chapter-zero/</link>
		<comments>http://agilo.us/2007/04/12/just-missed-me-chapter-zero/#comments</comments>
		<pubDate>Thu, 12 Apr 2007 06:01:34 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[JustMissedMe]]></category>
		<category><![CDATA[Rails]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://agilo.us/?p=3</guid>
		<description><![CDATA[I became interested in hpricot while experimenting during the creation of a prospecting system for my employer. Driving home from work one day I hatched an idea for a website that would employ hpricot to pull its data from Wikipedia&#8217;s Persondata. Due to the extremely large size of the associated data file I soon determined [...]]]></description>
			<content:encoded><![CDATA[<p>I became interested in <a href="http://code.whytheluckystiff.net/hpricot/">hpricot</a> while experimenting during the creation of a prospecting system for my employer. Driving home from work one day I hatched an idea for a <a href="http://justmissedme.com">website</a> that would employ hpricot to pull its data from Wikipedia&#8217;s <a href="http://en.wikipedia.org/wiki/Wikipedia:Persondata">Persondata</a>.</p>
<p>Due to the extremely large size of the associated data file I soon determined that an hpricot-based solution would not scale. Digging a bit on Wikipedia I discovered <a href="http://en.wikipedia.org/wiki/Streaming_Transformations_for_XML">STX</a> which appears to be a more promising means of proceeding. I&#8217;ll keep you posted.</p>
]]></content:encoded>
			<wfw:commentRss>http://agilo.us/2007/04/12/just-missed-me-chapter-zero/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

