.:aaron.helton:.

Google hasn’t mapped my thoughts just yet, so don’t get lost

PMF RSS Redux

with one comment

When I wrote the original RSS feed for the Projected Position System a few days ago, I was not satisfied with it.  It had a number of limitations that I suggested would only be fixed if I did the parsing myself.  Well, after a good deal of time figuring out how to do just that, I have created a new RSS feed that works better.  I still used Yahoo! Pipes to serve it up, but it’s coming from my home server (incidentally, I would love to have a mirror for this if anyone is interested).

Enjoy: http://pipes.yahoo.com/aaronhelton/pmfrss

Oh, and for anyone who is interested in my parser, read on.

Parsing the PMF’s PPS Site

For things like this, I use ruby more often than not.  It is very powerful and yet concise, plus it has a ton of third party libraries (gems) that make a number of tasks easier.

I had tried screen-scraping the PPS before, but had met with errors due to some badly-written HTML.  In the header HTML, for example, there was a closing STYLE tag with no opening STYLE tag.  Further down I found an illegal SPAN tag within a  TABLE element tag.  These two issues caused me a great deal of grief.  In the case of the first, it meant that very strict parsers were unable to process the document; in the second case it made finding the desired elements very difficult (as you will see below).

After some effort getting the right set of parsing tools for the job, I settled on Hpricot, since I really didn’t need to interact with the page in any meaningful way.  My basic search looks like this (including the beginning of the file):

require 'rubygems'
require 'open-uri'
require 'hpricot'

url = "https://www.pmf.opm.gov/JobSearch/results.aspx"

jobList = Array.new

doc = Hpricot(open(url))
doc.search("//span[@id='lblJobsList']").remove

list = doc.search("//font[@SIZE='-2']../../../tr")

That snippet is sufficient to open the PPS page, sort through the HTML, and return the section of the document that includes the job listings. From there it was a matter of limiting the output (20 rows), cleaning up the data, and grabbing the elements that would appear in the RSS feed. I have it set to run every hour from my home machine. With any luck, it will run there for a while, but with even MORE luck, PMF will obviate this with their own.

If you want the code for yourself: http://heltons.mooo.com/pmf/pmf.rb.txt

Written by aaronhelton

May 4, 2009 at 7:35 pm

Posted in development, pmf, ruby

One Response

Subscribe to comments with RSS.

  1. Aaron, this is excellent – great idea and a very needed service. Maybe when you start working at OPM you can set up a feed for rotational opportunities!

    tb

    May 5, 2009 at 1:45 am


Leave a Reply