PMF RSS Redux
When I wrote the original RSS feed for the Projected Position System a few days ago, I was not satisfied with it. It had a number of limitations that I suggested would only be fixed if I did the parsing myself. Well, after a good deal of time figuring out how to do just that, I have created a new RSS feed that works better. I still used Yahoo! Pipes to serve it up, but it’s coming from my home server (incidentally, I would love to have a mirror for this if anyone is interested).
Enjoy: http://pipes.yahoo.com/aaronhelton/pmfrss
Oh, and for anyone who is interested in my parser, read on.
Parsing the PMF’s PPS Site
For things like this, I use ruby more often than not. It is very powerful and yet concise, plus it has a ton of third party libraries (gems) that make a number of tasks easier.
I had tried screen-scraping the PPS before, but had met with errors due to some badly-written HTML. In the header HTML, for example, there was a closing STYLE tag with no opening STYLE tag. Further down I found an illegal SPAN tag within a TABLE element tag. These two issues caused me a great deal of grief. In the case of the first, it meant that very strict parsers were unable to process the document; in the second case it made finding the desired elements very difficult (as you will see below).
After some effort getting the right set of parsing tools for the job, I settled on Hpricot, since I really didn’t need to interact with the page in any meaningful way. My basic search looks like this (including the beginning of the file):
require 'rubygems'
require 'open-uri'
require 'hpricot'
url = "https://www.pmf.opm.gov/JobSearch/results.aspx"
jobList = Array.new
doc = Hpricot(open(url))
doc.search("//span[@id='lblJobsList']").remove
list = doc.search("//font[@SIZE='-2']../../../tr")
That snippet is sufficient to open the PPS page, sort through the HTML, and return the section of the document that includes the job listings. From there it was a matter of limiting the output (20 rows), cleaning up the data, and grabbing the elements that would appear in the RSS feed. I have it set to run every hour from my home machine. With any luck, it will run there for a while, but with even MORE luck, PMF will obviate this with their own.
If you want the code for yourself: http://heltons.mooo.com/pmf/pmf.rb.txt
Aaron, this is excellent – great idea and a very needed service. Maybe when you start working at OPM you can set up a feed for rotational opportunities!
tb
May 5, 2009 at 1:45 am