Adding RSS to the PMF PPS
[Update: I had to revise the feed links below since I could not rename the improved Pipes feed.]
I decided to take matters into my own hands and drag some portion of the PMF program kicking and screaming into the 21st Century: I made an RSS feed for the Projected Positions System. OK, so that’s a link to the Yahoo! Pipes application, but it does have its own feed.
What follows is my reasoning and methodology, the feed’s limitations (due to technical constraints), and a how-to on RSS for those who aren’t familiar.
Reasoning
I puzzled over (well, ranted about) what I considered missing functionality in my recent PMF Thoughts posting: the lack of RSS for the Projected Position System. It seems that the federal government is only now getting the idea when it comes to providing data that can be consumed in a variety of ways, and the PPS is far behind any modern technology for job listing. It is cumbersome to have to visit the PMF web site and go through their search options just to get a list of newly posted jobs; this is even more tedious when you are actually looking at specific agencies, but my tool doesn’t really address this directly.
What I set out to create was a simple RSS feed of the jobs that had been posted in the PPS to date. The nature of RSS is that new postings show up at the top of the list in whatever program you have that reads the items (see the quick RSS primer at the end of this post if you haven’t used feeds before), and it updates the list by periodically polling for changes. That means you no longer have to visit the site just to see if something new is there; you just point a feed reader at it and let it do the checking for you.
Methodology
My first impulse back in mid-March was to build a screen scraper that would load the positions page and parse out all the listings. Of course, there is nothing technically limiting me from taking this approach (time notwithstanding), but my initial attempts did meet with some technology hurdles that I never really felt motivated to overcome. So I let the project sit around a while, until someone mentioned Yahoo! Pipes.
I had come across Yahoo! Pipes shortly after its initial launch in 2007, but I just regarded it as a plaything that didn’t look very useful. When I looked at it again yesterday, however, I immediately thought of what I could try with it. Pipes has a function that lets you parse regular web pages for content, then use that to build RSS (or other output type) feeds. I tried to use it, but it does not include a way to manipulate page elements (like clicking buttons and submitting forms).
Then I tried something: by going to the plain old job results page directly, bypassing the search filters, I got the ENTIRE listing. But I still ran into a problem. The Pipes module that fetches an HTML page is limited to 200KB, and the results page is over 400KB. Nothing I tried could knock that down in a way that Pipes would accept, and so for a bit I believed the whole thing would be impossible.
I don’t give up easily though. Some searching around turned up another screen to data service that ended up working: Dapper. Dapper allowed me to load the entire page, then select the elements I was looking to include in my feed, name each one, and pass it out into a data format of my choice. Right up front, though, I noticed some limitations (all of these are detailed below), such as feed items that consisted entirely of part of an agency name, some jumbled and munged links, and the sort.
I wasn’t entirely satisfied with the output from Dapper, but I knew I could use Dapper’s output as input to Pipes, so I went back to work. Initially, I chose CSV for Dapper’s output, thinking that CSV would be easy for Pipes to work with. When that proved fruitless, I went home and thought about it some more, hopped on this morning, and decided to try RSS for the Dapper output. Bingo!
With RSS out from Dapper, I could use Pipes to parse the feed and Filter certain items, such as those with no post dates (that is, I wanted to get rid of the orphaned items that included only the pre-linefeed portions of the agency names). I did this by looking to see if the feed’s posted date included either AM or PM, since all of the posted positions included a time. That seemed to work, and what we have is the output you see if you follow the above link.
Limitations
The feed is not without limitations. I worked around as many as I could, but without writing a custom script to parse this, there’s really no way I can fix the feed.
- Missing Agency Names: Some of the agency names were cut off due to the presence of newline or linefeed characters in between the agency and sub-agency names. So, for instance, “Department of the Interior [line break] Bureau of Land Management” shows up in the feed as “Bureau of Land Management.” In most cases, I think this is a minor issue.
- Bad Job Title Links: In cases where the feed title contains multiple job title listings, the feed item link does not work correctly. This is a result of (3), where some positions were merged together in the parsing (it could be missing or improperly-terminated HTML elements for all I know). The links in the body of these feed items do work, however.
- Merged Job Postings: Again, this could have a number of causes, but without writing my own parsing script, I can’t fix it. One of the side effects of this, incidentally, is that a few of the job postings have been lumped together under the wrong agency. I would consider this the most severe of the three issues, but any confusion can be cleared up by following that job title’s link back to the PPS.
RSS Primer
In case you aren’t familiar with RSS, let me provide a bit of info. RSS is a syndication format that pipes machine-readable data around the Web. What it’s really good for (so far) is providing lists of updates to frequently updated sites, such as news sites, blogs, and the like. RSS is an XML format that is readable by programs known as feed readers. There are many readers available, but my favorite is Netvibes (technically Netvibes is not a reader all by itself, it just contains a way to gather feeds into separate feed-reading widgets). You can also use iGoogle if you are so inclined. Anyway, the beauty of RSS is that the feed reader tells you when you have something new, usually by making unread items appear in bold face; it knows the items are new because all items have a posted date that should correspond with the actual publication date of the items in question.
Each feed reader has its own method of adding feeds to check, so you would have to try them out to understand the steps for that particular reader. I haven’t found RSS very difficult once you have a feed address.
[...] a comment » When I wrote the original RSS feed for the Projected Position System a few days ago, I was not satisfied with it. It had a number of limitations that I suggested [...]
PMF RSS Redux « .:aaron.helton:.
May 4, 2009 at 7:35 pm