When I walk my kids to school in the morning we navigate a crossroad junction busy with commuters going to work, shoppers going to Asda and several buses normally arriving at once. There aren’t any controls for pedestrians so we – and countless others throughout the day – try and make it across in the split-second between the lights changing.
I wrote and spoke to the council people but got the message that they were all spent up and that it wasn’t deemed a priority. The unsaid logic was that there hadn’t been enough serious accidents there… *shakes fist**
The next step could have been a petition and wider campaign to highlight the perils of the junction, but for now I’ve turned to the data. I’m interested to see if the council are “right”, or at least try and have a discussion based upon some facts. I knew that via DataGM, datasets of road accidents since 2005 had been published, so I started to take a look through them. It’s resulted in a map of accidents in Hulme – but I wanted to share some wider concerns and observations about using opendata specifically in this post.
1 – Data use means making editorial decisions
The published datasets are quite large in terms of coverage and values. I had to think about how to segment and group this according to the analysis I wanted to undertake – settling upon
- When (dates, time etc)
- Where (geographic and administrative geography)
- What (number of vehicles and casualties, etc).
The wider point here is that alreadyI was making editorial decisions on a raw dataset.
2 – Transforming and cleaning opendata takes time
I spent a lot of time in Google Refine:
- Transforming the published Eastings and Northings values to lat/long, postcodes and then administrative geography elements (ward, local authority) – using regine URL lookups and the uk-postcodes API
- Building upon the published datetime stamp to isolate values for year, month and day – and also group times into arbitrary time slots through the day
- Converting the numeric codes for each question into the text/English version – in order that
the end user could navigate the data easily
So – Ive made a lot of adjustments to the original data. Ideally, I’d like my derivation to be openly published (currently via Google Fusion Tables), but more important is sharing and attributing the steps I’ve gone through. Again, in my usage of the open data I’m moving beyond the raw data via subjective decisions I make. What happens to the “added value” I’m creating?
3 – Sub datasets only make part of the story
I’ve created a map of Hulme using the Exhibit software and scripts. That’s all very well, but I’m aware that this area is pretty meaningless outside of local politics. Ideally, I’d like to lift the whole dataset onto such a facet browsing platform (if someone can help with Exhibit 3.0 then please shout) – but I’m aware that people may want to split and view the data via other factors – bus routes for example?
This data is derived from the Stats19 dataset, which requires each accident be recorded into a standard way. So far, roughly one third of the scope of this datasets has been published via my source – DataGM. There are potentially tons of other insights to be gleaned from the details on the accident, according to the forms that are in use.
4 – Did I chose the right dataset?
It figures to look at accident data when looking at road safety – or does it? What about data on traffic flows, bus routes, cycle and pedestrian throughput? Or wider data around local services and demographics? At this point I start to get into the “overwhelmed by opendata” state – and cling back to my initial little map. But how do we take this further and engage people? I posted the map to the local email news forum – not a single response so far…..people probably have far more interesting things to do.
This has been a great process to get to grips with a few things personally. In the meantime, I’ll keep on jumping the lights ….