Draft Technical Quality Framework

We are looking for ways to both dramatically improve the experience of finding and using government data for people, as well as a way to improve the motivations for agencies to improve the quality of data publishing over time.

As such we are considering implementing a basic technical quality framework on data.govt.nz and would like your feedback on whether this approach would be useful, whether you are a data user, data publisher or both.

We recognise that trying to simply enforce a new standard across the system is difficult but if the basic technical requirements for finding and using data are met, then it is a great start.

The Problem

For government open data to meet its full economic and social potential outside the public service, there needs to be a substantial increase in the meaningful reuse of data by the private and community sectors. Promotion of government data programs can help encourage data users to explore and experiment with available data but user research and feedback has repeatedly revealed frustration by data users in trying to find data they can actually use. We need to make it much easier for users to find data they can actually use and at the same time create additional motivation for government agencies to publish data in ways that can be readily consumed.

It is worth noting that data users are not the general public. Data users are the people, companies, universities and analysts with technical skills, motivation and capacity to innovate with government data with new applications, analysis, data visualisations and undreamt of uses of government data. Provision of front end visual tools to play with data is of less interest to serious data users but investment in improving discovery of the current supply chain, and investment in an improved data supply chain is critical for such reuse to be sustainable. To improve the data users’ experience with government data, we must first make it easier for them to easily find data they can actually use.

The Basic Technical Quality Measures is a series of technical attributes that are necessary for data users to innovate with government data. Where agencies or specific data domains have existing quality frameworks these could be a second tier quality measure for identified datasets of that agency or type, but these measures are focused on the very real needs of data users. For example, a beautifully crafted dataset that is published as a PDF is of almost no use to anybody. Whereas a less beautiful but regularly updated dataset published with a machine readable API (Application Programmable Interface) is pure gold for a developer or data visualisation expert.

This series of basic technical quality measures is kept as simple as possible and is based on the most basic needs of data users regarding format, metadata, APIs and public scoring. The four measures are a 5 star score to make it quick and easy for data users to search and visually assess the suitability of data to their purposes. Almost all measures are purposefully automated because quality measures that require human intervention are subject to mistakes, consistency and subjectivity. It will be implemented on beta.data.govt.nz for public consultation and testing with data owners and data users. It is only through active testing that such a framework can be properly tested.

Ideally it would be worth updating the data search interface with options for users based on the measures above. See mockup below.

technical quality framework proof of concept diagram

Technical Quality Measure #1 - Data Format

The format data is published in is important. It can be a first barrier or enabler to the data user doing anything with the data. For example, if it is not machine readable then the data user has to transform it to a machine readable version before they can do anything clever with it. If there is no API then they need to download it and then either embed it in their application or set up their own API. These barriers can turn a data user off from even trying to use the data let alone discovering some novel or innovative use.

The Tim Berners-Lee’s (TBL) 5 star CKAN plugin (used by the UK) is a useful start however, it prioritises openness over usefulness and some of the attributes are difficult to automate meaningfully. The implementation below is a slight tweak of the plugin to test with data users. Stars are associated with format type as per the list athttps://github.com/okfn/ckan-barnet/wiki/Data-quality

No resources = 0 stars
Anything posted = 1 star (any format, links to web pages, etc)
Structured but proprietary formats = 2 stars (XLS, XSLX, SPSS, etc)
Structured but open format = 3 star (generic databases, KML, SHP, CSV, TXT)
API available = 4 star (any type of working API including tabular, spatial, statistical, secure APIs, real time, unstructured data, aggregate).
Linked data available = 5 star (manually curated).

Note: very few datasets are linked data and are difficult to generically automate, so manual curation of these is required.

Technical Quality Measure #2 - Metadata Attributes

Although there is an extended metadata schema on data.govt.nz and myriad metadata schemas for data specialisations, these metadata attributes are important for data users to identify data they want to use in the first instance. A star for meeting each of the below to give cumulative number:

Whether the last update aligns with the update frequency in the metadata = 1 star.
- Reason: data users like to find data that is being kept appropriately up to date.
Whether a valid spatial context is indicated = 1 star.
- Reason: a valid spatial context makes it easier to associate data by location, although a lot of national data will has the valid spatial context of “New Zealand”.
Whether any data models, vocabularies, ontologies or other documentation is provided anywhere in the dataset = 1 star.
- Reason: if there is accompanying documentation and/or for the data, it is more compelling for data users to want to use the data, particularly if it is a complex or highly specialised data set. Even basic documentation explaining the background or language of the dataset can be tremendously helpful.
Whether the licence is one recognised by the Open Definition as an open licence (http://opendefinition.org/licenses/) AND there is zero cost for any type of reuse = ½ star each.
- Reason: a lot of data users only want to work with openly licenced free data as they don't want to have to deal with usage restrictions or complex reuse permissions.
Whether there is a valid contact email address = 1 star.
- Reason: Users want to know there is someone available they can ask about the data.

Technical Quality Measure #3 - API Quality

If the data is available via an API it goes a long way to supporting innovative reuse. APIs are critical for serious data users to build persistent analysis, applications or visualisations on government data so to raise public confidence in using government data, we need to make it easier for data users to identify API quality. Below is a 5 star quality ranking based on latency and uptime. Where an API achieves only one of the two conditions, they drop 1 star. So a 99% uptime but 10 second latency would get 4 stars.

No API or API down = 0 stars
Latency more than 8 seconds AND Uptime less than 70% = 1 star
Latency 6-8 seconds AND Uptime 70%-80% = 2 star
Latency 3-5 seconds AND Uptime 80%-90% = 3 star
Latency 2-3 seconds AND Uptime 90%-95% = 4 star
Latency less than 1 second AND Uptime over 95% = 5 star

Technical Quality Measure #4 - Public Rating

The public feedback on the quality of the dataset provides not only a form of peer review that data users can use to identify quality data according to other data users, but provides a check and balance for the quality measures above. If the public feedback on a dataset is consistently and markedly different from how the quality measures above represent the data, then the model will obviously need to be assessed. It is possible that in some cases the basic technical quality measures might be high but the public don't enjoy using the data for other reasons worth identifying.

A public rating out of 5.