This section provides guidance on the management and curation of data to ensure that they meet Open Data best practices and the established standards of professional data communities. It is intended for individuals and organizations involved in the production of data, such as government ministries and statistical agencies, but data consumers who want to understand how public data are produced may also find this section useful.
General Standards of Quality
While “quality” can be an ambiguous concept, what quality means in the context of data has been well defined for some time. EuroStat’s definition of quality in statistics provides a set of six quality dimensions that originally defined statistical data, but can also be applied to many other types of data:
|Relevance||The degree to which statistics meet current and potential users’ needs|
|Accuracy and Reliability||The degree to which data are free of errors arising from various factors; in the context of statistics, accuracy means the closeness of the estimated value to that of the true (unknown) value in the population|
|Timeliness and Punctuality||How soon the data are published relative to what they measure, and how closely data updates adhere to the intended publication schedule|
|Accessibility and Clarity||The ease with which users can access the data and the degree to which they are explained through metadata|
|Comparability||The degree to which data can be compared across time, regions or other domains|
|Coherence||The degree to which data comport to recognized definitions and methodologies|
A similar set of dimensions, published in the Project Open Data document, can be used to understand quality, specifically in the context of Open Data:
|Public||The degree to which government data are treated with a presumption that favors openness to the extent permitted by law and subject to privacy, confidentiality, security or other valid restrictions|
|Accessible||The degree to which Open Data are made available in convenient, modifiable and open formats that can be retrieved, downloaded, indexed and searched|
|Described||How fully Open Data are described so that consumers of the data have sufficient information to understand their strengths, weaknesses, analytical limitations and security requirements and how to process them|
|Reusable||Whether Open Data are made available under an open license that places no restrictions on their use|
|Complete||Whether Open Data are published in primary forms (i.e., as collected at the source), with the finest possible level of granularity that is practicable and permitted by law and other requirements|
|Timely||How soon Open Data are made available so as to preserve their value|
|Managed Post-Release||Whether a point of contact is designated to assist with data use and respond to complaints about adherence to these Open Data requirements|
While the principles of quality previously described generally apply to all data types, the standards and details by which data are produced and evaluated vary according to each data type. The following sections summarize the standards relevant to each data type.
Statistics and National Accounts
National Accounts define income, output and expenditure categories across an economy for various entities such as households, businesses and government. These statistics are generally produced and/or coordinated by National Statistics Offices (NSOs) in each country in accordance with detailed standards and methodologies.
Statistics are replete with standards that determine how to classify and organize statistics and how to assess quality. These are generally most useful to NSOs.
United Nations List of Statistical Standards. This is a catalog of statistical classifications, definitions, concepts, methodologies and procedures that offers guidance for the use of statistical products.
UK Guidelines for Measuring Statistical Quality. These guidelines for measuring statistical quality use the same general dimensions as the guidance from EuroStat (noted above in “General Standards”).
Quality of Statistical Output Checklist. This is a framework for assessing statistical quality along 19 characteristics, with the goal of creating a statistical quality checklist.
The International Monetary Fund has two related frameworks that provide guidance for national statistics: The General Data Dissemination System (GDDS) and the Special Data Dissemination System (SDDS). Both the GDDS and SDDS are designed to enhance the availability of timely and comprehensive statistics and therefore contribute to the pursuit of sound macroeconomic policies. The SDDS provides targeted guidance to members seeking access to international capital markets, and is expected to contribute to the improved functioning of financial markets as well.
Data Quality Assessment Framework (DQAF). Another tool from the IMF, the DQAF is used for comprehensive assessments of countries’ data quality. It addresses institutional environments, statistical processes and characteristics of the statistical products. The DQAF consists of a generic framework, plus additional modules for national account statistics, consumer and producer price indices, government finance statistics and public sector debt, monetary statistics, balance of payments and external debt.
World Bank Statistical Capacity Indicator. This tool provides an overview of the statistical capacity of developing countries, based on a diagnostic framework that assesses the capacity of statistical systems. The Statistical Capacity Indicator website allows users to visualize changes in a country’s statistical capacity over time.
Reports on Observance of Standards & Codes (ROSC). This collection of country-level reports summarizes the extent to which countries observe certain internationally recognized standards and codes. The reports are organized into 10 topics, one of which is data dissemination.
“Contracting” refers to the processes by which public institutions procure goods and services. Contracting data includes information on issued tenders or requests for proposals (RFPs), contracts awarded, performance evaluation and completion and more.
To encourage best practices in disclosure in the public contracting sector, the Open Contracting Partnership developed a set of principles that can be adapted to specific sector and local contexts. These principles are designed to make contracting more competitive and fair, support global transparency and open government movements, and guide governments and stakeholders in data disclosure to enable understanding, effective monitoring, efficient performance and accountability for outcomes.
The Open Contracting Data Standard (OCDS) has been created to enable governments to publish details on all stages of the Contracting Process, including Planning, Tender, Award, Contract and Implementation. The standard provides detailed schema for representing contracting data in a range of formats, as well as guidance on implementation options. The OCDS Help Desk is on hand to offer advice on implementing OCDS.
Budget data refer to public sector spending, disaggregated by level of government, functional or programmatic category, fiscal year and source of finance. BOOST and OpenSpending are two initiatives that provide good examples for how to produce budget data consistent with Open Data best practices:
BOOST: The BOOST initiative is a Bank-wide collaborative effort to facilitate access to budget data and improve decision-making processes and transparency. BOOST is a comprehensive public spending database established in 57 countries that uses government data and a 26-digit template to make highly granular fiscal data understandable and accessible to key users such as legislatures and civil society. Expenditure data are distributed on core fiscal dimensions, such as function, economic and fund sources, and can be linked with additional datasets to support broader efficiency and equity analyses. BOOST is also used for building open budgets and enhancing accountability by making budget data accessible to users in a consistent and readily understood framework.
OpenSpending: This is a central, high-quality, open database comprised of public financial information including budgets, spending and balance sheets; a community of users and contributors; and a set of open resources providing the technical, fiscal and political understanding needed to work with financial data. OpenSpending tracks and analyzes public financial information globally, and its database is a resource for journalists, academics, campaigners and others to discuss and investigate public financial information.
Transport data can provide information on high-level infrastructure, usage and capacity (e.g., extent of roads, number of vehicles in a population, fuel consumption). In this context, however, transport data is related to public or mass transit, such as the availability of trains, buses, taxis and schedules.
Transport data producers should be familiar with the General Transit Feed Specification (GTFS). GTFS is a machine-readable data standard for transportation schedules, data and associated geographic information that encourages re-use. The TransitApp is an example of an application that takes advantage of GTFS-formatted data published by several cities.
Geospatial data identify the geographic aspects of an extensive variety of things, such as the locations of buildings or polling centers, boundaries of neighborhoods and cities, or locations of forest concessions, to name only a few.
In Open Data initiatives, geospatial data is usually distributed in at least one of the following formats:
- TopoJSON is an extension of GeoJSON, but with a different approach to describing geographic features. As a result, TopoJSON files are typically 80% smaller than their GeoJSON equivalents.
- Keyhole Markup Language (KML) is an XML-based data format introduced by Google in Google Maps and Google Earth.
- Shapefile is the native format for the ArcGIS software suite from ESRI, but are so ubiquitous they are compatible with most major GIS systems.
The Open Geospatial Consortium (OGC) is a voluntary, international organization comprised of almost 500 companies, government agencies and universities that collaborate to develop standards for geospatial data. To date, the OGC has developed more than 30 standards for a variety of geospatial data types, including the KML format developed by Google and submitted to the OGC.
Comprised of survey responses from individuals, households or businesses, microdata has numerous applications, one of which is to produce aggregate statistics. Because there is an implicit and often legal expectation of confidentiality, microdata are subject to especially high standards concerning their distribution, and almost always undergo techniques to anonymize the data.
These references provide guidance to organizations who manage microdata:
- World Bank Microdata Practices & Tools. This resource documents the principles and practices employed in the World Bank’s microdata catalog, including acquisition, disclosure, metadata, cataloging and preservation.
- International Household Survey Network Guidelines. IHSN provides extensive guidance on data archiving and dissemination, including metadata and cataloging. The World Bank’s microdata guidance is based significantly on this source. Note, however, that the IHSN guidelines emphasize best practices for dissemination of microdata, and not necessarily microdata as Open Data. For instance, the IHSN guidelines discuss a range of options for licensing and data access – including the options to register users and charge fees – which are not at all consistent with Open Data best practices.
Aid data refer to the resources and activities through which institutions finance international development. The International Aid Transparency Initiative (IATI) is the primary initiative in this field. The IATI Standard is a publishing standard that allows aid data from different donors concerning various recipients to be compared. To date, more than 280 organizations have published data in the IATI Registry.
More Guidance on Quality and Techniques
Standards for governance and anonymization help clarify data management and security processes, and metadata offers valuable details about data composition and sources.
Data governance addresses how Open Data assets are managed both during their initial launch and on an ongoing basis. Governance policies clarify lines of authority within the government and ministries for managing data, describe the process and requirements for releasing or updating data, and provide a means for users to engage providers over any issues or requests that arise.
Data governance is often addressed within the context of Open Data policies. Other resources include:
Data Portal Quick-Setup Guides by data.gov.uk. These resources provide an overview of governance arrangements both at a high (inter-agency) level as well as the local level (i.e., a single data catalog), and describe the different roles in managing datasets.
Project Open Data Implementation Guide. This document is part of the U.S. Government’s Project Open Data and provides guidance to agencies implementing the Executive Order on Open Data. Among other topics, it provides guidance for:
- Creating and maintaining an enterprise data inventory of all datasets in an agency’s possession
- Creating a public data listing (a subset of the inventory)
- Engaging users to facilitate and prioritize the release of data
- Documenting data that cannot be released
Anonymization is the process of obscuring or removing information from a dataset that could be used to identify individuals, households or businesses, so that their anonymity is preserved and protected. Anonymization and the imperative to protect confidentiality are especially important for governments releasing public data. Equally important is the need for organizations to clearly articulate their privacy policies concerning data management, both to individuals that provide data and individuals that use that data. That said, many, many types of government data do not entail confidential information, and thus have little or no need for anonymization techniques.
Proper anonymization is highly specific to the type of data and the individual dataset. A few resources are listed here:
Handbook on Statistical Disclosure Control. This resource covers issues related to anonymization, including regulatory issues, microdata, tabular data, frequency tables and issues raised by remote access.
Anonymization Guide from the UK’s Information Commissioner’s Office. This resource provides guidance on anonymization techniques and privacy protection for a range of data types within the context of the United Kingdom’s Data Protection Act.
Rethinking Personal Data: Strengthening Trust (World Economic Forum). This report fosters dialogue around some of the key questions that must be resolved to ensure long-term and sustainable value creation. Several follow-up reports were released in 2013 and 2014.
Metadata is often simply defined as “data about data.” Metadata provides the information necessary to use a particular source of data effectively, and may include information about its source, structure, underlying methodology, topical, geographic and/or temporal coverage, license, when it was last updated and how it is maintained. Specific types of data often include additional metadata as appropriate; for instance, digital photographs may include a time stamp, information about the equipment used, aperture settings and possibly the GPS location.
The [Dublin Core Metadata Initiative] (http://dublincore.org) (DCMI) provides a framework and core vocabulary of metadata terms that can be applied to most electronic resources. Dublin Core is used heavily in DCAT, a standard designed to facilitate interoperability between web-based data catalogs. Governments may develop their own metadata models (preferably based on established standards such as DCAT) to provide further uniformity to government-wide Open Data initiatives. One example is the metadata schema propagated by data.gov.
Other metadata standards are in use for a wide variety of data types. For government data, some of the most relevant include:
|Data Documentation Initiative (DDI)||Used heavily in social science data, but applicable more broadly as well|
|ISO 19115-1:2014||Geospatial data|
|Text Encoding Initiative||Texts in digital form, chiefly in the humanities, social sciences and linguistics|
|Directory Interchange Format (DIF)||Scientific datasets|