Supply and Quality of Data
This section provides guidance on the management and curation of data to ensure that they meet Open Data best practices and the established standards of professional data communities. It is intended for individuals and organizations involved in the production of data, such as government ministries and statistical agencies, but data consumers who want to understand how public data are produced may also find this section useful.
While “quality” can be an ambiguous concept, what quality means in the context of data has been well defined for some time. EuroStat’s definition of quality in statistics provides a set of six quality dimensions that originally defined statistical data, but can also be applied to many other types of data:
|Relevance||The degree to which statistics meet current and potential users’ needs|
|Accuracy and Reliability||The degree to which data are free of errors arising from various factors; in the context of statistics, accuracy means the closeness of the estimated value to that of the true (unknown) value in the population|
|Timeliness and Punctuality||How soon the data are published relative to what they measure, and how closely data updates adhere to the intended publication schedule|
|Accessibility and Clarity||The ease with which users can access the data and the degree to which they are explained through metadata|
|Comparability||The degree to which data can be compared across time, regions or other domains|
|Coherence||The degree to which data comport to recognized definitions and methodologies|
A similar set of dimensions, published in the Project Open Data document, can be used to understand quality, specifically in the context of Open Data:
|Public||The degree to which government data are treated with a presumption that favors openness to the extent permitted by law and subject to privacy, confidentiality, security or other valid restrictions|
|Accessible||The degree to which Open Data are made available in convenient, modifiable and open formats that can be retrieved, downloaded, indexed and searched|
|Described||How fully Open Data are described so that consumers of the data have sufficient information to understand their strengths, weaknesses, analytical limitations and security requirements and how to process them|
|Reusable||Whether Open Data are made available under an open license that places no restrictions on their use|
|Complete||Whether Open Data are published in primary forms (i.e., as collected at the source), with the finest possible level of granularity that is practicable and permitted by law and other requirements|
|Timely||How soon Open Data are made available so as to preserve their value|
|Managed Post-Release||Whether a point of contact is designated to assist with data use and respond to complaints about adherence to these Open Data requirements|
While the principles of quality previously described generally apply to all data types, the standards and details by which data are produced and evaluated vary according to each data type. The following sections summarize the standards relevant to each data type.
Statistics and National Accounts
National Accounts define income, output and expenditure categories across an economy for various entities such as households, businesses and government. These statistics are generally produced and/or coordinated by National Statistics Offices (NSOs) in each country in accordance with detailed standards and methodologies.
Statistics are replete with standards that determine how to classify and organize statistics and how to assess quality. These are generally most useful to NSOs.
- United Nations List of Statistical Standards. This is a catalog of statistical classifications, definitions, concepts, methodologies and procedures that offers guidance for the use of statistical products.
- World Bank Statistical Performance Indicators (SPI). The World Bank launched the SPI in 2021 to measure the capacity and maturity of national statistical systems around the world, by assessing five pillars of statistical performance: (i) data use; (ii) data services; (iii) coverage of data topics and data products; (iv) data sources; and (v) data infrastructure. Underpinning these five pillars are 22 dimensions and 51 indicators. The SPI covers 174 economies, with data on selected indicators available for 217 economies. The SPI builds on and replaces its predecessor, the World Bank’s Statistical Capacity Indicators (SCI), which was previously in place since 2004.
- UK Guidelines for Measuring Statistical Quality. These guidelines for measuring statistical quality use the same general dimensions as the guidance from EuroStat (noted above in “General Standards”).
- Quality of Statistical Output Checklist. This is a framework for assessing statistical quality along 19 characteristics, with the goal of creating a statistical quality checklist.
- The International Monetary Fund has two related frameworks that provide guidance for national statistics: The General Data Dissemination System (GDDS) and the Special Data Dissemination System (SDDS). Both the GDDS and SDDS are designed to enhance the availability of timely and comprehensive statistics and therefore contribute to the pursuit of sound macroeconomic policies. (more…)
- Data Quality Assessment Framework (DQAF). Another tool from the IMF, the DQAF is used for comprehensive assessments of countries’ data quality. It addresses institutional environments, statistical processes and characteristics of the statistical products. (more…)
- Reports on Observance of Standards & Codes (ROSC). This collection of country-level reports summarizes the extent to which countries observe certain internationally recognized standards and codes. The reports are organized into 10 topics, one of which is data dissemination.
“Contracting” refers to the processes by which public institutions procure goods and services. Contracting data includes information on issued tenders or requests for proposals (RFPs), contracts awarded, performance evaluation and completion and more.
To encourage best practices in disclosure in the public contracting sector, the Open Contracting Partnership developed a set of principles that can be adapted to specific sector and local contexts. These principles are designed to make contracting more competitive and fair, support global transparency and open government movements, and guide governments and stakeholders in data disclosure to enable understanding, effective monitoring, efficient performance and accountability for outcomes.
The Open Contracting Data Standard (OCDS) has been created to enable governments to publish details on all stages of the Contracting Process, including Planning, Tender, Award, Contract and Implementation. The standard provides detailed schema for representing contracting data in a range of formats, as well as guidance on implementation options. The OCDS Help Desk is on hand to offer advice on implementing OCDS. There is also an OCDS validator program to check if data complies with the OCD standard.
Budget data refer to public sector spending, disaggregated by level of government, functional or programmatic category, fiscal year and source of finance. BOOST and OpenSpending are two initiatives that provide good examples for how to produce budget data consistent with Open Data best practices:
- BOOST: The BOOST initiative is a Bank-wide collaborative effort to facilitate access to countries’ line-item fiscal data and improve decision-making processes and transparency. BOOST is a comprehensive public spending database established in dozens of countries that uses government data and a 26-digit template to make highly granular fiscal data understandable and accessible to key users such as legislatures and civil society. (more…)
- OpenSpending: This is a central, high-quality, open database comprised of public financial information including budgets, spending and balance sheets; a community of users and contributors; and a set of open resources providing the technical, fiscal and political understanding needed to work with financial data. (more…)
Transport data can provide information on high-level infrastructure, usage and capacity (e.g., extent of roads, number of vehicles in a population, fuel consumption). In this context, however, transport data is related to public or mass transit, such as the availability of trains, buses, taxis and schedules.
Transport data producers should be familiar with the General Transit Feed Specification (GTFS). GTFS is a machine-readable data standard for transportation schedules, data and associated geographic information that encourages re-use. The TransitApp is an example of an application that takes advantage of GTFS-formatted data published by several cities.
Geospatial data identify the geographic aspects of an extensive variety of things, such as the locations of buildings or polling centers, boundaries of neighborhoods and cities, or locations of forest concessions, to name only a few.
In Open Data initiatives, geospatial data is usually distributed in at least one of the following formats:
- TopoJSON is an extension of GeoJSON, but with a different approach to describing geographic features. As a result, TopoJSON files are typically 80% smaller than their GeoJSON equivalents.
- Keyhole Markup Language (KML) is an XML-based data format introduced by Google in Google Maps and Google Earth.
- Shapefile is the native format for the ArcGIS software suite from ESRI, but are so ubiquitous they are compatible with most major GIS systems.
The Open Geospatial Consortium (OGC) is a voluntary, international organization comprised of almost 500 companies, government agencies and universities that collaborate to develop standards for geospatial data. To date, the OGC has developed more than 30 standards for a variety of geospatial data types, including the KML format developed by Google and submitted to the OGC.
Comprised of survey responses from individuals, households or businesses, microdata has numerous applications, one of which is to produce aggregate statistics. Because there is an implicit and often legal expectation of confidentiality, microdata are subject to especially high standards concerning their distribution, and almost always undergo techniques to anonymize the data.
These references provide guidance to organizations who manage microdata:
- World Bank Microdata Practices & Tools. This resource documents the principles and practices employed in the World Bank’s microdata catalog, including acquisition, disclosure, metadata, cataloging and preservation.
- International Household Survey Network Guidelines. IHSN provides extensive guidance on data archiving and dissemination, including metadata and cataloging. The World Bank’s microdata guidance is based significantly on this source. Note, however, that the IHSN guidelines emphasize best practices for dissemination of microdata, and not necessarily microdata as Open Data. (more…)
Aid data refer to the resources and activities through which institutions finance international development. The International Aid Transparency Initiative (IATI) is the primary initiative in this field. The IATI Standard is a publishing standard that allows aid data from different donors concerning various recipients to be compared. To date, more than 280 organizations have published data in the IATI Registry.
"Standards for governance and anonymization help clarify data management and security processes, and metadata offers valuable details about data composition and sources."
Data governance addresses how Open Data assets are managed both during their initial launch and on an ongoing basis. Governance policies clarify lines of authority within the government and ministries for managing data, describe the process and requirements for releasing or updating data, and provide a means for users to engage providers over any issues or requests that arise.
Data governance is often addressed within the context of Open Data policies. Other resources include:
- Project Open Data Implementation Guide. This document is part of the U.S. Government’s Project Open Data and provides guidance to agencies implementing the Executive Order on Open Data. Among other topics, it provides guidance for:
- Creating and maintaining an enterprise data inventory of all datasets in an agency’s possession
- Creating a public data listing (a subset of the inventory)
- Engaging users to facilitate and prioritize the release of data
- Documenting data that cannot be released
- Open Data: Unleashing the Potential. This policy paper and accompanying framework set out the U.K. Government’s open data strategy and data release schedule.
- Open Data Challenges and Opportunities for National Statistical Offices. This paper sets out key opportunities and challenges which Open Data presents to National Statistics Offices and identifies steps and solutions to help enable NSOs play a central leadership role in national or sub-national Open Data initiatives.
- Technical Assessment of Open Data Platforms for National Statistical Organizations. This report provides an overview of requirements and components for open data publication systems and assesses several data platform options which National Statistics Organizations could potentially use to manage and disseminate open datasets.
- Creating an Integrated National Data System. This interactive summary identifies steps and recommendations for country governments to share data between national participants safely while maximizing the benefit equitably, and includes several country case studies.
Anonymization is the process of obscuring or removing information from a dataset that could be used to identify individuals, households or businesses, so that their anonymity is preserved and protected. Anonymization and the imperative to protect confidentiality are especially important for governments releasing public data. Equally important is the need for organizations to clearly articulate their privacy policies concerning data management, both to individuals that provide data and individuals that use that data. That said, many, many types of government data do not entail confidential information, and thus have little or no need for anonymization techniques.
Proper anonymization is highly specific to the type of data and the individual dataset. A few resources are listed here:
- Handbook on Statistical Disclosure Control. This resource covers issues related to anonymization, including regulatory issues, microdata, tabular data, frequency tables and issues raised by remote access.
- Anonymization Guide from the UK’s Information Commissioner’s Office. This resource provides guidance on anonymization techniques and privacy protection for a range of data types within the context of the United Kingdom’s Data Protection Act.
- Rethinking Personal Data: Strengthening Trust (World Economic Forum). This report fosters dialogue around some of the key questions that must be resolved to ensure long-term and sustainable value creation. Several follow-up reports have since been published.
- Statistical Disclosure Control for Microdata Practice Guide. This guide provides practical steps for any agency to provide safe data access, while ensuring the microdata remain fit for purpose.
- Microdata Anonymization. This resource presents the main principles of microdata anonymization, techniques to measure and reduce risk, and best practices provided by the International Household Survey Network (IHSN).
- Managing Statistical Confidentiality & Microdata Access. This set of principles and guidelines was prepared and adopted by the Conference of European Statisticians (CES). The annex contains over twenty case studies from various countries.
Metadata is often simply defined as “data about data.” Metadata provides the information necessary to use a particular source of data effectively, and may include information about its source, structure, underlying methodology, topical, geographic and/or temporal coverage, license, when it was last updated and how it is maintained. Specific types of data often include additional metadata as appropriate; for instance, digital photographs may include a time stamp, information about the equipment used, aperture settings and possibly the GPS location.
The Dublin Core Metadata Initiative (DCMI) provides a framework and core vocabulary of metadata terms that can be applied to most electronic resources. Dublin Core is used heavily in DCAT, a standard designed to facilitate interoperability between web-based data catalogs. Governments may develop their own metadata models (preferably based on established standards such as DCAT) to provide further uniformity to government-wide Open Data initiatives. One example is the metadata schema propagated by data.gov.
Other metadata standards are in use for a wide variety of data types. For government data, some of the most relevant include:
Used heavily in social science data, but applicable more broadly as well
Texts in digital form, chiefly in the humanities, social sciences and linguistics