Skip to Main Navigation

Technology Options

Technology Options

This section provides guidance on the selection and implementation of various technologies used to develop Open Data platforms, with a particular focus on Open Data catalogs, which are the web-based systems used to make data available to end users. It is intended to support IT specialists who play a lead or coordinating role in managing the technical infrastructure of an Open Data initiative.

The terms “catalog,” “platform” and “portal” are often somewhat ambiguous and sometimes confusing. This Toolkit defines theses terms as follows:

  • data catalog is a list of datasets available in an Open Data initiative. Essential elements of a data catalog include searching, metadata, clear license information and access to the datasets themselves. Typically, a data catalog is the online centerpiece of an Open Data initiative.
  • platform provides an online “front door” for users to access all resources available under an Open Data initiative. A platform includes the data catalog along with other information and services that are part of the Open Data ecosystem. These typically include an online forum for questions, technical support and feedback; a knowledge base of background and training materials; and a blog for communications and outreach. The services within a platform are often implemented with a suite of technologies, not a single one.
  • portal can mean many different things; for that reason, this Toolkit avoids use of this term.

What does an Open Data Catalog Look Like?

Back to local Navigation

As described in the following paragraphs, data catalogs can be relatively simple and “stand alone,” or very sophisticated and integrated with other systems. Most Open Data catalogs, however, share a few common characteristics (more extensive lists are also available):

  • Easy access. Open Data catalogs make it very easy for users to access data quickly, freely and intuitively. Access to Open Data catalogs requires no registration or login, since such requirements would discourage exploration and use.

  • Search. Open Data catalogs make data easy to find. Most data catalogs sort data by subject, organization or type, and support full text searching of catalog contents. Many Open Data catalogs implement search engine optimization to expose data to conventional search engines.

  • Machine-readable data access. Data are available for download in machine-readable, non-proprietary electronic formats. To the extent possible, the preference is to have all data in a dataset available as a single download file.

  • Metadata. Key metadata, such as publication date and attribution, are prominently displayed for each dataset. Many Open Data catalogs implement the Dublin Core metadata standard and make the metadata available in machine-readable formats.

  • Clear data licensesData licenses are clearly and prominently displayed for each dataset. If data are licensed under Creative Commons, the Open Data License or other standards, transparent links to these licenses are included.

  • Data preview/visualization. Many Open Data catalogs include some facility to preview the data prior to download or visualize the data using built-in graphing or mapping tools.

  • Standards compliance. Most Open Data catalogs have built-in support for various standards, such as data formats (e.g., CSV, XML, JSON) and metadata (i.e., Dublin Core). Open Data catalogs typically make each dataset available as a unique and permanent URL, which makes it possible to cite and link to the data directly.

  • Application Programming Interface (API). APIs allow software developers to access the Open Data catalog – and often the data itself – through software. APIs facilitate data discovery, analysis, catalog integration, harvesting of metadata from external sites and a host of applications.

  • Security. Open Data catalogs implement security measures to protect data and metadata from being changed by unauthorized users.

Open Data catalogs generally follow one of two service delivery models. Open Source catalogs are nominally “free,” in that they may be acquired via download for no cost, and may be modified or customized without restriction or licensing fees. These products can be hosted on the owners’ own dedicated servers or on cloud-based infrastructure, but both approaches require the catalog operator to manage IT logistics. Some vendors provide cloud-hosting of open source products as a service. In contrast, Software as a Service (SaaS) products are available from various vendors for a monthly or annual fee, and vendors assume responsibility for IT management, security and software updates. SaaS vendors may also provide some measure of customization.

Three Models of an Open Data Catalog

Back to local Navigation

The three models below present one way of thinking about an Open Data catalog system. The intent here is to show how various elements and services relate to each other, and how the system changes at different scales.

Click a graphic below to view it at full size.

Model 1: Single Platform

The World Bank

This model demonstrates a simple IT infrastructure where the data catalog and data files are hosted within a single server environment. The server could be managed internally by the lead agency or it could be cloud-hosted. API-driven datasets, if any, may be managed separately according to the requirements for the underlying technology.

Blogging, user support, and feedback are essential elements of user engagement in an Open Data initiative, and can often be provided by the same or similar infrastructure as that used by the catalog itself. Conceptually, though, they are separate systems that are only loosely connected to the data catalog.

This model is suitable where there are a small number of datasets (less than 200) in the data catalog, datasets are small (less than 100Mb), and a single agency plays a strong role in coordinating the data catalog and managing the IT infrastructure.

Commonly Used Open Data Platforms

Back to local  Navigation

CKAN

CKAN is an open-source data catalog formally supported by the Open Knowledge Foundation, and can be installed on any Linux server, including cloud-hosted configurations. The Open Knowledge Foundation also offers hosting services for a monthly fee. CKAN is written in the Python programming language and designed for publishing and managing data either through a user interface or an API.

CKAN has a modular architecture through which additional or custom features may be added. For example, the DDI Importer extension (sponsored by the World Bank) provides support for the DDI metadata standard, including harvesting of metadata from microdata catalogs.

Examples

DKAN

DKAN is designed to be “feature compatible” with CKAN. This means that its underlying API is identical, so systems designed to be compatible with CKAN’s API should work equally well with DKAN. DKAN is also open source, but it is based on Drupal, a popular content management system written in PHP instead of Python. This may be more appealing to organizations that have already invested in Drupal-based websites. Drupal has its own modular architecture with thousands of modules available for download. It also has an option to customize modules and a large developer community.

Examples

Junar

Junar is a cloud-based SaaS Open Data platform, so data is typically managed within Junar’s infrastructure (the “all-in-one” model). Junar can provide either a complete data catalog or data via an API to a separate user catalog.

Examples

OpenDataSoft

OpenDataSoft is a cloud-based SaaS platform that offers a comprehensive suite of Open Data and visualization tools. The front end is fully open source. The platform supports common Open Data formats such as CSV, JSON and XML, along with geospatial formats such as KML, OSM and SHP. Search functionality is easy to use and the platform is available in multiple languages.

World Bank partners can freely access a version of OpenDataSoft here.

Examples

Semantic Media Wiki

Semantic MediaWiki is an extension of MediaWiki – the wiki application best known for powering Wikipedia. While traditional wikis contain only text, Semantic MediaWiki adds semantic annotations that allow a wiki to function as a collaborative database and data catalog. Semantic MediaWiki is an RDF implementation, meaning that both data and metadata are stored as linked data and are accessible via linked data interfaces such as SPARQL.

Examples

Socrata

Socrata  is a cloud-based SaaS Open Data catalog platform that provides API, catalog and data manipulation tools. One distinguishing feature of Socrata is that it allows users to create views and visualizations based on published data and save them for others to use. Additionally, Socrata offers an open-source version of its API, intended to facilitate transitions for customers that decide to migrate away from the SaaS model.

Examples

Swirrl

Swirrl was acquired by TPXimpact in 2023, which continues to offer the PublishMyData platform, a cloud-based SaaS Open Data platform built on linked data technologies (such as RDF and SPARQL) designed to achieve 100% compliance with the 5-star Open Data model.

Examples

Geospatial Data Platforms

Back to local Navigation

ArcGIS Open Data

ArcGIS Open Data is a cloud-based SaaS platform where users can explore both spatial and non-spatial data in a consistent interface, allowing extraction of specific features and download in multiple open formats and APIs. It is included for free with ArcGIS Online, leverages ArcGIS services, and integrates with hundreds of open-source applications for mobile, web and desktop. ArcGIS Open Data uses Koop, an open-source ETL engine that automatically transforms web services into accessible formats.

Examples

GeoNode

GeoNode is an open source platform for developing geospatial information systems (GIS) and for deploying spatial data infrastructures. It is designed to be extended and modified, and can be integrated into existing platforms.

Examples

Additional Reading

Back to local Navigation

These links provide more information and background on technology options.