Three-star formats. Three-star data should advisably be in one of the following formats, depending on whichever is more convenient for the data publisher. From the user standpoint, there is not much of a difference between these formats, but it would likely be most convenient to use .json.
Preconditions have been created for developing open-data infrastructure.
• csv files. The documentation must specify the alphabetical encoding, whether comma/semicolon is used as delimiter and whether a period or comma is used as the decimal point. Files should advisably have a header listing the names of the fields. And certainly the official20 csv format should be used as the basis, along with nuances as regards quotation marks etc (see http://en.wikipedia.org/wiki/Comma-separated_values).
• json files. Same requirements for language encoding standards.
• xml files.
Four-star formats. The principles are the same as in the case of three-star data, but the primary difference here is that globally unique identifiers – URIs or uniform resource identifiers – are to be used to identify objects. The use of uniform identifiers makes it much easier to use data across different systems.
To adopt URIs, a dataset prefix is to be added to each object-identifier during data export, for instance http://institution.ee/nameofdataset/objects/, where the full URI would be http://institution.ee/nameofdataset/objects/45321 and 45321 the object’s original ID in the dataset. If the IDs are not unique for their own data set (which is the most usual situation) the easiest would be to express URIs during export in a form where the name of the relevant table is added instead of “objects”, for instance http://institution.ee/nameofdataset/naturalpersons/.
Once objects have begun to be presented as URIs, it would be suitable, besides use of csv/json/xml, to express data in the form of RDF – as entity-attribute-value triplets21.
As data can be appear in various syntaxes in RDF, we advise using one of the following two:
• Microdata22, which is for encoding data into html to be read by humans: information can simultaneously be easily parsed by humans and is also readily machine-readable.
• RDFa23, which is analogous to and has the same objectives as Microdata, but is slightly more complex.
Data in Microdata format can always be converted into RDFa with little trouble. The next question after the Microdata/RDFa issue is the selection of field names – the names of object properties. There are two main approaches.
• The simplest way would be to express pairs of table/field names of the original dataset in the form of URIs, for instance: http://www.institution.ee/nameofdataset/nameoftable/nameoffield, an example being http://www.institution.ee/permitrecipients/naturalpersons/dob (www.institution.ee/permitrecipients/naturalpersons/dob).
• A slightly more complicated but potentially also more useful alternative would be to use, instead of the table/field name in the original system, the more general and common property name, if there is one. Schema.org is a suitable collection for searching for names. Note that if no suitable name is found, users will find it very easy later on to convert exported names to a form suitable for their purposes, on condition that they can understand the meaning of the exported field name.
The topic of ontologies should be addressed as well in discussing how field names should be expressed. They may be viewed as rules for converting/classifying field names; for instance if we want to say that our field name in the form of the URI http://www.institution.ee/permitrecipients/naturalpersons/dob is precisely the same thing as schema.org’s Thing>Person>birthDate.
From this view, the ontologies for publishing data are not a directly relevant or complicated topic; rather, they are more of a useful tool for application developers who mash up data from different sources. With regard to exporting data upon publishing, it would be expedient to write ontologies oneself either for documenting one’s field names or to convert a subset’s existing field names to schema.org names.
Five-star formats. One way of linking data URIs is to use, instead of the identifier used in databases, a de facto more universal global identifier – the URI. Let us suppose that the state agrees on (or that the Population Register and Business Register adopt the use of) a format for personal identification codes and Business Register codes consisting of http://prefix1.ee/prefix2/personalIDcode and http://prefix3.ee/prefix4/companycode. In such a case, the five-star representation of personal and company codes would be through URIs where prefixes/URI formats are not the company’s own but, rather, the formats more broadly agreed upon. The same goes for names of database fields – names of object properties.
How should a dataset be published in practice? There are three main technological ways of publishing data.
• For human-readable files, the directories containing the files are packaged, a short content description is added and the packaged directory (directories) are uploaded in freely downloadable form advisably on the institution’s website, http://<asutuse domeeninimi>.ee/avaandmed (http://<institution domain name>.ee/opendata) or in the opendata.riik.ee directory.
• In the case of data in databases, export the database content into a text format structured as xml or csv or json files etc and then implement the simple package-and-upload-to-web-server method. If the database contains personal data not subject to disclosure, the fields are simply not exported.
• As an alternative, the data in the databases may be published as a free web service that can be used to find and download the entire content of the dataset or a filtered partial set. Network service can be SOAP service, but the most preferred ones are simpler, for instance json-based REST services, as well simply csv format-issuing services with get or post input parameters.
The data must be described in reasonable manner, i.e. a person with no previous experience with the dataset but who understands the field and the technology must, with reasonable exertion, be able to understand the purpose, structure and content of the dataset.
The dataset must include description of the principles for updating the dataset and the planned frequency of the updates. The publisher of the dataset has no direct obligation to regularly update the dataset – it is important to record the update plan (or lack thereof) in writing in comprehensible fashion.
The datasets published by the institution must be easy to find. To do so, at least two means of publishing the existence, descriptions and download links to the data must be used.
• A special directory on the institution’s own website, /avaandmed (/opendata), such as http://www.institution.ee/opendata.
• National consolidated open data site/repository http://opendata.riik.ee.
Open Spatial Data
Kristian Teiter
Estonian Land Board
What are spatial data?
Simply put, spatial data are data with a geographic location and form. Such data are also called geodata, geoinformation and location data. As a rule, spatial data are presented and used in the form of a map that can be considered the spatial data output of a database. For instance, one of the outputs of the topographic data administered in the Topography Database of Estonia is topographical maps, but the data themselves can be used and made available online in xml format.
Fields that account for the principal use of spatial data are environmental protection, planning, construction, logistics, transport, the military and statistics, to name a few. More and more potential is seen in location, and the use of spatial data in different walks of life is seeing explosive growth.
Address