Lean Open Data / Open Dataset Guidelines
The following guidelines on implementing open datasets are recommendations and lessons learned based on Open Water Foundation (OWF) experience. OWF adheres to these guidelines when developing open datasets.
Introduction
Datasets may consist of original data or new derivations that are needed when exiting datasets do not facilitate use. For example, OWF creates datasets that join important foundational data, such as cross-referencing identifiers from different datasets. Most organizations tend to focus on their own data rather than cross-referencing to other datasets.
Depending on content and maturity of a dataset, a dataset may be accessed directly on the web or may need to be saved locally for further processing and to protect from changes to the online version. The more mature a dataset is, the more likely that it can be used directly via file access, by web service, or data portal. Versioning of the dataset is important to understand contents of the dataset.
Data Format
Many datasets can be represented in multiple forms including tables, time series, and map layers. The goal is to maintain the data in simple form such as an Excel workbook or common spatial data format while allowing conversion to other formats that may be more suitable for data processing and visualization. More complex data management solutions such as databases can be implemented when the initial approach does not scale. This approach should also support versioning data through timestamped snapshots of datasets as well as incorporating time within data to recognize the evolution of data in time (version and effective time).
This guide focuses significant attention on representing datasets in Excel workbooks, recognizing that many entities that maintain datasets have used Excel and will continue to use Excel for data management. Excel provides many convenient features and the data can generally be extracted and converted to other formats. The following formats are intended to be generally supported, although specific processing needs to be implemented in each case. See also the "Learn” websites that OWF has published for CSV, GeoJSON, etc.
- Tabular data formats:
- Microsoft Excel
- suitable for small datasets
- can be versioned through file naming
- can represent normalized tabular data using multiple worksheets in a workbook
- Comma-separated-value (CSV)
– the most general data form
- can represent tabular data
- has minimal metadata capability, although standards are emerging
- JSON
– useful for web applications
- has limitations dealing with null, NaN, and date/times that must be overcome by convention and software
- Microsoft Excel
- Spatial data formats:
- GeoJSON
– a more specialized form of JSON
- open standard for spatial data layer that contains geometry data, coordinate system, bounding box, and properties
- has limitations of JSON and currently there is no metadata standard
- standards are emerging for specifying cartography symbolization defaults
- Esri shapefile
– commonly used with Geographic Information Systems (GIS)
- comprised of multiple files (must zip to reference as a single URL) so difficult to use directly with a URL reference
- attribute names are restricted to 10 characters so longer names are truncated or full solution must be designed to this limitation
- cartographic symbolization is typically added using an Esri layer file (
*.lyr
)
- Esri file geodatabase
- requires GIS software to understand, best supported by Esri ArcGIS software
- Other map layer formats include KML, Well Known Text (KML), etc.
- can have benefits but try GeoJSON first
- GeoJSON
– a more specialized form of JSON
- Time series data formats:
- The tabular formats mentioned above can be used for time series data where the table contains columns for date/time, value, flag, etc.
- Representing additional time series properties and metadata requires more specific formats such as those handled by the TSTool software (DateValue, various model formats, WaterML, etc.)
Conversion between different formats is accomplished by using software such as TSTool and GIS.
It is OWF’s goal to overcome limitations in various formats through software features and automated process workflow logic.
This will allow maintenance of datasets in suitable formats while allowing for conversion to other formats as needed.
Open datasets created by OWF often have an analysis
or similar folder, with workflow to convert from original source to
alternate published formats.
Excel Tips
Excel workbooks used for data management need to be properly formatted to allow export of data in CSV format. Otherwise, issues can occur such as misformatting of identifiers that use leading zeros.
Data Flags
Datasets often have gaps due to various reasons. For example, a data object may be present in one dataset but is missing from another, or spatial data may be missing. It is desirable to clearly indicate data quality and missing data, rather than letting data users assume the reasons for missing (blank) values in dataset files. The following standard is proposed to clearly indicate data quality and missing values. The flag values and meaning can be included as a worksheet in an Excel workbook to document the meaning of data. The order shown in the table is from most certain to least certain value and reflects the following guideline:
- Color-coding in Excel is intended to allow a reviewer of the Excel format to quickly identify data that may be lower quality or need attention. Because color-coding is not available in simple electronic data formats such as CSV or JSON, the meaning of the color-coding must have an alternate representation in those formats, which is one reason to include a data flag column.
- Single-character flags are used for brevity. This also provides for the option of following the letter with a number to indicate a method, for example the method of estimating a value.
- Uppercase letters indicate values that are more certain.
Standard Data Flags
Flag Column Value in Excel and CSV | Excel Data Cell Value | Excel Data Cell Formatting | Description |
---|---|---|---|
G |
Known (good) value. | No color | Non-blank or blank value is good based on known source or high certainty cross-reference. |
g |
Estimated (good) value. | Yellow fill, or maybe light gray? | Non-blank or blank value is good but is based on an estimate technique. A number following the character can be used to indicate the method of estimation, for example g1. The method should be documented. |
N |
Blank | No color | Not applicable. No value (blank) is expected. |
M |
Blank | No color | Value is known to be missing in original source and therefore blank indicates that a value cannot be provided. |
m |
Blank | Light gray fill | Value is estimated to be missing. |
z |
Blank | Orange fill | Unable to confirm value. Value is possible but can’t confirm one way or the other. |
x |
Blank | Black fill | No attempt has been made to populate the data value, for example because time has not been allocated to review the data. |
N-to-many Relationships
It is useful to normalize data objects into “flat” table representations.
However, this approach has issues when a data field (column) may contain more than one relationship.
Consider a municipal growth management area that may overlap multiple counties.
The following are options for representing this N-to-many relationship situation.
Multiple options can be used at the same time to improve data access and usability.
For this example, assume that the Excel workbook contains a table Municipality
for municipalities and County
for associated counties.
-
Don't do this except in special cases: Include columns named
County1
,County2
, etc., in theMunicipality
table. Data rows will include each county name that is associated with the municipality. The counties may be sorted alphabetically or by some other attribute such as largest population. Blank values are used if the maximum number of counties is not appropriate for a row. This approach is undesirable because it breaks normal database normalization rules and can increase software complexity. Only use this aproach if the number of relationships is small and finite. The following approach is preferred. -
Include a column called “County” (or similar) with single value:
- Blank if no county is related.
- Name of county if only one county is related (e.g.,
Adams
) - Use
Multiple
if more than one county. This indicates that 1+ county names must be retrieved from a “relationship” table (see next item). This has the advantage of providing a simple value for reporting but complex data relationships require the relationship table.
- Use a separate relationship table (Excel worksheet) called
Municipality_County_Relate
(or similar), which contains columns that represent the relationship between municipality and county data, with 1+ rows for each municipality, each representing a municipality/county relationship. This design also supports 0+ relationships. This table can also include additional data that are appropriate for the relationship, for example percent of the municipality’s area that is in the county. If population of the relationship requires an external process such as spatial data processing, then the workflow for managing the data is more complex and a process needs to be put in place. - Include an extra column called
County_CSV
(or similar) that contains a comma-separated-value list of counties, for exampleAdams, Washington
. This list should agree with the relate table list and should be considered a derived column that depends on other column for input. This column is helpful to visually inspect data and for data displays. - It may be appropriate to indicate the population method in a flag column.
For example, add a column
Column_Flag
and useg1
value to indicate that a relationship was populated in a way that resulted in good quality/match using method1
. Method1
(and other methods) should be documented. - The direction of the relationship should be considered. For example, if relationships between municipalities and water providers are described, the relationship table could exist in either or both datasets. However, to avoid double data entry and potential errors that result, the main relationship table should be included in only one dataset and should be added to other datasets only when processing derived data products.
Spatial Data
Spatial data products such as basic locate maps and other visualizations such as choropleths (thematic maps) are very useful to provide context for datasets. However, spatial data formats can be cumbersome to work with and limit access to data if data users do not have access to GIS software. Therefore, one approach that can be taken for many core datasets is to use a tabular data format as the main data representation and join spatial data as appropriate to create derived spatial data products. This allows the dataset to be maintained without GIS software. The following discusses other aspects of spatial data products.
Point Data
Datasets that contain point data are relatively simple to maintain in tabular form by including
longitude (Longitude
) and latitude (Latitude
) decimal degree values as columns in the table. Considerations include:
- Longitude and latitude decimal degrees are generally satisfactory for simple visualizations and analysis; however, projection to a coordinate system may be needed for analysis such as those involving distance and area. Such projection could be done on the fly as needed, or be included in the original data.
- Coordinates for points are simple. However, coordinates for lines, areas, and three-dimensional forms require more thought.
For example, the following are options to mark municipalities on a map,
and the option that is selected should be handled consistently and documented in metadata:
- Location of main office corresponding to a postal address
- Location of centroid of service area
- Manually-selected location
- Other?
- Using the coordinate to create a URL for web map, for example, should result in a reasonable location.
Line and Area (Region) Data
Datasets can also contain line and region data, for example to indicate the county in which an entity exists. The determination of the values may be straightforward such as by inspecting maps or other datasets, in which case the value (e.g., county name) can simply be added to the dataset. In other cases, it may be appropriate or necessary to use a GIS process to join spatial data in order to accurately determine region intersections, and this process may also contribute additional data such as calculated values. A challenge is that the results of the GIS relate process may need to be absorbed back into the core data management format, such as Excel workbook. Ideally this can be accomplished by using appropriate unique identifiers so that the Excel relationship table is a swap and replace or can be otherwise automated. Additional considerations include:
- Is the spatial data layer used as input a single file/database or does it involve multiple files? For example, the growth management area boundaries for municipalities may exist in multiple files from multiple sources. Aggregating and normalizing such data in a way that is reproducible and can be updated over time needs to be handled.
- Is the join a simple one-to-one relationship or N-to-many. If the latter, see the guidelines for relationships.
- Is the join a linear reference where a geometry feature and position along the feature is needed. Multiple attributes complicates data management.
- How does spatial data vary over time? For example, are the spatial data static (e.g., Hydrologic Unit Code boundaries) or do they change over time (e.g., growing city boundary). These issues need to be considered when handling spatial data.
- What is the format for output products in order to publish the dataset in spatial form, for example:
- GeoJSON
- Esri Shapefile