It’s Wednesday Morning in Copenhagen, and we’re looking at how to populate data into our database. This article is for the fourth of our “Six Steps” — you can read the overview here.
This article is aimed at engineers, researchers and execs in the Wind Industry, to help understand the process of digitalisation. If you’re qualified in Systems Architecture or Data/Software Engineering, you’re way ahead of this; just get stuck in already!!
About the ‘Populate’ step
There’s really not a lot we can say about populating your data store, because this is the bit that gets totally different each time. In some cases, this might not even be necessary — for example if you’re tying into an existing store. You can safely skip this step if that’s the case.
Just show me the code!
Populating Elevations Data
We should always be delivering data with clear sourcing. I can’t stand it when you get data from Google or somewhere, and it seems legitimate but isn’t sourced. You can’t use that for science!
We’d forgotten this in our brainstorming, so added an extension to the database graph we developed earlier. This allows us to specify the source of the dataset with a proper reference:
The underlying dataset we used to provide the elevations is the Copernicus DEM — Global and European Digital Elevation Model (COP-DEM) GLO-30 dataset:
- DOI: https://doi.org/10.5270/ESA-c5d3d65
- Direct link: https://spacedata.copernicus.eu/collections/copernicus-digital-elevation-model
We accessed it via the AWS S3 mirror, which provides easy access to the dataset’s GeoTIFF files:
- Information: https://copernicus-dem-30m.s3.amazonaws.com/readme.html
- URL: https://copernicus-dem-30m.s3.amazonaws.com
- S3 URI:
While developing the populator, for cross-validation purposes, we developed a short script to plot data over a map on a Plotly chart — if we get time, we’ll refactor that into an Observable for people to use.
Elevations in GLO30 go down to 30 arcseconds spatial resolution, which varies depending on where you are on the globe but broadly is about 30m.
Looking up the H3 cell statistics, we see that Level 12 hexagons have an edge length of ~10m, making Level 12 the first level that’s finer than the spatial resolution of the data itself. So there’s no point going finer than this; we don’t populate any cells lower than L12.
How we populate higher levels
The L12 cells are populated by nearest-neighbour sampling the original TIFF files using the centre of the hexagon cell. That’s ideal, because L12 cells are smaller than the grid size.
But Level 12 is a lot of data to render if you want to cover a whole country! What if we need something coarser? Rather than attempting to analyse the raw data, we were able to take advantage of the graph structure to aggregate values up to coarser resolutions:
- Populate all L12 hexagons in an area.
- To populate a parent L11 cell, take the seven L2 hexagons inside it and average them.
- Keep going until reaching Level 8 (simply because we didn’t think coarser cells would be very useful).
There are some very powerful ways of doing aggregation in databases, but we stayed simple and just wrote some python code inside the populator service. Easy wins the day!
Engineering the Populator service
We used our own SDK to create and deploy a service to Google Cloud Run. The point of our SDK is to help wrap scientific code and provide it as a data service, so we’re constantly working on features to give the extra helping hand.
It works on a “question-answer” model. Some features we’re proud of:
- Under the hood it uses an event stream to initiate a “question”, and a second (question-specific) event stream to manage communication (for logs, monitor metrics, progress updates, and optionally a final “answer”).
- It handles errors by capturing all the inputs of a question, so an error can be reproduced for investigation with single line of code.
- Services using the framework can ask each other questions.
- Extra tools for querying and managing files in cloud object stores.
The populator service only makes use of the most basic aspects of the framework. It sits there. The API can ask it to add a list of hexagons to the database.
We added a “Secret” to the service (you can Secret Manager resources in our terraform config) and the purpose of that is to add credentials for accessing the database. Never commit credentials to source code!
We started without using the database at all — just plotting directly. to make sure we were pulling out regions correctly.
The populator is set up to load just a region around a selected area. This allows us to lazily-load data on demand by sending a question to the populator (from the API service).