72 hours at WindEurope: Populating data

Tom Clark
5 min readApr 26, 2023

--

It’s Wednesday Morning in Copenhagen, and we’re looking at how to populate data into our database. This article is for the fourth of our “Six Steps” — you can read the overview here.

This article is aimed at engineers, researchers and execs in the Wind Industry, to help understand the process of digitalisation. If you’re qualified in Systems Architecture or Data/Software Engineering, you’re way ahead of this; just get stuck in already!!

About the ‘Populate’ step

There’s really not a lot we can say about populating your data store, because this is the bit that gets totally different each time. In some cases, this might not even be necessary — for example if you’re tying into an existing store. You can safely skip this step if that’s the case.

Just show me the code!

With pleasure :)

Populating Elevations Data

Ensuring provenance

We should always be delivering data with clear sourcing. I can’t stand it when you get data from Google or somewhere, and it seems legitimate but isn’t sourced. You can’t use that for science!

We’d forgotten this in our brainstorming, so added an extension to the database graph we developed earlier. This allows us to specify the source of the dataset with a proper reference:

Adding a graph node to preserve scientific provenance of the data

Data sources

The underlying dataset we used to provide the elevations is the Copernicus DEM — Global and European Digital Elevation Model (COP-DEM) GLO-30 dataset:

We accessed it via the AWS S3 mirror, which provides easy access to the dataset’s GeoTIFF files:

While developing the populator, for cross-validation purposes, we developed a short script to plot data over a map on a Plotly chart — if we get time, we’ll refactor that into an Observable for people to use.

About Resolution

Elevations in GLO30 go down to 30 arcseconds spatial resolution, which varies depending on where you are on the globe but broadly is about 30m.

Looking up the H3 cell statistics, we see that Level 12 hexagons have an edge length of ~10m, making Level 12 the first level that’s finer than the spatial resolution of the data itself. So there’s no point going finer than this; we don’t populate any cells lower than L12.

These Level 12 hexagons have a 10m edge length. You can see the underlying resolution of the data in this view over some rough terrain. Note that the slight oversampling of the dataset is apparent in the pattern here, but not perfectly because of the nearest neighbour sampling and the different grids.

How we populate higher levels

The L12 cells are populated by nearest-neighbour sampling the original TIFF files using the centre of the hexagon cell. That’s ideal, because L12 cells are smaller than the grid size.

But Level 12 is a lot of data to render if you want to cover a whole country! What if we need something coarser? Rather than attempting to analyse the raw data, we were able to take advantage of the graph structure to aggregate values up to coarser resolutions:

  • Populate all L12 hexagons in an area.
  • To populate a parent L11 cell, take the seven L2 hexagons inside it and average them.
  • Keep going until reaching Level 8 (simply because we didn’t think coarser cells would be very useful).
Aggregate up to coarser levels by averaging seven values per parent hexagon

There are some very powerful ways of doing aggregation in databases, but we stayed simple and just wrote some python code inside the populator service. Easy wins the day!

Levels 9 through 12 of refinement in the same region used for cross-checking above

Engineering the Populator service

We used our own SDK to create and deploy a service to Google Cloud Run. The point of our SDK is to help wrap scientific code and provide it as a data service, so we’re constantly working on features to give the extra helping hand.

It works on a “question-answer” model. Some features we’re proud of:

  • Under the hood it uses an event stream to initiate a “question”, and a second (question-specific) event stream to manage communication (for logs, monitor metrics, progress updates, and optionally a final “answer”).
  • It handles errors by capturing all the inputs of a question, so an error can be reproduced for investigation with single line of code.
  • Services using the framework can ask each other questions.
  • Extra tools for querying and managing files in cloud object stores.

The populator service only makes use of the most basic aspects of the framework. It sits there. The API can ask it to add a list of hexagons to the database.

Security

We added a “Secret” to the service (you can Secret Manager resources in our terraform config) and the purpose of that is to add credentials for accessing the database. Never commit credentials to source code!

Cross checking

We started without using the database at all — just plotting directly. to make sure we were pulling out regions correctly.

We used a variety of different locations and maps, plotting hexagons with non-zero opacity to check that we were correctly correlating values with features in the landscape.

Lazy Loading

The populator is set up to load just a region around a selected area. This allows us to lazily-load data on demand by sending a question to the populator (from the API service).

--

--

Tom Clark

Fluid Dynamicist at the core, lover of chaos theory. Experienced scientist, developer and team lead working in wind energy — from startups to heavy industry.