It’s Tuesday Morning in Copenhagen, and we’re talking about Architecture with people in the hall. This article is for the second of our “Six Steps” — you can read the overview here.
This article is aimed at engineers, researchers and execs in the Wind Industry, to help understand the process of digitalisation. If you’re qualified in Systems Architecture or Data/Software Engineering, you’re way ahead of this; just get stuck in already!!
About the ‘Architect’ step
Product and/or data architecture is very cool sounding. It brings to mind extremely highly paid people at Google, finessing the perfect solution in their mind whilst having a conversation over a perfectly-brewed latte. The reality is different. It’s rough and ready, and subject to rapid change.
The objective is to write down roughly what the key components of your system will be, and roughly how they connect.
The golden rule for this step is that it’ll never be right the first time. Accept that, and you won’t get roadblocked. This is why the “Six Steps” are a cycle, not a straight line. Each time around the cycle, you improve and iterate on the architecture.
Who should do this?
Pretty much anyone with a general background in data/science/tech can approach it, although whether you should is really up to you and how you want to spend your time. If you tackle it, pass your ideas by someone experienced — even a 30 minute call will give you peace of mind.
What tools are available?
No kidding: at Octue, we always — and I mean always — use paper and pens. This is somewhat about creativity, but mostly because it’s the fastest way of getting diagrams down: using colours allows you to quickly differentiate components / sidenotes / complete tangents where you teach your team about an obscure aspect of computational geometry.
If you have to do it remotely, we’ve tried a lot of whiteboarding solutions (like jamboard and so on) but the best one by far is Google Slides. It’s not marketed for this purpose but it probably has the least glitchy interface of all the collaborative drawing tools out there, plus you can export slides to SVG, meaning your diagrams stay useful and are modifiable elsewhere.
DON’T DO: Systems level diagrams
Controversial opinion: we think it’s not worth bothering with systems level architecture diagrams (at this stage):
- Remember the previous step: keep the scope tiny, and isolated from your other efforts.
- As soon as you iterate even slightly, they’re out of date.
- In a subsequent step (‘Engineering’) we’ll use a tool called ‘terraform’. You can quickly generate system diagrams from terraform.
- You’ll quickly develop experience — when you need to get serious about these things, you’ll know.
That said, if it helps to clarify what’s going on in the first instance, then get stuck in. Here’s a good tool to help:
Getting started
How are you supposed to write down the entire system for a new data service?! To get started I always think about three things, in this order:
- Data structure
- Usability
- Storage
With these 3 thought about, you’ll be able to write a really basic diagram of your system.
Always, always, ALWAYS start with data structure
Even if you think you won’t use it later, this process will clarify in your mind the data that goes into and out of the system.
Create or collect some examples of the raw data that you’re processing. Make sure they’re not commercially sensitive (so you can share them later). Then write down a clear schema for that data — use JSONSchema (for general data exchange) or AVRO (for extreme high throughput pipelines).
Pro tip / Shameless plug: Octue are building a repository for JSONSchema to help you do this. At the time of writing it’s in very early alpha, but you’ll soon be able to find examples from across the industry — or publish your own so your whole team are on the same page.
Then think about usability
We’ll come back to this again later in much more detail later in the “Six Steps”, so I won’t duplicate material here. Come and meet us in the hall (or if you’re catching up after the event, then read ahead) to see what we have in mind.
Finally, think about storage
Sorry for the length of this one! Skip to the next section if you don’t care about data stores.
I’ve also excluded a lengthy topic about data lakes and data warehouses and data meshes and data <trendy term here>. These are the purview of the data and knowledge engineers. For the grassroots digitalisation work we’re talking about here, don’t worry about them. If you‘re doing the six steps, and the result gets some traction with your colleagues, any solution you‘ve chosen can be evolved to be part of such things.
You may not have to store data but if you do, here are a few considerations with tips on tools or examples:
Object stores. For big binary files like audio, video or specialised instrument data in binary form, the chances are that dumping them into a cloud object store is the way to go.
Pro tip / Shameless plug: We built a really powerful way of creating a datalake from a mass of legacy files on a hard drive… Octue‘s ‘django-twined’ and ‘octue-sdk-python’ libraries are designed to upload/download files and metadata in cloud storage, synchronising entries between an object store, a SQL database and your laptop. That means you can filter and query for cloud files straightforwardly, then download just the ones you need!!
Timeseries / Event databases. For data which arrives in high-volume streams of small “events”, consider solutions like BigQuery or InfluxDB. Their best application is where data slowly gets less useful over time (eg you daily need to do some analysis, or aggregation, after which the raw data rarely gets touched).
Pro tip: Beware!! If you frequently need to query across these whole datasets to select subsets, querying can get very expensive. In that case, be aware of the need to cluster tables, or switch to PostgreSQL with a JSON column to contain the event data.
Pro tip: This talk, “From Blade to BigQuery”, shows you, complete with full open-source code, how to get events from a wind turbine to the cloud.
Graph databases like neo4j. These are great where you have highly relational data (although you trade off data integrity) and need to fetch many things at once through the relations. There are two stellar use cases:
- Federation. You can straightforwardly connect graphs across databases, so when you get to that stage, you’ll be able to join up your own private data with public (or other private) data securely.
- Scalability. You can scale these things to trillions of nodes, so if your dataset is going to get mega quickly, you can avoid all sorts of troubles maintaining a conventional SQL instance.
Pro tip: A comment from Octue’s Senior Software Engineer, Marcus Lugg on using neo4j for the first time last week: “I thought it was weird at first but this query language is really beautiful; [this query] would be a nightmare of joins in SQL”
NoSQL databases like mongoDB. Are probably not the answer. For our industry sector, other than event/timeseries streams and graphs mentioned above, I’ve never ever seen a use case that couldn’t be covered with PostgreSQL (see below). If you disagree and have a good use case, please comment below, I’d love to hear it!
Last but not least, SQL databases are a universal starters workbench. If you’re choosing a SQL database these days, it’s PostgreSQL or nothing. PostgreS has powerful NoSQL capability built in, and is just a really versatile workhorse.
Here’s the rub: if you need a database and don’t KNOW that one of the above DB types are right for you, starting with PostgreSQL is a safe bet. Remember this isn’t about getting it right, it’s about learning quickly:
If you start with PostgreSQL, you’ll either know within one iteration of the Six Steps that your choice was super wrong, or have something that’ll tide you over for a while.
Architecting a solution for 72 hours at WindEurope
And now for the good bit! We’ve long wanted to try out a geospatial solution called ‘h3’, which uses a hexagonal mesh covering the globe.
The system is incredibly elegant and was invented by some engineers at Uber, to help manage the widely varying spatial density of their data points (outside a rail station you’ll have many data points per square m, in the countryside you’ll have few or none, so you need to manage different spatial resolutions).
They open-sourced it (thanks Uber!); read their beautiful blog post here:
The mesh has successive refinement levels, meaning that with a single integer, you can represent not only location on the earth but also spatial resolution inherent to the data.
So, we’ll use that as a really compact way of expressing data.
Brainstorming
We sat down for about 3 hours with some pens and some paper. Sorry, we can’t really capture it but for those of you following on live, come and talk, we’ll work with you in the hall to go through this stage or help you through your own:
Thinking about data structure
We thought about the ways people might need to fetch data. We concluded that to get started, we’d want to fetch elevations for a single point, a collection of points, or for a region (if we’re displaying on a map).
From our brainstorming, we already knew we’d need an API, which is where our service hits the outside world. We talked roughly about this at first but later published the definitions of what data looks like at the boundary.
- https://jsonschema.registry.octue.com/octue/h3-elevations-input/0.1.0.json
- https://jsonschema.registry.octue.com/octue/h3-elevations-output/0.1.0.json
Thinking about usability
Most of the uses we can think of either involve putting data onto a map, or loading it into python. Javascript developers are pretty familiar with fetching; and the fetch pattern is tied intimately to the use case — so javascript is covered.
On the python-side, we wanted to make it as easy as possible for non-developers. So we figured that a very lightweight python library would allow you to get elevations with a single line of code. More on this later.
Database selection and graph structure
Luckily, our above statement about PostgreSQL holds true (or we’d be quite embarrassed!). We looked into it and… yes, you can totally store h3 data efficiently in Postgres!! Postgres has PostGIS; a highly advanced geospatial library which really complements the use case too.
But, we work with Postgres all the time at Octue and this should be a learning experience for us too. So, because of the nature of the hexagonal data structure being a heptree graph (each node divides into seven nodes) we’ve decided to try out a new technology for us — the neo4j graph database. The idea is to:
- Efficiently traverse up- or down- the tree, to aggregate data up from the fine resolutions or zoom in from coarser data.
- Bind any number of data sources later (starting with elevations, but thinking big!)
- Federate databases, so if a customer has confidential, high resolution measurements on site we can easily join them.
Because our data is so simple, the graph is both straightforward and beautiful:
Wrapping Up
Here’s where we’re at, with a little explanation:
This is all you need for the first iteration. If the architecture diagram has more than just a few clear elements, you may struggle to deliver.
Remember, you can always grow and adapt but to get something in-place, start simple. See you next time!