Powering OpenStreetMap's Future: A year of improvements from OpenStreetMap Foundation’s Site Reliability Engineer"

Just over one year ago, I joined the OSM Foundation to improve the reliability and security of technology and infrastructure that powers OpenStreetMap. I've worked closely with the volunteer driven Operations Working Group this past year, and think that the work done to improve our processes and documentation has helped make us all more effective. I'm grateful for the support to work with this group, and I'm happy to see how much progress we've made in building what's needed for the future of OpenStreetMap.

I'll go into a little detail below about what's transpired. At a high level, I made it easier to manage deployment of the software running on our servers; hardened our network infastructure through better redundancy, monitoring, access, and documentation; grew our use of cloud services for tile rendering, leveraging a generous AWS sponsorship; improved our security practices; refreshed our developer environments; and last but definitely not least, finalized migration of 16 years of content from our old forums to our new forums.

If you want to hear more from me over the course of the work last year, check out my talk at State of the Map 2022 and my interview on the GeoMob podcast. And I'd love to hear from you, email me at osmfuture@firefishy.com.

2022-2023 Site Reliability Details

Manage software on our servers

Containised small infrastructure components (GitHub actions for building)

I have containerised many of our small sites which previously built using bespoke methods in our chef codebase as part of the "Configuration as code" setup. Moved the building to github actions. Setup a base for any future container (aka docker) based projects going forward. There our first container / docker based projects hosted on OSMF infrastructure.

Our chef based code is now simpler, more secure and deploys quicker.

Improved chef testing (ops onboarding documentation)

We use chef.io for infrastructure (configuration) management of all our servers and the software used on them. Over the last year the chef test kitchen tests have been extended and now also work on modern Apple Silicon machines. The tests now reliably run as part of our CI / PR processes. The tests add quality control and assurance to the changes we make. Adding ARM support was easier to add because we could use test kitchen before moving to ARM server hardware.

Having reliable tests should help onboard new chef contributors.

Hardened our network infastructure

Network Upgrades @ AMS (New Switches, Dual Redundant Links, Dublin soon)

Our network setup in Amsterdam was not as redundant as it should have been. The Cisco Small Business equipment we had out-grown. We had unexpected power outages due to hardware issues. The equipment was also limiting future upgrades. The ops group decided to replace with Juniper equipment which we had standardised on at the Dublin data centre. I replaced it with minimal downtime in a live environment (<15mins).

Both Dublin and Amsterdam data centers now use a standardised configurations. Improved security config. Each server now has fully bonded links for improved redundancy and performance. The switches have improved power and network redundancy. We are awaiting the install of the fully resilient uplinks (order submitted) in the next month.

Out of Band access to both data centres (4G based)

I build and installed an out-of-band access devices at each site. The devices are hard wired to networking and power management equipment using serial consoles. The out-of-band devices have resilient 4G link to a private 4G network (1NCE). The out-of-band access devices are custom built Raspberry PIs with redundant power supplies and 4x serial connectors.

Documentation of Infrastructure to easy maintenance (Racks / Power)

I documented each rack unit, power port (Power Distribution Unit), network connection and cable at the datacenters. This makes it easier to manage the servers, reduces errors and allows us to properly instruct remote hands (external support provider) to makes any changes on our behalf.

Oxidized (Visibility of Network Equipment)

Our network and power distribution configuration is now stored in git and changes are reported. This improves visibility of any changes in the team, and intern improves security.

Config is continiously monitored and any config drift between our sites is now much easier to resolve.

Terraform Infrastructure as Code (improve management / repeatability)

Terraform is an infrastructure-as-code tool, we now use it for managing our remote monitoring service (statuscake) and I am in the process of implementing it to manage our AWS and Fastly infrastructure.

Previous these components were all managed manually using the respective web UI. infrastructure-as-code allows the Ops team to collaboratively work on changes, enhances visibility and the repeatability / rollback of any changes.

Likewise we manage all our domains and DNS using dnscontrol code. Incremental improvements have been made over the last year, including add CI tests to improve outside collaboration.

Grew our use of cloud services

AWS in use for rendering infrastructure. Optimised AWS costs. Improved security. Improved Backup. More in pipeline

Ops team have slowly been increasing our usage of AWS over a few years. I have built out multiple usage specific AWS accounts using an AWS organisation model to improve billing and security as per AWS best practise guidelines. We generously received AWS sponsorship for expanding our rendering infrastructure. We build the experimental new rendering infrastructure using ARM architecture using AWS Graviton2 EC2.
We haven't previously used ARM based servers. As part of improvements to our chef (configuration as code) we had added local testing support for Apple Silicon (ARM), only small additions were required to add the required compatibility for ARM servers to chef.

We were impressed by the performance of AWS Graviton2 EC2 instances for running the OSM tile rendering stack. We also tested on-demand scaling and instance snapshotting for potential further rending stack improvements on AWS.
We have increased our usage of AWS for data backup.

Improved our security

Over the last year a number of general security improvements have been made. For example: Server access is now via ssh key (password access now disabled). We've also moved from a bespoke gpg based password manager for the ops team to using gopass (feature rich version of https://www.passwordstore.org/ ), gopass improves key management and sharing the password store.

Additionally we have also enhanced the lockdown of our wordpress instances by reducing installed components, disabling inline updates and disabling XMLRPC access. We are also working to reduce the users with access and removing unused access permissions.

Documented key areas of vulnerability requiring improvement (Redundancy, Security, etc)

Documentation on technical vulnerability: I am producing a report on key areas of vulnerability requiring improvement (Redundancy, Security, etc). The document can be used to focus investment in future to further reduce our expose to risks.

Refreshed our developer environments

New Dev Server

We migrated all dev users to a new dev server in the last year. The old server was end of life (~10 years old) and was reaching capacity limits (CPU and storage). The new server was delivered directly to the Amsterdam data centre, physically installed by remote hands and I communicated the migration, and then migrated all users and projects across.

Retired subversion

I retired our old svn.openstreetmap.org code repository in the last year. The code repository was used since the inception of the project, containing a rich history of code development in the project over the years. I converted svn code repository to git using a custom reposurgeon config, attention was made to maintain the full contribution history and correctly link previous contributors (350+) to respective github and related accounts. The old svn links were maintained and now link to the archive on github https://github.com/openstreetmap/svn-archive

Forum Migration

The old forum migration, we migrated 1 million posts and 16 years of posts to discourse. All posts were converted from fluxbb markdown to discourse's flavour of markdown. All accounts were merged and auth converted to OpenStreetMap.org "single sign-on" based auth.

All the old forum links redirect (link to the imported) to correct content. Users, Categories (Countries etc), Thread Topics, and individual posts.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.