Background
Previously our Grafana setup hinged on and ec2 instance with a local sqlite db as persistent storage. In an effort to improve our reliability and availability of Grafana, we needed to change this.
Process
We’ve broken the work out into a few migration phases:
- Migrate off sqlite db to postgres for a sharable storage backend
- Move Grafana to a scalable workload orchestration (ECS)
- Lower TTL for associated records for more responsive switches when a Grafana task has to be cycled out (ECS Service Discovery defaults to 10s which we can match)
- Configure high availability (along with unified alerting)
- To de-duplicate alert propagation (Grafana evaluates alert rules on each instance) we will need to properly configure peering for multiple instances dynamically.
Migrating backend to Postgres
Grafana was backed with a local sqlite db which was restricting our ability to scale the application. For a more reliable experience for users,
Migrating Grafana to ECS
Upgrading to Grafana v12
While Grafana does have support for rolling updates, since this was a major version with known breaking changes, and we havent enabled anything yet - I figured it would be smoother to upgrade at this point and then enable HA.
Some notable changes for us in the new version:
- Git Sync support (Experimental)
- Dashboard schema changes (Experimental)
- Drilldown
See the full breakdown of changes here.