Managing PostgreSQL-as-a-Service at Large Scale in SAP Multi-Cloud platform
SAP Cloud Platform (SCP) is an open platform-as-a-service (PaaS) product that provides core services, for building and extending cloud applications on multiple cloud IAASs. SCP supports AWS, OpenStack, Azure and GCP.
One of the core services provided by SCP is PostgreSQL as a Service (PostgreSQL-as-a-Service). Each PostgreSQL-as-a-Service instance consists of 5 VMs - Postgres-Master, Postgres-Standby and 3-PGPOOL VMs. Data is replicated asynchronously from Postgres-Master to Postgres-Standby.
SCP manages more than 10000 PostgreSQL-as-a-Service instances across multiple IAASs.
Postgresql-as-a-Service - Robust & Highly Available
[pgpool] VMs continuously monitor the health of postgres VMs. In case of any failures, [pgpool] triggers the promotion of Postgres-Standby to Postgres-Master. Failover process is comprised of STONITH operation for auto-correction.
PostgreSQL-as-a-Service instance provides point-in-time-recovery*(PITR)* using WAL archiving. A base backup along with WAL files are archived on cloud storage. Snapshot based base-backup is taken in AWS/Azure/GCP. In Openstack, base backup is taken by copying and compressing data directory.
PostgreSQL-as-a-Service remains available during base backup. Recovery process involves restoration of data directory from base-backup and replay of WAL logs to a desired "reocvery-time-objective".
Multiple plans of PostgreSQL-as-a-Service are made available based on #cpu_cores, memory and disk size associated with instance VMs, with major-version-upgrade feature to upgrade to next higher version .
SCP makes use of internal tool chain component for deployment, life-cycle and release management of large scale distributed services in a IAAS agnostic manner. All postgreSQL-as-a-service instances are deployed using this toolchain. It ensures that all service instances have necessary vms in running state 24*7 and performs corrections if necessary.
A broker component is used to mediate between applications and PostgreSQL-as-a-Service instances. All service operations like create|delete|update|upgrade cluster are routed via broker and triggered by applications that intend to use PostgreSQL-as-a-service instances. Broker also routes plan change requests, scheduled backup of instances and scheduled updates of instances.
SAP MultiCloud platform performs fully automated rolling updates of PostgreSQL-as-a-service instances. Every instance is updated bi-weekly for introducing new features/binaries and bug-fixes. OS updates and security patches are also applied regularly which prevents the instance from various security vulnerabilities.
When primary gets updated as part of rolling update, standby is promoted to primary within seconds thereby by providing almost zero downtime for PostgreSQL-as-a-Service instances.
A monitoring agent runs in every PostgreSQL VM to report its health metrics like CPU|memory|disk-usage, and database information like availability-of-the-service, replication-status, number-of-active-connections.
The monitoring agent collects this information and reports it to centralized monitoring server, which stores in a time-series-database.
A monitoring-web-application shows metrics via various charts so that devops can identify the instance-status at any given date-time range.
Alerting-module raise alerts when some undesired state is reported, like "primary-server-not-available, replication-down, disk-size-threshold-crossed, backup-failed among others.
All important system-logs and custom logs generated from a service-instance is pushed to a central system so that Ops can access them to trace any condition/debug any problems.
Troubleshooter lets users debug any issue irrespective of service-instance availability.
- 2018 October 16 14:00 PDT
- 50 min
- Silicon Valley