Presented by:

Dinu

Dinesh Kumar

SAP Labs

-

Img 20180928 131009  02

Vinayak Jadhav

SAP Labs India
48b5d968ef296ffeb964ed2cecbac389

Abhijit Gharami

SAP Labs
No video of the event yet, sorry!
Download the Slides

SAP Cloud Platform (SCP) is an open platform-as-a-service (PaaS) product that provides core services, for building and extending cloud applications on multiple cloud IAASs. SCP supports AWS, OpenStack, Azure and GCP.

One of the core services provided by SCP is PostgreSQL as a Service (PostgreSQL-as-a-Service). Each PostgreSQL-as-a-Service instance consists of 5 VMs - Postgres-Master, Postgres-Standby and 3-PGPOOL VMs. Data is replicated asynchronously from Postgres-Master to Postgres-Standby.

SCP manages more than 10000 PostgreSQL-as-a-Service instances across multiple IAASs.

Postgresql-as-a-Service - Robust & Highly Available

[pgpool] VMs continuously monitor the health of postgres VMs. In case of any failures, [pgpool] triggers the promotion of Postgres-Standby to Postgres-Master. Failover process is comprised of STONITH operation for auto-correction.

N|Solid

Disaster-Recovery/Data-Protection-using-Backup-and-Restore

PostgreSQL-as-a-Service instance provides point-in-time-recovery*(PITR)* using WAL archiving. A base backup along with WAL files are archived on cloud storage. Snapshot based base-backup is taken in AWS/Azure/GCP. In Openstack, base backup is taken by copying and compressing data directory.

PostgreSQL-as-a-Service remains available during base backup. Recovery process involves restoration of data directory from base-backup and replay of WAL logs to a desired "reocvery-time-objective".

Multiple-Plans-to-support-all-use-cases

Multiple plans of PostgreSQL-as-a-Service are made available based on #cpu_cores, memory and disk size associated with instance VMs, with major-version-upgrade feature to upgrade to next higher version .

N|Solid

Horizontal-Scalability-and-IAAS-Agnostic-Nature

SCP makes use of internal tool chain component for deployment, life-cycle and release management of large scale distributed services in a IAAS agnostic manner. All postgreSQL-as-a-service instances are deployed using this toolchain. It ensures that all service instances have necessary vms in running state 24*7 and performs corrections if necessary.

A broker component is used to mediate between applications and PostgreSQL-as-a-Service instances. All service operations like create|delete|update|upgrade cluster are routed via broker and triggered by applications that intend to use PostgreSQL-as-a-service instances. Broker also routes plan change requests, scheduled backup of instances and scheduled updates of instances.

Bi-weekly-Rolling-Updates-with-Near-Zero-Downtime

SAP MultiCloud platform performs fully automated rolling updates of PostgreSQL-as-a-service instances. Every instance is updated bi-weekly for introducing new features/binaries and bug-fixes. OS updates and security patches are also applied regularly which prevents the instance from various security vulnerabilities.

When primary gets updated as part of rolling update, standby is promoted to primary within seconds thereby by providing almost zero downtime for PostgreSQL-as-a-Service instances.

Centralized-Monitoring-System

A monitoring agent runs in every PostgreSQL VM to report its health metrics like CPU|memory|disk-usage, and database information like availability-of-the-service, replication-status, number-of-active-connections.

N|Solid

The monitoring agent collects this information and reports it to centralized monitoring server, which stores in a time-series-database.

A monitoring-web-application shows metrics via various charts so that devops can identify the instance-status at any given date-time range.

Multi-channel-Alerting-System

Alerting-module raise alerts when some undesired state is reported, like "primary-server-not-available, replication-down, disk-size-threshold-crossed, backup-failed among others.

N|Solid

N|Solid

Centralized-Troubleshooting-System-for-all-PostgreSQL-as-a-Service-instances

All important system-logs and custom logs generated from a service-instance is pushed to a central system so that Ops can access them to trace any condition/debug any problems.

Troubleshooter lets users debug any issue irrespective of service-instance availability.

N|Solid

Date:
2018 October 16 14:00 PDT
Duration:
50 min
Room:
Market
Conference:
Silicon Valley
Language:
Track:
Dev
Difficulty:
Medium