The Parallel Retrieve Cursor
I am now living in Beijing and working at VMware in Greenplum database team as a software engineer, focusing on data migration among clusters, Greenplum's FDW, etc. Before join VMware, I used to work for Oracle and Pivotal R&D, engaged in database development and testing. I am also a committer of Apache HAWQ project.
My name is Junfeng Yang, I hold a master's degree in computer science from the Chinese University of Hong Kong. After I graduated from CUHK, I join The Knowledge and Education Exchange Platform (KEEP) Project as a research assistant in CUHK. Responsible for information retrieving and data mining.
Since joining VMware(Pivotal), I have involved several projects, includes: GPText — Greenplum In-Database Text Analytics Extension, responsible for text data indexing, retrieval, rich document index support, and natural language processing (NLP). GPCopy - Supports copy objects data from one Greenplum cluster to another for migration purposes. GP2GP - Provides the ability to run queries across multiple Greenplum clusters.
No video of the event yet, sorry!
Greenplum database is a massively parallel processing (MPP) database based on PostgreSQL. Each cluster consists of a master node, standby master node, and some segment nodes. When queries come, a query plan is generated in master node and disptached to segments. The plan gets executed on segments and the result will be sent to master node, finally results gathered on master and return to the database users.
In this session we firstly deliver a introduction to the architecture of Greenplum database, internal componenets, then present its new feature: the parallel retrieve cursor. This feature is designed to reduce the heavy burdens of the master node. For a normal cursor, a cursor is created and data are fetched on master node in Greenplum. However, for a parallel retrieve cursor, it can redirect the results to segments insead of gathering them to the master node. We define a new term "endpoint", to stand for the results for one segment.
Once a parallel retrieve cursor has been declared on Query Dispatcher, a corresponding endpoint will be created on each QE which contains the query result. Then, those endpoints can be used as the source, and results can be retrieved from them in parallel on each QE. each endpoint is associated with a token, token is used to authenticate the retrieve connection.
Client application setup retrieve mode connections to these segments and retrieve result data from endpoints in parallel. This way can significantly reduce the burden of master node, also can provide fundamentals for the further implementations of Greenplum's Foreign Data Wrapper.
A demo is provided in this session, to show how to use parallel retrieve cursor in psql.
- 2020 March 23 10:00
- 50 min
- Postgres Conference 2020
- Distributed SQL