Petabyte Scale Data Warehousing with Open Source Greenplum Database
Presented by:
Marshall Presser
Marshall Presser is a Data Engineer in Pivotal's Data Labs where he helps customers solve complex analytic problems with the Greenplum Database.
Prior to coming to Pivotal (formerly Greenplum), he spent 12 years at Oracle, specializing in High Availability, Business Continuity, Clustering, Parallel Database Technology, Disaster Recovery and Large Scale Database Systems. Marshall has also worked for a number of hardware vendors implementing clusters and other parallel architectures. His background includes parallel computation, operating system and compiler development as well as private consulting for organizations in heath care, financial services, and federal and state governments.
Marshall holds a B.A in Mathematics and an M.A. in Economics and Statistics from the University of Pennsylvania and a M.Sc. in Computing from Imperial College, London.
Andreas Scherbaum
Andreas Scherbaum is working with PostgreSQL since 1997. He is involved in several PostgreSQL related community projects, member of the Board of Directors of the European PostgreSQL User Group and also wrote a PostgreSQL book (in German). Since 2011 he is working for EMC/Greenplum/Pivotal and tackles very big databases.
It's more than just storing and retrieving data. Equally important are loading high volume data in parallel and running analytics in the database. This hands-on session will lead you through the entire process of creating, loading, and analyzing data in the Greenplum MPP database. It's PostgreSQL, but bigger and DWH-focused.
At the end of this workshop, attendees will learn modern DWH techniques in a PostgreSQL based Massively Parallel Processing platform. This includes the basic architecture of the Greenplum Database, the parallel techniques for loading, querying, and analyzing structured and semi-structured data, as well as the tools Greenplum provides for doing analytics in the database.
Workshop Agenda:
Introduction to MPP and Greenplum
Distribution -- a key to good performance in Greenplum
Parallel loading -- loading multi Terabytes per hour
Loading from s3 and external connectivity
Polymorphic storage and external partitions
Compare external tables to Foreign Data Wrappers
Partitioning vs. Distribution -- how they interact
Difference between PG and GP partitions
Query response time exercises
Running Analytics in Greenplum: MADlib exercise
Analyzing Free Form Text with SOLR and GPText
Monitoring and Managing Greenplum with Command Center
Managing Concurrency with Resource groups and Workload Manager
Running PL/Python and PL/R as Trusted Languages with PL/Container
Pre-requisites: Laptop with a modern browser and SSH client; Instruction on using SSH on Windows; Basic knowledge of SQL
Users will connect to a cloud based Greenplum Cluster.
There will be a maximum of 25 attendees.
Suggested Pre work:
Videos on YouTube Channel
GP Database basics - https://www.youtube.com/watch?v=cCuGX_fLNl8&list=PL4duir3J-8GUodk1uS9ONPU_eWvfCeVjT
GP & analytics: https://www.youtube.com/watch?v=3K1PRZNYHZE&list=PL4duir3J-8GXgVNvHVE8Y86W79Gzu5oEk
GP & MADlib https://www.youtube.com/watch?v=Nza2F2dU-Q0&list=PL4duir3J-8GUcubGGpudx6KCCxp8onTI8
- Date:
- Duration:
- 7 h
- Room:
- Conference:
- PostgresConf US 2018
- Language:
- English
- Track:
- Greenplum Summit
- Difficulty:
- Medium
- Requires Registration:
- Yes (Registered: 11)