A Modern Interface for Data Science on Postgres and Greenplum
Presented by:
Scott Hajek
Scott Hajek is a Senior Data Scientist for Pivotal. He has applied his expertise in machine learning and natural language processing to diverse industries and use cases, including communications surveillance, information extraction from unstructured text, linking information across disparate data sources, network security, price optimization, and logistics optimization.
Scott holds an M.A. in Linguistics from the University of Illinois at Urbana-Champaign and a B.A. in Psychology and Linguistics from UNC Chapel Hill. At those institutions he investigated how humans understand and produce language, both written and spoken, with a focus on understudied languages such as Thai.
No video of the event yet, sorry!
Data scientists today expect to work with tools that have good abstractions and interfaces. Pure SQL is not the best interface for data science, but the power and scale of SQL-based systems can be beneficial. This talk introduces a modern interface for Postgres and Greenplum that appeals to data scientists.
The importance of good abstractions and interfaces can be seen in the dominance of R, Python, and PySpark in the data science field and the similarity between their notions of dataframes. Data scientists (DS) do not relish the thought of directly writing SQL strings by hand. Nor for that matter do application developers, hence why the latter prefer object-relational mappers like ActiveRecord, Django, etc. In addition to the cognitive benefits of abstraction, such frameworks cut out error-prone manual steps, avoid dangerous string formatting, and enable more robust testing.
So why wouldn’t a data scientist just avoid SQL-based platforms altogether? Relational databases such as Postgres offer rich analytical abilities and stability, and their MPP variants offer massive scale in storage and distributed processing. Data scientists would value the ability to harness the scale of such systems while having nice abstractions to work with.
Ibis offers DS pythonistas the best of both those worlds. It is a framework for specifying queries and transformations with deferred execution on big data platforms. It looks and feels similar to DataFrame-based tools like pandas and PySpark. Lazy execution with client-side error checking helps by making certain mistakes fail fast, and it encourages delegating all the processing to the database. Ibis already supports Postgres and thus already works with a lot of the functionality on Greenplum. Minor extensions can be made to add the functionality that is special to GPDB. Ibis supports several other pluggable backends, so code written for Postgres/Greenplum could easily be run against other systems like BigQuery, HDFS, and Impala.
- Date:
- 2019 March 19 11:00 EDT
- Duration:
- 20 min
- Room:
- Grammercy
- Conference:
- Postgres Conference
- Language:
- Track:
- Greenplum Summit
- Difficulty:
- Medium