TimescaleDB: Re-engineering PostgreSQL as a time-series database
Michael J. Freedman is the co-founder and CTO of TimescaleDB, an open-source database that scales SQL for time-series data, and a Professor of Computer Science at Princeton University. His research focuses on distributed systems, networking, and security.
Previously, Freedman developed CoralCDN (a decentralized CDN serving millions of daily users) and Ethane (the basis for OpenFlow / software-defined networking). He co-founded Illuminics Systems (acquired by Quova, now part of Neustar) and is a technical advisor to Blockstack.
Honors include: Presidential Early Career Award for Scientists and Engineers (PECASE, given by President Obama), SIGCOMM Test of Time Award, Caspar Bowden Award for Privacy Enhancing Technologies, Sloan Fellowship, NSF CAREER Award, Office of Naval Research Young Investigator Award, DARPA Computer Science Study Group membership, and multiple award publications. Prior to joining Princeton in 2007, he received his Ph.D. in computer science from NYU's Courant Institute, and his bachelors and masters degrees from MIT.
No video of the event yet, sorry!
Time-series data is now everywhere -- IoT, user event streams, system monitoring, finance, adtech, industrial control, transportation, and logistics -- and increasingly used to power core applications. It also creates a number of technical challenges: to ingest high volumes of structured data; to ask complex, performant queries for both recent and historical time intervals; to perform specialized time-centric analysis and data management. And this data doesn't exist in isolation, entries must often be joined against other relational data to ask key business questions (e.g., tracking a shipping container is much more powerful once combined with information about its goods).
Many developers working with time-series data today turn to polyglot solutions: a NoSQL database to store their time-series data (for scale), and a relational database for associated metadata and key business data. This leads to engineering complexity, operational challenges, and even referential integrity concerns. Thus many have found they require a purpose-built time-series database as this type of data proliferates. Yet the current state of time-series databases is lacking, and still forces users into the same issues with running complex polyglot or immature solutions.
In this talk, I describe why these operational headaches are unnecessary and how we re-engineered PostgreSQL as a time-series database in order to simplify time-series application development. In particular, the nature of time-series workloads—appending data about recent events—presents different demands than transactional (OLTP) workloads. By taking advantage of these differences, we can improve insert rates by 20x over vanilla Postgres and achieve much faster queries, even while offering full SQL (including JOINs). This simplifies one's product and stack with a single database, while enabling users to ask much more complex and ad-hoc questions about their data.
TimescaleDB achieves this by storing data on an individual server in a manner more common to distributed systems: heavily partitioning (sharding) data into chunks to ensure that hot chunks corresponding to recent time records are maintained in memory. This right-sized chunking is performed automatically, and the database can even adapt its chunk sizes based on observed resource demands. Yet it hides this behind a "hypertable" that can be inserted into or queried like a single table: even at 100B+ rows over 10K+ chunks. While this adds a few additional milliseconds for query planning, it enables TimescaleDB to avoid the performance cliff that Postgres experiences at larger table sizes (10s of millions of rows).
TimescaleDB is packaged as a Postgres extension, released under the Apache 2 license.
- 2018 April 18 11:30 EDT
- 50 min
- PostgresConf US 2018