Using the PostgreSQL recursive CTE to compute Bacon numbers for actors listed in the IMDb
Bryn Llewellyn is a Technical Product Manager at Yugabyte, Inc. YugabyteDB is an open source, cloud native distributed SQL database that looks like PostgreSQL to the developer. Bryn’s speciality is SQL and stored procedures in the context of distributed SQL.
Bryn has worked in the software field for more than forty years. He started working with SQL when he joined Oracle UK in 1990. He relocated to Oracle HQ (Redwood Shores, CA) in 1996 and his last role, before leaving, was as the Product Manager for PL/SQL. He left Oracle in April 2019 to join YugaByte, Inc.
Bryn started off doing image analysis and pattern recognition at Oxford University (programming in FORTRAN) and then worked in Oslo, first at the Norwegian Computing Center and then in a startup. In Norway, Bryn programmed in Simula—recognized as the first object-oriented programming language and as the inspiration for C++.
Students of Computer Science are often given this famous assignment:
- Compute the Bacon numbers for actors listed in the IMDb up to, for example, five hops
The requirement is easy to understand and is readily communicated by the example "What is Al Pacino’s Bacon Number?" The answer is given by this path, where each step goes via a movie or TV show that starred some pair of actors:
- Kevin Bacon > Bill Murray > Billy Crystal > James Earl Jones > Al Pacino
This shows that Al Pacino’s Bacon Number is four. Usually, the solution is required to annotate each hop with the names of the productions in which the pair of actors both co-starred. For example, the first hop, from Kevin Bacon to Bill Murray exists because they both co-starred in "She's Having a Baby" and "Wild Things".
Typically, the assignment specifies the use of a specific "if-then-else" language like Java. But the problem can be neatly solved with SQL by using a recursive common table expression (hereinafter recursive CTE). However, no amount of Internet searching took me to even a single example of a self-contained working SQL solution that you could run on actual IMDb data. But I got lots of irrelevant hits for solutions that use NoSQL or Graph databases.
I'll show you in this session how to compute Bacon numbers using good old PostgreSQL. The statement is so brief that it fits in a Tweet!
I’ll skip explaining who I am and how I come to be working with PostgreSQL so that I can spend all my time on my technical content. Instead, you can read about me in my recent PostgreSQL Person of the Week interview.
While a solution must be correct, this is not enough to make it necessarily viable. I'll show you that the terse SQL gets the result very quickly on small amounts of data. But the real IMDb is large and very highly interconnected—and here the terse and correct SQL never finishes!
The problem is that, while the final result set is manageably small, intermediate results are vast. The solution is well known in general: you need to implement early pruning. However, the native CTE cannot support this. I discovered that if you implement the recursive CTE algorithm that the PostgreSQL documentation describes in words, using PL/pgSQL and SQL together, then this gives you a splendid opening in which you can intervene with the essential early pruning to deliver an acceptably quick solution. I'll explain this and demonstrate working code.
— — — — — — — — — —
If you'd like to read about this, either before of after my talk, then look at my blog post: Computing Bacon Numbers for actors listed in the IMDb.
To see my slides and code used in the presentation, please access this .zip file.
- 2021 September 1 13:00 EDT
- 1 h
- 2021 Postgres Conference Webinars
- Requires Registration:
- Yes (Registered: 69)