Optimising full-text queries in the amaGama translation memory server
I am a Python developer with a keen interest in natural language processing. My PhD related to the automatic cleaning of translation memories to improve their quality.
I have used PostgreSQL through the Django web framework, but also in a translation memory system targeting it specifically.
I have contributed to a number of FOSS projects (also as part of my day job). A lot of FOSS software have Afrikaans translations due to my involvement.
No video of the event yet, sorry!
The amaGama project implements a FOSS translation memory web service built with Python on top of PostgreSQL. I recently worked on improving its performance, and would like to report on what I did and how I did it. The presentation will cover how an understanding of the problem domain, usage patterns and algorithms involved allowed for a big performance improvement despite some (arguable) shortcomings of PostgreSQL.
A translation memory contains texts and their translations. The amaGama service hosts such a database of translations of many FOSS packages in many languages, such as GNOME, KDE, PostgreSQL, Mozilla, LibreOffice, etc. The web service is typically queried by a tool for computer assisted translation, and responds with similar translations done before in the language pair of interest. The suggestions are meant to help translators work faster and with higher quality.
Response time is important to actually help users, and before, amaGama would sometimes take multiple seconds to respond to certain queries — far too long to be helpful. The database schema features two simple tables with a full-text index (GIN) used to perform fuzzy matching on the previous texts. An analysis of query plans indicated bad row estimates, and arguably bad query plans. VACUUM and collecting more statistics did not improve things.
Part of my solution involved a combination of CLUSTER with partially overlapping partial indexes. The partial indexes helped to address some shortcomings relating to full-text indexing with GIN, and combining it with CLUSTER ensured that disk I/O could be reduced for many queries. The median, average and worse case times for queries were reduced to as little as 40% of their previous times.
- 2019 October 8 13:00 SAST
- 40 min
- South Africa 2019
- Case Studies