Rubopop

Quick Postgres performance wins for beginners : EXPLAIN

Category: tech

Updated: Sat at 07:54 on 05/10/2024 - published

Description: Some quick wins that should improve performance with Postgres... at least a little

Improve your Postgres skills:

Are you familiar with some SQL and Postgres but not sure how to improve your queries ?

We're going to breakdown the EXPLAIN query plan. The first step to understanding your initial queries.

Here is the query I'm currently working on :

SELECT "milestones".* FROM "milestones" WHERE (status = 'finished') ORDER BY "milestones"."created_at" DESC LIMIT 25 OFFSET 1175;

Here's what I did :
1) insert enough data into the milestones table. As you probably guessed, there's no point optimising a query if you have no data. If you have no data, your queries are already optimised.
I've inserted about 100 000 records in the milestones table (you could go for much more data if you are expecting more). 100 000 is not 'a lot' but postgres will deal with this amount of records differently than when there are only say 100 records and that's ok for this example.

2) run explain on your query, or explain analyze like :

 EXPLAIN (ANALYZE) SELECT "milestones".* FROM "milestones" WHERE (status = 'finished') ORDER BY "milestones"."created_at" DESC LIMIT 25 OFFSET 1175;

all you need to do is prefix your your SQL statement with EXPLAIN or EXPLAIN (ANALYZE).

3) read the output... pretty confusing, right ?

 Limit  (cost=6539.44..6542.31 rows=25 width=72) (actual time=24.233..27.975 rows=25 loops=1)
   ->  Gather Merge  (cost=6404.31..13164.36 rows=58783 width=72) (actual time=23.971..27.933 rows=1200 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Sort  (cost=5404.30..5551.26 rows=58783 width=72) (actual time=21.700..21.740 rows=795 loops=2)
               Sort Key: created_at DESC
               Sort Method: top-N heapsort  Memory: 434kB
               Worker 0:  Sort Method: top-N heapsort  Memory: 434kB
               ->  Parallel Seq Scan on milestones  (cost=0.00..2103.99 rows=58783 width=72) (actual time=0.013..8.862 rows=50000 loops=2)
                     Filter: ((status)::text = 'finished'::text)
                     Rows Removed by Filter: 808
 Planning Time: 0.091 ms
 Execution Time: 28.004 ms
(13 rows)

So here are the basics :

1) This is a step by step of how Postgres is planning to execute the query in the context of how much data is in the milestones table. In other words, Postgres is going to change the way it does the work, depending on how much work there is to do. This is the way you can see how it plans to do this query with about 100 000 records. If it were 100 or 10 000 000 records, there would be a different plan.
2) Read the query plan from the bottom up... the bottom arrow is the first step. So Parallel Seq Scan on milestones (etc) is the first step.

Each step has a process involved, and uses memory, and takes a certain amount of time.
Parallel Seq Scan is the process, cost is an estimated numerical value of how expensive this process is, bear in mind - its an estimation. And the rows corresponds to the rows returned in the process, width is the estimated byte size returned by the rows. Other indices speak for themselves - like actual time and loops. You can see there are 2 loops during the first and second steps.
You can probably guess, parallel means something identical and linear. It' s rows that have been divided into different segments, and sequential scan is reading row by row to gather the information requested (which was rows where status = finished). Some of this first step sounds a little inefficient. Although with 100 000 records, it's still quick. So its maybe ok not to do any optimisation. But I'm starting to wonder if there could be an improvement, especially the data was larger than 100 000 records. If I wanted to investigate more, I would try the same query with a million or more records and see how that effects things.

The other steps (called nodes) Sort and Gather Merge are also providing some information which can help us. Sort is self explanatory and makes sense seeing as we asked for an ORDER BY statement.
Gather Merge, is literally gluing the previously divided rows from the parallel seq scan. This step is also saying a worker was planned. Great! And if the word worker is reminding you of background jobs, then it's quite similar. This works well to optimise work loads! But there's a catch... Workers work if they are available. If a worker is available, then ok, but a query that is planning a worker might not actually get one, and this could end up in a slower response.

So how could we improve this ?
You've decided that this is a really important query, and it needs to be optimised. Theres a few things that are available, but the first thing to consider is, is it worth adding an index ? If the answer is yes, then which columns ? I would say a quick win is an index on the status and created at column, in that order, because step one is get all the statuses equal to 'finished' and then sort them by created_at.

CREATE INDEX index_milestones_on_status_and_created_at ON milestones (status, created_at);

Here is the query plan after having put an index on the status and created_at columns:

 Limit  (cost=86.07..87.90 rows=25 width=72) (actual time=0.401..0.409 rows=25 loops=1)
   ->  Index Scan Backward using index_milestones_on_status_and_created_at on milestones  (cost=0.42..7287.19 rows=99957 width=72) (actual time=0.027..0.350 rows=1200 loops=1)
         Index Cond: ((status)::text = 'finished'::text)
 Planning Time: 0.099 ms
 Execution Time: 0.432 ms
(5 rows)

One of the key things I look out for is, did Postgres even use the index ? If it shows up now in the query plan then that's good news, as you're index is actually helping Postgres. Index that don't get used by Postgres are going to slow the query down. Further more, we can see that the number of steps has gone down from 3 to 1, and all the other stats are showing much smaller numbers for cost, actual time loops and so on. So, I'm probably going to keep the index, especially if I know there's going to be a lot of data that is going to be queried on the status and created_at columns.

Thanks for reading.