hydra/src
Graham Christensen bc4b96d053
BuildOutputs: index path with HASH
Looking at AWS' Performance Insights for a Hydra instance, I found
the hydra-queue-runner's query:

    select id, buildStatus, releaseName, closureSize, size
    from Builds b
    join BuildOutputs o on b.id = o.build
    where
      finished = ?
      and (buildStatus = ? or buildStatus = ?)
      and path = $1

was the slowest query by at least 10x. Running an explain on this
showed why:

hydra=> explain select id, buildStatus, releaseName, closureSize, size
    from Builds b join BuildOutputs o on b.id = o.build where
    finished = 1 and (buildStatus = 0 or buildStatus = 6) and
    path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';

                                                     QUERY PLAN
    ------------------------------------------------------------------------
     Gather  (cost=1000.43..33718.98 rows=2 width=56)
       Workers Planned: 2
       ->  Nested Loop  (cost=0.43..32718.78 rows=1 width=56)
             ->  Parallel Seq Scan on buildoutputs o  (cost=0.00..32710.32
                                                       rows=1
                                                        width=4)
                   Filter: (path = '/nix/store/s93kh...snip...'::text)
             ->  Index Scan using indexbuildsonjobsetidfinishedid on builds b
                                            (cost=0.43..8.45 rows=1 width=56)
                   Index Cond: ((id = o.build) AND (finished = 1))
                   Filter: ((buildstatus = 0) OR (buildstatus = 6))
    (8 rows)

A paralell sequential scan is definitely better than a sequential scan, but the
cost ranging from 0 to 32710 is not great. Looking at the table, I saw the `path`
column is completely unindex:

    hydra=> \d buildoutputs
                Table "public.buildoutputs"
    Column |  Type   | Collation | Nullable | Default
    --------+---------+-----------+----------+---------
    build  | integer |           | not null |
    name   | text    |           | not null |
    path   | text    |           | not null |
    Indexes:
        "buildoutputs_pkey" PRIMARY KEY, btree (build, name)
    Foreign-key constraints:
        "buildoutputs_build_fkey" FOREIGN KEY (build) REFERENCES builds(id)
            ON DELETE CASCADE

Since we always do exact matches on the path and don't care about ordering,
and since the path column is very high cardinality a `hash` index is a
good candidate. Note that I did test a btree index and it performed
similarly well, but slightly worse.

After creating the index (this took about 10 seconds) on a test database:

    create index IndexBuildOutputsPath on BuildOutputs using hash(path);

We get a *significantly* reduced cost:

    hydra=> explain select id, buildStatus, releaseName, closureSize, size
    hydra->     from Builds b join BuildOutputs o on b.id = o.build where
    hydra->     finished = 1 and (buildStatus = 0 or buildStatus = 6) and
    hydra->     path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1';
                                                QUERY PLAN
    -------------------------------------------------------------------------------------------------------
    Nested Loop  (cost=0.43..41.41 rows=2 width=56)
    ->  Index Scan using buildoutputs_path_hash on buildoutputs o  (cost=0.00..16.05 rows=3 width=4)
            Index Cond: (path = '/nix/store/s93khs2dncf2cy273mbyr4fb4ns3db20-MIDIVisualizer-5.1'::text)
    ->  Index Scan using indexbuildsonjobsetidfinishedid on builds b  (cost=0.43..8.45 rows=1 width=56)
            Index Cond: ((id = o.build) AND (finished = 1))
            Filter: ((buildstatus = 0) OR (buildstatus = 6))
    (6 rows)

For direct comparison, the overall query plan was changed:

    From: Gather      (cost=1000.43..33718.98 rows=2 width=56)
    To:   Nested Loop (cost=   0.43.....41.41 rows=2 width=56)

and the query plan for buildoutputs changed from a maximum cost of
32,710 down to 16.

In practical terms, the query's planning and execution time was reduced:

Before (ms) | Try 1   | Try 2   | Try 3
------------+---------+---------+--------
Planning    |   0.898 |   0.416 |   0.383
Execution   | 138.644 | 172.331 | 375.585

After (ms)  | Try 1   | Try 2   | Try 3
------------+---------+---------+--------
Planning    |   0.298 |   0.290 |   0.296
Execution   | 219.625 |   0.035 |   0.034
2021-01-18 11:28:05 -05:00
..
hydra-eval-jobs Update for Nix Value::type API changes 2020-12-29 17:25:51 -06:00
hydra-evaluator Fix build 2020-07-08 12:50:02 +02:00
hydra-queue-runner Update for receiveContents taking string_view 2020-12-26 11:23:26 +00:00
lib Merge pull request #841 from pingiun/github-login 2021-01-05 14:51:51 +01:00
libhydra Fix build 2020-07-08 12:50:02 +02:00
root Don't use enable_github_login option after all 2021-01-04 18:09:49 +01:00
script Implement GitHub logins 2020-12-28 14:37:03 +01:00
sql BuildOutputs: index path with HASH 2021-01-18 11:28:05 -05:00
ttf Add font for the captcha 2013-03-04 12:16:13 +01:00
Makefile.am Revert "hydra-eval-jobs -> nix eval-hydra-jobs" 2020-02-19 20:36:52 +01:00
Makefile.PL * Move everything up one directory. 2009-03-05 13:41:57 +00:00