I recently added support for batch inserts to koyo and thought I'd make a quick post about it. Here's what it looks like to use this feature:
(define ib
(make-insert-batcher
#:on-conflict '(do-nothing (ticker))
'tickers
'([isin "TEXT"]
[ticker "TEXT"]
[added_at "TIMESTAMPTZ"])))
(with-database-connection [conn db]
(for ([(isin ticker added-at) (in-sequence datasource)])
(ib-push! ib conn isin ticker added-at))
(ib-flush! ib conn))
You create a batcher, tell it what table to insert the data into and
what columns to insert, then push data into it and flush it at the
end. Pushing may trigger a flush when too many rows have accumulated,
according to an optional #:batch-size
argument.
In the past, when I built a batcher like this, I did it by accumulating
the row data into an array and, on flush, generating an INSERT
statement with a row-wise set of placeholder parameters. Like this:
INSERT INTO tickers(
isin, ticker, added_at
) VALUES
($1, $2, $3),
($4, $5, $6),
...
($(n*3+1), $(n*3+2), $(n*3+3))
This works fine for the most part, but it has a couple of problems.
First, the maximum number of parameters to an insert statement in
Postgres is 65536, so at most 65k/n-columns
rows may be batched in
memory before a flush is required. Second, this has the obvious problem
that every flush requires sending a new, long query to the database,
so this approach can't easily leverage prepared statements. The latter
doesn't seem to have a huge impact, but it's still some unnecessary
inefficiency.
This time around, I decided to buffer the values in column-wise arrays.
On flush, those arrays are passed to the insert statement directly and
I use UNNEST
to turn them into a virtual table to insert from. That
looks like this:
INSERT INTO tickers(
isin, ticker, added_at
) SELECT * FROM UNNEST(
$1::TEXT[],
$2::TEXT[],
$3::TIMESTAMPTZ[]
) AS t(isin, ticker, added_at)
So, the end result is a much shorter query that can be prepared ahead of
time and reused between flushes. It also means we can buffer more rows
in memory before flushing. The only disadvantage is that the batcher
needs to know what the individual column types are ahead of time, hence
the last positional argument to make-insert-batcher
.