Optimizing and shrinking reported
table for CCX Notification Service
Reasons
- Access to
reported
table is very slow on production system - Even simple
select count(*) from report
can take minutes! - Table grows over time (new features) up to ~150GB as high limit today
- Will be limiting factor for integration with ServiceLog
Table structure
- Seems simple enough
- and it is
- but devil is in the details as usual
postgres=# \d reported ;
Table "public.reported"
Column | Type | Collation | Nullable | Default
-------------------+-----------------------------+-----------+----------+--------
org_id | integer | | not null |
account_number | integer | | not null |
cluster | character(36) | | not null |
notification_type | integer | | not null |
state | integer | | not null |
report | character varying | | not null |
updated_at | timestamp without time zone | | not null |
notified_at | timestamp without time zone | | not null |
error_log | character varying | | |
event_type_id | integer | | not null |
Indexes:
"reported_pkey" PRIMARY KEY, btree (org_id, cluster, notified_at)
"notified_at_desc_idx" btree (notified_at DESC)
"updated_at_desc_idx" btree (updated_at)
Foreign-key constraints:
"fk_notification_type" FOREIGN KEY (notification_type) REFERENCES notification_types(id)
"fk_state" FOREIGN KEY (state) REFERENCES states(id)
"reported_event_type_id_fkey" FOREIGN KEY (event_type_id) REFERENCES event_targets(id)
Table size
- CCX Notification Service is started with 15 minutes frequency
- i.e. 4 times per hour
- On each run
- approximatelly 65000 reports are written into the table
- Reports are stored for 8 days
- limit required for ‘cooldown’ feature
- Simple math
- number of records=8 days × 24 hours × 4 (runs per hour) × 65000
- number of records=8×24×4×65000
- 49920000 ~ 50 millions records!
- ⇒ “each byte in reports counts!”
reported
table details
- Let’s look at the
reported
table used in production for several months - We’ll measure overall table size
- real size, not just #records
- We’ll measure how fast is
select count(*) from reported
- it needs to go through one selected index to compute that value
- good for measure overall DB shape
Overall table size
postgres=# SELECT pg_size_pretty(pg_total_relation_size('reported'));
pg_size_pretty
----------------
28 GB
(1 row)
Accessing table indexes
postgres=# explain(analyze, buffers, format text) select count(*) from reported;
Finalize Aggregate (cost=640314.62..640314.63 rows=1 width=8) (actual time=1750.633..1756.451 rows=1 loops=1)
Buffers: shared hit=192 read=39748
-> Gather (cost=640314.41..640314.62 rows=2 width=8) (actual time=1750.483..1756.441 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=192 read=39748
-> Partial Aggregate (cost=639314.41..639314.42 rows=1 width=8)
(actual time=1738.180..1738.180 rows=1 loops=3)
Buffers: shared hit=192 read=39748
-> Parallel Index Only Scan using updated_at_desc_idx on reported
(cost=0.44..590342.78 rows=19588652 width=0)
(actual time=1.125..1015.623 rows=15677300 loops=3)
Heap Fetches: 0
Buffers: shared hit=192 read=39748
Planning Time: 0.135 ms
Execution Time: 1756.511 ms
^^^^^^^^^^^
Conclusion
- Not super fast
- OTOH for approximately 50M records it is still ok
- Can be make faster
- by introducing simpler index (ID)
- that will increase table size by 50 000 000 * sizeof(ID) :)
Table as live entity in the system
- Content of
reported
table is changing quite rapidly- well it depends how we look at it
- each hour, approximately 65000 × 4 = 260000 reports are created/deleted
- a lot in terms of changed MB/GB
- OTOH it is just 0.52% of the whole table
After one batch run of CCX Notification Service
- Table size grows a lot
postgres=# SELECT pg_size_pretty(pg_total_relation_size('reported'));
pg_size_pretty
----------------
56 GB
(1 row)
- Auto vacuumer is started in the background
- It affects all access to the
reported
table significantly- this is the little devil we talked about!
- (Aristotle was right: “horror vacui”)
postgres=# select * from pg_stat_progress_vacuum;
pid | datid | datname | relid | phase | heap_blks_total | heap_blks_scanned | heap_blks_vacuumed | index_vacuum_count | max_dead_tuples | num_d
ead_tuples
-------+-------+----------+-------+-------------------+-----------------+-------------------+--------------------+--------------------+-----------------+------
-----------
14583 | 14450 | postgres | 49851 | vacuuming indexes | 6267728 | 1491630 | 745994 | 1 | 11184809 |
11184525
- Access to
reported
table during vacuuming
postgres=# explain(analyze, buffers, format text) select count(*) from reported;
Finalize Aggregate (cost=4402283.71..4402283.72 rows=1 width=8) (actual time=62663.754..62679.376 rows=1 loops=1)
Buffers: shared hit=60496 read=3202143 written=3
-> Gather (cost=4402283.50..4402283.71 rows=2 width=8) (actual time=62663.735..62679.361 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=60496 read=3202143 written=3
-> Partial Aggregate (cost=4401283.50..4401283.51 rows=1 width=8) (actual time=62605.754..62605.756 rows=1 loops=3)
Buffers: shared hit=60496 read=3202143 written=3
-> Parallel Index Only Scan using updated_at_desc_idx on reported (cost=0.57..4303340.24 rows=39177303 width=0) (actual time=4.406..59423.974
rows=15677300 loops=3)
Heap Fetches: 47032112
Buffers: shared hit=60496 read=3202143 written=3
Planning Time: 0.176 ms
Execution Time: 62679.443 ms
^^^^^^^^^^^^
- one minute just to retrieve number of records!?
After vacuuming
- Let’s look at # records again
postgres=# explain(analyze, buffers, format text) select count(*) from reported;
Finalize Aggregate (cost=793066.38..793066.39 rows=1 width=8) (actual time=1719.988..1728.055 rows=1 loops=1)
Buffers: shared hit=315 read=38090
-> Gather (cost=793066.17..793066.38 rows=2 width=8) (actual time=1719.940..1728.046 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=315 read=38090
-> Partial Aggregate (cost=792066.17..792066.18 rows=1 width=8) (actual time=1712.611..1712.612 rows=1 loops=3)
Buffers: shared hit=315 read=38090
-> Parallel Index Only Scan using updated_at_desc_idx on reported (cost=0.56..742827.94 rows=19695290 width=0) (actual time=2.758..1002.403 ro
ws=15677300 loops=3)
Heap Fetches: 0
Buffers: shared hit=315 read=38090
Planning:
Buffers: shared hit=36
Planning Time: 1.282 ms
Execution Time: 1728.156 ms
(15 rows)
- we are back at good numbers
After some CCX Notification Service runs
- Table size is ~ same in terms of #records
- but many pages are in ‘deleted’ state
- needs defragmentation
- But vacuuming is usually not finished in time
- it can be proved easily:
postgres=# SELECT pg_size_pretty(pg_total_relation_size('reported'));
pg_size_pretty
----------------
141 GB
Possible solutions for this problem
- Database partitioning
- vertical
- horizontal
- Reduce number of records
- change in process logic
- Reduce size of records
- less IOPS in overall
- less network transfers
- faster vacuuming
- increase #ops done in “burst mode”
Vertical partitioning
- Done by column
- in our case by
report
column - no specific syntax in PostgreSQL
- query usually consists of several
JOIN
s
- in our case by
- Won’t be super useful in our case
INSERT
needs to changereport
column + other columns as wellSELECT
needs to queryreport
column + other columns as well
- Conclusion
- not planned to be added in near future
Horizontal partitioning
- Creating tables with fewer rows
- additional tables to store the remaining rows
- specific syntax in PostgreSQL 10.x
- When
- old data (not used much) vs new data
- partitioned “naturally” by country, date range etc.
- And this is our case!
- head: new data added recently
- body: data with age > 25 minutes and < 8 days
- tail: data with age >= 8 days
- Conclusion
- possible future spike
Number of records reduction
- We should do an
INSERT
+on conflict do update
- do not add a new row to the DB for each report that is not re-notified on each run
- Calculation
- we are processing roughly 65k reports per run, and notifying something in the range of hundreds
- that’s an estimated 64k rows with not sent same state rows added per run (every 15 minutes)
- Conclusion
- possible future spike
Size of records reduction
- What’s stored in
reported.report
column after all?- well, some JSON of “any” size
- some stats
postgres=# select min(length(report)), max(length(report)), avg(length(report)) from reported;
min | max | avg
-----+------+----------------------
421 | 8341 | 426.3675676374453857
(1 row)
-
-> seems like most reports are of similar size 421 chars
-
Empty report:
{
"analysis_metadata": {
"start": "2022-08-18T14:14:02.947224+00:00",
"finish": "2022-08-18T14:14:03.826569+00:00",
"execution_context": "ccx_ocp_core.context.InsightsOperatorContext",
"plugin_sets": {
"insights-core": {
"version": "insights-core-3.0.290-1",
"commit": "placeholder"
},
"ccx_rules_ocp": {
"version": "ccx_rules_ocp-2022.8.17-1",
"commit": null
},
"ccx_ocp_core": {
"version": "ccx_ocp_core-2022.8.17-1",
"commit": null
}
}
},
"reports": []
}
- Non empty report:
- at most 8341 characters long
- What’s really needed for CCX Notification Service to work?
func issuesEqual(issue1, issue2 types.ReportItem) bool {
if issue1.Type == issue2.Type &&
issue1.Module == issue2.Module &&
issue1.ErrorKey == issue2.ErrorKey &&
bytes.Equal(issue1.Details, issue2.Details) {
return true
}
return false
}
Conclusion
- 421 chars/bytes are stored for all reports, even empty ones
- 21 GB of “garbage” for 50MB reports
- out of circa 26 GB -> ~80% of the overall size!
- can be droped before DB write (PR already reviewed)
- Most of
report
attributes are not used and can be dropped as well- tags
- links
- ~320 chars/bytes per non-empty reports
- ~3GB of “garbage”
Further shrinking
- TOAST
- gzip
report.reported
column programmatically
TOAST
- Compression on DB side
- Can be configured per column
- Citing official doc: ““the best thing since sliced bread”
- Fast but not too efficient
- need to lower the threshold from 2048 chars/bytes to minimum 128
postgres=# alter table reported set(toast_tuple_target=128);
ALTER TABLE
Gzipping report.reported
- Very unpopular in academia
- it’s DB problem how/if to compress data
- In the real world
- very easy to scale our service (thanks OpenShift)
- hard/costly/lack of knowledge how to scale DB properly
- Questions to answer
- will be it hard to implement?
- will be efficient?
- will be fast or slow?
- Will be it hard to implement?
- basically 10 lines of code
package main
import (
"bytes"
"compress/gzip"
"fmt"
"os"
)
func gzipBytes(src []byte) ([]byte, error) {
var buf bytes.Buffer
zw := gzip.NewWriter(&buf)
_, err := zw.Write(src)
if err != nil {
return nil, err
}
if err := zw.Close(); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
func gzipString(src string) ([]byte, error) {
var buf bytes.Buffer
zw := gzip.NewWriter(&buf)
_, err := zw.Write([]byte(src))
if err != nil {
return nil, err
}
if err := zw.Close(); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
- Will it be efficient?
- let’s measure for “empty” and largest report
Filename: data/empty_report.json
String: 422 Zipped out: 252 Stripped: 40%
String: 422 Zipped out: 252 Stripped: 40%
Filename: data/large_report.json
String: 8395 Zipped out: 2454 Stripped: 71%
String: 8395 Zipped out: 2454 Stripped: 71%
- Will it be fast or slow?
- forget academia, benchmarks are there to prove
package main_test
import (
"bytes"
"compress/gzip"
"os"
"testing"
)
var (
text1 string = ""
text2 string
text3 string
)
func init() {
content, err := os.ReadFile("../data/empty_report.json")
if err != nil {
panic(err)
}
text2 = string(content)
content, err = os.ReadFile("../data/large_report.json")
if err != nil {
panic(err)
}
text3 = string(content)
}
func gzipBytes(src *[]byte) ([]byte, error) {
var buf bytes.Buffer
zw := gzip.NewWriter(&buf)
_, err := zw.Write(*src)
if err != nil {
return nil, err
}
if err := zw.Close(); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
func gzipString(src *string) ([]byte, error) {
var buf bytes.Buffer
zw := gzip.NewWriter(&buf)
_, err := zw.Write([]byte(*src))
if err != nil {
return nil, err
}
if err := zw.Close(); err != nil {
return nil, err
}
return buf.Bytes(), nil
}
func BenchmarkGzipEmptyString(b *testing.B) {
for i := 0; i < b.N; i++ {
gzipString(&text1)
}
}
func BenchmarkGzipSimpleJSONAsString(b *testing.B) {
for i := 0; i < b.N; i++ {
gzipString(&text2)
}
}
func BenchmarkGzipRuleReportAsString(b *testing.B) {
for i := 0; i < b.N; i++ {
gzipString(&text3)
}
}
func BenchmarkGzipZeroBytes(b *testing.B) {
bytes1 := []byte(text1)
for i := 0; i < b.N; i++ {
gzipBytes(&bytes1)
}
}
func BenchmarkGzipSimpleJSONAsBytes(b *testing.B) {
bytes2 := []byte(text2)
for i := 0; i < b.N; i++ {
gzipBytes(&bytes2)
}
}
func BenchmarkGzipRuleReportAsBytes(b *testing.B) {
bytes3 := []byte(text3)
for i := 0; i < b.N; i++ {
gzipBytes(&bytes3)
}
}
- Results
goos: linux
goarch: amd64
pkg: main/benchmark
cpu: Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
BenchmarkGzipEmptyString-8 153459 111102 ns/op
BenchmarkGzipSimpleJSONAsString-8 89157 143929 ns/op
BenchmarkGzipRuleReportAsString-8 35571 324360 ns/op
BenchmarkGzipZeroBytes-8 103700 123996 ns/op
BenchmarkGzipSimpleJSONAsBytes-8 94760 157709 ns/op
BenchmarkGzipRuleReportAsBytes-8 39266 322264 ns/op
PASS
ok main/benchmark 92.980s
Summary
- Performance problems with
reported
table are caused by- high number of records stored (50M)
- large records, especially in column
reported.report
- Possible solution discussed there
- vertical partitioning
- N/A, not useful
- horizontal partitioning
- spike for future research
- reduction of number of records
- viable solution
- spike for future research
- reduction size of records
- low hanging fruit - able to reduce by 80% easily
- well understood
- set of tasks to be prepared
- to be decided: whether to do gzipping on client side
- vertical partitioning