I'm hosting my own Lemmy instance and trying to figure out how to optimize PSQL to reduce disk IO at the expense of memory.

I accept increased risk this introduces, but need to figure out parameters that will allow a server with a ton of RAM and reliable power to operate without constantly sitting with 20% iowait.

Current settings:
# DB Version: 15
# OS Type: linux
# DB Type: web
# Total Memory (RAM): 32 GB
# CPUs num: 8
# Data Storage: hdd

max_connections = 200
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 4
effective_io_concurrency = 2
work_mem = 10485kB
min_wal_size = 1GB
max_wal_size = 4GB
max_worker_processes = 8
max_parallel_workers_per_gather = 4
max_parallel_workers = 8
max_parallel_maintenance_workers = 4
fsync = off
synchronous_commit = off
wal_writer_delay = 800
wal_buffers = 64MB
Most load comes from LCS script seeding content and not actual users.

Solution: My issue turned out to be really banal - Lemmy's PostgreSQL container was pointing at default location for config file (/var/lib/postgresql/data/postgresql.conf) and not at the location where I actually mounted custom config file for the server (/etc/postgresql.conf). Everything is working as expected after I updated docker-compose.yaml file to point PostgreSQL to correct config file. Thanks @bahmanm@lemmy.ml for pointing me in the right direction!

you are viewing a single comment's thread
view the rest of the comments

[–] bahmanm@lemmy.ml 1 points 1 year ago (1 children)

could not resize shared memory

That means too many chunky parallel maintenance workers are using the memory at the same time (max_parallel_maintenance_workers and maintenance_work_mem.)

VACCUMing is a very important part of how PG works; can you try setting max_parallel_maintenance_workers to 1 or even 0 (disable parallel altogether) and retry the experiment?

I did increase shared_buffers and effective_cache_size with no effect.

That probably rules out the theory of thrashed indices.

https://ctxt.io/2/AABQciw3FA https://ctxt.io/2/AABQTprTEg https://ctxt.io/2/AABQKqOaEg

Since those stats are cumulative, it's hard to tell anything w/o knowing when was the SELECT run. It'd be very helpful if you could run those queries a few times w/ 1min interval and share the output.

I did install Prometheus with PG exporter and Grafana...Anything specific you can suggest that I should focus on?

I'd start w/ the 3 tables I mentioned in the previous point and try to find anomalies esp under different workloads. The rest, I'm afraid, is going to be a bit of an investigation and detective work.

If you like, you can give me access to the Grafana dashboard so I can take a look and we can take it from there. It's going to be totally free of charge of course as I am quite interested in your problem: it's both a challenge for me and helping a fellow Lemmy user. The only thing I ask is that we report back the results and solution here so that others can benefit from the work.

[–] daq@lemmy.daqfx.com 1 points 1 year ago

If you like, you can give me access to the Grafana dashboard so I can take a look and we can take it from there. It's going to be totally free of charge of course as I am quite interested in your problem: it's both a challenge for me and helping a fellow Lemmy user. The only thing I ask is that we report back the results and solution here so that others can benefit from the work.

No problem. PM me an IP (v4 or v6) or an email address (disposable is fine) and I'll reply with a link to access Grafana with above in allow list.