I’ve been working with Salt Stack for a few months now and though as with all configuration management systems, it has its pros and cons, I quite like it.
I noticed though, some things starting to take a long time, especially when we started to manage 500+ servers and have quite a log of states.
Some optimisations I applied to improve the performance…
In the master config.
keep_jobs: 24 timeout: 180 ping_on_rotate: True
Keeps results of run jobs in files on the master for 24 hours.
Waits 180 seconds for all minions in the job to return before timing out.
Ping each minion after rotating the keys, which happens nightly by default. Setting this to true with a large population of minions can create a thundering herd problem, which needs to be solved with some configs on the minions. This will prevent slow response times from the minions, if it is the first time contacting the minion after the key has been rotated.
In the minion configs.
When a minion tries to re-auth to the master after the nightly key rotation, and if under a thundering herd scenario, the master can’t respond to the minion in time, the minion will wait randomly between 0 and 60 seconds before trying another re-auth attempt.
recon_default: 1000 recon_max: 59000 recon_randomize: True
When a minion tries to reconnect to the master for a state run or a ping or key exchange, and a similar thundering herd situation occurs, wait a randomly allotted time between 1000 milliseconds and 59000 milliseconds before attempting to reconnect.
keysize: 2048 master_tries: -1 acceptance_wait_time: 10 acceptance_wait_time_max: 60 return_retry_timer: 5 return_retry_timer_max: 15
Use a securely large (2048 bit) but not too large key size, so that the minion’s CPU is not consumed for too many cycles with encryption and decryption of the state and pillar data.
Unlimited reconnect retries.
If the Master is too heavily loaded to handle auth requests it will time it out. The Minion will then wait 10 seconds to retry.
The Minion will increase its wait time by 10 seconds each subsequent retry until reaching 60 seconds.
On state returns to the master from the minions, if the Master is also too heavily loaded, the Minion will wait 5 seconds to initiate the next return attempt, for each subsequent return attempt until the time to wait before the next return attempt reaches 15 seconds.
These configs mitigate the thundering herd, but what about other optimisations?
Other non-config optimisations.
I noticed on debian systems, when trying to install packages, Salt Stack would force the Minion to run apt-get update before every single package install. It would even do this when the package was already installed! At first when there was 3 or 4 packages to install, this wasn’t so bad, but our states would regularly try to install 10-25 different packages. Each aot-get update would take 10-15 seconds. Multiply that by 25 packages and 50 servers, and all of a sudden, the state could take anywhere between 300 and 2000 seconds. To combat this, simply include these options in your pkg.install states:
refresh: False cache_valid_time: 300
Refresh controls whether or not the package repo database is updated prior to installing the requested package(s).
cache_valid_time sets the value in seconds after which the cache is marked as invalid, and a cache update is necessary. This overwrites the refresh parameter’s default behavior of True. This parameter is available only on Debian based distributions and has no effect on the rest.
Some other potentially interesting settings to set.
(on the master for these ones)
pillar_cache: True pillar_cache_ttl: 1800 pillar_cache_backend: disk job_cache: False
These will cache pillar data in a file on the master for 1800 seconds, which will reduce the time for the master to calculate the pillars and therefore reduce the time to complete the state run.
However, this will cause problems if you are changing pillar data regularly as you may have trouble clearing these from the master’s cache. It may also present a security issue if you have keys and passwords stored in pillars.
If you aren’t interested in results of job runs, (potentially running them on nightly crons), or have disabled the minions from returning, the Master doesn’t have to write job returners to disk so you could disabling the job_cache.
(and on the minions)
pub_ret: False cache_jobs: True
The minion writes the job results to it’s local disk rather than returning to the master with the results.