Scope
This article applies to PowerShell Universal environments earlier than 4.3.0 that are configured with multiple PSU servers using a SQL database.
Problem
The PowerShell Universal server may fail to start due to SQL timeout errors. Background jobs, such as the heartbeat and app token sync jobs, may run rapidly, and concurrently, during server startup. This will be evident in the system logs due to errors related to SQL and multiple heartbeats run at once.
Root Cause
The root cause is due to how the Hangfire scheduler works when a job queue is not available for a scheduled job. This can happen if a PowerShell Universal service is stopped before it can remove its scheduled background jobs. The schedules will remain in the Hangfire scheduler queue and will continue to create new jobs in the queue for the computer that is no longer running. This requires that at least 1 PowerShell Universal server is still active. If the server is down for some time, this queue will grow large, and the server will attempt to run all the jobs at once when it is started. If it's unable to do so, it may crash, causing the server to never succeed in starting.
Solution
PowerShell Universal 4.3.0 introduced logic to skip these jobs during the startup process to only process the most recent copy of the job. In order to clean up the jobs table, you can truncate the Hangfire.Jobs table in the SQL database and remove the recurring job for the service that is no longer running. These can be safely removed because restarting a PowerShell Universal service will cause the schedules to be recreated. Always backup your database before performing any SQL operation.
1. Truncate Hangfire.Jobs
To remove all queued Hangfire jobs, run the following SQL command.
2. Remove Unused Schedules
You can remove unused schedules from the Hangfire dashboard. Visit http://<servername>:<port>/hangfire to view the dashboard. Select the 6 recurring jobs for the server that is no longer active. You can see the selected schedules below.
- <NodeName>.ProcessMonitor
- AppTokenRefresh.<NodeName>
- GitSync.<NodeName>
- Heartbeat.<NodeName>
- ModuleRefresh.<NodeName>
- HealthCheck.<NodeName>
If you remove a recurring job in error, you can restart the affected service to force it to recreate the schedule.