Chapter 14. Performance tuning for automation controller

Tune your automation controller to optimize performance and scalability. When planning your workload, ensure that you identify your performance and scaling needs, adjust for any limitations, and monitor your deployment.

Automation controller is a distributed system with multiple components that you can tune, including the following:

  • Task system in charge of scheduling jobs
  • Control Plane in charge of controlling jobs and processing output
  • Execution plane where jobs run
  • Web server in charge of serving the API
  • Websocket system that serve and broadcast websocket connections and data
  • Database used by multiple components

14.1. Capacity Planning for deploying automation controller

Capacity planning for automation controller is planning the scale and characteristics of your deployment so that it has the capacity to run the planned workload. Capacity planning includes the following phases:

  1. Characterizing your workload
  2. Reviewing the capabilities of different node types
  3. Planning the deployment based on the requirements of your workload

14.1.1. Characteristics of your workload

Before planning your deployment, establish the workload that you want to support. Consider the following factors to characterize an automation controller workload:

  • Managed hosts
  • Tasks per hour per host
  • Maximum number of concurrent jobs that you want to support
  • Maximum number of forks set on jobs. Forks determine the number of hosts that a job acts on concurrently.
  • Maximum API requests per second
  • Node size that you prefer to deploy (CPU/Memory/Disk)

14.1.2. Types of nodes in automation controller

You can configure four types of nodes in an automation controller deployment:

  • Control nodes
  • Hybrid nodes
  • Execution nodes
  • Hop nodes

14.1.2.1. Benefits of scaling control nodes

Control and hybrid nodes provide control capacity. They provide the ability to start jobs and process their output into the database. Every job is assigned a control node. In the default configuration, each job requires one capacity unit to control. For example, a control node with 100 capacity units can control a maximum of 100 jobs.

Vertically scaling a control node by deploying a larger virtual machine with more resources increases the following capabilities of the control plane:

  • The number of jobs that a control node can perform control tasks for, which requires both more CPU and memory.
  • The number of job events a control node can process concurrently.

Scaling CPU and memory in the same proportion is recommended, for example, 1 CPU: 4 GB RAM. Even when memory consumption is high, increasing the CPU of an instance can often relieve pressure. The majority of the memory that control nodes consume is from unprocessed events that are stored in a memory-based queue.

Note

Vertically scaling a control node does not automatically increase the number of workers that handle web requests.

An alternative to vertically scaling is horizontally scaling by deploying more control nodes. This allows spreading control tasks across more nodes as well as allowing web traffic to be spread over more nodes, given that you provision a load balancer to spread requests across nodes. Horizontally scaling by deploying more control nodes in many ways can be preferable as it additionally provides for more redundancy and workload isolation in the event that a control node goes down or experiences higher than normal load.

14.1.2.2. Benefits of scaling execution nodes

Execution and hybrid nodes provide execution capacity. The capacity consumed by a job is equal to the number of forks set on the job template or the number of hosts in the inventory, whichever is less, plus one additional capacity unit to account for the main ansible process. For example, a job template with the default forks value of 5 acting on an inventory with 50 hosts consumes 6 capacity units from the execution node it is assigned to.

Vertically scaling an execution node by deploying a larger virtual machine with more resources provides more forks for job execution. This increases the number of concurrent jobs that an instance can run.

In general, scaling CPU alongside memory in the same proportion is recommended. Like control and hybrid nodes, there is a capacity adjustment on each execution node that you can use to align actual use with the estimation of capacity consumption that the automation controller makes. By default, all nodes are set to the top of that range. If actual monitoring data reveals the node to be over-used, decreasing the capacity adjustment can help bring this in line with actual usage.

An alternative to vertically scaling execution nodes is horizontally scaling the execution plane by deploying more virtual machines to be execution nodes. Because horizontally scaling can provide additional isolation of workloads, you can assign different instances to different instance groups. You can then assign these instance groups to organizations, inventories, or job templates. For example, you can configure an instance group that can only be used for running jobs against a certain Inventory. In this scenario, by horizontally scaling the execution plane, you can ensure that lower-priority jobs do not block higher-priority jobs

14.1.2.3. Benefits of scaling hop nodes

Because hop nodes use very low memory and CPU, vertically scaling these nodes does not impact capacity. Monitor the network bandwidth of any hop node that serves as the sole connection between many execution nodes and the control plane. If bandwidth use is saturated, consider changing the network.

Horizontally scaling by adding more hop nodes could provide redundancy in the event that one hop node goes down, which can allow traffic to continue to flow between the control plane and the execution nodes.

14.1.2.4. Ratio of control to execution capacity

Assuming default configuration, the maximum recommended ratio of control capacity to execution capacity is 1:5 in traditional VM deployments. This ensures that there is enough control capacity to run jobs on all the execution capacity available and process the output. Any less control capacity in relation to the execution capacity, and it would not be able to launch enough jobs to use the execution capacity.

There are cases in which you might want to modify this ratio closer to 1:1. For example, in cases where a job produces a high level of job events, reducing the amount of execution capacity in relation to the control capacity helps relieve pressure on the control nodes to process that output.

14.2. Example capacity planning exercise

After you have determined the workload capacity that you want to support, you must plan your deployment based on the requirements of the workload. To help you with your deployment, review the following planning exercise.

For this example, the cluster must support the following capacity:

  • 300 managed hosts
  • 1,000 tasks per hour per host or 16 tasks per minute per host
  • 10 concurrent jobs
  • Forks set to 5 on playbooks. This is the default.
  • Average event size is 1 Mb

The virtual machines have 4 CPU and 16 GB RAM, and disks that have 3000 IOPs.

14.2.1. Example workload requirements

For this example capacity planning exercise, use the following workload requirements:

Execution capacity

  • To run the 10 concurrent jobs requires at least 60 units of execution capacity.

    • You calculate this by using the following equation: (10 jobs * 5 forks) + (10 jobs * 1 base task impact of a job) = 60 execution capacity

Control capacity

  • To control 10 concurrent jobs requires at least 10 units of control capacity.
  • To calculate the number of events per hour that you need to support 300 managed hosts and 1,000 tasks per hour per host, use the following equation:

    • 1000 tasks * 300 managed hosts per hour = 300,000 events per hour at minimum.
    • You must run the job to see exactly how many events it produces, because this is dependent on the specific task and verbosity. For example, a debug task printing “Hello World” produces 6 job events with the verbosity of 1 on one host. With a verbosity of 3, it produces 34 job events on one host. Therefore, you must estimate that the task produces at least 6 events. This would produce closer to 3,000,000 events per hour, or approximately 833 events per second.

Determining quantity of execution and control nodes needed

To determine how many execution and control nodes you need, reference the experimental results in the following table that shows the observed event processing rate of a single control node with 5 execution nodes of equal size (API Capacity column). The default “forks” setting of job templates is 5, so using this default, the maximum number of jobs a control node can dispatch to execution nodes makes 5 execution nodes of equal CPU/RAM use 100% of their capacity, arriving to the previously mentioned 1:5 ratio of control to execution capacity.

NodeAPI capacityDefault execution capacityDefault control capacityMean event processing rate at 100% capacity usageMean events processing rate at 50% capacity usageMean event processing rate at 40% capacity usage

4 CPU at 2.5Ghz, 16 GB RAM control node, a maximum of 3000 IOPs disk

approximately 10 requests per second

n/a

137 jobs

1100 per second

1400 per second

1630 per second

4 CPU at 2.5Ghz, 16 GB RAM execution node, a maximum of 3000 IOPs disk

n/a

137

n/a

n/a

n/a

n/a

4 CPU at 2.5Ghz, 16 GB RAM database node, a maximum of 3000 IOPs disk

n/a

n/a

n/a

n/a

n/a

n/a

Because controlling jobs competes with job event processing on the control node, over-provisioning control capacity can reduce processing times. When processing times are high, you can experience a delay between when the job runs and when you can view the output in the API or UI.

For this example, for a workload on 300 managed hosts, executing 1000 tasks per hour per host, 10 concurrent jobs with forks set to 5 on playbooks, and an average event size 1 Mb, use the following procedure:

  • Deploy 1 execution node, 1 control node, 1 database node of 4 CPU at 2.5Ghz, 16 GB RAM, and disks that have approximately 3000 IOPs.
  • Keep the default fork setting of 5 on job templates.
  • Use the capacity adjustment feature in the instance view of the UI on the control node to reduce the capacity down to 16, the lowest value, to reserve more of the control node’s capacity for processing events.

Additional Resources

14.3. Performance troubleshooting for automation controller

Users experience many request timeouts (504 or 503 errors), or in general high API latency. In the UI, clients face slow login and long wait times for pages to load. What system is the likely culprit?

  • If these issues occur only on login, and you use external authentication, the problem is likely with the integration of your external authentication provider. See Setting up enterprise authentication or seek Red Hat Support.
  • For other issues with timeouts or high API latency, see Web server tuning.

Long wait times for job output to load.

What can I do to increase the number of jobs that automation controller can run concurrently?

  • Factors that cause jobs to remain in “pending” state are:

    • Waiting for “dependencies” to finish: this includes project updates and inventory updates when “update on launch” behavior is enabled.
    • The “allow_simultaneous” setting of the job template: if multiple jobs of the same job template are in “pending” status, check the “allow_simultaneous” setting of the job template (“Concurrent Jobs” checkbox in the UI). If this is not enabled, only one job from a job template can run at a time.
    • The “forks” value of your job template: the default value is 5. The amount of capacity required to run the job is roughly the forks value (some small overhead is accounted for). If the forks value is set to a very large number, this will limit what nodes will be able to run it.
    • Lack of either control or execution capacity: see “awx_instance_remaining_capacity” metric from the application metrics available on /api/v2/metrics. See Metrics for monitoring automation controller application for more information about how to monitor metrics. See Capacity planning for deploying automation controller for information on how to plan your deployment to handle the number of jobs you are interested in.

Jobs run more slowly on automation controller than on a local machine.

  • Some additional overhead is expected, because automation controller might be dispatching your job to a separate node. In this case, automation controller is starting a container and running ansible-playbook there, serializing all output and writing it to a database.
  • Project update on launch and inventory update on launch behavior can cause additional delays at job start time.
  • Size of projects can impact how long it takes to start the job, as the project is updated on the control node and transferred to the execution node. Internal cluster routing can impact network performance. For more information, see Internal cluster routing.
  • Container pull settings can impact job start time. The execution environment is a container that is used to run jobs within it. Container pull settings can be set to “Always”, “Never” or “If not present”. If the container is always pulled, this can cause delays.
  • Ensure that all cluster nodes, including execution, control, and the database, have been deployed in instances with storage rated to the minimum required IOPS, because the manner in which automation controller runs ansible and caches event data implicates significant disk I/O. For more information, see Red Hat Ansible Automation Platform system requirements.

Database storage does not stop growing.

  • Automation controller has a management job titled “Cleanup Job Details”. By default, it is set to keep 120 days of data and to run once a week. To reduce the amount of data in the database, you can shorten the retention time. For more information, see Removing Old Activity Stream Data.
  • Running the cleanup job deletes the data in the database. However, the database must at some point perform its vacuuming operation which reclaims storage. See PostgreSQL database configuration and maintenance for automation controller for more information about database vacuuming.

14.4. Metrics to monitor automation controller

Monitor your automation controller hosts at the system and application levels.

System level monitoring includes the following information:

  • Disk I/O
  • RAM use
  • CPU use
  • Network traffic

Application level metrics provide data that the application knows about the system. This data includes the following information:

  • How many jobs are running in a given instance
  • Capacity information about instances in the cluster
  • How many inventories are present
  • How many hosts are in those inventories

Using system and application metrics can help you identify what was happening in the application when a service degradation occurred. Information about automation controller’s performance over time helps when diagnosing problems or doing capacity planning for future growth.

14.4.1. Metrics for monitoring automation controller application

For application level monitoring, automation controller provides Prometheus-style metrics on an API endpoint /api/v2/metrics. Use these metrics to monitor aggregate data about job status and subsystem performance, such as for job output processing or job scheduling.

The metrics endpoint includes descriptions of each metric. Metrics of particular interest for performance include:

  • awx_status_total

    • Current total of jobs in each status. Helps correlate other events to activity in system.
    • Can monitor upticks in errored or failed jobs.
  • awx_instance_remaining_capacity

    • Amount of capacity remaining for running additional jobs.
  • callback_receiver_event_processing_avg_seconds

    • colloquially called “job events lag”.
    • Running average of the lag time between when a task occurred in ansible and when the user is able to see it. This indicates how far behind the callback receiver is in processing events. When this number is very high, users can consider scaling up the control plane or using the capacity adjustment feature to reduce the number of jobs a control node controls.
  • callback_receiver_events_insert_db

    • Counter of events that have been inserted by a node. Can be used to calculate the job event insertion rate over a given time period.
  • callback_receiver_events_queue_size_redis

    • Indicator of how far behind callback receiver is in processing events. If too high, Redis can cause the control node to run out of memory (OOM).

14.4.2. System level monitoring

Monitoring the CPU and memory use of your cluster hosts is important because capacity management for instances does not introspect into the actual resource usage of hosts. The resource impact of automation jobs depends on what the playbooks are doing. For example, many cloud or networking modules do most of the processing on the execution node, which runs the Ansible Playbook. The impact on the automation controller is very different than if you were running a native module like “yum” where the work is performed on the target hosts where the execution node spends much of the time during this task waiting on results.

If CPU or memory usage is very high, consider lowering the capacity adjustment (available on the instance detail page) on affected instances in the automation controller. This limits how many jobs are run on or controlled by this instance.

Monitor the disk I/O and use of your system. The manner in which an automation controller node runs Ansible and caches output on the file system, and eventually saves it in the database, creates high levels of disk reads and writes. Identifying poor disk performance early can help prevent poor user experience and system degradation.

Additional resources

14.5. PostgreSQL database configuration and maintenance for automation controller

To improve the performance of automation controller, you can configure the following configuration parameters in the database:

Maintenance

The VACUUM and ANALYZE tasks are important maintenance activities that can impact performance. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it’s necessary to do VACUUM periodically, especially on frequently-updated tables. ANALYZE collects statistics about the contents of tables in the database, and stores the results in the pg_statistic system catalog. Subsequently, the query planner uses these statistics to help determine the most efficient execution plans for queries. The autovacuuming PostgreSQL configuration parameter automates the execution of VACUUM and ANALYZE commands. Setting autovacuuming to true is a good practice. However, autovacuuming will not occur if there is never any idle time on the database. If it is observed that autovacuuming is not sufficiently cleaning up space on the database disk, then scheduling specific vacuum tasks during specific maintenance windows can be a solution.

Configuration parameters

To improve the performance of the PostgreSQL server, configure the following Grand Unified Configuration (GUC) parameters that manage database memory. You can find these parameters inside the $PDATA directory in the postgresql.conf file, which manages the configurations of the database server.

  • shared_buffers: determines how much memory is dedicated to the server for caching data. The default value for this parameter is 128 MB. When you modify this value, you must set it between 15% and 25% of the machine’s total RAM.
Note

You must restart the database server after changing the value for shared_buffers.

  • work_mem: provides the amount of memory to be used by internal sort operations and hash tables before disk-swapping. Sort operations are used for order by, distinct, and merge join operations. Hash tables are used in hash joins and hash-based aggregation. The default value for this parameter is 4 MB. Setting the correct value of the work_mem parameter improves the speed of a search by reducing disk-swapping.

    • Use the following formula to calculate the optimal value of the work_mem parameter for the database server:
Total RAM * 0.25 / max_connections
Note

Setting a large work_mem can cause the PostgreSQL server to go out of memory (OOM) if there are too many open connections to the database.

  • max_connections: specifies the maximum number of concurrent connections to the database server.
  • maintenance_work_mem: provides the maximum amount of memory to be used by maintenance operations, such as vacuum, create index, and alter table add foreign key operations. The default value for this parameter is 64 MB. Use the following equation to calculate a value for this parameter:
Total RAM * 0.05
Note

Set maintenance_work_mem higher than work_mem to improve performance for vacuuming.

Additional resources

For more information on autovacuuming settings, see Automatic Vacuuming.

14.6. Automation controller tuning

You can configure many automation controller settings by using the automation controller UI, API, and file based settings including:

  • Live events in the automation controller UI
  • Job event processing
  • Control and execution node capacity
  • Instance group and container group capacity
  • Task management (job scheduling)
  • Internal cluster routing
  • Web server tuning

14.6.1. Managing live events in the automation controller UI

Events are sent to any node where there is a UI client subscribed to a job. This task is expensive, and becomes more expensive as the number of events that the cluster is producing increases and the number of control nodes increases, because all events are broadcast to all nodes regardless of how many clients are subscribed to particular jobs.

To reduce the overhead of displaying live events in the UI, administrators can choose to either:

  • Disable live streaming events.
  • Reduce the number of events shown per second or before truncating or hiding events in the UI.

When you disable live streaming of events, they are only loaded on hard refresh to a job’s output detail page. When you reduce the number of events shown per second, this limits the overhead of showing live events, but still provides live updates in the UI without a hard refresh.

14.6.1.1. Disabling live streaming events

Procedure

  1. Disable live streaming events by using one of the following methods:

    1. In the API, set UI_LIVE_UPDATES_ENABLED to False.
    2. Navigate to your automation controller. Open the Miscellaneous System Settings window. Set the Enable Activity Stream toggle to Off.

14.6.1.2. Settings to modify rate and size of events

If you cannot disable live streaming of events because of their size, reduce the number of events that are displayed in the UI. You can use the following settings to manage how many events are displayed:

Settings available for editing in the UI or API:

  • EVENT_STDOUT_MAX_BYTES_DISPLAY: Maximum amount of stdout to display (as measured in bytes). This truncates the size displayed in the UI.
  • MAX_WEBSOCKET_EVENT_RATE: Number of events to send to clients per second.

Settings available by using file based settings:

  • MAX_UI_JOB_EVENTS: Number of events to display. This setting hides the rest of the events in the list.
  • MAX_EVENT_RES_DATA: The maximum size of the ansible callback event’s "res" data structure. The "res" is the full "result" of the module. When the maximum size of ansible callback events is reached, then the remaining output will be truncated. Default value is 700000 bytes.
  • LOCAL_STDOUT_EXPIRE_TIME: The amount of time before a stdout file is expired and removed locally.

Additional resources

For more information on file based settings, see Additional settings for automation controller.

14.6.2. Settings for managing job event processing

The callback receiver processes all the output of jobs and writes this output as job events to the automation controller database. The callback receiver has a pool of workers that processes events in batches. The number of workers automatically increases with the number of CPU available on an instance.

Administrators can override the number of callback receiver workers with the setting JOB_EVENT_WORKERS. Do not set more than 1 worker per CPU, and there must be at least 1 worker. Greater values have more workers available to clear the Redis queue as events stream to the automation controller, but can compete with other processes such as the web server for CPU seconds, uses more database connections (1 per worker), and can reduce the batch size of events each worker commits.

Each worker builds up a buffer of events to write in a batch. The default amount of time to wait before writing a batch is 1 second. This is controlled by the JOB_EVENT_BUFFER_SECONDS setting. Increasing the amount of time the worker waits between batches can result in larger batch sizes.

14.6.3. Capacity settings for control and execution nodes

The following settings impact capacity calculations on the cluster. Set them to the same value on all control nodes by using the following file-based settings.

  • AWX_CONTROL_NODE_TASK_IMPACT: Sets the impact of controlling jobs. You can use it when your control plane exceeds desired CPU or memory usage to control the number of jobs that your control plane can run at the same time.
  • SYSTEM_TASK_FORKS_CPU and SYSTEM_TASK_FORKS_MEM: Influence how many resources are estimated to be consumed by each fork of Ansible. By default, 1 fork of Ansible is estimated to use 0.25 of a CPU and 100 Mb of memory.

Additional resources

For information about file-based settings, see Additional settings for automation controller.

14.6.4. Capacity settings for instance group and container group

Use the max_concurrent_jobs and max_forks settings available on instance groups to limit how many jobs and forks can be consumed across an instance group or container group.

  • To calculate the max_concurrent_jobs you need on a container group consider the pod_spec setting for that container group. In the pod_spec, you can see the resource requests and limits for the automation job pod. Use the following equation to calculate the maximum concurrent jobs that you need:
((number of worker nodes in kubernetes cluster) * (CPU available on each worker)) / (CPU request on pod_spec) = maximum number of concurrent jobs
  • For example, if your pod_spec indicates that a pod will request 250 mcpu Kubernetes cluster has 1 worker node with 2 CPU, the maximum number of jobs that you need to start with is 8.

    • You can also consider the memory consumption of the forks in the jobs. Calculate the appropriate setting of max_forks with the following equation:
((number of worker nodes in kubernetes cluster) * (memory available on each worker)) / (memory request on pod_spec) = maximum number of forks
  • For example, given a single worker node with 8 Gb of Memory, we determine that the max forks we want to run is 81. This way, either 39 jobs with 1 fork can run (task impact is always forks + 1), or 2 jobs with forks set to 39 can run.

    • You might have other business requirements that motivate using max_forks or max_concurrent_jobs to limit the number of jobs launched in a container group.

14.6.5. Settings for scheduling jobs

The task manager periodically collects tasks that need to be scheduled and determines what instances have capacity and are eligible for running them. The task manager has the following workflow:

  1. Find and assign the control and execution instances.
  2. Update the job’s status to waiting.
  3. Message the control node through pg_notify for the dispatcher to pick up the task and start running it.

If the scheduling task is not completed within TASK_MANAGER_TIMEOUT seconds (default 300 seconds), the task is terminated early. Timeout issues generally arise when there are thousands of pending jobs.

One way the task manager limits how much work it can do in a single run is the START_TASK_LIMIT setting. This limits how many jobs it can start in a single run. The default is 100 jobs. If more jobs are pending, a new scheduler task is scheduled to run immediately after. Users who are willing to have potentially longer latency between when a job is launched and when it starts, to have greater overall throughput, can consider increasing the START_TASK_LIMIT. To see how long individual runs of the task manager take, use the Prometheus metric task_manager__schedule_seconds, available in /api/v2/metrics.

Jobs elected to begin running by the task manager do not do so until the task manager process exits and commits its changes. The TASK_MANAGER_TIMEOUT setting determines how long a single run of the task manager will run for before committing its changes. When the task manager reaches its timeout, it attempts to commit any progress it made. The task is not actually forced to exit until after a grace period (determined by TASK_MANAGER_TIMEOUT_GRACE_PERIOD) has passed.

14.6.6. Internal Cluster Routing

Automation controller cluster hosts communicate across the network within the cluster. In the inventory file for the traditional VM installer, you can indicate multiple routes to the cluster nodes that are used in different ways:

Example:

[automationcontroller]
controller1 ansible_user=ec2-user ansible_host=10.10.12.11 node_type=hybrid routable_hostname=somehost.somecompany.org
  • controller1 is the inventory hostname for the automation controller host. The inventory hostname is what is shown as the instance hostname in the application. This can be useful when preparing for disaster recovery scenarios where you want to use the backup/restore method to restore the cluster to a new set of hosts that have different IP addresses. In this case you can have entries in /etc/hosts that map these inventory hostnames to IP addresses, and you can use internal IP addresses to mitigate any DNS issues when it comes to resolving public DNS names.
  • ansible_host=10.10.12.11 indicates how the installer reaches the host, which in this case is an internal IP address. This is not used outside of the installer.
  • routable_hostname=somehost.somecompany.org indicates the hostname that is resolvable for the peers that connect to this node on the receptor mesh. Since it may cross multiple networks, we are using a hostname that will map to an IP address resolvable for the receptor peers.

14.6.7. Web server tuning

Control and Hybrid nodes each serve the UI and API of automation controller. WSGI traffic is served by the uwsgi web server on a local socket. ASGI traffic is served by Daphne. NGINX listens on port 443 and proxies traffic as needed.

To scale automation controller’s web service, follow these best practices:

  • Deploy multiple control nodes and use a load balancer to spread web requests over multiple servers.
  • Set max connections per automation controller to 100.

To optimize automation controller’s web service on the client side, follow these guidelines:

  • Direct user to use dynamic inventory sources instead of individually creating inventory hosts by using the API.
  • Use webhook notifications instead of polling for job status.
  • Use the bulk APIs for host creation and job launching to batch requests.
  • Use token authentication. For automation clients that must make many requests very quickly, using tokens is a best practice, because depending on the type of user, there may be additional overhead when using basic authentication.

Additional resources