Skip to content

Ansible module to restart ec2 instances #6905

Open
AmitPhulera wants to merge 24 commits into
masterfrom
ap/restart-ans-mod
Open

Ansible module to restart ec2 instances #6905
AmitPhulera wants to merge 24 commits into
masterfrom
ap/restart-ans-mod

Conversation

@AmitPhulera
Copy link
Copy Markdown
Contributor

@AmitPhulera AmitPhulera commented Jun 2, 2026

https://dimagi.atlassian.net/browse/SAAS-19382

A redo of #6858, the changes were so much that it made no sense continuing on the existing branch.

This is part of the groundwork required to automate the machine restarts.

I have tested the command locally on staging web13 machine and the results were as expected, following are the output for each command -

  • describe
$ cat > /tmp/args.json <<'EOF'
{"ANSIBLE_MODULE_ARGS": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "describe"}}
EOF
python src/commcare_cloud/ansible/library/ec2_instance_state.py /tmp/args.json

{"changed": false, "command": "describe", "instances": [{"instance_id": "i-0fb86252d8c3490e9", "previous_state": "running", "current_state": "running", "name": "web13-staging", "instance_type": "t3a.xlarge", "availability_zone": "us-east-1a", "private_ip": "10.201.10.244", "public_ip": null, "tags": {"Name": "web13-staging", "Environment": "staging", "Group": "webworkers"}, "launch_time": "2025-08-30T17:34:24+00:00"}], "unchanged_instance_ids": [], "diff": {"before": {"states": {"i-0fb86252d8c3490e9": "running"}}, "after": {"states": {"i-0fb86252d8c3490e9": "running"}}}, "invocation": {"module_args": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "describe", "wait": true, "region": null}}}
  • start
$ cat > /tmp/args.json <<'EOF'
{"ANSIBLE_MODULE_ARGS": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "start"}}
EOF
python src/commcare_cloud/ansible/library/ec2_instance_state.py /tmp/args.json

{"changed": true, "command": "start", "instances": [{"instance_id": "i-0fb86252d8c3490e9", "previous_state": "stopped", "current_state": "running", "name": "web13-staging", "instance_type": "t3a.xlarge", "availability_zone": "us-east-1a", "private_ip": "10.201.10.244", "public_ip": null, "tags": {"Name": "web13-staging", "Environment": "staging", "Group": "webworkers"}, "launch_time": "2026-06-05T10:14:44+00:00"}], "unchanged_instance_ids": [], "diff": {"before": {"states": {"i-0fb86252d8c3490e9": "stopped"}}, "after": {"states": {"i-0fb86252d8c3490e9": "running"}}}, "invocation": {"module_args": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "start", "wait": true, "region": null}}}
  • stop
$ cat > /tmp/args.json <<'EOF'
{"ANSIBLE_MODULE_ARGS": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "stop"}}          
EOF
python src/commcare_cloud/ansible/library/ec2_instance_state.py /tmp/args.json

{"changed": true, "command": "stop", "instances": [{"instance_id": "i-0fb86252d8c3490e9", "previous_state": "running", "current_state": "stopped", "name": "web13-staging", "instance_type": "t3a.xlarge", "availability_zone": "us-east-1a", "private_ip": "10.201.10.244", "public_ip": null, "tags": {"Name": "web13-staging", "Environment": "staging", "Group": "webworkers"}, "launch_time": "2026-06-05T10:12:22+00:00"}], "unchanged_instance_ids": [], "diff": {"before": {"states": {"i-0fb86252d8c3490e9": "running"}}, "after": {"states": {"i-0fb86252d8c3490e9": "stopped"}}}, "invocation": {"module_args": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "stop", "wait": true, "region": null}}}
  • stop_start
$ cat > /tmp/args.json <<'EOF'
{"ANSIBLE_MODULE_ARGS": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "stop_and_start"}}
EOF
python src/commcare_cloud/ansible/library/ec2_instance_state.py /tmp/args.json

{"changed": true, "command": "stop_and_start", "instances": [{"instance_id": "i-0fb86252d8c3490e9", "previous_state": "running", "current_state": "running", "name": "web13-staging", "instance_type": "t3a.xlarge", "availability_zone": "us-east-1a", "private_ip": "10.201.10.244", "public_ip": null, "tags": {"Name": "web13-staging", "Environment": "staging", "Group": "webworkers"}, "launch_time": "2026-06-05T10:12:22+00:00"}], "unchanged_instance_ids": [], "diff": {"before": {"states": {"i-0fb86252d8c3490e9": "running"}}, "after": {"states": {"i-0fb86252d8c3490e9": "running"}}}, "invocation": {"module_args": {"instance_ids": ["i-0fb86252d8c3490e9"], "command": "stop_and_start", "wait": true, "region": null}}}
Environments Affected

All

- Manages the running state of EC2 instances given an explicit list of
instance IDs. Supports four commands - describe, start, stop, stop_and_start,
and is idempotent (no API call is made if the instance is already in the requested state).
- Designed to run with delegate_to localhost. AWS credentials and the target region are picked up from the standard boto3 credential chain; in the commcare-cloud workflow the AWS_PROFILE and AWS_REGION environment variables are exported automatically before ansible runs.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just not too familiar with this. Do you mind elaborating on the "delete_to localhost" workflow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is delegate_to localhost. This is the command that we want to run locally on the host from where the command was run instead of running it on the target machine.
For this case the AWS credentials will be on our local system so we want to run the module from our local systems.

Comment on lines +134 to +135
bad = [i for i in instance_ids if not INSTANCE_ID_RE.match(i)]
if bad:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
bad = [i for i in instance_ids if not INSTANCE_ID_RE.match(i)]
if bad:
bad_ids = [i for i in instance_ids if not INSTANCE_ID_RE.match(i)]
if bad_ids:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +137 to +140
except ImportError:
raise RuntimeError(
"boto3 is required by ec2_instance_state but is not installed."
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not call module.fail_json directly here, similar to what _get_region does, and then that way main doesn't need to wrap the call to _get_ec2_client in a try/except.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a good point. Will update it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +144 to +148
"""Per-run context shared by the flow helpers.

Bundles the EC2 client, the AnsibleModule, so these don't have to be
passed as arguments to every helper.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Per-run context shared by the flow helpers.
Bundles the EC2 client, the AnsibleModule, so these don't have to be
passed as arguments to every helper.
"""
"""
Bundles the EC2 client and Ansible module for convenience
"""

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

)
return boto3.client('ec2', region_name=region)

class _Ctx:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my favorite name, but need to continue reviewing to offer suggestions.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the functions that accept this type (like _do_start(ctx, ...)) look like instance methods. What do you think of renaming this to something like StopStarter and moving those functions to methods?

class StopStarter:

    def __init__(self, client, module):
        self.client = client
        self.module = module

    def describe(self, instance_ids):
        ...

    def start(self, instance_ids, wait):
        ...

    def stop(self, instance_ids, wait):
        ...

    ...

Feel free to pick a different name, that was just the first thing that come to mind.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @millerdev. Used EC2InstanceManager class name 616ad06
This looks much better.

Comment on lines +311 to +320
def _wait_for(ctx, waiter_name, wait_instances):
if not wait_instances or ctx.module.check_mode:
return
waiter = ctx.client.get_waiter(waiter_name)
try:
waiter.wait(InstanceIds=[i.instance_id for i in wait_instances])
except Exception as e: # noqa: BLE001 - surface any waiter failure as module failure
ctx.module.fail_json(
msg=f"Waiter {waiter_name!r} failed for {_labels(wait_instances)}: {e}")
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wait_instances -> instances

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return ', '.join(i.label for i in instances)


def _check_not_terminated(ctx, instances, action):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: my preference would be to not pass action into this method since when I first saw _check_not_terminated(ctx, instances, InstanceCommand.START), my first thought was "why would the action impact a check for terminated instances?". I guess the alternatives are generalize the error message or leave it up to the caller to call module.fail_json.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While moving stuff to EC2InstanceManager, this was addressed.
616ad06

ctx.module.fail_json(msg=f"StopInstances failed for {labels}: {e}")
return

wait_for_stopped = list(targets) + list(already_stopping)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

targets and already_stopping are added together on line 336 as well. Could put this variable up with where targets and already_stopping are already defined. I also don't think these need to be wrapped in list(...) right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. 02e4d10

# Honor the user's `wait` choice for the final running wait.
start_payload = _do_start(ctx, instance_ids, wait=wait)

# Combine: previous_state = state before the stop; current_state = after start.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +400 to +401
# unchanged = ids that were no-ops in BOTH phases.
# Highly unlikely to happen in practice, but we sort to make the result deterministic.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would this be possible?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would only happen when the instance state would be terminated but given the fact we are checking for it already so it should never happen. But I kept it for to keep the output same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants