Improved AWS ECS Queue based Scaling Automation

I’ve always faced issues in my current ECS setup where the default scaling mechanism is not as “reactive” but rather proactive. Whether is it CPU utilization or load balancer hits, its not as efficient in dealing with unforeseen/volatile loads and end up wasting tons of money when load is gone.

I’ve hosted an application in AWS ECS backed by ASG EC2 , which uses CPU utilization but this scaling metrics is inefficient and expensive as my loads are volatile and by the time ECS managed to scale up, the load is gone. Also I do face issues in getting ECS to scale “faster” in a way that it would react based on incoming job loads rather than resource utilization. To add on, ECS has a limit of 300 tasks that could be in pending state so I’m determined to perform some “magic” to fix this problem!

AWS ECS Proposed design

All Hail! AWS Swiss Army Knife!

Reason why i picked lambda

  • Perform custom calculation
  • Can integrate quickly with almost all aspect of AWS without changing current design for future proofing
  • Cheap ( hopefully free )
  • Can perform API calls to internal LoadBalancer

The following flow will illustrate the whole scaling flow

  1. Lambda will perform API calls to internal LoadBalancer to retrieve “Pending jobs” and “Current worker count”
    • Calculate the ECS cluster’s “velocity” – Pending jobs / current worker count
    • Calculate the EC2 cluster’s size “velocity” – (pending jobs / current worker count) / 300
      • 300 is equivalent to 5mins which is an average time each job will take
    • Note that by configuration, each ec2 can only host 8 worker tasks
  2. Lambda will post the custom metric to CW metric
  3. ECS tasks scaling policies are configured to use “Target Tracking Policy” and point to the custom metric.
    • The threshold to be 300 as 300 equates to 5min which is our average time needed process a job
  4. EC2 Scaling policy, capacity provider was configured but to improve the scaling mechanism, what I’ve did was to reverse engineer how capacity provider works and manipulate the metric data such that it will pre-empt the ASG ec2 count and ready for ECS tasks to be deployed
    • Post EC2 cluster velocity * 100 – capacity provider works by percentage

With this solution, the design was able to scale from 100 – 5000 ECS tasks in 3-5mins based on incoming load and it will scale down when load is gone. Where without lambda, it took nearly 40mins to do so!

Lambda utilization was not huge and was only scheduled to run by “Event bridge” every 1min so the cost involve is essentially FREE!!!!

Automation Stack Design Idea – How do we automate the automation SAFELY?

In this post, would like to share my idea of Enterprise Automation Stack – in perspective of tools. I will also explain little bit of the concept behind why I feel these tools or concepts are needed in every Automation builds.

In every Automation stack, we start our journey from the lowest level – worker, which we will design and create our scripts or playbook to perform a singlet set of tasks to achieve 1 desired outcome and the tool or technology adopted must be capable of interacting directly with the endpoint and then perform some tasks on the device or machine and visually provides feedback on the outcome.

But then as more playbooks or scripts are created, you’ll realize that though most of the tasks are automated but still requires human to run those automations, eye-balling on the output to ensure the task performs as it should and the output is as of intended. – too much work for me

Puppet, Chef, Ansible, Ansible Tower/AWX, Windows DSC are some of the worker tools which I know that are really powerful in terms of execution but very limited when logical or certain decision making is required – they are designed this way to simplify the lower tier tasks.

In my upcoming project, I am using AWX as my lower tier automation tool to perform all the heavy lifting in my automation stack.

My use case is Network OS upgrade, which will have 3 different main task to perform

  • Image download and upload to device and must perform the following safety checks
    • Pre checks of disk space size
    • Download image from vendor
    • Upload to device
    • Post check to ensure image is uploaded successfully
  • Upgrading the device
    • Pre checks to ensure device disk consist of the desired image
    • Execute upgrade
    • Ensure device is up after reboot
    • Perform post check to make sure device is at desired version
  • Post capture
    • Capture logs, config, other operational needed outputs
3 different AWX workflow
a set of task in a AWX workflow to achieve 1 outcome

See, with ansible dealing with the heavy lifting, you sure wont want it to also perform heavy decision makings and though ansible does have workflows but the nature of ansible is supposed to adopt the “KISS” approach.

This is where a second layer of automation is required – Automate the automation! Yay!

This second layer of automation which I called it the “Orchestrator”, will be the worker’s manager to decide and inform the worker what to run, where to run, when to run and how it should be ran.

  • What to run – which job it is supposed to do now?
  • Where to run – which devices should it be performing that job?
  • When to run – What time it should be running this job?
  • How to run – must it run a certain job first before running this job?

Tools like StackStorm, Node-RED, IFTTT are some of the event triggering or workload orchestrator tools which enables you to stick different use cases together to achieve 1 or more business use case.

For me, I’m using Node-RED an IFTTT IOT tool which have the capability to deal with such logics in the simplest manner! – if the tool is hard to use, throw it away 

snapshot of node-red flow stitching multiple workflows together

What happens is Node-Red periodically poll tickets from a Centralize API server, determine the ticket state and based on the ticket state, perform API calls to AWX to perform that particular task for that ticket and monitor the job that AWX is running. This allows me to control the state of the ticket from start to the end and monitoring what AWX is doing and the job status.

Now with all these tools perform actions and tasks at different level, I would need a way to monitor and visualize the whole progress! Node-red and AWX does provide have Web UI but it does not allow me to know what the state of the deployment is, what is each automation tool doing and what is the current automation state.

This is where a Visualization tool is needed, though tools like Grafana and Kibana have powerful visualization capabilities but they do not fit my requirement of the need for custom API, storage and custom UI development needs.

For my case, I’m using my company’s platform which gives me the flexibility to host an Angular Application, create custom API which connects back to a MongoDB. This 3 component forms a custom application as my Visualization layer

I can view all the ticket created at one glance

I will know at detail level, what is ansible or node-red doing and the state of the automation

With this design, it gives me the flexibility to basically do anything when it comes to automation

  1. Perform a set of low-level tasks to achieve a outcome can be done with AWX
  2. Stich multiple tasks together to achieve a business outcome with node-red working with AWX
  3. High level visualize what the system is doing, where it is right now, what is the next step, when is the next step and how its going to be performed

Along the way of setting this Automation stack up from Ground 0, I do learn a lot of new knowledge, explored so many design approach and tools and now seeing every level work together. – major brain orgasm!

Upcoming, I will show you how I build my Automation Stack and how I incorporate IAAC (Infrastructure as a code) to build it up and dispose or replicate it as I please

Accelerate ITOps – Why we should adopt not just one methodology to achieve NoOps

In today’s world of IT, its all about speed, innovation and automation – and this applies to the products and company that makes them. Traditionally, ITOps has been ensuring that systems and applications are running well for the business but with the introduction of DevOps and cloud technology, these ITOps people would find it hard to relate or thought that it would be totally in-practical to believe in methodologies which are promotes continuous changes and unpredictable stability to be used in their environment.

As automation was always been my passion in my journey of being the laziest person alive and let the robot do my job, understanding these methodologies would help me to expand my knowledge horizon, improve the way I do things and strengthen my believes. I always scratch my head and ask myself “How can I make ITOps and DevOps work together so that I can achieve NoOps?” and “What kind of tools do I need?

I personally like to reference to the diagram below to explain what automation is

https://i.stack.imgur.com/m0tMY.jpg

DevOps practices which focuses on continuous changes, continuous updates, continuous delivery, team communication and features oriented and automation is about how scripts and tools can help us minimize human repetitive works, minimize time to perform that job, decrease human errors – the idea of let the human control the robot.

So to recap…..

  • IT Ops is about ensuring systems and applications are in good shape so that business operates as usual and lights are always on
  • DevOps is about practices which focuses on continuous changes and updates or the idea of pushing new changes to their application as fast and many as possible
  • Automation is about how to let the machine help the human to do the job so to lessen human mistakes, time taken and increase productivity

If I were look at things at another angle, isn’t it that IT Ops repetitive works contributes to Automation ideas which then we can utilize DevOps practices to roll out the Automation idea as quickly as possible to help IT Ops? – did I just reached NoOps?

Well, I would like to keep this post short on explaining my view on why these methodologies works together. I will leave it to my upcoming post which will focus fully in-depth on technical concept explaining my point of view of how I achieve NoOps and which tools I’ve used so stay tuned!

Ansible – net_put error

Unknown and weird error faced while utilizing ansible’s network module “net_put” to transfer image file from local to a remote network device

The full traceback is:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ansible/executor/task_executor.py", line 147, in run
    res = self._execute()
  File "/usr/lib/python3.6/site-packages/ansible/executor/task_executor.py", line 660, in _execute
    result = self._handler.run(task_vars=variables)
  File "/usr/lib/python3.6/site-packages/ansible/plugins/action/net_put.py", line 131, in run
    result['changed'] = changed
UnboundLocalError: local variable 'changed' referenced before assignment
fatal: [ios]: FAILED! => {
  "msg": "Unexpected failure during module execution.",
  "stdout": ""
}

Seriously….. who understands this error?!

Solution

As the error is not significant, I wouldn’t want to assume my scenario will fix all so for my case, I scraped through the ansible core’s python script file and found the problem which gives the error. Apparently python scp is not installed and to fix it, a simple install command will do

pip3 install scp

For those using ansible AWX, interim solution is to perform “docker exec -it < awx-task id > /bin/bash” and execute the above command but my recommended long term solution is to recreate a docker image after installing scp and store it in local docker.

This will prevent the same issue after docker vm reboots

How to install AWX Isolated Node

Yes, you’re right! Finally a documentation on how to deploy AWX isolated node!

I could not find a proper documentation on AWX isolated node deployment due to AWX opensource community not supporting this particular feature which i don’t really know why, i decided to spend some time cracking my head on how to get it done since AWX is the upstream of Ansible Tower.

Background

AWX is the upstream and open source version of Ansible Tower. Ansible engine does not have the capability to store passwords securely, RBAC relies on the OS level , there’s no centralized way of managing inventory or any APIs to bring automation to the next level.

As AWX or Ansible Tower comprise of many different features, here are some main feature:

  • Store sensitive information securely
  • Role-based access control (RBAC)
  • Grab playbooks or scripts from git
  • Grab inventory from existing inventory software

AWX runs the playbook on the host server which it’s installed on but this could be a challenge when it comes to an actual production environment where there can be different domains or zones and allowing opening firewall ports from one server to all managed nodes (can be network devices or servers) could be a challenge.

Ansible Tower has this feature called isolated node where there can be 1 centralized Tower running the playbook but execution is performed on another server. This feature provides multi-tenancy capabilities to any Tower deployment.

Quoted from the original blog post on redhat’s website

“A Tower Isolated Node is a headless Ansible Tower node that can be used for local execution capacity, either in a constrained networking environment such as a DMZ or VPC, or in a remote data center for local execution capacity. The only prerequisite is that there is SSH connectivity from the Tower Cluster to the Isolated Node. The Tower Cluster will send all jobs for the relevant inventory to the Isolated Node, run them there, and then pull the job details back into Ansible Tower for viewing and reporting.”

https://developers.redhat.com/blog/2017/12/20/understanding-ansible-tower-isolated-nodes/

Prerequisites…..

  • Existing AWX deployed
    • You can follow the following AWX installation guide and install in your preferred platform. I’ll be using the docker-compose method
  • A Rhel/Centos7 server (Not tested on other linux OS)
  • Internet connection / you can find your own way to deliver the packages dependencies
  • SSH connection from AWX to isolated node
  • SSH Passwordless key login configure from AWX to isolated node

Overview

The following diagram will illustrate the whole process and data flow to give you an overview on how isolated node works with AWX.

Setup – Isolated Node

Perform initial update

sudo yum update -y

Install other dependencies packages
P.S I’m using python2 for this POC, you could change to Python3 if required)

sudo yum install epel-release python-pip python-devel -y

Install Ansible

yum install ansible rsync -y

Install ansible_runner

pip install ansible-runner pywinrm –user

Setup file system folders

mkdir /var/lib/awxx
chown awx:awx /var/lib/awx

Note : The user default user login is “awx” << change it to your own if required

Configure- AWX

You’ll need to access the container image which has ansible installed to do use the awx-manage command.

I’m using the docker-compose method to deploy AWX

List the running containers

docker ps

locate the container ID which is running the awx_task image

Access into awx_task container

docker exec -it 07694949d898 /bin/bash

Create a instance in awx

awx-manage provision_instance –hostname <hostname/ip of isolated node>

Add newly created instance into a isolated group

awx-manage register_queue –queuename <your queue name> –hostname <hostname/ ip of isolated node> –controller <controller tower name e.g. tower>

Congratulations! You’ve successfully created a isolated node with AWX and do note to access the AWX web ui -> instance groups to view your newly created isolated instance group!!

Locate a file which contains a string

I do face a ton of situation whereby i face an error while performing an installation on a Linux system and I need to solve the error, doing so requires me to locate a file (can be the installation file or system file) which contains a particular sentence (can be the error message or a particular configuration) in order for me to determine what should i need to perform next

Problem: I’m supposed to install a IBM product on a Rhel system. I know i have successfully installed the product in my Dev lab but when performing the installation in the production environment, the installer shell script keeps prompting the error “unknown command: installation.sh”. After troubleshooting the error i realized the customer had a Rhel system which is hardened! Means i would need to change the installer script to “sh < script_name>” for all the error i faced pertaining to this problem.

Solution: As i do not have the visibility of all the scripts which this installation file will potentially call, I need to have a command which allows me to search the script name whenever i faced the error.

grep -iHrn "<string to locate>" .

This command will search the path “.” (current working path) which will then return the path of the file which contains the string and line number

After which you can use the following command to edit the file

vi < path to file > +<line_number>

For my case, i will go to that line and add a “sh” before the shell script

What Is Ansible and How to get started

We’ve all heard about this term “Ansible Automation”, “Ansible can do this” or “Ansible could be used to automate our operations”, so we start by asking our ‘Best Friend’ aka ‘The place where all questions are answered’, which is GOOGLE.

So, we start Googling “Ansible automation” or “Ansible tutorial” or maybe “How to automate using ansible”, then proceed to watch some videos from YouTube hoping to gain some knowledge on what is Ansible all about but consciously we realized, it all ends there, and we carry on our work-life not knowing how to ansible could help our day to day operation works.

What is Ansible?


I’m pretty sure you’ve scrapped through the whole Google and the answer to this question is always the same, “Ansible is a configuration tool which is agentless and works via SSH/API/WinRM blah blah blah”.

I’m not going through that to bore you, so let me tell you my definition of Ansible, it is simply an “Automation tool”. Now I’m pretty sure you’ll be like, “this guy is a joke! You think I’m so dumb to not know!? Oh, c’mon”.

Hold your horses, my friends, let me explain, the key takeaway here is that I want you guys to understand, Ansible is a tool where people from different aspect of IT field, collaborate together to create different functions to perform on a device using Python then these functions are translated to human readable language which is YAML.

Ansible is an open-source tool in which contributors from different aspects of the IT field have gone through the hardship to develop these “functions” what we call in Ansible is “modules” using Python which he/she then decides to contribute back to the community for others to utilize.

Don’t Reinvent the Wheel, Unless You Plan on Learning About Wheels

Getting Started with Ansible


Ok…. Now we know that someone has created these “functions” for us then now what? The fun thing is rather than getting an answer here, why not ask yourself, that question instead?

Which are the most boring, repetitive and brainless workflow which you’re tasked to do but your first reaction is “Shit! I have to do it again!? Oh, C’mon it’s so boring! My life sucks”

What is the workflow which is so static that you could do it in another environment with the same steps to provide the same output?

Lastly, what are the workflows which you feel like you’re transforming into a 1980s Robot?

These are the workflows that you could recite it out, step by step by head without a document to follow. Instead of reciting it out, why not transfer all these workflows into individual files, store it somewhere and when we need it, let Ansible do it for us. In Ansible, we call this “file” a playbook.

An example of a playbook which is a clear explanation of boring workflow is “Backup Cisco Device Configuration”, this is the most static, boring and time consuming workflow but sorry to say it is needed as to ensure that we will have the most updated configuration copy when restoring the device.

  1. Ensure that Local directory is created
  2. Connect to device and issue “show run” command
  3. Save output to local directory

cisco_ios_backup.yml

---
- hosts: cisco
  gather_facts: no
  connection: local
  
  tasks:
- name: Creates directory
  file:
    path: "/opt/cisco"
    state: directory
    mode: u=rwx,g=r,o=r
 
- name: Issue "Show Run" command
  ios_command:
     commands: show run
   register: run
 
- name: Saving Config to local
  copy:
    content: "{{ run.stdout[0] }}"
    dest: "/opt/cisco/{{ inventory_hostname }}"

inventory file

---
[cisco]
cisco1 ansible_host=10.10.10.1 ansible_network_os=ios
cisco2 ansible_host=10.10.10.2 ansible_network_os=ios
cisco3 ansible_host=10.10.10.3 ansible_network_os=ios

How Ansible works is that it will have based on the specified inventory file, for each device stated in it, run this playbook in sequence. Tadah, jobs done! Boring stuff bye-bye!

Just imagine the pain you have to go through using the traditional method every day in the morning and the most sickening part is that I have 500 devices to backup!

It never ends here….

There are many modules and commands available to be executed on the remote host with Ansible. Instead of asking Google what Ansible can deliver, why not start Googling “Using Ansible to backup Cisco device” “Use Ansible to mass provision ESXi VMs”.

Start googling the use case rather than asking what the automation tool can deliver and start documenting your workflow rather than expecting Ansible to know the workflow. Tell Ansible what you want to do rather than expecting Ansible to know what to do.

Design a site like this with WordPress.com
Get started