Cleaning Apps from Mesos Using Marathon Recipe

Problem #

It is common to see services deployed to mesos that are not in use anymore or have been over-provisioned to support testing such as QA validation or performance. Not cleaning up those services can result in extra costs.

Cleanup Process #

Marathon offers a convenient REST API that supports mesos parallel scaling resizing, allowing the cleanup process to be automated (see marathon REST API). The cleanup process can be specified through SLA per environment such as DEV (all instances will be resized to 0 nightly), TEST and STAGE (all instances will be resized to 1).

Steps to implement the cleanup process through python and Jenkins: #

  • install python and pip (version used for this guide is 3.7+):

    brew install python    
    install requests lib: 
    pip3 install requests
    
  • create the cleanup.py script with the following content: cleanup.py

    #!/usr/bin/python
    import sys
    import requests
    import json
    
    
    def delete_failed_tasks(id):
        app_tasks = requests.get(address + "/apps/" + id + "/tasks")
        if app_tasks.status_code != 200:
            print('error PUT /v2/apps {}'.format(app_tasks.status_code))
            exit
        else:
            for task in app_tasks.json()['tasks']:
                print("task={}".format(task))
                print('state={}\n'.format(task['state']))
                if task['state'] == 'TASK_FAILED':
      
                    delete_deployment(id)
      
                    delete_task = requests.delete(
                        address + "/apps/" + service['id'] + "/tasks/" + task['id'] + "?force=true")
                    if delete_task.status_code != 200:
                        print(
                            'error DELETE /v2/apps task {}'.format(delete_task.status_code))
                        exit
                    else:
                        print(
                            'DELETE /v2/apps task success{}'.format(delete_task))
      
      
    def delete_deployment(id):
        deployments = requests.get(address + "/deployments/")
        if deployments.status_code != 200:
            print('error GET /v2/deployments {}'.format(deployments.status_code))
            exit
        else:
            for deployment in deployments.json():
                if id in deployment['affectedApps']:
                    print("id={}".format(deployment['id']))
                    delete_request = requests.delete(
                        address + "/deployments/" + deployment['id'] + "?force=true")
                    if delete_request.status_code != 202    :
                        print(
                            'error DELETE /v2/deployments {}'.format(delete_request.status_code))
                        exit
                    else:
                        print(
                            'DELETE /v2/deployment success {}'.format(delete_request))
      
      
    env = str(sys.argv[1])
    instances = str(sys.argv[2])
    path_to_app = str(sys.argv[3]) if len(sys.argv) == 4 else None
    print('environment: ', env)
    environment_address_d = {"dev": "http://dev.mesos:8080/v2",
                            "test": "http://test.mesos:8080/v2",
                            "stage": "http://stage.mesos:8080/v2"}
    address = environment_address_d[env]
    print('address: ', address)
      
    if address == None:
        print("invalid address")
        exit
      
    resp = requests.get(address + "/apps")
    if resp.status_code != 200:
        # This means something went wrong.
        print('GET /v2/apps {}'.format(resp.status_code))
    for service in resp.json()['apps']:
        if path_to_app in service['id']:
            print('current id={} instances={}\n'.format(
                service['id'], service['instances']))
      
            delete_failed_tasks(service['id'])
      
            response = requests.put(address + "/apps/" + service['id'],
                                    data="{\"instances\": " + instances + "}")
      
            if response.status_code != 200:
                print('error PUT /v2/apps {}'.format(response.status_code))
            else:
                print('updated id={} set to instances={}\n'.format(
                    service['id'], instances))
    

    NOTE: the script above defines 2 functions to remove failed tasks and deployments associated with an app. Alternatively, force=true code on line 76, but with potential side effects of deleting good deployments. `

  • script execution using arguments: python3.7 cleanup.py [enviroment. Ex: dev|test|stage] [desired number of instances. Ex: 0] [filter based on string constain. Ex: /middleware/employee/crew|crew|employee]

  • all services under employee path: python3.7 cleanup.py dev 0 /middleware/employee

  • a single service: python3.7 cleanup.py dev 1 /middleware/employee/crew-app-landing-v1

  • create a jenkins project using the following Jenkinsfile:

    pipeline {
        agent {
            docker {
                image 'python:3.7-slim'
            }
        }
        triggers {
            cron ('H 0 0 0 0')
        } 
        stages {
            stage('Build') {
                steps {
                    sh 'pip3 install requests'
                    sh 'python3.7 cleanup.py dev 1 path/to/api/'
                }
            }
        }
    }
    

    NOTE the execution of the command python3.7 cleanup.py dev 1 path/to/api/ will resize all instances under the path/to/api/ to 1 instance.