Categories
ITOps

Packer and Ansible testing with Hyper-V (on Windows 10)

Why?

With almost all of our clients now preferring AWS and Azure for hosting VMs / Docker containers we have to manage a lot of AMIs / VM images. Ensuring that these AMIs are correctly configured, hardened and patched is a critical requirement. To do this time and cost effectively, we use packer and ansible. There are solutions such as Amazon’s ECS that extend the boundary of the opaque cloud all the way to the containers, which has a number of benefits but does not currently meet a number of non-functional requirements for most of our clients. If those non-functional requirements we gone, or met by something like AWS ECS, it would be hard to argue against simply using terraform and ecs – removing our responsibility for managing the docker host VMs.

Anyway, we are making some updates to our IaaS code base which includes a number of new requirements and code changes to our packer and ansible code. To make these changes correctly and quickly I need a build/test cycle that is as short as possible (shorted than spinning up a new EC2 instance). Fortunately, one of the benefits of packer is the ‘cloud agnosticism’… so theoretically I should be able to test 99% of my new packer and ansible code on my windows 10 laptop using packer’s Hyper-V Builder.

Setting up

I am running Windows 10 Pro on a Dell XPS 15 9560. VirtualBox is the most common go-to option for local vm testing but thats not an option if you are already running Hyper-V (which I am). So to get things started we need to:

  1. Have a git solution for windows – I am using Microsoft’s VS Code (which is really a great opensource tool from M$)
  2. Install packer for windows, ensuring the executable is in the Windows PATH
  3. Create VM in Hyper-V to act as a base template (I am using Centos 7 minimal as we use https://www.centos.org/download/CentOs AMIs on AWS)
  4. Install Hyper-V Linux Integration Services on the Centos 7 base VM (this is required for Packer to be able to determine new VMs’ IP addresses) – if you are stuck with packer failing to connect with SSH to the VM and you are using a Hyper-V switch this will most likely be the issue
  5. Add a Hyper-V builder to our packer.json (as below)
...
  {
    "clone_from_vm_name": "sonet-ami-template",
    "shutdown_command": "echo 'packer' | sudo -S shutdown -P now",
    "headless": true,
    "ssh_password": "{{user `ssh_username`}}",
    "ssh_username": "{{user `ssh_username`}}",
    "switch_name": "Default Switch",
    "type": "hyperv-vmcx"
  }
...

Now, assuming the packer and ansible code is in a funcitonal state, I can build a new VM and run packer + ansible via powershell (run with administrative privileges) with:

packer build --only hyperv-vmcx packer.json
Categories
ITOps

AWS RDS (Oracle 12c) Offsite Backups

A lot of people need to do offsite backups for AWS RDS – which can be done trivially within AWS. If you need offsite backups to protect you against things like AWS account breach or AWS specific issues – offsite backups must include diversification of suppliers.

I am going to use Amazon’s Data Migration service to replicate AWS RDS data to a VM running in Azure and set up snapshots/backups of the Azure hosts.

The new (2018) AWS Data Migration Service solve offisite RDS backup problems

The steps I used to do this are:

  1. Set up an Azure Windows 2016 VM
  2. Create an IPSec tunnel between the Azure Windows 2016 VM and my AWS Native VPN
  3. Install matching version of Oracle on the Windows 2016 VM
  4. Configure Data Migration service
  5. Create a data migration and continuous replication task
  6. Snapshots/Backups and Monitoring
  7. Debug and Gotchyas

1,2 – Set up Azure Windows 2016 VM and IPSec tunnel

Create Network on Azure and place a VM in the network with 2 interfaces. One interface must have an public IP, call this one ‘external’ and the other inteface will be called ‘internal’ – Once you have the public IP address of your Windows 2016 VM, create a ‘Customer Gateway’ in your AWS VPC pointing to that IP. You will also need a ‘Virual Private Gateway’ configured for that VPC. Then create a ‘Site-to-Site VPN connection’ in your VPC (it won’t connect for now but create it anyway). Configure your Azure Win 2016 VM to make an IPSec tunnel by following these instructions (The instructions are for 2012 R2 but the only tiny difference is some menu items):
https://docs.aws.amazon.com/vpc/latest/adminguide/customer-gateway-windows-2012.html#cgw-win2012-download-config. Once this is completed both your AWS site-to-site connection and your Azure VM are trying to connect to each other. Ensure that the Azure VM has its security groups configured to allow your AWS site-to-site vpn to get to the Azure VM (I am not sure which ports and protocols specifically, I just white-listed all traffic from the two AWS tunnel end points. Once this is done it took around 5 mins for the tunnel to come up (I was checking the status via the AWS Console), I also found that it requires traffic to be flowing over the link, so I was running a ping -t <aws_internal_ip> from my Azure VM. Also note that you will need to add routes to your applicable AWS route tables and update AWS security groups for the Azure subnet as required.

3 – Install matching version of Oracle on the Windows 2016 VM

4,5 – Configure Data Migration service and migration/replication

Log into your AWS console and go to ‘Data Migration Service’ / ‘DMS’ and hit get started. You will need to set up a replication VM (well atleast pick a size, security group, type etc). Note that the security group that you add the replication host to must have access to both your RDS and your Azure DBs – I could not pick which subnet the host went into so I had to add routes for a couple more subnets that expected. Next you will need to add your source and target databases. When you add in the details and hit test the wizard will confirm connectivity to both databases. I ran into issue on both of these points because of not adding the correct security groups, the windows firewall on the Azure VM and my VPN link dropping due to no traffic (I am still investigating a fix better than ping -t for this). Next you will be creating a migration/replication task, if you are going to be doing ongoing replication you need to run the following on your Oracle RDS db:

  • exec rdsadmin.rdsadmin_util.set_configuration(‘archivelog retention hours’, 24);
  • exec rdsadmin.rdsadmin_util.alter_supplemental_logging(‘ADD’,’ALL’);
  • exec rdsadmin.rdsadmin_util.alter_supplemental_logging(‘DROP’,’PRIMARY KEY’);

You can filter by schema, which should provide you with a drop down box to select which schema/s. Ensure that you enable logging on the migration/replication task (if you get errors, which I did the first couple of attempts, you won’t be fixing anything without the logs.

6 – Snapshots and Monitoring

For my requirements, daily snapshots/backups of the Azure VM will provide sufficient coverage. The Backup vault must be upgraded to v2 if you are using a Standrd SSD disk on the Azure VM, see:
https://docs.microsoft.com/en-us/azure/backup/backup-upgrade-to-vm-backup-stack-v2#upgrade . To enable email notifications for Azure backups, go to the azure portal, select the applicable vault, click on ‘view alerts’ -> ‘Configure notifications’ -> enter an email address and check ‘critical’ (or what type of email notifications you want. Other recommended monitoring checks include: ping for VPN connectivity, status check of DMS task (using aws cli), SQL query on destination database confirming latest timestamp of a table that should have regular updates.

7 – Debug and Gotchyas

  • Azure security group allowing AWS vpn tunnel endpoint to Azure VM
  • Windows firewall rule on VM allowing Oracle traffic (default port 1521) from AWS RDS private subnet
  • Route tables on AWS subnets to route traffic to your Azure subnet via the Virtual Private Network
  • Security groups on AWS to allow traffic from Azure subnet
  • Stability of the AWS <–> Azure VM site-to-site tunnel requires constant traffic
  • The DMS replication host seems to go into an arbitrary subnet of your VPC (there probably some default setting I didn’t see) but check this and ensure it has routes for the Azure site-to-site
  • Ensure the RDS Oracle database has the archive log retention and supplemental logs settings as per steps 4,5.
  • Azure backup job fails with ‘Currently Azure Backup does not support Standard SSD disks’. – upgrade backup vault: https://docs.microsoft.com/en-us/azure/backup/backup-upgrade-to-vm-backup-stack-v2#upgrade
Categories
ITOps

Windows Remote Desktop Services 2016 Review

It has been a long while since I looked at RDS – with Azure, Office 365 and Server 2016 there seems to be a lot of new (or better) options. To get across some of the options I have decided to do a review of Microsoft’s documentation with the aim of deciding on a solution for a client. The specific scenario I am looking at is a client with low spec workstations, using Office 365 Business Premium (including OneDrive), Windows 10 and have a single Windows 2016 Virtual Private Server.

Some desired features:

  • Users should be able to use their workstations or the remote desktop server interchangeably
  • Everything done on workstations should be replicated to the RDS server and visa-versa
  • Contention on editing documents should be dealt with reasonably
  • The credential for signing into workstations, email and remote services should be the same (ideally with a 2FA option for RDS)

Issues faced:

  • The Office 365 users were created several months before the RDS server was deployed
    • The Azure AD connect service which synchronizes users in an Active Directory deployment with Office 365 user (Azure AD) is a one way street, assuming the ‘on-prem’ active directory object exist already and only need to be create in Azure AD (Office 365) – see the work around for this here
  • Office 365 licensing for ‘shared’ computers means that Office 365 Business Premium users can’t use a VPS – so entrerprise plans of business plus must be used.

How to configure Remote Desktop Services? After getting Active Directory installed and configure to sync with Azure AD I now need choose and implement the RDS configuration.

Starting with the Microsoft Doc we have the following options:

  • Session-based virtualization – Many users per host
  • VDI – Virtual machine for each user — note that if your server is already a VM this isnt really an option (nested VMs are not ideal)

Based on our clients situation – session-based make much more sense for now. Next up – what are we going to publish to the users logging into remote desktop service?

  • Desktops – Providing users with the full desktop experience
  • RemoteApps – Users run apps that seem to be running locally but are in fact being served via RDS

Desktops makes sense for now. So – how do we set up a Session-based desktop for remote access by multiple users?

  1. Add the required roles to the RDS servers (see: https://docs.microsoft.com/en-us/windows-server/remote/remote-desktop-services/rds-deploy-infrastructure)
  2. Create an AD service and link it to office 365 with Azure AD Connect

As Microsoft says:

  • You still must have an internet-facing server to utilize RD Web Access and RD Gateway for external users
  • You still must have an Active Directory and–for highly available environments–a SQL database to house user and Remote Desktop properties
  • You still must have communication access between the RD infrastructure roles (RD Connection Broker, RD Gateway, RD Licensing, and RD Web Access) and the end RDSH or RDVH hosts to be able to connect end-users to their desktops or applications.

After setting all of this up I am very happy with the results. The single source of truth for user must be the ‘on-prem’ AD. Syncing an on-prem AD service to Office 365 is almost seemless with some miner tweak required that are fairly easy to find with some googling.

Categories
ITOps

Sync users from Office 365 for a new Active Directory Install

Importing users from Office 365 to an on Prem-AD can be required in cases where an organisation who has been using Office 365 wants to start using a Remote Desktop Service or alike. To reduce the number of passwords and provide single sign on (or at least same sign on) the Windows Server my have Azure AD connect installed and be syncing with the businesses Office 365 account. The problem is that out of the box Azure AD connect is a one way street. It only creates object on the Azure side – it does not import Office 365 users into the server’s Active Directory.

To get users from Office 365 created in a new Windows Active Directory Service:

## Connect to Office 365 with PowerShell
Set-ExecutionPolicy Unrestricted -Force # can be reverted at the end
$O365CREDS = Get-Credential
$SESSION = New-PSSession -ConfigurationName Microsoft.Exchange -ConnectionUri https://ps.outlook.com/powershell -Credential $O365CREDS -Authentication Basic -AllowRedirection
Import-PSSession $SESSION
Connect-MsolService -Credential $O365CREDS 
# For more details see: https://oddytee.wordpress.com/2013/03/21/connect-to-office-365-with-powershell/

## Get a list of Office 365 User Objects 
Get-User | Export-Csv "C:\O365Export.csv" -NoTypeInformation 

## Use that list to create new AD users - slightly modified from source material
import-csv C:\O365Export.csv -Encoding UTF8 | foreach-object {New-ADUser -Name ($_.Firstname + "." + $_.Lastname) -SamAccountName ($_.Firstname + "." + $_.Lastname) -GivenName $_.FirstName -Surname $_.LastName -City $_.City -Department $_.Department -DisplayName $_.DisplayName -Fax $_.Fax -MobilePhone $_.MobilePhone -Office $_.Office -PasswordNeverExpires ($_.PasswordNeverExpires -eq "True") -OfficePhone $_.PhoneNumber -PostalCode $_.PostalCode -EmailAddress $_.SignInName -State $_.State -StreetAddress $_.StreetAddress -Title $_.Title -UserPrincipalName $_.UserPrincipalName -Enabled $True -AccountPassword (ConvertTo-SecureString -string "Password" -AsPlainText -force) }

Sources:

Categories
Introduction to OpenStack ITOps

Changing OpenStack endpoints from HTTP to HTTPS

After deploying OpenStack Keystone, Swift and Horizon I have a need to change the public endpoints for these services from HTTP to HTTPS.

Horizon endpoint

This deployment is a single server for Horizon. The TLS/SSL termination point is on the server (no loadbalancers or such).

To get Horizon using TLS/SSL all that needs to be done is adding the keys, cert, ca and updating the vhost. My vhost not looks like this:

WSGISocketPrefix run/wsgi
<VirtualHost *:80>
	ServerAdmin sys@test.mwclearning.com
	ServerName horizon-os.mwclearning.com.au
	ServerAlias api-cbr1-os.mwclearning.com.au
	ServerAlias api.cbr1.os.mwclearning.com.au
	ServerName portscan.assetowl.ninja
	Redirect permanent / https://horizon-os.mwclearning.com.au/dashboard
</VirtualHost>

<VirtualHost *:443>
	ServerAdmin sys@test.mwclearning.com
	ServerName horizon-os.mwclearning.com.au
	ServerAlias api-cbr1-os.mwclearning.com.au
	ServerAlias api.cbr1.os.mwclearning.com.au

	WSGIDaemonProcess dashboard
	WSGIProcessGroup dashboard
	WSGIScriptAlias /dashboard /usr/share/openstack-dashboard/openstack_dashboard/wsgi/django.wsgi
	Alias /dashboard/static /usr/share/openstack-dashboard/static

	<Directory /usr/share/openstack-dashboard/openstack_dashboard/wsgi>
		Options All
		AllowOverride All
		Require all granted
	</Directory>

	<Directory /usr/share/openstack-dashboard/static>
		Options All
		AllowOverride All
		Require all granted
	</Directory>
	SSLEngine on
	SSLCipherSuite ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS:!RC4
	SSLCertificateKeyFile /etc/pki/apache/wildcard.mwclearning.com.au-201710.key.pem
	SSLCertificateFile /etc/pki/apache/wildcard.mwclearning.com.au-201710.cert.pem
	SSLCACertificateFile /etc/pki/apache/wildcard.mwclearning.com.au-201710.ca.pem
</VirtualHost>

With a systemctl restart httpd this was working….

Logging into Horizon and checking the endpoints under Project -> Compute -> API Access I can see some more public HTTP endpoints:

Identity	http://api.cbr1.os.mwclearning.com:5000/v3/
Object Store	http://swift.cbr1.os.mwclearning.com:8080/v1/AUTH_---

These endpoints are defined in Keystone, to see them and edit them there I can ssh to the keystone server and run some mysql queries. Before I do this I need to make sure that the swift and keystone endpoints are configure to use TLS/SSL.

Keystone endpoint

Again the TLS/SSL termination point is apache… so some modification to /etc/httpd/conf.d/wsgi-keystone.conf is all that is required:

Listen 5000
Listen 35357

<VirtualHost *:5000>
    WSGIDaemonProcess keystone-public processes=5 threads=1 user=keystone group=keystone display-name=%{GROUP}
    WSGIProcessGroup keystone-public
    WSGIScriptAlias / /usr/bin/keystone-wsgi-public
    WSGIApplicationGroup %{GLOBAL}
    WSGIPassAuthorization On
    LimitRequestBody 114688
    <IfVersion >= 2.4>
      ErrorLogFormat "%{cu}t %M"
    </IfVersion>
    ErrorLog /var/log/httpd/keystone.log
    CustomLog /var/log/httpd/keystone_access.log combined

    <Directory /usr/bin>
        <IfVersion >= 2.4>
            Require all granted
        </IfVersion>
        <IfVersion < 2.4>
            Order allow,deny
            Allow from all
        </IfVersion>
    </Directory>
Alias /identity /usr/bin/keystone-wsgi-public
<Location /identity>
    SetHandler wsgi-script
    Options +ExecCGI

    WSGIProcessGroup keystone-public
    WSGIApplicationGroup %{GLOBAL}
    WSGIPassAuthorization On
</Location>
	SSLEngine on
	SSLCipherSuite ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS:!RC4
	SSLCertificateKeyFile /etc/pki/apache/wildcard.mwclearning.com.au-201710.key.pem
	SSLCertificateFile /etc/pki/apache/wildcard.mwclearning.com.au-201710.cert.pem
	SSLCACertificateFile /etc/pki/apache/wildcard.mwclearning.com.au-201710.ca.pem
</VirtualHost>

<VirtualHost *:35357>
    WSGIDaemonProcess keystone-admin processes=5 threads=1 user=keystone group=keystone display-name=%{GROUP}
    WSGIProcessGroup keystone-admin
    WSGIScriptAlias / /usr/bin/keystone-wsgi-admin
    WSGIApplicationGroup %{GLOBAL}
    WSGIPassAuthorization On
    LimitRequestBody 114688
    <IfVersion >= 2.4>
      ErrorLogFormat "%{cu}t %M"
    </IfVersion>
    ErrorLog /var/log/httpd/keystone.log
    CustomLog /var/log/httpd/keystone_access.log combined

    <Directory /usr/bin>
        <IfVersion >= 2.4>
            Require all granted
        </IfVersion>
        <IfVersion < 2.4>
            Order allow,deny
            Allow from all
        </IfVersion>
    </Directory>
Alias /identity_admin /usr/bin/keystone-wsgi-admin
<Location /identity_admin>
    SetHandler wsgi-script
    Options +ExecCGI

    WSGIProcessGroup keystone-admin
    WSGIApplicationGroup %{GLOBAL}
    WSGIPassAuthorization On
</Location>
</VirtualHost>

I left the internal interface as HTTP for now…

Swift endpoint

OK so swift one is a bit different… its actually recommended to have an SSL termination service in front of the swift proxy see: https://docs.openstack.org/security-guide/secure-communication/tls-proxies-and-http-services.html

With that recommendation from OpenStack and ease of creating an apache reverse proxy – I will do that.

# install packages
sudo yum install httpd mod_ssl

After install create a vhost  /etc/httpd/conf.d/swift-endpoint.conf contents:

<VirtualHost *:443>
  ProxyPreserveHost On
  ProxyPass / http://127.0.0.1:8080/
  ProxyPassReverse / http://127.0.0.1:8080/

  ErrorLog /var/log/httpd/swift-endpoint_ssl_error.log
  LogLevel warn
  CustomLog /var/log/httpd/swift-endpoint_ssl_access.log combined

  SSLEngine on
  SSLCipherSuite ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+3DES:!aNULL:!MD5:!DSS:!RC4
  SSLCertificateKeyFile /etc/httpd/tls/wildcard.mwclearning.com.201709.key.pem
  SSLCertificateFile /etc/httpd/tls/wildcard.mwclearning.com.201709.cert.pem
  SSLCertificateChainFile /etc/httpd/tls/wildcard.mwclearning.com.201709.ca.pem
</VirtualHost>
#resart apache
systemctl restart httpd

So now we should have an endpoint that will decrypt and forward https request from port 443 to the swift listener on port 8080.

Updating internal auth

As keystones auth listener is the same for internal and external (vhost) I also updated the internal address to match the FQDN allowing for valid TLS.

Keystone service definitions

mysql -u keystone -h services01 -p
use keystone;
select * from endpoint;
# Updating these endpoints with 
update endpoint set url='https://swift-os.mwclearning.com:8080/v1/AUTH_%(tenant_id)s' where id='579569...';
update endpoint set url='https://api-cbr1-os.mwclearning,com:5000/v3/' where id='637e843b...';
update endpoint set url='http://controller01-int.mwclearning.com:5000/v3/' where id='ec1ad2e...';

Now after restarting the services all is well with TLS!

Categories
ITOps

Testing Kubernetes and CoreOS

In the previous post I described some the general direction and ‘wants’ for the next step of our IT Ops, summarised as:

Want Description
Continuous Deployment We need to have more automation and resiliency in our deployment, without adding our own code that needs to be changes when archtecture and service decencies change.
Automation of deployments Deployments, rollbacks, services discovery, easy local deployments for devs
Less time on updates Automation of updates
Reduced dependence on config management (puppet) Reduce number of puppet policies that are applied hosts
Image Management Image management (with immutable post deployment)
Reduce baseline work for IT staff IT staff have low baseline work, more room for initiatives
Reduce hardware footprint There can be no increase in hardware resource requirements (cost).

Start with the basics

Lets start with the simple demo deployment supplied by the CoreOS team.

https://coreos.com/kubernetes/docs/latest/kubernetes-on-vagrant.html

That set up was pretty straight forward (as supplied demos usually are).  Simple verification that the k8s components are up and running:

vagrant global-status 
#expected output assuming 1 etcd, 1 k8s controller and 2 k8s worker as defined in config.rb
id name provider state directory
----------------------------------------------------------------------------------------------------------
2146bec e1 virtualbox running VirtualBox VMs/coreos-kubernetes/multi-node/vagrant
87d498b c1 virtualbox running VirtualBox VMs/coreos-kubernetes/multi-node/vagrant
46bac62 w1 virtualbox running VirtualBox VMs/coreos-kubernetes/multi-node/vagrant
f05e369 w2 virtualbox running VirtualBox VMs/coreos-kubernetes/multi-node/vagrant

#set kubctl config and context
export KUBECONFIG="${KUBECONFIG}:$(pwd)/kubeconfig"
kubectl config use-context vagrant-multi
kubectl get nodes
#expected output
NAME STATUS AGE
172.17.4.101 Ready,SchedulingDisabled 4m
172.17.4.201 Ready 4m
172.17.4.202 Ready 4m

kubectl cluster-info
#expected output
Kubernetes master is running at https://172.17.4.101:443
Heapster is running at https://172.17.4.101:443/api/v1/proxy/namespaces/kube-system/services/heapster
KubeDNS is running at https://172.17.4.101:443/api/v1/proxy/namespaces/kube-system/services/kube-dns
kubernetes-dashboard is running atvagran       

*Note: It can take some time (5 mins or longer if core-os is updating) for the kubernetes cluster to become available. To see status, vagrant ssh c1 (or w1/w2/e1) and run journalctl -f (following service logs).

Accessing the kubernetes dashboard requires tunnelling, which if using the vagrant set up can be accomplished with: https://gist.github.com/iamsortiz/9b802caf7d37f678e1be18a232c3cc08 (note, that is for single node, if using multinode then change line 21 to:

vagrant ssh c1 -c "if [ ! -d /home/$USERNAME ]; then sudo useradd $USERNAME -m -s /bin/bash && echo '$USERNAME:$PASSWORD' | sudo chpasswd; fi"

Now the dashboard can be access on http://localhost:9090/.

Now lets to some simple k8s examples:

Create a load balanced nginx deployment:

# create 2 containers from nginx image (docker hub)
kubectl run my-nginx --image=nginx --replicas=2 --port=80
# expose the service to the internet
kubectl expose deployment my-nginx --target-port=80 --type=LoadBalancer
# list service nodes
kubectl get po
# show service info
kubectl get service my-nginx
kubectl describe service/my-nginx

First interesting point… with simple deployment above, I have already gone awry. Though I have 2 nginx containers (presumably for redundancy and load balancing), they have both been deployed on the same worker node (host). Lets not get bogged down now — will keep working through examples which probably cover how to ensure redundancy across hosts.


1
2
# Delete the service, removes pods and containers
kubectl delete deployment,service my-nginx

Reviewed config file (pod) options: http://kubernetes.io/docs/user-guide/configuring-containers/

Deploy demo application

https://github.com/kubernetes/kubernetes/blob/release-1.3/examples/guestbook/README.md

  1. create service for redis master, redis slaves and frontent
  2. create a deployment for redis master, redis slaves and frontend

Pretty easy.. now how do we get external traffic to the service? Either NodePort’s, Loadbalancers or ingress resource (?).

Next lets look at how to extend Kubernetes to

Categories
ITOps

Why look at Kubernetes and CoreOS

We are currently operating a service oriented architecture that is ‘dockerized’ with both host and containers running CentOS 7 when deployed straight on top of ec2 instances. We also have a deployment pipline with beanstalk + homegrown scripts. I imagine our position/maturity is similar to a lot of SMEs, we have aspirations of being on top of new technologies/practices but are some where in between old school and new school:

Old School New School
IT and Dev separate Devops (Ops and Devs have the same goals and responsibilities)
Monolithic/Large services Microservices
Big Releases Continuous Deployment
Some Automation Almost total automation with self-service
Static scaling Dynamic scaling
Config Management Image management (with immutable deployments)
IT staff have a high baseline work IT staff have low baseline work, more room for initiatives

This is not about which end of this incomplete spectrum is better… we have decided that for our place in the world, moving further the left is desirable. I know there are a lot of experienced IT operators that take this view:

Why CoreOS for Docker Hosts?

CoreOS: A lightweight Linux operating system designed for clustered deployments providing automation, security, and scalability for your most critical applications – https://coreos.com/why/

Our application and supporting services run in docker, there should not be any dependencies on the host operating system (apart from the docker engine and storage mounts).

Some questions I ask myself now:

  • Why do I need to monitor for and stage deployments of updates?
  • Why am I managing packages on a host OS that could be immutable (like CoreOS is, kind of)?
  • Why am I managing what should be homogeneous machines with puppet?
  • Why am I nursing host machines back to health when things go wrong (instead of blowing them away and redeploying)?
  • Why do I need to monitor SE Linux events?

I want a Docker Host OS that is/has:

  • Smaller, Stricter, Homogeneous and Disposable
  • Built in hosts and service clustering
  • As little management as possible post deployment

CoreOS looks good for removing the first set of questions and sufficing the wants.

Why Kubernetes?

Kubernetes: “A platform for automating deployment, scaling, and operations of application containers across clusters of hosts” – http://kubernetes.io/docs/whatisk8s/

Some questions I ask myself now:

  • Should my deployment, monitoring and scaling completely separate or be a platform?
  • Why do I (IT ops) still need to be around for prod deployments (no automatic success criteria for staged deploys and not automatic rollback)?
  • Why are our deployment scripts so complex and non-portable
  • Do I want a scaling solution outside of AWS Auto-Scaling groups?

I want a tool/platform to:

  • Streamline and rationalise our complex deployment process
  • Make monitoring, scaling and deployment more manageable without our lines of homebaked scripts
  • Generally make our monitoring, scaling and deployment more able to meet changing requirements

Kubernetes looks good for removing the first set of questions and sufficing the wants.

Next steps

  • Create a CoreOS cluster
  • Install Kubernetes on the cluster
  • Deploy an application via Kubernetes
  • Assess if CoreOS and Kubernetes take us in a direction we want to go
Categories
ITOps

Monitoring client side performance and javascript errors

The rise of single page apps (ie AngularJS) present some interesting problems for Ops. Specifically, the increased dependence on browser executed code means that real user experience monitoring is a must.

apm_logos

To that end I have reviewed some javascript agent monitoring solutions:

The solution/s must have the following requirements:

  • Must have:
    • Detailed javascript error reporting
    • Negligible performance impact
    • Real user performance monitoring
    • Effective single page app (AnglularJS support)
    • Real time alerting
  • Nice to have:
    • Low cost
    • Easy to deploy and maintain integration
    • Easy integration with tools we use for notifications (icinga2, Slack)

As our application is a single page Angular app, New Relic Browser requires that we pay US$130 for any single page app capability. The JavaScript error detection was not very impressive as uncaught exceptions outside of the angular app were not reported without angular integration.

Google Analytics with custom event push does not have any real time alerting which disqualifies it as an Ops solution.

AppDynamics Browser was easy to integrate, getting javascript error details in the console was straight forward but getting those errors to communication tools like slack was surprisingly difficult. Alerts are based on health checks which are breaking of metric thresholds – so I can send an alert saying there was more than 0 javascript errors in the last minute. But no details about the error and no direct link to the error.

Sentry.io simple to add monitoring, simple to get alerting with click through to all the javascript error info. No performance monitoring.

Conclusion sticking to the Unix philosophy, using sentry.io for javascript error alerting and AppDynamics Browser Lite for performance alerting. Both have free levels to get started (ongoing, not just 30 day trial).

Categories
ITOps

Getting started with Gatling – Part 2

With the basics of Simulations, Scenarios, Virtual Users, Sessions, Feeders, Checks, Assertions and Reports down –  it’s time to think about what to load test and how.

Will start with a test that tries to mimic the end user experience. That means that all the 3rd party javascript, css, images etc should be loaded. It does not seem reasonable to say our loadtest performance was great but none of our users will get a responsive app because of all those things we depend on (though, yes, most of it will likely already be cached by the user). This increases the complexity of the simulation scripts as there will be lots of additional resource requests cluttering things up. It is very important for maintainability to avoid code duplication and use the singleton object functionality available.

Using the recorder

As I want to include CDN calls, I tried the recorder’s ‘Generate CA’ functionality. This is supposed to generate certs on the fly for each CN. This would be convenient as I could just trust a locally generated CA and not have to track down and trust all sources. Unfortunately I could not get the recorder to generate its own CA, and when using a local CA generated with openssl I could not feed the CA password to the recorder. I only spent 15 mins trying this until reverting to the default self signed cert. Reviewing Firefox’s network panel (Firefox menu -> Developer -> Network ) shows any blocked sources which can then be visited directly and trusted with our fake cert (there are some fairly serious security implications of doing this, I personally only use my testing browser (firefox) with these types of proxy tools and never for normal browsing).

The recorder is very handy for getting the raw code you need into the test script, it is not a complete test though. Next up is:

  1. Dealing with authentication headers –  The recorded simulation does not set the header based on response from login attempt
  2. Requests dependent on the previous response – The recorder does not capture this dependency it only see the raw outbound requests so there will need to be consideration on parsing results
  3. Validating responses

Dealing with authentication headers

The Check API is used for verifying that the response to a request matches expectations and capturing some elements in it.

After half an hour or so of playing around the Check API, it is behaving as I want thanks to good, concise doc.

.exec(http("login-with-creds")
   .post("/cm/login")
   .headers(headers_14)
   .body(RawFileBody("test_user_creds.txt"))
   .check(headerRegex("Set-Cookie", "access_token=(.*);Version=*").saveAs("auth_token"))

The “.check” is looking for the header name “Set-Cookie” then extracting the auth token using a regex and finally saving the token as a key called auth_token.

In subsequent requests I need to include a header containing this value, and some other headers. So instead of listing them out each time a function makes things much neater:

def authHeader (auth_token:String):Map[String, String] = {
   Map("Authorization" -> "Bearer ".concat(auth_token),
       "Origin" -> baseURL)
} 
//...
http("list_irs")
   .get(uri1 + "/information-requests")
   .headers(authHeader("${auth_token}")) // providing the saved key value as a string arg

Its also worth noting that to ensure that all this was working as expected I modified /conf/logback.xml to output all HTTP request response data to stdout.

	
	
	
	
	
	
	
	

Requests dependent on the previous response

With many modern applications, the behaviour of the GUI is dictated by responses from an API. For example, when a user logs in, the GUI requests a json file with all (max 50) of the users open requests. When the GUI received this, the requests are rendered. In many cases this rendering process involves many more HTTP requests that depending on the time and state of the users which may vary significantly. So… if we are trying to imitate end user experience instead of requesting the render info for the same open requests all of the time, we should parse the json response and adjust subsequent requests accordingly. Thankfully gatling allows for the use of JsonPath. I got stuck trying to get all of the id vals out of a json return and then create requests for each of them. I had incorrectly assumed that the EL Gatling provided ‘random’ function could be called on a vector. This meant I thought the vector was ‘undefined’ as per the error message. The vector was in fact as expected which was clear by printing it.

//grabs all id values from the response body and puts them in a vector accessible via "${answer_ids}" or sessions.get("answer_ids")
http("list_irs")
.get(uri1 + "/information-requests")
.headers(authHeader("${auth_token}")).check(status.is(200), jsonPath("$..id").findAll.saveAs("answer_ids")) 
//....
//prints all vaules in the answer_ids vector
.exec(session => {
    val maybeId = session.get("answer_ids").asOption[String]
    println(maybeId.getOrElse("no ids found"))
    session
})

To run queries with all of the values pulled out of the json response we can use the foreach component. Again got stuck for a little while here. Was putting the foreach competent within an exec function, where (as below) it should be outside of an exec and reference a chain the contains an exec.

val answer_chain = exec(http("an_answer")
    .get(uri1 + "/information-requests/${item}/stores/answers")
    .headers(authHeader("${auth_token}")).check(status.is(200)))
//...
val scn = scenario("BasicLogin")
/...
.exec(http("list_irs")
    .get(uri1 + "/information-requests")
    .headers(authHeader("${auth_token}")).check(status.is(200), jsonPath("$..id").findAll.saveAs("answer_ids"))),
.foreach("${answer_ids}","item") { answer_chain }

Validating responses

What do we care about in responses?

  1. HTTP response headers (generally expecting 200 OK)
  2. HTTP response body contents – we can define expectations based on understanding of app behaviour
  3. Response time – we may want to define responses taking more than 2000ms as failures (queue application performance sales pitch)

Checking response headers is quite simple and can be seen explicitly above in .check(status.is(200). In fact, there is no need for 200 checks to be explicit as “A status check is automatically added to a request when you don’t specify one. It checks that the HTTP response has a 2XX or 304 status code.”checks.

HTTP response body content checks are valuable for ensuring the app behaves as expected. They also require a lot of maintenance so it is important to implement tests using code reuse where possible. Gatling is great for this as we can use the scala and all the power that comes with it (ie: reusable objects and functions across all tests).

Next up is response time checks. Note that these response times are specific to the HTTP layer and do not infer a good end user experience. Javascript and other rendering, along with blocking requests mean that performance testing at the HTTP layer is incomplete performance testing (though it is the meat and potatoes).
Gatling provides the Assertions API to conduct checks globally (on all requests). There are numerous scopes, statistics and conditions to choose from there. For specific operations, responseTimeInMillis and latencyInMillis are provided by Gatling – responseTimeInMillis includes the time is takes to fully send the request and fully receive the response (from the test host). As a default I use responseTimeInMillis as it has slightly higher coverage as a test.

These three verifications/tests can be seen here:

package mwc_gatling
import scala.concurrent.duration._
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import io.gatling.jdbc.Predef._

class BasicLogin extends Simulation {
    val baseURL="https://blah.mwclearning.com"
    val httpProtocol = http
        .baseURL(baseURL)
        .acceptHeader("application/json, text/plain, */*")
        .acceptEncodingHeader("gzip, deflate")
        .acceptLanguageHeader("en-US,en;q=0.5")
        .userAgentHeader("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:43.0) Gecko/20100101 Firefox/43.0")
    def authHeader (auth_token:String):Map[String, String] = {
        Map("Authorization" -> "Bearer ".concat(auth_token), "Origin" -> baseURL)
    } 

    val answer_chain = exec(http("an_answer")
        .get(uri1 + "/information-requests/${item}/stores/answers")
        .headers(authHeader("${auth_token}")).check(status.is(200), jsonPath("$..status")))
    
    val scn = scenario("BasicLogin")
    .exec(http("get_web_app_deps")
     //... bunch of get requests for JS CSS etc
    .exec(http("login-with-creds")
        .post("/cm/login")
        .body(RawFileBody("test_user_creds.txt"))
           .check(headerRegex("Set-Cookie", "access_token=(.*);Version=*").saveAs("auth_token"))
    //... another bunch of get for post auth deps
        http("list_irs")
            .get(uri1 + "/information-requests")
            .headers(authHeader("${auth_token}")).check(status.is(200), jsonPath("$..id").findAll.saveAs("answer_ids"))
    //... now that we have a vector full of ids we can request those resources
    .foreach("${answer_ids}","item") { answer_chain }
    
  //... finally set the simulation params and assertions
    setUp(scn.inject(atOnceUsers(10))).protocols(httpProtocol).assertions(
        global.responseTime.max.lessThan(2000),
        global.successfulRequests.percent.greaterThan(99))
}

That’s about all I need to get started with Gatling! The next steps are:

  1. extending coverage (more tests!)
  2. putting processes in place to notify and act on identified issues
  3. refining tests to provide more information about the likely problem domain
  4. making a modular and maintainable test library that can be updated in one place to deal with changes to app
  5. aggregating results for trending and correlation with changes
  6. spin up and spin down environments specifically for load testing
  7. jenkins integration
Categories
ITOps

Getting started with Gatling – Part 1

With the need to do some more effective load testing I am getting started with Gatling. Why Gatling and not JMeter? I have not used either so I don’t have a valid opinion. I made my choice based on:

Working through the Gatling Quickstart

Next step is working through the basic doc: http://gatling.io/docs/2.2.1/quickstart.html#quickstart. Pretty simple and straightforward.

Moving on to the more advanced tutorial: http://gatling.io/docs/2.2.1/advanced_tutorial.html#advanced-tutorial. This included:

  • creating objects for process isolation
  • virtual users
  • dynamic data with Feeders and Checks
  • First usage of Gatling’s Expression Language (not rly a language o_O)

The most interesting function:

object Search {
    val feeder = csv("search.csv").random
    val search = exec(http("Home")
                .get("/"))
                .pause(1)
                .feed(feeder)
                .exec(http("Search")
                .get("/computers?f=${searchCriterion}")
                .check(css("a:contains('${searchComputerName}')", "href").saveAs("computerURL")))
                .pause(2)
                .exec(http("Select")
                .get("${computerURL}"))
                .pause(3)
}

…Simulation‘s are plain Scala classes so we can use all the power of the language if needed.

Next covered off the key concepts in Gatling:

  • Virtual User -> logical grouping of behaviours ie: Administrator(login, update user, add user, logout)
  • Scenario -> define Virtual Users behaviours ie: (login, update user, add user, logout)
  • Simulation -> is a description of the load test (group of scenarios, users – how many and what rampup)
  • Session -> Each virtual user is back by a Session this can allow for sharing of data between operations (see above)
  • Feeders -> Method for getting input data for tests ie: login values, search and response values
  • Checks -> Can verify HTTP response codes and capture elements of the response body
  • Assertions -> Define acceptance criteria (slower than x means failure)
  • Reports -> Aggregated output

Last review for today was of presentation by Stephane Landelle and Romain Sertelon,  the authors of Gatling:

Next step is to implement some test and figure out a good way to separate simulations/scenarios and reports.