Creating EC2 keypairs with AWS CLI

It is easy to create EC2 keypairs with the AWS CLI:

$ aws ec2 create-key-pair --key-name mynewkeypair > keystuff.json

After creating the keypair it should appear in your EC2 key pairs listing. The keystuff.json file will contain the RSA private key you will need to use to connect to any instances you create with the keypair, as well as the name of the key and its fingerprint.

{
    "KeyMaterial": "-----BEGIN RSA PRIVATE KEY-----\n<your private key>==\n-----END RSA PRIVATE KEY-----",
    "KeyName": "mynewkeypair",
    "KeyFingerprint": "53:47:ee:01:3a:35:9b:52:1c:4f:99:6f:87:b0:0f:8b:ed:83:55:3b"
}

To extract the private key into a separate file, use the jq JSON filter.

$ jq '.KeyMaterial' keystuff.json --raw > mynewkey.pem

GitLab Weirdness

If you're using GitLab.com for hosting your repositories, you may have encountered a strange problem wherein your newly-created repository's dashboard doesn't update.

/images/gitlab-weirdness.thumbnail.png

That is, when you git push your changes to the repository, the interface still looks like a newly-created repository, and neither your files nor your commits are visible in the web UI. This is weird because the remote repository works in all other respects. You can push code up to it, clone it, etc. You just can't see it on the GitLab website.

I've seen this happen a couple of times, and so far I've found that the quick fix is to run Housekeeping on the repository from the Edit Project page.

/images/gitlab-housekeeping.thumbnail.png

Housekeeping can take a couple of minutes but most of the time it works and you can see your repository's files and commit history after running it. If it doesn't work, you have to delete the repository in GitLab and re-create it, pushing your code up again.

Installing Python 2.7.11 on CentOS 7

CentOS 7 ships with python 2.7.5 by default. We have some software that requires 2.7.11. It's generally a bad idea to clobber your system python, since other system-supplied software may rely on it being a particular version.

Our strategy for running 2.7.11 alongside the system python is to build it from source, then create virtualenvs that will run our software.

Step 1. Update CentOS and install development tools

# as root
yum upgrade -y
yum groupinstall 'Development Tools' -y
yum install zlib-devel openssl-devel

Step 2. Download the Python source tarball

# As a regular user (avoid doing mundane things as root)
$ cd /tmp
$ wget https://www.python.org/ftp/python/2.7.11/Python-2.7.11.tgz
$ tar -zxf Python-2.7.11.tgz
$ cd Python-2.7.11

Step 3. Configure, build and install into /opt (replace with /usr/local/ if you prefer)

$ ./configure --prefix=/opt/
$ make
$ make install

Step 4. Install pip and virtualenv for the system Python

You have to be root for this.

# curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
# python get-pip.py
# pip install virtualenv

Step 5. Use the system virtualenv to create a venv for your updated Python

You can now create virtualenvs, just point --python to the 2.7.11 interpreter

$ mkdir env
$ virtualenv --python=/opt/bin/python2.7 env/pyenv
$ source env/pyenv/bin/activate
$ python --version
Python 2.7.11

namedtuple Comes in Handy

I've been writing a lot of Python code recently. Oftentimes I struggle with what a method should return when I have to relay more than one value back to the caller. For example:

def PaymentGateway:
    def do_transaction(self, target, amount, bill_code, **kwargs):
        """
        Perform some transaction against the API.

        :return: whether the transaction was successful or not
        :rtype: bool
        """
        # stuff happens here
        try:
            result = self.amount_transaction(tx_details)
            logger.info("Success: CODE=%s Details=%s" % (result.code, result.detail))
            return True
        except GatewayException as ex:
            logger.error("Transaction failed: ERROR=%s reason=%s" % (ex.err_code, ex.message))
            return False

The code that calls do_transaction might look like this:

if payment_gw.do_transaction(subid, amount, bill_code, service_id, ref_code) is True:
    # Hooray! Succe$$!
    report_success("Transaction for %s was successful. Check logs for status code." % subid)
else:
    # Boo
    report_failure("Transaction failed. I don't know why...")

Many times this is fine, but what if the caller needs the details from the amount_transaction result or the GatewayException? A quick solution is to return a dict :

def PaymentGateway:
    def do_transaction(self, target, amount, bill_code, **kwargs):
        """
        Perform some transaction against the API.

        :return: a dict that contains keys 'success', 'code', and 'detail'
        :rtype: dict
        """
        # stuff happens here
        try:
            result = self.amount_transaction(tx_details)
            logger.info("Success: CODE=%s Details=%s" % (result.code, result.detail))
            success_dict = {
                'success': True,
                'code': result.code,
                'detail': result.detail,
            }
            return success_dict
        except GatewayException as ex:
            logger.error("Transaction failed: ERROR=%s reason=%s" % (ex.err_code, ex.message))
            error_dict = {
                'success': False,
                'code': ex.err_code,
                'detail': ex.message,
            }
            return error_dict

It works but it's pretty ad-hoc. The structure of whatever do_transaction returns won't be obvious unless you dig into the code. The caller will end up like:

payment_status = payment_gw.do_transaction(subid, amount, bill_code, service_id, ref_code)
if payment_status['success'] is True:
    # Hooray! Succe$$!
    report_success("Transaction for %s was successful, status code %s" % (subid, payment_status['code']))
else:
    # Boo
    report_failure("Transaction failed, because: %s" % payment_status['detail'])

Now the caller is poluted with literal strings like 'success', 'code' and 'status'. These can be hell to debug, specially if you happen to misspell one of them in your code. Even if you're using an awesome IDE like PyCharm.

An altenative to defining these ad-hoc dict structures is to use namedtuple from the collections package.

from collections import namedtuple

PaymentStatus = namedtuple('PaymentStatus', ['success', 'code', 'detail'])

def PaymentGateway:
    def do_transaction(self, target, amount, bill_code, **kwargs):
        """
        Perform some transaction against the API.

        :return: whether the transaction was successful or not
        :rtype: PaymentStatus
        """
        # stuff happens here
        try:
            result = self.amount_transaction(tx_details)
            logger.info("Success: CODE=%s Details=%s" % (result.code, result.detail))
            return PaymentStatus(True, result.code, result.detail)
        except GatewayException as ex:
            logger.error("Transaction failed: ERROR=%s reason=%s" % (ex.err_code, ex.message))
            return PaymentStatus(False, ex.err_code, ex.message)

namedtuple forces us to be explicit about what do_transaction returns. And explicit is better than implicit. For the caller, this looks like:

payment_status = payment_gw.do_transaction(subid, amount, bill_code, service_id, ref_code)
if payment_status.success is True:
    # Hooray! Succe$$!
    report_success("Transaction for %s was successful, status code %s" % (subid, payment_status.code))
else:
    # Boo
    report_failure("Transaction failed, because: %s" % payment_status.detail)

This is almost as simple as our first example, and is free of string literals. And if you're using PyCharm, you can take advantage of the code completion which will know about the attributes of your new namedtuple class:

/images/pycharm_namedtuple.png

So if your code is littered with string literals as keys for return values from methods that return dict, consider having them return a namedtuple instead.

The Art of Data Science

/images/art-of-data-science-book.thumbnail.png

I will admit, I was pretty stoked yesterday when Roger Peng retweeted my announcement that his new book was available.

In the book, Peng and co-author Elizabeth Matsui walk us through the different activites of data analysis, from formulating questions, basic exploratory data analysis to get a rough feel for the data, to modelling the data with familiar distributions through to basic inference and prediction.

Using R and the datasets that come bundled with it, Peng and Matsui demonstrate how each activity is actually an iterative process itself. At each stage, it's important to evaluate what you already know (or think you know) and revise your expectations based on the data.