2: Face Detection and Age Estimation

DRAFT

Announcements

SRTIs!

Images, images everywhere

Suppose you’ve imaged a drive and started your analysis. You find that it’s a 2TB drive, and it contains (among other things) hundreds of thousands of images of various types. How do you go about analyzing, summarizing, or even just surveying it?

You could of course look for structure on the disk itself – recently edited or created files vs older ones; or maybe a directory structure or naming scheme imposed by someone who controlled the disk. But today we’re going to talk a bit about some automated techniques you might use to accomplish particular tasks (or adapt to accomplish others). In particular, we’re going to talk about computer-aided techniques for face detection and for age estimation.

We’ll do the first, face detection, using a technique called “Haar-based feature cascades” proposed by Viola and Jones in 2001. This and techniques like it were state of the art for a while; arguably neural nets have surpassed them now, though whether you can use a NN depends upon the hardware resources you have. And then we’ll talk about how to use a NN to do age estimation of discovered faces.

Let’s get started.

Face recognition

Like all CV tasks, face recognition seems simple until you sit down and try to start coding it. Our brains are fantastic at all manner of visual processing, and we don’t have to consciously think about it at all. Computers, though, generally do only and exactly what we tell them in a language they understand – “find faces” is a little too high level for them. So we have to break this task down into manageable chunks for the computer to handle.

On board: Haar-like features as blocks of black/white for edge detection (and how to compute them: sum of black minus sum of white). 24x24 blocks in various patterns; there are 160K or so varieties (though we use Adaboost to select a couple thousand to actually use).

An aside: integral image (for speed) – compute all pixel values only once, don’t re-sum all pixels in each feature for each window, that’s nonsense! Instead compute each pixel’s value as the sum of its value and the value of the neighbor above and to the left. Then you can do some simple geometry tricks to find the value of any rectangle in only a few ops (value of lower right + value of upper left - value of upper right - value of lower left). This is also known as a summed area table.

We train the model via adaboost – find the most relevant features. Given some training data to work on, adaboost considers each feature; if it is correct > 0.5, it is a “weak classifier” and added to linear combination of weak classifiers. The linear combination (it can be shown, but that’s a different class) is a strong classifier. This is similar to how many ensemble classifier (or random forests) work.

Ok, now we’ve selected features, how do we apply them? Naively, we’d have to slide our window across the entire image pixel by pixel; at each place, we’d have to evaluate ~2000 features, then put them into our classifier. Too much work! Solution: cascade based classifier.

The idea is to break the classifier into a hierarchy, where we coarsely filter out the “definitely-not-face” area first. Then we run the more expensive parts of the classifier on the “maybe-face” sections. How? Break classifier into cascades. Take the first, say 10 features. Evaluate them on an area and decide “possibly face” or “definitely not face”.

Doing it in Python

I made you use python in this course for a variety of reasons, one of which was so you could leverage it in future projects. Many (most?) useful third-party libraries have python bindings, which means you can use them in python to do what you want. I’m going to use OpenCV, an open-source computer vision library, to demonstrate how you might use an existing implementation of Haar cascades to do face detection.

I installed OpenCV and its python3 bindings. Here’s the script we wrote in class to do Haar cascade classification of faces. It also does the scaling and padding that are needed to input the found faces into the next thing we’re going to do. Note you’ll need to adjust the location of the OpenCV haarcascades to point to your local install to make this work.

import argparse
import os

import numpy
import cv2

HAARCASCADE_FRONTALFACE_PATH = '/opt/local/share/OpenCV/haarcascades/haarcascade_frontalface_alt.xml'

if not os.access(HAARCASCADE_FRONTALFACE_PATH, os.R_OK):
    sys.exit("error: unable to open HAARCASCADE_FRONTALFACE_PATH; please check and set correctly")

def extract_faces(image, face_cascade, padding_fraction=0.4, resize=True, new_size=(224, 224)):
    """
    Return as a list the faces found in the image.
    Parameters
    ----------
    image : numpy uint8 (BGR-pixel order) format
    face_cascade : an initialized OpenCV2 CascadeClassifier
    padding_fraction: how much padding (as a fraction of found face size)
    resize : whether to resize the face once found
    new_size : the size of the resized face (with padding)
    Returns
    -------
    a list of OpenCV2 images (as numpy arrays), one per face
    """
    (height, width, _) = image.shape

    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    faces = face_cascade.detectMultiScale(gray, 1.1, 4)
    result = []
    for (i, (x,y,w,h)) in enumerate(faces):
        # margins, and padding (by replication) at border if necessary
        xpad = int(w * padding_fraction)
        ypad = int(h * padding_fraction)

        ymin = y - ypad
        if ymin < 0:
            yminpad = abs(ymin)
            ymin = 0
        else:
            yminpad = 0
        ymax = y + h + ypad
        if ymax >= height:
            ymax = height - 1
            ymaxpad = ymax - height + 1
        else:
            ymaxpad = 0
        xmin = x - xpad
        if xmin < 0:
            xminpad = abs(xmin)
            xmin = 0
        else:
            xminpad = 0
        xmax = x + w + xpad
        if xmax >= width:
            xmax = width - 1
            xmaxpad = xmax - width + 1
        else:
            xmaxpad = 0
        face = image[ymin:ymax, xmin:xmax]
        face_padded = cv2.copyMakeBorder(face,ymaxpad,yminpad,xminpad,xmaxpad,cv2.BORDER_REPLICATE)
        if resize:
            face_padded = cv2.resize(face_padded, new_size, interpolation=cv2.INTER_CUBIC)
        result.append(face_padded)
    return result

def make_classifier(cascade_path=HAARCASCADE_FRONTALFACE_PATH):
    return cv2.CascadeClassifier(cascade_path)

def main():
    parser = argparse.ArgumentParser(description='Extract face(s) from images; write results to FILE-face-0.png, FILE-face-1.png, etc.')
    parser.add_argument('files', metavar='FILE', nargs='+',
                        help='a file to perform image recognition on')
    args = parser.parse_args()

    if not os.access(HAARCASCADE_FRONTALFACE_PATH, os.R_OK):
        sys.exit("error: unable to open HAARCASCADE_FRONTALFACE_PATH; please check and set correctly")
    face_cascade = make_classifier()
    for f in args.files:
        image = cv2.imread(f)
        faces = extract_faces(image, face_cascade)
        for i, face in enumerate(faces):
            cv2.imwrite('{}-face-{}.png'.format(f, i), face)

if __name__ == '__main__':
    main()

Now for something completely different

Brief overview of NN: In short, they are (trainable) functions from input to output; how does it work? Underpants gnomes! (I kid, but the science of NNs is very much in its infancy right now – we don’t have a strong predictive theory about how or why changes in network structure affect output accuracy, etc.).

For this demonstration, we’re going to do image analysis. Our inputs will be pixel values, and our outputs will depend upon the model. We want a relatively small input size to cut down on the size of our model – one standard is to downsample input images to something like 224x224 (which we already did). We’ll use a pre-trained model that does age estimation: it has 101 outputs, each of whose value is the probability that the input face is of that age in the range [0, 100]. In other words, this NN produces a probability distribution over ages for the input face.

It’s a pre-trained age estimator (from https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/) – note in the paper how long it took to train (5 days on high-end hardware!). The model is in Caffe format and I didn’t have time to translate it into something newer, so we’ll have to use Python2 (there are no python 3 bindings for caffe that I could get to work). There’s a little bit of other weird bookkeeping to do (including “normalizing” the image against the “mean image” which is the mean pixel values over all images. Weird, I know. ). Here’s the script, which assumes that Caffe and the Python2 bindings are installed (and also assumes you’ve downloaded the various models and support files ). If you have trouble tracking down one of these files and want to get it working send me a Piazza message and I’ll post them.

import sys
if sys.version_info[0] != 2 or sys.version_info[1] != 7:
    sys.exit("This script written for Python 2.7; disable this check at your own risk.")
import argparse
import os

import caffe
import numpy as np

DATA_DIR = os.path.join('/Users/liberato/Research/2016-research-automated-image-detection', 'data')

# model from Rothe et al.; appears to be the VGG-16 architecture
MODEL_STRUCTURE = os.path.join(DATA_DIR, 'age.prototxt')

# two sets of model weights from Rothe, et al.
# first was trained on the IMDB-WIKI data set
# second was then fine-tuned on data from the ICCV competition
#MODEL_WEIGHTS = os.path.join(DATA_DIR, 'dex_imdb_wiki.caffemodel')
MODEL_WEIGHTS = os.path.join(DATA_DIR, 'dex_chalearn_iccv2015.caffemodel')

# CNNs often subtract the mean pixel value (over the set of training image pixels) to "center" pixel values; this is an established technique to improve accuracy
MEAN_IMAGE = os.path.join(DATA_DIR, 'imagenet_mean.binaryproto')

def load_net():
    net = caffe.Net(MODEL_STRUCTURE,
                    MODEL_WEIGHTS,
                    caffe.TEST) # test (as opposed to TRAIN) mode
    return net

def load_mean_pixels():
    # load mean image, determine mean pixel values
    blob = caffe.proto.caffe_pb2.BlobProto()
    data = open(MEAN_IMAGE, 'rb' ).read()
    blob.ParseFromString(data)
    mu = np.array(caffe.io.blobproto_to_array(blob)).reshape(3,256,256)
    mu = mu.mean(1).mean(1)  # average over pixels to obtain the mean (BGR) pixel values
    return mu

def construct_transformer(net, mu, rb_swap=True):
    # construct a transformer for input data
    transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
    transformer.set_transpose('data', (2,0,1))  # move image channels to outermost dimension
    transformer.set_mean('data', mu)            # subtract the dataset-mean value in each channel
    transformer.set_raw_scale('data', 255)      # rescale from [0, 1] to [0, 255]
    if rb_swap:
        transformer.set_channel_swap('data', (2,1,0))  # swap channels from RGB to BGR
    return transformer

def transform_image(image, transformer):
    """Transform and crop an input image in memory."""
    if image.shape != (224,224,3):
        sys.stderr.write('warning: resizing input to 224x224\n')
        sys.stderr.flush()
        image = caffe.io.resize_image(image, (224,224))
    transformed_image = transformer.preprocess('data', image)
    return transformed_image

def transform_file(path, transformer):
    """Transform and crop an input image stored in a file."""
    image = caffe.io.load_image(path)
    return transform_image(image, transformer)

def estimate_probs(transformed_images, net, progress_tracker=None):
    """
    Estimate the age of the face in each transformed_image. Return a length 100 list of probabilities. list[0] is the
    probability of age=0 (years), list[1] is the probability of age=1, etc.
    """

    # prepare data (input) buffer in the network
    net.blobs['data'].reshape(1,       # batch size
                              3,       # three-channel (BGR) images
                              224,224) # image size

    result = []

    for image in transformed_images:
        # copy image into the network
        net.blobs['data'].data[...] = image
        if progress_tracker: progress_tracker.update()

        # run the classifier
        output = net.forward()
        if progress_tracker: progress_tracker.update()

        output_prob = output['prob']

        result.append(np.ndarray.tolist(output_prob)[0])

    return result

def estimate_ages(transformed_images, net):
    """
    Estimate the age of the face in each transformed_image. Return a list of ordered ages lists; each age list is in order from most-to-least likely.
    """

    probabilities_list = estimate_probs(transformed_images, net)

    result = []

    for probs in probabilities_list:
        probs_ages = [(prob, age) for age, prob in enumerate(probs)]
        probs_ages.sort()
        probs_ages.reverse()
        sorted_ages = [age for _, age in probs_ages]
        result.append(sorted_ages)

    return result

def main():
    parser = argparse.ArgumentParser(description='Estimate ages of faces in FILEs; write results to STDOUT.')
    parser.add_argument('files', metavar='FILE', nargs='+',
                        help='a file to perform age estimation on')
    args = parser.parse_args()

    net = load_net()
    mean_px = load_mean_pixels()
    transformer = construct_transformer(net, mean_px)

    faces = [transform_file(f, transformer) for f in args.files]
    results = estimate_ages(faces, net)

    for (face, result) in zip(args.files, results):
        print face, result[0:5]

if __name__ == '__main__':
    main()