Erik G. Learned-Miller
Professor of Computer Science
University of Massachusetts, Amherst
140 Governors Drive, Office 248
Phone: (413) 545-2993
| End-to-end face detection and cast grouping in movies using Erdos-Renyi clustering (ICCV 2017)
We present an end-to-end system for detecting and clustering faces by identity in full-length movies. Unlike works that start with a predefined set of detected faces, we consider the end-to-end problem of detection and clustering together. We make three separate contributions. First, we combine a state-of-the-art face detector with a generic tracker to extract high quality face tracklets. We then introduce a novel clustering method, motivated by classic results in graph theory. It is based on the observation that large clusters can be fully connected by joining just a small fraction of their point pairs, while just a single connection between two different people can lead to poor clustering results. This suggests clustering using a verification system with very few false positives but perhaps moderate recall. We introduce such a verification procedure with good recall in the low false-positive regime, based on features from the analysis of differences (FAD). Finally, we define a novel end-to-end detection and clustering evaluation metric allowing us to assess the accuracy of the entire end-to-end system. We present state-of-the-art results on multiple video data sets and also on standard face databases.
| Causal Motion Segmentation in Moving Camera Videos (ECCV 2016)
[arXiv] [code] [Project page]
The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer, and even camouflage. In addition to all of this, the ability to detect motion is nearly instantaneous. While there has been much recent progress in motion segmentation, it still appears we are far from human capabilities. In this work, we derive from first principles a new likelihood function for assessing the probability of an optical flow vector given the 3D motion direction of an object. This likelihood uses a novel combination of the angle and magnitude of the optical flow to maximize the information about the true motions of objects. Using this new likelihood and several innovations in initialization, we develop a motion segmentation algorithm that beats current state-of-the-art methods by a large margin. We compare to five state-of-the-art methods on two established benchmarks, and a third new data set of camouflaged animals, which we introduce to push motion segmentation to the next level.
| Labeled Faces in the Wild: A Survey
[Draft pdf] [Springer Page] [LFW Database Page]
In 2007, Labeled Faces in the Wild was released in an effort to spur research in face recognition, specifically for the problem of face verification with unconstrained images. Since that time, more than 50 papers have been published that improve upon this benchmark in some respect. A remarkably wide variety of innovative methods have been developed to overcome the challenges presented in this database. As performance on some aspects of the benchmark approaches 100% accuracy, it seems appropriate to review this progress, derive what general principles we can from these works, and identify key future challenges in face recognition. In this survey, we review the contributions to LFW for which the authors have provided results to the curators (results found on the LFW results web page). We also review the cross cutting topic of alignment and how it is used in various methods. We end with a brief discussion of recent databases designed to challenge the next generation of face recognition algorithms.
| Multi-view Convolutional Neural
Networks for 3D Shape Recognition (ICCV 2015)
[pdf] [Project page]
A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes’ rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.
| Coherent Motion Segmentation in Moving Camera Videos using
Optical Flow Orientations
Link to Paper Link to Project Page
In moving camera videos, motion segmentation is commonly performed using the image plane motion of pixels, or optical flow. However, objects that are at different depths from the camera can exhibit different optical flows even if they share the same real-world motion. This can cause a depth-dependent segmentation of the scene. Our goal is to develop a segmentation algorithm that clusters pixels that have similar real-world motion irrespective of their depth in the scene. Our solution uses optical flow orientations instead of the complete vectors and exploits the well-known property that under camera translation, optical flow orientations are independent of object depth. We introduce a probabilistic model that automatically estimates the number of observed independent motions and results in a labeling that is consistent with real-world motion in the scene. The result of our system is that static objects are correctly identified as one segment, even if they are at different depths. Color features and information from previous frames in the video sequence are used to correct occasional errors due to the orientation-based segmentation. We present results on more than thirty videos from different benchmarks. The system is particularly robust on complex background scenes containing objects at significantly different depths.
| Augmenting CRFs with Boltzmann Machine
Link to Paper Link to Project Page
The conditional random field (CRF) is a powerful tool for building models to label segments in images. They are particularly appropriate for modeling local interactions among labels for regions (e.g., superpixels). Complementary to this, the restricted Boltzmann machine (RBM) has been used to model global shapes produced by segmentation models. In this work, we present a new model that uses the combined power of these two types of networks to build a state-of-the-art labeler, and demonstrate its labeling performance for the parts of complex face images. Specifically, we address the problem of labeling the Labeled Faces in the Wild data set into hair, skin and background regions. The CRF is a good baseline labeler, but we show how an RBM can be added to the architecture to provide a global shape bias that complements the local modeling provided by the CRF. This hybrid model produces results that are both quantitatively and qualitatively better than the CRF alone. In addition to high quality segmentation results, we demonstrate that the hidden units in the RBM portion of our model can be interpreted as face attributes which have been learned without any attribute-specific training data.
| Improving Open-Vocabulary Scene Text
Link to Paper
This paper presents a system for open-vocabulary text recognition in images of natural scenes. First, we describe a novel technique for text segmentation that models smooth color changes across images. We combine this with a recognition component based on a conditional random field with histogram of oriented gradients descriptors and incorporate language information from a lexicon to improve recognition performance. Many existing techniques for this problem use language information from a standard lexicon, but these may not include many of the words found in images of the environment, such as storefront signs and street signs. We avoid this limitation by incorporating language information from a large web-based lexicon of around 13.5 million words. This lexicon contains words encountered during a crawl of the web, so it is likely to contain proper nouns, like business names and street names. We show that our text segmentation method allows for better recognition performance than the current state-of-the-art text segmentation method. We also evaluate this full system on two standard data sets, ICDAR 2003 and ICDAR 2011, and show an increase in word recognition performance compared to the current state-of-the-art methods.
| Scene Text Segmentation via Inverse Rendering|
Link to Paper
Recognizing text in natural photographs that con- tain specular highlights and focal blur is a challenging problem. In this paper we describe a new text segmentation method based on inverse rendering, i.e. decomposing an input image into basic rendering elements. Our technique uses iterative optimization to solve the rendering parameters, including light source, material properties (e.g. diffuse/specular reflectance and shininess) as well as blur kernel size. We combine our segmentation method with a recognition component and show that by accounting for the rendering parameters, our approach achieves higher text recognition accuracy than previous work, particularly in the presence of color changes and image blur. In addition, the derived rendering parameters can be used to synthesize new text images that imitate the appearance of an existing image.
| Distribution Fields with Adaptive
Kernels for Large Displacement Image Alignment|
Link to Paper
While region-based image alignment algorithms that use gradient descent can achieve sub-pixel accuracy when they converge, their convergence depends on the smoothness of the image intensity values. Image smoothness is often enforced through the use of multi- scale approaches in which images are smoothed and downsampled. Yet, these approaches typically use fixed smoothing parameters which may be appropriate for some images but not for others. Even for a particular image, the optimal smoothing parameters may depend on the magnitude of the transformation. When the transformation is large, the image should be smoothed more than when the transformation is small. Further, with gradient-based approaches, the optimal smoothing parameters may change with each iteration as the algorithm proceeds towards convergence. We address convergence issues related to the choice of smoothing parameters by deriving a Gauss-Newton gradient descent algorithm based on distribution fields (DFs) and proposing a method to dynamically select smoothing parameters at each iteration. DF and DF-like representations have previously been used in the context of tracking. In this work we incorporate DFs into a full affine model for region-based alignment and simultaneously search over parameterized sets of geometric and photometric transforms. We use a probabilistic interpretation of DFs to select smoothing parameters at each step in the optimization and show that this results in improved convergence rates.
| Improvements in Joint Domain-Range
Modeling for Background Subtraction|
Link to Paper
In many algorithms for background modeling, a distribution over feature values is modeled at each pixel. These models, however, do not account for the dependencies that may exist among nearby pixels. The joint domain-range kernel density estimate (KDE) model by Sheikh and Shah, which is not a pixel-wise model, represents the background and foreground processes by combining the three color dimensions and two spatial dimensions into a five-dimensional joint space. The Sheikh and Shah model, as we will show, has a peculiar dependence on the size of the image. In contrast, we build three-dimensional color distributions at each pixel and allow neighboring pixels to influence each s distributions. Our model is easy to interpret, does not exhibit the dependency on image size, and results in higher accuracy. Also, unlike Sheikh and Shah, we build an explicit model of the prior probability of the background and the foreground at each pixel. Finally, we use our adaptive kernel variance method to adapt the KDE covariance at each pixel. With a simpler and more intuitive model, we can better interpret and visualize the effects of the adaptive kernel variance method, while achieving accuracy comparable to state-of-the-art on a standard backgrounding benchmark.other
| Tracking with Distribution Fields|
Link to Paper Link to Project Page
In this work, we exhibit our first major application of distribution fields (see below). We show that simply by "exploding" the representation of an image into a distribution field, and then using more-or-less standard blurring techniques, we can achieve state-of-the-art tracking results.
| Joint Alignment and Clustering|
Link to Paper
Joint alignment of a collection of functions is the process of independently transforming the functions so that they appear more similar to each other. Typically, such unsupervised alignment algorithms fail when presented with complex data sets arising from multiple modalities or make restrictive assumptions about the form of the functions or transformations, limiting their generality. We present a transformed Bayesian infinite mixture model that can simultaneously align and cluster a data set. Our model and associated learning scheme offer two key advantages: the optimal number of clusters is determined in a data-driven fashion through the use of a Dirichlet process prior, and it can accommodate any transformation function parameterized by a continuous parameter vector. As a result, it is applicable to a wide range of data types, and transformation functions. We present positive results on synthetic two-dimensional data, on a set of one-dimensional curves, and on various image data sets, showing large improvements over previous work. We discuss several variations of the model and conclude with directions for future work.
| Distribution Fields: A Representation for Low-Level Vision Problems|
Link to Paper
We are developing a new representation, called distribution fields, and an associated set of algorithms, to address certain issues in low-level vision problems. One of our goals is to come up with a single representation that can be used to achieve state-of-the-art results on many different low-level problems such as tracking, optical flow, image registration, affine covariant matching, image stitching, and background subtraction. Another goal is to combine the best properties of successful representations such as SIFT, HOG, geometric blur, mean shift descriptors, shape contexts, image pyramids, and other successful techniques. Finally, we want a method of comparing images that is probabilistic and easily interpretable. We think that distribution fields, and our alignment method, which we call the sharpening match are a good start towards achieving these goals.
| Learning Hierarchical Representations for Face Verification|
Link to Paper
Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved performance by combining several such representations. In this paper, we propose deep learning as a natural source for obtaining additional, complementary representations. To learn features in high-resolution images, we make use of convolutional deep belief networks. Moreover, to take advantage of global structure in an object class, we develop local convolutional restricted Boltzmann machines, a novel convolutional learning model that exploits the global structure by not assuming stationarity of features across the image, while maintaining scalability and robustness to small misalignments. We also present a novel application of deep learning to descriptors other than pixel intensity values, such as LBP. In addition, we compare performance of networks trained using unsupervised learning against networks with random filters, and empirically show that learning weights not only is necessary for obtaining good multi-layer representations, but also provides robustness to the choice of the network architecture parameters. Finally, we show that a recognition system using only representations obtained from deep learning can achieve comparable accuracy with a system using a combination of hand-crafted image descriptors. Moreover, by combining these representations, we achieve state-of-the-art results on a real-world face verification database.
| Online Domain Adaptation of a Pre-Trained Cascade of Classifiers
Link to Paper
Many classifiers are trained with massive training sets only to be applied at test time on data from a different distribution. How can we rapidly and simply adapt a classifier to a new test distribution, even when we do not have access to the original training data? We present an on-line approach for rapidly adapting a "black box" classifier to a new test data set without retraining the classifier or examining the original optimization criterion. Assuming the original classifier outputs a continuous number for which a threshold gives the class, we reclassify points near the original boundary using a Gaussian process regression scheme. We show how this general procedure can be used in the context of a classifier cascade, demonstrating performance that far exceeds state-of-the-art results in face detection on a standard data set. We also draw connections to work in semi-supervised learning, domain adaptation, and information regularization.
|Congealing of Complex Images (trying moving your mouse over the image at left) |
Link to Project Page
Many recognition algorithms depend on careful positioning of an object into a canonical pose, so the position of features relative to a fixed coordinate system can be examined. Currently, this positioning is done either manually or by training a class-specialized learning algorithm with samples of the class that have been hand-labeled with parts or poses. In this paper, we describe a novel method to achieve this positioning using poorly aligned examples of a class with no additional labeling. Given a set of unaligned examplars of a class, such as faces, we automatically build an alignment mechanism, without any additional labeling of parts or poses in the data set. Using this alignment mechanism, new members of the class, such as faces resulting from a face detector, can be precisely aligned for the recognition process. Our alignment method improves performance on a face recognition task, both over unaligned images and over images aligned with a face alignment algorithm specifically developed for and trained on hand-labeled face images. We also demonstrate its use on an entirely different class of objects (cars), again without providing any information about parts or pose to the learning algorithm.
| Scene Text Recognition
Link to Paper
Scene text recognition (STR) is the recognition of text anywhere in the environment, such as signs and store fronts. Relative to document recognition, it is challenging because of font variability, minimal language context, and uncontrolled conditions. Much information available to solve this problem is frequently ignored or used sequentially. Similarity between character images is often overlooked as useful information. Because of language priors, a recognizer may assign different labels to identical characters. Directly comparing characters to each other, rather than only a model, helps ensure that similar instances receive the same label. Lexicons improve recognition accuracy but are used post hoc. We introduce a probabilistic model for STR that integrates similarity, language properties, and lexical decision. Inference is accelerated with sparse belief propagation, a bottom-up method for shortening messages by reducing the dependency between weakly supported hypotheses. By fusing information sources in one model, we eliminate unrecoverable errors that result from sequential processing, improving accuracy. In experimental results recognizing text from images of signs in outdoor scenes, incorporating similarity reduces character recognition error by 19%, the lexicon reduces word recognition error by 35%, and sparse belief propagation reduces the lexicon words considered by 99.9% with a 12X speedup and no loss in accuracy.
| Recognition from One Example using Hyper-Features |
Link to Project Page
In this project, we attempt to solve the problem of object identification, which is specialized recognition where the category is known (for example cars or faces) and the algorithm recognizes an object's exact identity (such as Bob's BMW). For example, we might be given images of cars like those on the left side of the figure and be asked to find which of the four cars on right are the same as either of the two on the left. See Andras Ferencz's web-site for more about this project here. This work is a continuation of Andras Ferencz's thesis work at Berkeley.
|Congealing for Automatic
Link to Project Page
I recently developed a process I call congealing , which is a way of aligning a group of objects simultaneously , using an entropy minimization procedure. This can be used to perform traditional "preprocessing" tasks such as deskewing, centering etc. Try moving your mouse over the images of handwritten zeroes at left. As you do, the results of the congealing are shown. Notice that the zeroes have been "normalized" to be much more similar to each other. In my Ph.D. thesis, I extended congealing to gray-scale images and other multi-valued images, and to one-dimensional, three-dimensional, and four-dimensional data sets, including 3-D brain volumes. Currently, our goal is to extend congealing to more complex features (than single pixel features) such as Lowe's SIFT descriptors. Then this method can be applied to aligning complex images such as faces on arbitrary backgrounds.
Link to Project Page
One of the most basic capabilities for an agent with a vision system is to recognize its own surroundings. Yet surprisingly, despite the ease of doing so, many robots store little or no record of their own visual surroundings. This paper explores the utility of keeping the simplest possible persistent record of the environment of a stationary torso robot, in the form of a collection of images captured from various pan-tilt angles around the robot. We demonstrate that this particularly simple process of storing background images can be useful for a variety of tasks, and can relieve the system designer of certain requirements as well. We explore three uses for such a record: auto-calibration, novel object detection with a moving camera, and developing attentional saliency maps.
| Text Recognition
Link to Project Page
The goal of this project is to design and build a wearable system for the visually impaired that will detect signs in an image and recognize them. We hypothesize that at a low level signs fall into particular class of textures that are distinguishable from many others that may be found in natural scenes. Therefore, discriminating textures will be the first step toward extracting and eventually identifying signs. Other work has focused exclusively on detecting and tracking text in images and video. Even those signs that consist purely of text are often in unusual fonts and/or arrangements that pose challenges to traditional text detectors. More importantly, many signs consist of recognizable logos that contain no text at all. We investigate whether all of these regions can be identified at a low level in an integrated model.
| RADICAL, a New ICA Algorithm
Link to Project Page
There has been a great deal of new work recently on the problem of Independent Components Analysis (ICA). A variety of new and interesting methods have emerged including Kernel ICA (Bach and Jordan), a method by Hastie and Tibshirani at Stanford, and other methods. I have my own new ICA algorithm, called RADICAL, which I developed with John Fisher at MIT. On synthetic experiments across a wide range of source densities, RADICAL is more accurate than every other algorithm we tested, including Kernel ICA, Fast ICA, extended Infomax, and JADE. It also appears to be very robust to outliers. You can find out more from our recent paper in the Journal of Machine Learning Research here. Or check out the web page for the algorithm here, from which a MATLAB version of RADICAL can be downloaded.
| Learned Color Constancy
One tricky thing about objects, especially white objects, is that they can "appear" as almost any color. A white object in the early morning looks rather bluish but in the mid-day sun may look yellower and during a sunset might appear pink. How then can we use color to help us, rather than hinder us, in object recognition? Traditional approaches to this difficulty have often involved estimating the illuminant, after which one can "correct" the image to appear as if under a standard illuminant. Collaborating with Kinh Tieu at MIT, we sidestepped the problem of illuminant estimation by modeling only how colors commonly change together under natural lighting changes. Two images can then be inferred to represent the same object if there is a statistically common mapping between the colors of the two images. We report on this method and its applications in an ICCV paper and in this NIPS paper. The image at left shows how plausible novel images of an object can be synthesized from joint color changes learned from a totally different object. The synthesized images were generated with only a single example image, and with no a priori or built in knowledge of lighting statistics, optics, or the physics of illumination.
The panel on the left shows three photographs of a woman who developed a condition known as acromegaly. The first picture is shown when she is young and symptom-free. The second and third photos show the progress of the disease over time. This condition results from an excess of growth hormone, and causes disfiguring growth of the bones of the skull and swelling of the face, hands, and feet. Our goal in this project is to detect acromegaly automatically from generic photographs so that it can be diagnosed earlier, leading to better clinical outcomes. In collaboration with Volker Blanz and others, we have developed a classification system which prescreens patients for acromegaly.
This project was originally conceived by my father, Dr. Ralph E. Miller, who practices endocrinology in Lexington, Kentucky. Qifeng (Luke) Lu at UMass has been a major contributor as well.
| Mathematical Expression
Suppose you wanted to scan in a mathematical expression from a book and have it automatically converted to LaTeX, or write an expression on a pen-based computer and have it automatically read and evaluated. Paul Viola and I wrote a paper describing our early system for recognition. Nick Matsakis continued the work and got a great system working. For a demo of Nick's new and improved system, click on the picture at right.
|MR Bias Correction
The goal of magnetic resonance (MR) imaging is to form images of patient anatomy for diagnosis and other analyses. Often these images exhibit brightness distortions due to imperfections in the measurement apparatus. The goal of this work is to eliminate these imperfections from MR images. Previous approaches have been model-based (Wells) or have operated on a single image to reduce brightness entropies (Viola). Our method reduces entropies ACROSS images, using information about the distribution of brightness values at a particular location.On the left are two sets of MR images of infant brains. The top set of images shows the brightness biases due to the scanner imperfections. The other set shows the images after correction by our algorithm. I have a NIPS paper with Parvez Ahammad that describes this work in detail here.
Near uniform partitions are a technique for dividing a probability space, using only a set of random samples from that space, into chunks of approximately equal probability measure, or into chunks whose probability measure is approximately linear in the number of constituent subregions. The figure at left shows how samples from a two-dimensional Gaussian distribution can be used to split the Gaussian up into chunks whose probability masses are approximately linear in the number of subregions. Notice that regions in area of high density are smaller, and regions in area of low density are larger, resulting in a near uniform mass for each region. Near uniform partitions can be used in estimation of information theoretic quantities such as entropy, mutual information, and Kullback-Leibler divergence. They can also be used in hypothesis testing. For a discussion of their use in entropy estimation, see this short ICASSP paper.
|Probability Distributions on
Christophe Chefd'hotel and I developed kernels for these curved
spaces, allowing us to obtain better probability
density estimates for these "shape" spaces. This work
is described in a CVPR paper here. The figure at right
shows conceptually that a direct "Euclidean distance"
between points is not always appropriate in a curved
|Masters Thesis: Improved Surface
Area Estimates Using Alternative Voxel Shapes |
Check out my Master's Thesis if you're a fan of stochastic geometry. It addresses the advantages and disadvantages of using various voxel shapes (other than the standard rectangular prisms) to tessellate 3-D space.