Justin Domke's Publications
Click to hide abstracts 

Divide and Couple: Using Monte Carlo Variational Objectives for Posterior Approximation
J. Domke and D. Sheldon NeurIPS 2019 arxiv slides poster  
Recent work in variational inference (VI) uses ideas from Monte Carlo estimation to tighten the lower bounds on the loglikelihood that are used as objectives. However, there is no systematic understanding of how optimizing different objectives relates to approximating the posterior distribution. Developing such a connection is important if the ideas are to be applied to inferencei.e., applications that require an approximate posterior and not just an approximation of the loglikelihood. Given a VI objective defined by a Monte Carlo estimator of the likelihood, we use a "divide and couple" procedure to identify augmented proposal and target distributions. The divergence between these is equal to the gap between the VI objective and the loglikelihood. Thus, after maximizing the VI objective, the augmented variational distribution may be used to approximate the posterior distribution.  
Provable Gradient Variance Guarantees for BlackBox Variational Inference
J. Domke NeurIPS 2019 arxiv poster  
Recent variational inference methods use stochastic gradient estimators whose variance is not well understood. Theoretical guarantees for these estimators are important to understand when these methods will or will not work. This paper gives bounds for the common "reparameterization" estimators when the target is smooth and the variational family is a locationscale distribution. These bounds are unimprovable and thus provide the best possible guarantees under the stated assumptions.  
Thompson Sampling and Approximate Inference
M. Phan, Y. AbbasiYadkori and J. Domke NeurIPS 2019 arxiv  
We study the effects of approximate inference on the performance of Thompson sampling in the karmed bandit problems. Thompson sampling is a successful algorithm for online decisionmaking but requires posterior inference, which often must be approximated in practice. We show that even small constant inference error (in αdivergence) can lead to poor performance (linear regret) due to underexploration (for α<1) or overexploration (for α>0) by the approximation. While for α>0 this is unavoidable, for α≤0 the regret can be improved by adding a small amount of forced exploration even when the inference error is a large constant.  
Provable Smoothness Guarantees for BlackBox Variational Inference
J. Domke Arxiv 2019 arxiv  
Blackbox variational inference tries to approximate a complex target distribution through a gradientbased optimization of the parameters of a simpler distribution. Provable convergence guarantees require structural properties of the objective. This paper shows that for locationscale family approximations, if the target is M Lipschitz smooth, then so is the “energy” part of the variational objective. The key proof idea is to describe gradients in a certain innerproduct space, thus permitting the use of Bessel’s inequality. This result gives bounds on the location of the optimal parameters, and is a key ingredient for convergence guarantees.  
Using Large Ensembles of Control Variates forVariational Inference
T. Geffner and J. Domke NeurIPS 2018 arxiv  
Variational inference is increasingly being addressed with stochastic optimization. In this setting, the gradient's variance plays a crucial role in the optimization procedure, since high variance gradients lead to poor convergence. A popular approach used to reduce gradient's variance involves the use of control variates. Despite the good results obtained, control variates developed for variational inference are typically looked at in isolation. In this paper we clarify the large number of control variates that are available by giving a systematic view of how they are derived. We also present a Bayesian risk minimization framework in which the quality of a procedure for combining control variates is quantified by its effect on optimization convergence rates, which leads to a very simple combination rule. Results show that combining a large number of control variates this way significantly improves the convergence of inference over using the typical gradient estimators or a reduced number of control variates.  
Importance Weighting and Variational Inference
J. Domke and D. Sheldon NeurIPS 2018 arxiv poster  
Recent work used importance sampling ideas for better variational bounds on likelihoods. We clarify the applicability of these ideas to pure probabilistic inference, byshowing the resulting Importance Weighted Variational Inference (IWVI) techniqueis an instance of augmented variational inference, thus identifying the looseness inprevious work. Experiments confirm IWVI’s practicality for probabilistic inference.As a second contribution, we investigate inference with elliptical distributions,which improves accuracy in low dimensions, and convergence in high dimensions.  
A Divergence Bound for Hybrids of MCMC and Variational Inference and an Application to Langevin Dynamics and SGVI
J. Domke ICML 2017 pdf slides poster blog  
Two popular classes of methods for approximate inference are Markov chain Monte Carlo (MCMC) and variational inference. MCMC tends to be accurate if run for a long enough time, while variational inference tends to give better approximations at shorter time horizons. However, the amount of time needed for MCMC to exceed the performance of variational methods can be quite high, motivating more finegrained tradeoffs. This paper derives a distribution over variational parameters, designed to minimize a bound on the divergence between the resulting marginal distribution and the target, and gives an example of how to sample from this distribution in a way that interpolates between the behavior of existing methods based on Langevin dynamics and stochastic gradient variational inference (SGVI).  
Distributed Solar Prediction with Wind Velocity
J. Domke, N. Engerer, A. Menon and C. Webers IEEE Photovoltaic Specialists Conference (PVSC) 2016  
The growing uptake of residential PV (photovoltaic) systems creates challenges for electricity grid management, owing to the fundamentally intermittent nature of PV production. This creates the need for PV forecasting based on a distributed network of sites, which has been an area of active research in the past few years. This paper describes a new statistical approach to PV forecasting, with two key contributions. First, we describe a "local regularisation" scheme, wherein the PV energy at a given site is only attempted to be predicted based on measurements on geographically nearby sites. Second, we describe a means of incorporating wind velocities into our prediction, which we term "wind expansion", and show that this scheme is robust to errors in specification of the velocities. Both these extensions are shown to significantly improve the accuracy of PV prediction.  
Clamping Improves TRW and Mean Field Approximations
A. Weller and J. Domke AISTATS 2016 arxiv  
We examine the effect of clamping variables for approximate inference in undirected graphical models with pairwise relationships and discrete variables. For any number of variable labels, we demonstrate that clamping and summing approximate subpartition functions can lead only to a decrease in the partition function estimate for TRW, and an increase for the naive mean field method, in each case guaranteeing an improvement in the approximation and bound. We next focus on binary variables, add the Bethe approximation to consideration and examine ways to choose good variables to clamp, introducing new methods. We show the importance of identifying highly frustrated cycles, and of checking the singleton entropy of a variable. We explore the value of our methods by empirical analysis and draw lessons to guide practitioners.  
Reflection, Refraction, and Hamiltonian Monte Carlo
H. Afshar and J. Domke NIPS 2015 pdf  
Hamiltonian Monte Carlo (HMC) is a successful approach for sampling from continuous densities. However, it has difficulty simulating Hamiltonian dynamics with nonsmooth functions, leading to poor performance. This paper is motivated by the behavior of Hamiltonian dynamics in physical systems like optics. We introduce a modification of the Leapfrog discretization of Hamiltonian dynamics on piecewise continuous energies, where intersections of the trajectory with discontinuities are detected, and the momentum is reflected or refracted to compensate for the change in energy. We prove that this method preserves the correct stationary distribution when boundaries are affine. Experiments show that by reducing the number of rejected samples, this method improves on traditional HMC.  
Maximum Likelihood Learning With Arbitrary Treewidth via FastMixing Parameter Sets
J. Domke NIPS 2015 pdf  
Inference is typically intractable in hightreewidth undirected graphical models, making maximum likelihood learning a challenge. One way to overcome this is to restrict parameters to a tractable set, most typically the set of treestructured parameters. This paper explores an alternative notion of a tractable set, namely a set of "fastmixing parameters" where Markov chain Monte Carlo (MCMC) inference can be guaranteed to quickly converge to the stationary distribution. While it is common in practice to approximate the likelihood gradient using samples obtained from MCMC, such procedures lack theoretical guarantees. This paper proves that for any exponential family with bounded sufficient statistics, (not just graphical models) when parameters are constrained to a fastmixing set, gradient descent with gradients approximated by sampling will approximate the maximum likelihood solution inside the set with highprobability. When unregularized, to find a solution εaccurate in loglikelihood requires a total amount of effort cubic in 1/ε, disregarding logarithmic factors. When ridgeregularized, strong convexity allows a solution εaccurate in parameter distance with an effort quadratic in 1/ε. Both of these provide of a fullypolynomial time randomized approximation scheme.  
LossCalibrated Monte Carlo Action Selection
E. Abbasnejad, J. Domke and S. Sanner AAAI 2015 pdf  
Bayesian decisiontheory underpins robust decision making in applications ranging from plant control to robotics where hedging action selection against state uncertainty is critical for minimizing low probability but potentially catastrophic outcomes (e.g, uncontrollable plant conditions or robots falling into stairwells). Unfortunately, belief state distributions in such settings are often complex and/or high dimensional, thus prohibiting the efficient application of analytical techniques for expected utility computation when realtime control is required. This leaves Monte Carlo evaluation as one of the few viable (and hence frequently used) techniques for online action selection. However, lossinsensitive Monte Carlo methods may require large numbers of samples to identify optimal actions with high certainty since they may sample from high probability regions that do not disambiguate action utilities. In this paper we remedy this problem by deriving an optimal proposal distribution for a losscalibrated Monte Carlo importance sampler that maximizes a lower bound on the expected utility of the best action. Empirically, we show that using our losscalibrated Monte Carlo method yields highaccuracy optimal action selections in a fraction of the number of samples required by conventional lossinsensitive samplers.  
Projecting Markov Random Fields for Fast Mixing
X. Liu and J. Domke NIPS 2014 pdf  
Markov chain Monte Carlo (MCMC) algorithms are simple and extremely powerful techniques to sample from almost arbitrary distributions. The flaw in practice is that it can take a large and/or unknown amount of time to converge to the stationary distribution. This paper gives sufficient conditions to guarantee that univariate Gibbs sampling on Markov Random Fields (MRFs) will be fast mixing, in a precise sense. Further, an algorithm is given to project onto this set of fastmixing parameters in the Euclidean norm. Following recent work, we give an example use of this to project in various divergence measures, comparing univariate marginals obtained by sampling after projection to common variational methods and Gibbs sampling on the original parameters.  
Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems
A. Defazio, T. Caetano and J. Domke ICML 2014 pdf talk  
Recent advances in optimization theory have shown that smooth strongly convex finite sums can be minimized faster than by treating them as a black box "batch" problem. In this work we introduce a new method in this class with a theoretical convergence rate four times faster than existing methods, for sums with sufficiently many terms. This method is also amendable to a sampling without replacement scheme that in practice gives further speedups. We give empirical results showing state of the art performance.  
Training Structured Predictors Through Iterated Logistic Regression
J. Domke Advanced Structured Prediction, MIT Press, 2014 pdf talk  
In a setting where approximate inference is necessary, structured learning can be formulated as a joint optimization of inference "messages" and local potentials. This chapter observes that, for fixed messages, the optimization problem with respect to potentials takes the form of a logistic regression problem, biased by the current set of messages. This observation leads to an algorithm that alternates between messagepassing inference updates, and learning updates. It is possible to employ any set of potential functions where an algorithm exists to optimize a logistic loss, including linear functions, boosted decision trees, and multilayer perceptrons.  
Structured Learning via Logistic Regression
J. Domke NIPS 2013 pdf  
A successful approach to structured learning is to write the learning objective as a joint function of linear parameters and inference messages, and iterate between updates to each. This paper observes that if the inference problem is "smoothed" through the addition of entropy terms, for fixed messages, the learning objective reduces to a traditional (nonstructured) logistic regression problem with respect to parameters. In these logistic regression problems, each training example has a bias term determined by the current set of messages. Based on this insight, the structured energy function can be extended from linear factors to any function class where an "oracle" exists to minimize a logistic loss.  
Projecting Ising Model Parameters for Fast Mixing
J. Domke and X. Liu NIPS 2013 pdf code  
Inference in general Ising models is difficult, due to high treewidth making treebased algorithms intractable. Moreover, when interactions are strong, Gibbs sampling may take exponential time to converge to the stationary distribution. We present an algorithm to project Ising model parameters onto a parameter set that is guaranteed to be fast mixing, under several divergences. We find that Gibbs sampling using the projected parameters is more accurate than with the original parameters when interaction strengths are strong and when limited time is available for sampling.  
Learning Graphical Model Parameters with Approximate Marginals Inference
J. Domke PAMI 2013 pdf talk  
Likelihood basedlearning of graphical models faces challenges of computationalcomplexity and robustness to model misspecification. This paper studies methods that fit parameters directly to maximize a measure of the accuracy of predicted marginals, taking into account both model and inference approximations at training time. Experiments on imaging problems suggest marginalizationbased learning performs better than likelihoodbased approximations on difficult problems where the model being fit is approximate in nature.  
Generic Methods for OptimizationBased Modeling
J. Domke AISTATS 2012 pdf code  
"Energy" models for continuous domains can be applied to many problems, but often suffer from high computational expense in training, due to the need to repeatedly minimize the energy function to high accuracy. This paper considers a modified setting, where the model is trained in terms of results after optimization is truncated to a fixed number of iterations. We derive "backpropagating" versions of gradient descent, heavyball and LBFGS. These are simple to use, as they require as input only routines to compute the gradient of the energy with respect to the domain and parameters. Experimental results on denoising and image labeling problems show that learning with truncated optimization greatly reduces computational expense compared to "full" fitting.  
Dual Decomposition for Marginal Inference
J. Domke AAAI 2011 pdf slides  
We present a dual decomposition approach to the treereweighted belief propagation objective. Each tree in the treereweighted bound yields one subproblem, which can be solved with the sumproduct algorithm. The master problem is a simple differentiable optimization, to which a standard optimization method can be applied. Experimental results on 10x10 Ising models show the dual decomposition approach using LBFGS is similar in settings where messagepassing converges quickly, and one to two orders of magnitude faster in settings where messagepassing requires many iterations, specifically high accuracy convergence, and strong interactions.  
Parameter Learning with Truncated MessagePassing
J. Domke CVPR 2011 pdf appendix  
Training of conditional random fields often takes the form of a doubleloop procedure with messagepassing inference in the inner loop. This can be very expensive, as the need to solve the inner loop to high accuracy can require many messagepassing iterations. This paper seeks to reduce the expense of such training, by redefining the training objective in terms of the approximate marginals obtained after messagepassing is "truncated" to a fixed number of iterations. An algorithm is derived to efficiently compute the exact gradient of this objective. On a common pixel labeling benchmark, this procedure improves training speeds by an order of magnitude, and slightly improves inference accuracy if a very small number of messagepassing iterations are used at test time.  
Implicit Differentiation by Perturbation
J. Domke NIPS 2010 pdf  
This paper proposes a simple and efficient finite difference method for implicit differentiation of marginal inference results in discrete graphical models. Given an arbitrary loss function, defined on marginals, we show that the derivatives of this loss with respect to model parameters can be obtained by running the inference procedure twice, on slightly perturbed model parameters. This method can be used with approximate inference, with a loss function over approximate marginals. Convenient choices of loss functions make it practical to fit graphical models with hidden variables, high treewidth and/or model misspecification.  
Image Transformations and Blurring
J. Domke and Y. Aloimonos PAMI 2009 pdf  
Since cameras blur the incoming light during measurement, different images of the same surface do not contain the same information about that surface. Thus, in general, corresponding points in multiple views of a scene have different image intensities. While multiple view geometry constrains the locations of corresponding points, it does not give relationships between the signals at corresponding locations. This paper offers an elementary treatment of these relationships. We first develop the notion of "ideal" and "real" images, corresponding to, respectively, the raw incoming light and the measured signal. This framework separates the filtering and geometric aspects of imaging. We then consider how to synthesize one view of a surface from another; if the transformation between the two views is affine, it emerges that this is possible if and only if the singular values of the affine matrix are positive. Next, we consider how to combine the information in several views of a surface into a single output image. By developing a new tool called "frequency segmentation" we show how this can be done despite not knowing the blurring kernel.  
Who Killed the Directed Model?
J. Domke, A. Karapurkar and Y. Aloimonos CVPR 2008 pdf  
Prior distributions are useful for robust lowlevel vision, and undirected models (e.g. Markov Random Fields) have become a central tool for this purpose. Though sometimes these priors can be specified by hand, this becomes difficult in large models, which has motivated learning these models from data. However, maximum likelihood learning of undirected models is extremely difficult essentially all known methods require approximations and/or high computational cost.
Conversely, directed models are essentially trivial to learn from data, but have not received much attention for lowlevel vision. We compare the two formalisms of directed and undirected models, and conclude that there is no a priori reason to believe one better represents lowlevel vision quantities. We formulate two simple directed priors, for natural images and stereo disparity, to empirically test if the undirected formalism is superior. We find in both cases that a simple directed model can achieve results similar to the best learnt undirected models with significant speedups in training time, suggesting that directed models are an attractive choice for tractable learning. 

Learning Convex Inference of Marginals
J. Domke UAI 2008 pdf talk  
Graphical models trained using maximum likelihood are a common tool for probabilistic inference of marginal distributions. However, this approach suffers difficulties when either the inference process or the model is approximate. In this paper, the inference process is first defined to be the minimization of a covex fnuction, inspired by free energy approximations. Learning is then done directly in terms of the performance of the inference process at univariate marginal prediction. The main novelty is that this is a direct minimization of empirical risk, where the risk measures the accuracy of predicted marginals.  
Signals on Pencils of Lines
J. Domke and Y. Aloimonos ICCV 2007 pdf  
This paper proposes the "epipolar pencil transformation" (EPT). This is a tool for comparing the signals in different images, with no use of feature detection, yet taking advantage of the constraints given by epipolar geometry. The idea is to develop a descriptor for each point, summarizing the signals on the pencil of lines intersecting at that point. To compute the EPT, first find compact descriptors for each line, then combine these appropriately for each pencil. Given the EPT for two images, computing the epipolar geometry reduces to a closest pairs problem select one pencil from each set such that the L1 distance (in descriptor space) is minimized. By this reduction to a high dimensional closest pairs problem, recent advances in computational geometry can be used to efficiently identify the best global solution. This technique is robust, as each potential solution is evaluated by comparing the signals for all the lines passing through the two hypothesized epipoles. At the same time, as the closest pairs algorithm performs a global search, the solution is not distracted by local minima. The EPT is used here both for the problem of twoview rigid motion, and manyview place recognition.  
Multiple View Image Reconstruction: A Harmonic Approach
J. Domke and Y. Aloimonos CVPR 2007 pdf  
This paper presents a new constraint connecting the signals in multiple views of a surface. The constraint arises from a harmonic analysis of the geometry of the imaging process and it gives rise to a new technique for multiple view image reconstruction. Given several views of a surface from different positions, fundamentally different information is present in each image, owing to the fact that cameras measure the incoming light only after the application of a lowpass filter. Our analysis shows how the geometry of the imaging is connected to this filtering. This leads to a technique for constructing a single output image containing all the information present in the input images.  
A Probabilistic Notion of Camera Geometry: Calibrated vs. Uncalibrated
J. Domke and Y. Aloimonos PCV 2006 pdf  
We suggest altering the fundamental strategy in Fundamental or Essential Matrix estimation. The traditional approach first estimates correspondences, and then estimates the camera geometry on the basis of those correspondences. Though the second half of this approach is very well developed, such algorithms often fail in practice at the correspondence step. Here, we suggest altering the strategy. First, estimate probability distributions of correspondence, and then estimate camera geometry directly from these distributions. This strategy has the effect of making the correspondence step far easier, and the camera geometry step somewhat harder. The success of our approach hinges on if this tradeoff is wise. We will present an algorithm based on this strategy. Fairly extensive experiments suggest that this tradeoff might be profitable.  
Deformation and Viewpoint Invariant Color Histograms
J. Domke and Y. Aloimonos BMVC 2006 pdf  
We develop a theoretical basis for creating color histograms that are invariant under deformation or changes in viewpoint. The gradients in different color channels weight the influence of a pixel on the histogram so as to cancel out the changes induced by deformations. Experiments show these histograms to be invariant under a variety of distortions and changes in viewpoint.  
A Probabilistic Notion of Correspondence and the Epipolar Constraint
J. Domke and Y. Aloimonos 3DPVT 2006 pdf  
We present a probabilistic framework for correspondence and egomotion. First, we suggest computing probability distributions of correspondence. This has the advantage of being robust to points subject to the aperture effect and repetitive structure, while giving up no information at feature points. Additionally, correspondence probability distributions can be computed for every point in the scene. Next, we generate a probability distribution over the motions, from these correspondence probability distributions, through a probabilistic notion of the epipolar constraint. Finding the maximum in this distribution is shown to be a generalization of leastsquared epipolar minimization. We will show that because our technique allows so much correspondence information to be extracted, more accurate egomotion estimation is possible.  
Integration of Visual and Inertial Information for Egomotion: a Stochastic Approach
J. Domke and Y. Aloimonos ICRA 2006 pdf  
We present a probabilistic framework for visual correspondence, inertial measurements and Egomotion. First, we describe a simple method based on Gabor filters to produce correspondence probability distributions. Next, we generate a noise model for inertial measurements. Probability distributions over the motions are then computed directly from the correspondence distributions and the inertial measurements. We investigate combining the inertial and visual information for a single distribution over the motions. We find that with smaller amounts of correspondence information, fusion of the visual data with the inertial sensor results in much better Egomotion estimation. This is essentially because inertial measurements decrease the "translationrotation" ambiguity. However, when more correspondence information is used, this ambiguity is reduced to such a degree that the inertial measurements provide negligible improvement in accuracy. This suggests that inertial and visual information are more closely integrated in a compositional sense.  
A Probabilistic Framework for Correspondence and Egomotion
J. Domke and Y. Aloimonos ICCV WDV 2005 pdf  
This paper is an argument for two assertions: First, that by representing correspondence probabilistically, drastically more correspondence information can be extracted from images. Second, that by increasing the amount of correspondence information used, more accurate egomotion estimation is possible. We present a novel approach illustrating these principles.
We first present a framework for using Gabor filters to generate such correspondence probability distributions. Essentially, different filters 'vote' on the correct correspondence in a way giving their relative likelihoods. Next, we use the epipolar constraint to generate a probability distribution over the possible motions. As the amount of correspondence information is increased, the set of motions yielding significant probabilities is shown to "shrink" to the correct motion. 