IBM CAREER

While at IBM, Peter Haas was associated with the Accelerated Discovery Lab and also collaborated with the Cognitive Solutions and Foundations group as well as the IBM Watson Division. Research topics that Peter pursued are given below.

Modeling and Simulation

Peter played a leading role on the Splash project, whose goal is to build a novel computational framework for integrating existing data, models, simulations, analytics to facilitate collaborative, cross-disciplinary modeling, simulation, and analytics. Peter also conducted research on techniques for modelling, simulation, and control of complex discrete-event stochastic systems, with applications to manufacturing, computer, telecommunication, work-flow, and transportation systems. He made fundamental contributions to the theory of stochastic Petri nets and generalized semi-Markov processes, models that can be used to formally represent a broad class of complex discrete-event systems. His results include (1) the first formal definitions of non-Markovian SPNs and colored SPNs, (2) development of novel methods for specifying and simulating delays in SPNs and GSMPs, and (3) development of a simulation theory that provides conditions on the building blocks of an SPN or GSMP under which the associated system-state and delay processes are stable and are amenable to output-analysis techniques such as the regenerative method, batch-means method, or spectral method. In later work he established some surprising and counterintuitive results for systems where clocks for event occurences are set according to heavy-tailed distributions.

Managing Uncertain Data / Stochastic Analytics / Machine Learning

In a collaboration wwith the IBM Watson Division, Peter developed novel methods for principled computation of confidence values for machine-generated hypotheses. With Chris Jermaine (Rice University) and his students, Peter developed the Monte Carlo database system (MCDB) for managing uncertain data; this system facilitates complex stochastic analytics close to the data. With colleagues at IBM, he developed techniques for porting MCDB functionality to Hadoop, a massively parallel MapReduce computing environment. Peter and Prof. Jermaine then extended MCDB to facilitate risk analysis by facilitating estimation of extreme quantiles of a query-result probability distribution, and efficiently obtaining random samples from the corresponding distribution tails. Peter also has helped develop a system for providing upper and lower bounds on the result of OLAP queries over unresolved integrated data, and has developed techniques, based on maximum-entropy methods, for assigning correctness probabilities to structured data that have been automatically extracted from text. MCDB was extended to a new system called SimSQL, which allows distributed simulation of database-valued Markov chains, using Hadoop for scalability. This functionality in turn facilitiates Bayesian machine learning over massive data, as well as large-scale agent-based simulations. SimSQL has been released as open source code.

Analytics over Massive Data

With IBM and university colleagues, Peter developed best-of-breed parallel and distributed Big Data algorithms for tasks including

matrix completion as is used in recommender systems, based on novel "distributed stochastic gradient descent" (DSGD) techniques,
optimization via gradient descent for machine learning, statistics, and decision support, using "online aggregation" methods,
analysis of dynamic interaction graphs such as Twitter mention-activity graphs via a novel "probabilistic edge decay" technique,
efficient execution of “groupwise set-valued analytics” such as stratified sampling based on exploiting "adaptive MapReduce" technology, and
ad hoc grouping and aggregation queries in Hadoop-like environments using novel indexing techniques.
time-biased sampling techniques for managing machine-learning models
matrix compression for scalable, declarative machine learning in Apache SystemML

Peter also worked on Ricardo, a system that leverages MapReduce processing to extend the R statistical platform to massive data. The above work has been in collaboration with the BigInsights project. As mentioned above, Peter has also collaborated with Chris Jermaine on the SimSQL system for stochastic analytics and machine learning on massive data.

Sampling for Information Management

Peter has contributed in a number of ways to the development of methods for, and applications of, database sampling. Peter helped create the proposed ISO standard for random sampling in SQL queries and provided statistical expertise during the implementation of sampling in IBM's DB2 product; he has developed novel techniques for supporting and enhancing this type of sampling. As part of his query-optimization and data-integration work, Peter helped develop a sampling-based "bump-hunting" method for discovering fuzzy algebraic constraints in relational data, as well as the CORDS algorithm for automatic discovery of correlations and soft functional dependencies. Use of these methods has resulted in order-of-magnitude speedups in query processing. He has also developed state-of-the-art sampling-based methods for estimating the number of distinct values of a database attribute, as well as sampling-based methods for accelerating association-rule mining. Peter has also developed sampling-based methods for quickly estimating the answer to "aggregation" queries that compute statistics, such as selectivities, sums, averages, and distinct-value counts, over relational expressions. His work with the CONTROL group at UC Berkeley and with Jeffrey Naughton at the University of Wisconsin focused on extending these methods even further to permit online, interactive processing of aggregation queries. Peter also developed sampling-based algorithms for creating and maintaining multiple sample synopses in a "synopsis warehouse" in order to support information discovery for the enterprise, as well as a sampling-based method for optimizing scan sharing in main-memory databases on multi-core CPU machines.

Other Research and Product-Related Activities

Peter helped develop a variety of techniques for query optimizers in database systems to improve their performance over time by learning from query feedback, as part of the LEO project. This includes use of maximum entropy techniques for consistent selectivity estimation and feedback-based histogram maintenance, as well as feedback-based methods for detecting statistical dependencies and for automatically configuring query-optimizer statistics collection. A number of his algorithms were incorporated into the DB2 and IDS products. Peter also studied the application of probabilistic methods to problems in query optimization for XML and relational databases. He has also worked on enhancing the statistical-processing capabilities of the DB2 and Visual Warehouse products, helping both to implement this functionality and to develop the ISO SQL standard for specifying linear regression queries over relational databases. He also developed statistical and data-mining techniques for detection and prediction of anomalies in complex software systems. In addition, he helped develop novel hash-based methods for accurate distinct-value estimation in the presence of multiset operations.

(Home)