Reverse Data Management

Forward and Reverse Data Transformations

Database research mainly focuses on forward-moving data flows: source data is subjected to transformations and evolves through queries, aggregations, and view definitions to form a new target in- stance, possibly with a different schema. This forward paradigm underpins most data management tasks today, such as querying, data integration, data mining, etc. We contrast this forward processing with Reverse Data Management (RDM), where the action needs to be performed on the input data, on behalf of desired outcomes in the output data. Some data management tasks already fall under this paradigm, for example updates through views, data generation, data cleaning and repair. RDM is, by necessity, conceptually more difficult to define, and computationally harder to achieve. Today, however, as increasingly more of the available data is derived from other data, there is an increased need to be able to modify the input in order to achieve a desired effect on the output, motivating a systematic study of RDM.

Reverse transformations

In general, RDM problems are harder to formulate and implement, because of the simple fact that the inverse of a function is not always a function. Given a desired output (or change to the output), there are multiple inputs (or none at all) that can produce it. This is a prevailing difficulty in all RDM problems. This project aims to develop a unified framework for Reverse Data Management problems, which will bring together several subfields of database research. RDM problems can be classified along two dimensions, as shown in the table below. On the "target" dimension, problems are divided into those that have explicit and those that have implicit specifications. The former means that the desired target effect is given as a tuple-level data instance; this is the case in causality and view updates. The latter means that the target effect is described indirectly, through statistics and constraints; examples include how-to queries and data generation. On the "source" dimension, problems are divided in those that use a reference source, and those that do not. For example, view updates and how-to queries fall under the former category, while data generation under the latter.

RDM classification

How-To Queries

How-to queries are motivated from a related research problem within the forward processing paradigm: what-if or hypothetical queries. They use source and target data to ask questions of the form "How would the output change for a given change in the source?". They are motivated by a variety of business applications that require strategy evaluation and decisions. However, it is more meaningful for such applications to be treated under the reverse framework of how-to queries, i.e. "How should the input change in order to achieve the desired output?"

Example (Portfolio Analysis) An analyst at a brokerage company wants to investigate strategies that could achieve better returns and volatility of customer portfolios, based on the company's recommendations during the last three years. He would like to receive a list of possible modifications to the company's stock recommendations, that would achieve the desired output in the customer's portfolios (e.g. 10% return). Out of all the possible scenarios, the analyst wants to give preference to those that are closest to the company's current strategy as they would require fewer trades.

Example (Shipment Consolidation) A product supplier receives orders for various products from different clients. He would like to minimize costs by consolidating shipments to the same client. However, the supplier needs to make sure that all orders arrive within the agreed delivery window, and no shipment exceeds the maximum order size.

Example (Resource Utilization) A system administrator has access to system logs of a server cluster with information on resource allocation and utilization, and job arrival, execution, and waiting times. During peak times within the day the system can become overloaded, and job wait times reach undesirable highs. The administrator wants to determine which machines in the cluster should be kept in operation so that job wait times are bounded by a small constant, and throughput remains high, while minimizing the operational cost of the cluster (there is a cost associated with keeping a machine in operation).

Publications

  1. Reverse Data Management
    Alexandra Meliou, Wolfgang Gatterbauer, and Dan Suciu.
    VLDB 2011.
    [pdf Paper], [pptxSlides]


People

Faculty (University of Massachusetts, Amherst)
Faculty (University of Washington)