Supporting Undo and Redo in Scientific Data Analysis
by Xiang Zhao, Emery R. Boose, Yuriy Brun, Barbara Staudt Lerner, Leon J. Osterweil
Abstract:
This paper presents a provenance-based technique to support undoing and redoing data analysis tasks. Our technique targets scientists who experiment with combinations of approaches to processing raw data into presentable datasets. Raw data may be noisy and in need of cleaning, it may suffer from sensor drift that requires retrospective calibration and data correction, or it may need gap-filling due to sensor malfunction or environmental conditions. Different raw datasets may have different issues requiring different kinds of adjustments, and each issue may potentially be handled by different approaches. Thus, scientists must often experiment with different sequences of approaches. In our work, we show how provenance information can be used to facilitate this kind of experimentation with scientific datasets. We describe an approach that supports the ability to (1) undo a set of tasks while setting aside the artifacts and consequences of performing those tasks, (2) replace, remove, or add a data-processing technique, and (3) redo automatically those set aside tasks that are consistent with changed technique. We have implemented our technique and demonstrate its utility with a case study of a common, sensor-network, data-processing scenario showing how our approach can reduce the cost of changing intermediate data-processing techniques in a complex, data-intensive process.
Citation:
Xiang Zhao, Emery R. Boose, Yuriy Brun, Barbara Staudt Lerner, and Leon J. Osterweil, Supporting Undo and Redo in Scientific Data Analysis, in Proceedings of the 5th USENIX Workshop on the Theory and Practice of Provenance (TaPP), 2013.
Related:
A previous version appeared as University of Massachusetts, Computer Science technical report UM-CS-2013-015
Bibtex:
@inproceedings{Zhao13TaPP,
  author = {Xiang Zhao and Emery R. Boose and Yuriy Brun and Barbara Staudt
  Lerner and Leon J. Osterweil},  
  title =
  {\href{http://people.cs.umass.edu/brun/pubs/pubs/Zhao13TaPP.pdf}{Supporting Undo
  and Redo in Scientific Data Analysis}},
  booktitle = {Proceedings of the 5th USENIX Workshop on the Theory and
  Practice of Provenance (TaPP)},
  venue = {TaPP},
  address = {Lombard, IL, USA},
  month = {April},
  date = {2--3},
  year = {2013},
  accept = {$\frac{12}{19} \approx 63\%$},

  note = {A previous version appeared as University of Massachusetts,
  Computer Science technical report UM-CS-2013-015},
  previous = {A previous version appeared as University of
  Massachusetts, Computer Science technical report UM-CS-2013-015},

  abstract = {This paper presents a provenance-based technique to support
  undoing and redoing data analysis tasks. Our technique targets scientists who
  experiment with combinations of approaches to processing raw data into
  presentable datasets. Raw data may be noisy and in need of cleaning, it may
  suffer from sensor drift that requires retrospective calibration and data
  correction, or it may need gap-filling due to sensor malfunction or
  environmental conditions. Different raw datasets may have different issues
  requiring different kinds of adjustments, and each issue may potentially be
  handled by different approaches. Thus, scientists must often experiment with
  different sequences of approaches. In our work, we show how provenance
  information can be used to facilitate this kind of experimentation with
  scientific datasets. We describe an approach that supports the ability to
  (1) undo a set of tasks while setting aside the artifacts and consequences of
  performing those tasks, (2) replace, remove, or add a data-processing
  technique, and (3) redo automatically those set aside tasks that are
  consistent with changed technique. We have implemented our technique and
  demonstrate its utility with a case study of a common, sensor-network,
  data-processing scenario showing how our approach can reduce the cost of
  changing intermediate data-processing techniques in a complex, data-intensive
  process.},
}