Avgust: Automating Usage-Based Test Generation from Videos of App Executions

Zhao, Yixue; Talebipour, Saghar; Baral, Kesina; Park, Hyojae; Yee, Leon; Khan, Safwat Ali; Brun, Yuriy; Medvidovic, Nenad; Moran, Kevin

doi:10.1145/3540250.3549134

by Yixue Zhao, Saghar Talebipour, Kesina Baral, Hyojae Park, Leon Yee, Safwat Ali Khan, Yuriy Brun, Nenad Medvidovic, Kevin Moran

Abstract:

Writing and maintaining UI tests for mobile apps is a time-consuming and tedious task. While decades of research have produced automated approaches for UI test generation, these approaches typically focus on testing for crashes or maximizing code coverage. By contrast, recent research has shown that developers prefer usage-based tests, which center around specific uses of app features, to help support activities such as regression testing. Very few existing techniques support the generation of such tests, as doing so requires automating the difficult task of understanding the semantics of UI screens and user inputs. In this paper, we introduce Avgust, which automates key steps of generating usage-based tests. Avgust uses neural models for image understanding to process video recordings of app uses to synthesize an app-agnostic state-machine encoding of those uses. Then, Avgust uses this encoding to synthesize test cases for a new target app. We evaluate Avgust on 374 videos of common uses of 18 popular apps and show that 69% of the tests Avgust generates successfully execute the desired usage, and that Avgust's classifiers outperform the state of the art.

View PDF

Citation:

Yixue Zhao, Saghar Talebipour, Kesina Baral, Hyojae Park, Leon Yee, Safwat Ali Khan, Yuriy Brun, Nenad Medvidovic, and Kevin Moran, Avgust: Automating Usage-Based Test Generation from Videos of App Executions, in Proceedings of the 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2022, pp. 421–433.

Bibtex:

@inproceedings{Zhao22fse,
  author = {Yixue Zhao and Saghar Talebipour and Kesina Baral and Hyojae Park and 
  Leon Yee and Safwat Ali Khan and Yuriy Brun and Nenad Medvidovic and Kevin Moran},
  title = {\href{http://people.cs.umass.edu/brun/pubs/pubs/Zhao22fse.pdf}{{\textsc{Avgust}}: 
  {Automating} Usage-Based Test Generation from Videos of App Executions}},
  booktitle = {Proceedings of the 29th ACM Joint European Software
  Engineering Conference and Symposium on the Foundations of Software
  Engineering (ESEC/FSE)},
  venue = {ESEC/FSE},
  month = {November},
  year = {2022},
  date = {14--18},
  pages = {421--433},
  address = {Singapore},
  doi = {10.1145/3540250.3549134},

  note = {ACM artifact badges granted: 
  \href{https://www.acm.org/publications/policies/artifact-review-and-badging-current}{\raisebox{-.75ex}{\includegraphics[height=2.5ex]{ACMArtifactAvailable}}~Artifact Available, 
  \raisebox{-.75ex}{\includegraphics[height=2.5ex]{ACMArtifactFunctional}}~Artifact Functional}. 
  \href{https://doi.org/10.1145/3540250.3549134}{DOI: 10.1145/3540250.3549134}, 
  arXiv: \href{https://arxiv.org/abs/2209.02577}{abs/2209.02577}},

  accept = {$\frac{99}{449} \approx 22\%$},


  abstract = {<p>Writing and maintaining UI tests for mobile apps is a
  time-consuming and tedious task. While decades of research have produced
  automated approaches for UI test generation, these approaches typically
  focus on testing for crashes or maximizing code coverage. By contrast,
  recent research has shown that developers prefer usage-based tests, which
  center around specific uses of app features, to help support activities
  such as regression testing. Very few existing techniques support the
  generation of such tests, as doing so requires automating the difficult
  task of understanding the semantics of UI screens and user inputs. In this
  paper, we introduce Avgust, which automates key steps of generating
  usage-based tests. Avgust uses neural models for image understanding to
  process video recordings of app uses to synthesize an app-agnostic
  state-machine encoding of those uses. Then, Avgust uses this encoding to
  synthesize test cases for a new target app. We evaluate Avgust on 374
  videos of common uses of 18 popular apps and show that 69% of the tests
  Avgust generates successfully execute the desired usage, and that Avgust's
  classifiers outperform the state of the art.</p>},
  
  fundedBy = {NSF CCF-1717963, NSF CCF-1763423, NSF CNS-1823354, NSF CCF1955853, 
  NSF CCF-2030859 (to the CRA for CIFellows), U.S. Office of Naval Research N00014-17-1-2896},
}