Alexandra Meliou

Alexandra Meliou


Assistant Professor

College of Information and Computer Science

140 Governors Drive

University of Massachusetts

Amherst, MA 01003-9264 USA


Email:
Office: 330
Phone: +1-413-545-3788
Fax: +1-413-545-1249


Curriculum Vitae

Research

Data is critical in almost every aspect of society, including education, technology, healthcare, economy, and science. Poor understanding and handling of data, data biases, poor data quality, and errors in data-driven processes are detrimental in all domains that rely on data. My research augments data management with user-facing functionality that helps people make sense of their data and use it effectively, at a time when data becomes increasingly unpredictable, unwieldy, and unmanageable. I focus on issues of provenance, causality, explanations, data quality, usability, and data and algorithmic bias.

Prospective students
show more

I am actively recruiting strong graduate and junior / senior undergraduate students to work on research projects. If you are a UMass student interested in doing research with me, you should email me to set up an appointment, giving me a brief summary of your background and interests. My interests are not restricted to my existing projects, so do not hesitate to come to me with ideas of your own.

If you are not yet a UMass student: You should apply (undergraduate | graduate) to become one before contacting me. Emailing me directly will not affect your chances of admission, and I will not be able to respond to these requests. However, if you have done research, you can have your mentor email me with a recommendation.



Research highlights

Usability and analysis

As data is now a staple in so many aspects of human activity, the audience for data technologies has expanded to include a varied range of users: from non-experts wishing to peruse datasets, to domain experts with specialized data processing needs. Data systems have not adapted to address these demands effectively: databases' specialized query languages and structure create barriers for non-experts, while the lack of native support for important computing needs leaves experts to develop application-specific solutions themselves. Our work removes data-use barriers by simplifying access for non-experts to data and by augmenting database functionality with advanced problem-solving capabilities, thus simplifying analytics workflows by moving them closer to the data.

Package queries PackageBuilder: supporting queries for packages [Project page]
Traditional database queries follow a simple model: they define constraints that each tuple in the result must satisfy. This model is computationally efficient, as the database system can evaluate the query conditions on each tuple individually. However, many practical, real-world problems require a collection of result tuples—which we call a package—to satisfy constraints collectively, rather than individually. We developed an end-to-end system that supports package queries, allowing the declarative specification and efficient evaluation of a significant class of constrained optimization problems within a database.
Publications: [PVLDB 2016] [SIGMOD Record 2017] [VLDBJ 2017] [CACM 2018 (in production)]
award Awards: ACM SIGMOD Research Highlight, CACM Research Highlight, Best Papers of VLDB 2016

SQuID SQuID: Semantic-similarity-aware Query Intent Discovery [Project page]
Non-experts cannot easily peruse relational data, as traditional query interfaces allow data retrieval through well-structured queries. To write such queries, one needs expertise in the query language (typically SQL) and knowledge of the potentially complex database schema. Unfortunately, non-expert users typically lack both. SQuID infers query intent effectively by leveraging the data in the database to understand the context of the provided examples. SQuID's abduction-aware probabilistic model captures esoteric and complex semantic contexts, outperforming the state of the art.
Publications: [SIGMOD 2018 demo]

Fairness and diversity

Data-driven software has the ability to shape human behavior: it affects the products we view and purchase, the news articles we read, the social interactions we engage in, and, ultimately, the opinions we form. Yet, data is an imperfect medium, tainted by errors, omissions, and biases. As a result, discrimination shows up in many data-driven applications, such as advertisements, hotel bookings, image search, and vendor services. Biases in data and software risk forming, propagating, and perpetuating biases in society. Data management research should develop tools to detect, inform, and mitigate the effects of bias, skew, and misuse in data-driven processes.

Fairness testing Fairness testing [Project page]
Our work studied software fairness and discrimination and produced a testing-based method for measuring if and how much software discriminates, focusing on causality in discriminatory behavior. Our approach, Themis, is the first framework of its kind that automatically generates efficient test suites to measure discrimination. Our techniques rely on reasoning about causal relationships between inputs and outputs of a system. Understanding how inputs affect software behavior can empower developers to control for bias in data and ensure more fair use of software systems.
Publications: [ESEC/FSE 2017] [ESEC/FSE 2018 demo] [ESEC/FSE 2018 vision]
award Awards: ACM SIGSOFT Distinguished Paper Award

RC-index Fast diverse data retrieval
Data skew is often a cause of algorithmic bias, and the ability to retrieve balanced, diverse datasets can mitigate the underlying problem. Diversification is one common way to present representative results to users, and it is employed by many real-world systems. However, providing diverse results for general range queries (i.e., queries that return a subset of the data based on filtering conditions) efficiently and scalably remains challenging. Our work introduces a general, index-based algorithm for diversifying the results of multi-dimensional range queries over a single relation. At a high level, our algorithm transforms each range query into a set of subordinate searches, performs these searches using a novel index structure, the RC-Index.
Publications: [PVLDB 2018]

Data quality

Data quality has long been a focus of data management research, but our data quality challenges have only grown. Data is produced at unprecedented rates, from sources that are broad, varied, and unreliable, and through large-scale processes that introduce their own inaccuracies (e.g., structured data extraction from unstructured text). Traditional data cleaning techniques identify discrepancies and purge datasets of errors, but they treat the manifestation of a problem, not its root cause. They disregard the fact that errors are often systemic, inherent to the process that produces the data, and thus will keep occurring unless the problems are corrected at their source. Our work offers crucial insights into data quality issues: instead of repairing the errors themselves, our research focuses on diagnosing the reasons for the errors and identifying repairs in the processes that produce the data.

Data X-Ray Data X-Ray: Diagnosing errors in data systems
Data X-Ray is a diagnostic framework for profiling errors in data and determining systemic reasons for them in internet-scale knowledge extraction pipelines. This setting is challenging due to the large scale of the data, the prevalence of errors, and the complexity of the system.
Publications: [SIGMOD 2015] [PVLDB 2015 demo]

QFix QFix: Diagnosing errors in relational logs
Relational databases are often dynamic, and even when data is cleaned, new errors can be introduced by applications and users who interact with the data. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Any discovered errors tend to be corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. QFix derives explanations and repairs for discrepancies in relational data by analyzing the effects of queries that operated on the data and identifying potential mistakes in those queries.
Publications: [SIGMOD 2017] [SIGMOD 2016 demo]

Causality and explanations

Today's data is vast and often unreliable and the systems that process data are increasingly complex. Even simple transformations through database queries obscure the origins of data and the derivation of results. The goal of my research is to promote users' trust in data and systems through support for understanding and explanations. Explanations provide opportunities for systems to interact with humans and obtain feedback, improving their operation. Explanations also allow domain experts and system developers to understand system decisions and improve system function.

Causality Causal analysis and explanations in data management [Project page (causality)]
Our research investigates techniques that help users understand the results of their queries by analyzing the history of data transformations (provenance). Unfortunately, using the provenance to explain query results is often impractical, as provenance information can grow very large even for simple transformations and modest-size datasets. Our work refines provenance information by analyzing the causal contributions of data to a result, and develops explanation frameworks for a variety of data-driven settings.
Publications (sample): [DE Bulletin 2018] [EDBT 2017] [PVLDB 2015] [PVLDB 2014 tutorial]

Funding sponsors