Yanlei Diao

Adjunct Professor
Department of Computer Science
University of Massachusetts Amherst

Address: Department of Computer Science Room 232
140 Governors Drive
Amherst, MA 01003-9264
Assistant: Rachel Lavery {last-name}@cs.umass.edu

[Home]  [Funding]  [Teaching]  [Publications]  [Talks]  [Service]  [Misc]  [Curriculum Vitae]  [DB Group]

Research Interests

My research interests lie in big data analytics and scalable intelligent information systems, with a focus on optimization in big data analytics, interactive data exploration, explainable anomaly detection, data streams, and uncertain data management.

Data systems Research for Exploration, Analytics, and Modeling (DREAM) Lab, co-directed with Prof. Miklau, Meliou and Haas

Some of our recent project websites:
Cloud data analytics and data stream analytics
Interactive data exploration
GESALL: Genomic scalable analysis with low latency
SCALLA: Scalable low-latency analytics
CLARO: Uncertain data stream processing
SASE: Complex event processing over streams
STONES: Flash-based data management system
SPIRE: RFID data stream processing


Recent News

GESALL, our project on genomic data processing, obtained new NSF funding, joint with the New York Genome Center; a new publication at CIDR 2015; and a new research video, joint with the Harvard Medical School and Boston Children's Hospital.

Doctoral student, Yunmeng Ban, received the 2014 Microsoft Research Graduate Women's Scholarship, one of ten scholars chosen from 101 applicants!

Yanlei Diao received the 2013 CRA-W Borg Early Career Award for significant contributions to research and outreach!

Complete list of honors and awards: here

Curriculum vitae: here


Research Projects

GESALL: GEnomic Scalable Analysis with Low Latency. Next-generation sequencing has transformed genomics into a new paradigm of data-intensive computing, raising several salient challenges. First, the deluge of genomic data needs to undergo deep analysis to mine biological information, which requires a full pipeline that integrates many data processing and analysis tools. Second, deep analysis pipelines often take long to run, which entails a long cycle for algorithm and method development. This project aims to bring the latest big data technology and database technology to the genomics domain to revolutionize its data crunching power. The proposed research includes: development of a deep pipeline for genomic data analysis by assembling state-of-the-art methods; automatic parallelization of the workflow using the big data technology; a principled approach to optimizing the genomic pipeline; and integration of streaming technology to reduce latency of important results. The prototype system will be deployed in both private and public cloud environments, and fully evaluated using existing long-running pipelines and in a variety of real use cases.

SCALLA: Scalable Low-Latency Analytics. An integral part of many data-intensive applications is the need to collect and analyze enormous data sets, such as social network data, server log data, scientific data, and big bio data. Concurrently, new programming models and architectures have been developed for large-scale cluster computing, exemplified by recent MapReduce systems. In these big data systems, however, data needs to be loaded to the cluster before any queries can be run, resulting in a high delay to start query processing. Morever, answers to a long-running query are returned only when the entire job completes, causing a long delay in returning query answers. In this project, we design, develop, and evaluate a scalable, low-latency analytics platform, called Scalla, that fundamentally transforms the existing cluster computing paradigm into an incremental parallel processing paradigm, and further extends to near real-time analytics. We further develop a few applications in the domains of social network data analysis and big bio data analysis on the Scalla platform.

CLARO: Uncertain Data Management. The goal of this project is to design and develop a data management system that captures data uncertainty from data collection to query processing to final result generation. Such uncertain data stream processing is crucial to many real-world applications such as hazardous weather monitoring and traffic monitoring. To achieve this goal, our project takes a principled approach grounded in probability and statistical theory to support uncertainty as a first-class citizen, and efficiently integrate this approach into high-volume stream processing. In particular, we aim to capture uncertainty of raw data streams as they are produced as well as changes of uncertainty as data propagates through various query processing operators.

SASE: Complex Event Processing over Streams. We study stream processing in the context of large-scale event-based systems that are gaining adoption in applications such as supply chain management, financial services, and network and application monitoring. These systems create high volumes of events. End applications require these events to be filtered and correlated for complex pattern detection, aggregated on different temporal and geographic scales, and transformed to new events that reach a semantic level appropriate for the applications. We address issues involved in stream-based event processing ranging from the query language to computation complexity to fast implementation. We further consider complex pattern evaluation with imprecise timestamps of events, which commonly arise in event processing in distributed systems.

STONES: Flash-based Data Management Systems. Recent advances in flash technology have enabled embedded devices, personal computers, and high-end servers to be equipped with high-capacity flash memory and its packaged devices such as solid state drives (SSDs). Flash memory and SSDs provide faster random access and more energy-efficient operations over traditional hard disks. In this project, we are designing new storage systems and query processing algorithms for large-scale data analysis and high-performance databases that employ hybrid storage of flash memory and hard disks.

SPIRE: RFID Data Stream Processing. Radio Frequency Identification (RFID) technology is gaining acceptance in an increasing number of applications for tracking and monitoring purposes. Despite its promise to provide unprecedented visibility in various domains, RFID technology presents numerous challenges, including incomplete and noisy data, lack of information about inter-object relationships, and high volumes. In this project, we develop an RFID stream processing system that employs probabilistic inference to derive locations of unobserved objects and inter-object relationships such as containments and further supports probabilistic query processing to derive high-level information.


Past Projects

Fast and Memory-Efficient Packet Content Scanning. Packet content scanning compares the packet payload against a set of patterns specified as regular expressions. Memory requirements using traditional methods for fast packet scanning are prohibitively high. We develop regular expression rewrite techniques to reduce memory usage, and grouping schemes to increase the regular expression matching speed without increasing memory usage. Our implementation can achieve orders-of-magnitude performance improvements over the implementations used in the Linux L7-filter and Snort system. Such efficient packet content scanning enables new technologies such as real-time worm detection, content lookup in overlay networks, fine-grained load balancing, etc.

ONYX: Internet-Scale XML Data Dissemination. We study Internet-scale data dissemination that delivers XML-encoded documents from multiple publishing sites to millions of subscribers based on the subscribers' data interests. We explore the idea of content-based routing of documents in distributed dissemination systems. We seek to enhance such data dissemination with advanced services such as stateful publish/subscribe and QoS. We investigate implementations that are able to meet demanding efficiency and scalability requirements.

YFilter: High-Volume XML Message Brokering. We design a message brokering system that provides fast, on-the-fly filtering of incoming XML messages for large numbers of simultaneous queries, and transforms the matching messages according to recipient-specific requirements. We explore the key issues including shared processing of queries for efficient and scalable filtering and leveraging the filtering solutions for customized result generation. We released YFilter 1.0, a freely available software system containing the filtering engine and the query workload generator of YFilter.

Stream-based XQuery Processing. We develop a memoization-based approach to shared processing for the full XQuery language in a stream-based environment. We implement the approach by extending the streaming XQuery processor that BEA Systems incorporates as part of their BEA WebLogic Integration 8.1 product. We demonstrate the effectiveness of the approach in typical use cases of XQuery.



PhD Students
    Boduo Li
    Liping Peng
    Haopeng Zhang
    Abhishek Roy
    Yunmeng Ban
    Wenzhao Liu

MS Students

    Thanh Tran (Twitter)
    Edward Mazur (Google, New York)
    Richard Cocci (Harvard Law School)
    Ravishankar Guruswamy Rajamony (Goldman Sachs)
    Daniel Gyllstrom (UMass)

Visiting Students
    Zhao Cao (IBM Research China)
    Yanming Nie