As the Big Data era unfolds, there is a pressing demand for unified toolkits to handle scalable data analytics. My current works present one such generic framework via formal language
specifications that brings in many disparate problems under one umbrella and promises to provide massive speed-up
gain for big data analytics.
Formal languages such as context-free grammars are used to model data formats/semantics in wide-array of applications, e.g., human behavior analysis, network data modeling and event detection, structural phenomena found in biological sequences, speech processing, program syntax analysis, identifying topics in unstructured and semi-structured text etc. While, the expressive power of formal language modeling is well-understood, the grammar based computations may have very high time and space requirements. This severely limits the applicability of this method on large-scale data. Our goal is to overcome this barrier, and make grammar based computations accessible to its large number of applications. Grammar based computations are closely related to fundamental graph problems and statistical inference algorithms. See the following publications for more details. |

The problems that fall under the category of the "first" project can mostly be solved optimally by dynamic programming albeit with a very high running time. Dynamic programming is possibly the most systematic way of developing polynomial time exact algorithms. However, high degree of polynomial running time, often quadratic, cubic or higher, makes dynamic programmings unsuitable as an algorithmic tool for handling today's massive amount of data. We need highly scalable algorithms that run in near-linear time (or sub-linear if possible), often by making a single sequential scan or in a distributed fashion. This gives rise to the very timely and important question: how can we approximate dynamic programming fast? Can we systematically forget some states of the dynamic programming table to reduce space and running time and yet get a close to optimal solution (amnesic approximation). |

Scheduling plays a significant role to save energy and thus reduce cost in modern
data centers and cloud architecture. There are many distinct features that set apart data center scheduling
from traditional scheduling literature. For example, here location of servers/machines play an important role for both scheduling
and data placements. Data centers exhibit a hierarchical structure whereby processors are arranged in racks which are arranged in stacks and so on.
Data communication cost between two processors are now highly dependent on the placement of these processors in the hierarchy. Other
important quantifiers include controling replication factor, and placement of data replicas both to achieve parallelism and fault-tolerance.
Some of my publications in this area. |

A field of applied mathematics that has greatly influenced algorithm design is probabilistic
methods. Most of my works (pick a random paper from my publication list) rely on careful use of randomization to solve
nontrivial optimizations, or design fast algorithms. My main research interest here lies on developing dependent rounding techniques, and
constructive aspects of local lemma to handle limited correlations among random variables. Local lemma is intimately connected to
Concurrency theory, and one of my current project is to understand and explore this connetion in depth.
Some more publications in this area. |

Combinatorial graph theory has wide applicability in distributed computing, in network design and optimization problems.
My research has focused on several aspects of graph algorithms, from finding patterns/answering queries over streaming graphs,
dynamic graph clustering, computing vertex covers, detecting dense subgraphs etc. Currently, I am working on several graph problems where
the underlying data (presence of edges, weights on edges) is noisy. The other focus is on developing memory-constrained graph algorithms.
Map-Reduce framework is not suitable for most graph problems. We are investigatings alternate models of multi-core algorithms for graph
problems.
Some related publications in graph algorithms. |

Today, as data driven decision making sweeps through all aspects of society, the importance of data quality is realized more than ever
. Poor quality data can have huge impact in business outcomes. It may lead to wrong decision making, misrepresentation, process
inefficiency and failure to comply with industry and government regulations. No doubt, data quality management is gaining
significant traction and veracity is noted as a major characterization of Big Data.
There is a large body of work in database research community on structured relational data quality management by extending integrity constraints to explore dependencies on data. One of our main focus here is on semi-structured data, such as XML (extensible mark-up language) where attribute values as well as structures of data provide important clues for data cleaning. With industrial collaborators, we are developing novel algorithms and tools for semi-structured data quality mining. Our other focus is on crowd-sourcing--bringing human in loop to process uncertain data. Crowd sourcing is inherently an optimization problem where we want to ask minimum number of well-designed questions to uncover the underlying truth. Human answers themselves may be noisy. We are developing framework for robust crowdsourcing that requires designing new algorithms for stochastic graph problems. Some related publications in data quality and uncertain data management |

I am involved in various interdisciplinary research efforts, with collaborations with biologists, we are investigating the effect of repeat sequence concetration and chromatin structure on gene expression variability, identifying evolving patterns with stress, aging, and as effect of drugs. Computational aspects of these problems involve developing new graph algorithms, pattern matching techniques, and statistical modeling. To know more about this research initiative send me an email firstname @ cs.umass.edu |