Active Projects

Here is a short description of some of my current research projects.
  • How Grammars Can Cure Your Big Data Woes? Connection to Fundamental Graph Problems, Scalable Inference Algorithms

    As the Big Data era unfolds, there is a pressing demand for unified toolkits to handle scalable data analytics. My current works present one such generic framework via formal language specifications that brings in many disparate problems under one umbrella and promises to provide massive speed-up gain for big data analytics.

    Formal languages such as context-free grammars are used to model data formats/semantics in wide-array of applications, e.g., human behavior analysis, network data modeling and event detection, structural phenomena found in biological sequences, speech processing, program syntax analysis, identifying topics in unstructured and semi-structured text etc. While, the expressive power of formal language modeling is well-understood, the grammar based computations may have very high time and space requirements. This severely limits the applicability of this method on large-scale data. Our goal is to overcome this barrier, and make grammar based computations accessible to its large number of applications.

    Grammar based computations are closely related to fundamental graph problems and statistical inference algorithms. See the following publications for more details.

  • Language Edit Distance & Maximum Likelihood Parsing of Stochastic Grammars: Faster Algorithms & Connection to Fundamental Graph Problems
        Barna Saha, FOCS 2015.
  • The Dyck Language Edit Distance Problem in Near-linear Time, Barna Saha, FOCS 2014.
  • On Repairing Structural Problems in Semi-structured Data, with Flip Korn, Divesh Srivastava and Shanshan Ying, VLDB, 2013.
  • fig1.png fig2.png grammar.png
  • Scaling Up Dynamic Programming

    The problems that fall under the category of the "first" project can mostly be solved optimally by dynamic programming albeit with a very high running time. Dynamic programming is possibly the most systematic way of developing polynomial time exact algorithms. However, high degree of polynomial running time, often quadratic, cubic or higher, makes dynamic programmings unsuitable as an algorithmic tool for handling today's massive amount of data. We need highly scalable algorithms that run in near-linear time (or sub-linear if possible), often by making a single sequential scan or in a distributed fashion. This gives rise to the very timely and important question: how can we approximate dynamic programming fast? Can we systematically forget some states of the dynamic programming table to reduce space and running time and yet get a close to optimal solution (amnesic approximation).
  • Energy-Efficient Resource Allocation

    Scheduling plays a significant role to save energy and thus reduce cost in modern data centers and cloud architecture. There are many distinct features that set apart data center scheduling from traditional scheduling literature. For example, here location of servers/machines play an important role for both scheduling and data placements. Data centers exhibit a hierarchical structure whereby processors are arranged in racks which are arranged in stacks and so on. Data communication cost between two processors are now highly dependent on the placement of these processors in the hierarchy. Other important quantifiers include controling replication factor, and placement of data replicas both to achieve parallelism and fault-tolerance.

    Some of my publications in this area.

  • Hierarchical Graph Partitioning, with Mohammadtaghi Hajiaghayi, Theodore Johnson and Reza Khani. SPAA 2014.
  • A Constant Factor Approximation Algorithm for Fault Tolerant k-Median Problem, with Mohammadtaghi Hajiaghayi, Wei Hu, Jian Li and Shi Li. SODA, 2014.
  • Distributed Data Placement to Minimize Communication Costs via Graph Partitioning, with Lukasz Golab, Marios Hadjieleftheriou, and Howard Karloff. SSDBM, 2014.
  • New Approximation Results for Resource Replication Problems, with Samir Khuller and Kanthi Sarpatwar. APPROX, 2012.
  • The Matroid Median Problem, with Ravishankar Krishnaswamy, Vishwanath Nagarajan, Amit Kumar and Yogish Sabharwal. SODA, 2011.
  • Energy Efficient Scheduling via Partial Shutdown, with Samir Khuller and Jian Li. SODA, 2010.
  • sched1.png sched2.jpg
  • Randomization in Computation

  • A field of applied mathematics that has greatly influenced algorithm design is probabilistic methods. Most of my works (pick a random paper from my publication list) rely on careful use of randomization to solve nontrivial optimizations, or design fast algorithms. My main research interest here lies on developing dependent rounding techniques, and constructive aspects of local lemma to handle limited correlations among random variables. Local lemma is intimately connected to Concurrency theory, and one of my current project is to understand and explore this connetion in depth.

    Some more publications in this area.

  • New Constructive Aspects of the Lovasz Local Lemma, with Bernhard Haeupler and Aravind Srinivasan. FOCS 2010.
  • A New Approximation Technique for Resource-Allocation Problems, with Aravind Srinivasan. ITCS 2010.
  • Set Cover revisited: Hypergraph Cover with Hard Capacities, with Samir Khuller. ICALP, Track A.
  • Graph Algorithms

  • Combinatorial graph theory has wide applicability in distributed computing, in network design and optimization problems. My research has focused on several aspects of graph algorithms, from finding patterns/answering queries over streaming graphs, dynamic graph clustering, computing vertex covers, detecting dense subgraphs etc. Currently, I am working on several graph problems where the underlying data (presence of edges, weights on edges) is noisy. The other focus is on developing memory-constrained graph algorithms. Map-Reduce framework is not suitable for most graph problems. We are investigatings alternate models of multi-core algorithms for graph problems.

    Some related publications in graph algorithms.

  • Set Cover revisited: Hypergraph Cover with Hard Capacities, with Samir Khuller. ICALP, 2012.
  • Dense Subgraphs with Restrictions and Applications to Gene Annotation Graphs, with Allison Hoch, Samir Khuller, Louiqa Raschid and Xiao-Ning Zhang.RECOMB 2010.
  • On Finding Dense Subgraphs, with Samir Khuller. ICALP, 2009.
  • On Maximum Coverage in the Streaming Model & Application to Multi-topic Blog-Watch, with Lise Getoor. SDM, 2009.
  • On Estimating Path-Aggregates over Streaming Graphs, with Sumit Ganguly. ISAAC, 2006.
  • Data Quality

  • Today, as data driven decision making sweeps through all aspects of society, the importance of data quality is realized more than ever . Poor quality data can have huge impact in business outcomes. It may lead to wrong decision making, misrepresentation, process inefficiency and failure to comply with industry and government regulations. No doubt, data quality management is gaining significant traction and veracity is noted as a major characterization of Big Data.

    There is a large body of work in database research community on structured relational data quality management by extending integrity constraints to explore dependencies on data. One of our main focus here is on semi-structured data, such as XML (extensible mark-up language) where attribute values as well as structures of data provide important clues for data cleaning. With industrial collaborators, we are developing novel algorithms and tools for semi-structured data quality mining.

    Our other focus is on crowd-sourcing--bringing human in loop to process uncertain data. Crowd sourcing is inherently an optimization problem where we want to ask minimum number of well-designed questions to uncover the underlying truth. Human answers themselves may be noisy. We are developing framework for robust crowdsourcing that requires designing new algorithms for stochastic graph problems.

    Some related publications in data quality and uncertain data management

  • Tutorial Data Quality: The other Face of Big Data, with Divesh Srivastava. ICDE, 2014.

  • Online Entity Resolution with an Oracle with Donatella Firmani and Divesh Srivastava. VLDB, 2016.
  • Tree Scope: Finding Structural Anomalies in Semi-Structured Data Demonstration Paper, with Flip Korn, Divesh Srivastava, Shanshan Ying. VLDB, 2015.
  • The Dyck Language Edit Distance Problem in Near-linear Time, Barna Saha, FOCS, 2014.
  • On Repairing Structural Problems in Semi-structured Data, with Flip Korn, Divesh Srivastava and Shanshan Ying. VLDB, 2013.
  • Less is More, Selecting Sources Wisely for Integration, with Xin Luna Dong and Divesh Srivastava. VLDB, 2013.
  • Discovering Conservation Rules, with Lukasz Golab, Howard Karloff, Flip Korn and Divesh Srivastava. ICDE, 2012.
    Among the Best Papers of ICDE'12
  • Schema Covering: A Step Towards Enabling Reusability in Information Integration, with Ioana Stanoi and Ken Clarkson. ICDE, 2010.
  • A Unified Approach to Ranking in Probabilistic Databases, with Jian Li and Amol Deshpande. VLDB, 2009. BEST PAPER AWARD
  • quality1.jpg quality2.png
  • Interdisciplinary Research

  • I am involved in various interdisciplinary research efforts, with collaborations with biologists, we are investigating the effect of repeat sequence concetration and chromatin structure on gene expression variability, identifying evolving patterns with stress, aging, and as effect of drugs. Computational aspects of these problems involve developing new graph algorithms, pattern matching techniques, and statistical modeling. To know more about this research initiative send me an email firstname @