Arya Mazumdar arya@cs.umass.edu

W 12:20 - 13:10 CS 150

The theory seminar is a weekly meeting in which topics of interest in the theory of computation — broadly construed — are presented. This is sometimes new research by visitors or by local people. It is sometimes work in progress, and it is sometimes recent or classic material of others that some of us present in order to learn and share.

The goal of all talks in this seminar is to encourage understanding and participation. We would like as many attendees as possible to get a sense of the relevant ideas being discussed, including their context and significance.

**Please email me if you would like to give a talk, or if you would like to suggest or invite someone else; or a paper or topic that you would like to see covered.**

This is a one-credit seminar which may be taken repeatedly for credit.

Meeting | Date | Topics | Speaker |

1 | Sep 7 | Organizational Meeting | NA |

2 | Sep 14 | Fibonacci Series, Pell's equation and polynomials | Soumyabrata Pal (UMass) |

3 | Sep 21 | Optimal Hashing-based Time-Space Trade-offs for ApproximateNear Neighbors | Ilya Razenshteyn (MIT) |

4 | Sep 28 | Clustering with an Oracle | Arya Mazumdar (UMass) |

4a | Sep 30 | On the Robust Hardness of Grobner Basis Computation | Gwen Spencer (Smith College) |

5 | Oct 5 | Language Edit Distance, (min,+)-Matrix Multiplication & Beyond | Barna Saha (Umass) |

6 | Oct 12 | Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift | Ted Leone (UMass) |

7a | Oct 17 | Distinguished Speaker Progress in Error Correction | Venkat Guruswami (CMU) |

7 | Oct 19 | Learning from Big Ranking Data | Lirong Xia (RPI) |

8 | Oct 26 | A Descriptive Approach to Graph Isomorphism | Sophie Koffler (UMass) |

9 | Nov 2 | Dusting for Fingerprints in Private Data | Jonathan Ullman (Northeastern U.) |

9a | Nov 4 | Distinguished Speaker Randomness | Avi Wigderson (IAS) |

10 | Nov 9 | Some Simple Group Testing Algorithms | Pardis Malekzadeh (UMass) |

11 | Nov 16 | Storage Capacity in Special Families of Graphs | Sofya Vorotnikova (UMass) |

11a | Nov 16 | Distinguished Speaker Codes for Data Storage | Robert Calderbank (Duke) |

12 | Nov 30 | Time series prediction using support vector machine | Qianxin Xu (UMass) |

13 | Dec 7 | Rank aggregation and Minimum Feedback Arc set | Raj Kumar Maity (UMass) |

14 | Dec 14 | PIR Codes | Anil Saini (UMass) |

Sep 14 **Soumyabrata Pal** Fibonacci Series, Pell's equation and polynomials

**Abstract:** One of the most interesting results of the last century was the proof completed by Matijasevich that computably enumerable sets are precisely the diophantine sets (MRDP Theorem), thus settling, based on previously developed machinery, Hilbert's question whether there exists a general algorithm for checking the solvability in integers of any diophantine equation. Also diophantine representations of the set of prime numbers were exhibited involving polynomials of high degree and many variables. While it is easy to show that there does not exist a univariate polynomial that evaluates to only prime numbers (not to mention all the prime numbers, the two-variable case for primes is a well known open problem), so far there do not exist any results on techniques that help prove the non-existence of multivariate polynomials of certain low degrees for interesting sets (such as the primes), or for showing lower bounds on the degree-variable trade-off for such polynomials. In this paper we describe techniques to prove the nonexistence of polynomials in two variables for some simple generalizations of the Fibonacci sequence , and we believe similar techniques exist for the primes. In this paper we mainly show the following results: (1) using one of the many techniques known for solving the Pell's equation, namely the solution in an extended number system, we prove the existence and explicitly find the polynomials for the recurrences of the form e(n)=ae(n-1)+e(n-2) with starting values of 0 and 1 in particular, and for any arbitrary starting values, in the process defining a concept of fundamental starting numbers, (2) we prove a few identities that seem to be quite interesting and useful, (3) we use these identities in a novel way to generate systems of equations of certain rank deficiency using which we disprove for the first time the existence of any polynomial in 2 variables for the generalized recurrence of the form e(n)=ae(n-1)+be(n-2) (even though these are obviously computably enumerable and hence diophantine), (4) using a known Cassini modification, we prove a similar non-existence for three variables. Our work raises questions about what techniques are good for establishing non-existence or for proving lower bounds on the degree and on the number of variables for the diophantine representation of these as well as other interesting sets such as primes.

**Bio:** Soumyabrata Pal graduated from IIT Kharagpur, India, with a B.Tech. and now is a PhD advisee of Arya Mazumdar in College of Information and Computer Sciences (CICS), UMass.

Sep 21 **Ilya Razenshteyn** Optimal Hashing-based Time-Space Trade-offs for Approximate
Near Neighbors

**Abstract:** I will show a new data structure for the Approximate Nearest Neighbor problem for Euclidean and Hamming distances, which has the following benefits:

It achieves a smooth time-space trade-off, with two extremes being

**near-linear**space and**sub-polynomial**query time.It unifies, simplifies, and improves upon all previous data structures for the problem.

It is optimal in an appropriate restricted model.

The data structure can be seen as a combination of Spherical Locality-Sensitive Filtering and data-dependent Locality-Sensitive Hashing.

Joint work with Alexandr Andoni (Columbia), Thijs Laarhoven (IBM Research Zurich) and Erik Waingarten (Columbia). The preprint is available here.

**Bio:** Ilya Razenshteyn is a 5th year graduate student at MIT CSAIL advised by Piotr Indyk. He holds masters degree in mathematics from Moscow State University. Ilya's research interests include: similarity search, sketching, metric embeddings, high-dimensional geometry, streaming algorithms, compressed sensing, combinatorial optimization.

Sep 28 **Arya Mazumdar** Clustering with an Oracle

**Abstract** Given a set V of n elements, consider the simple task of clustering them into k clusters, where k is unknown. We are allowed to make pairwise queries. Given elements u and v in V, a query asks whether u,v belong to the same cluster and returns a binary answer assuming a true underlying clustering. The goal is to minimize the number of such queries to correctly reconstruct the clusters. When the answer to each query is correct, a simple lower and upper bound of Theta(nk) on query complexity is easy to derive. Our major contribution is to show how only a mild side information in the form of a similarity matrix leads to a great reduction in query complexity to O(k^2). This remains true even when the answer of each query can be erroneous with certain probability and ‘resampling’ is not allowed. Note that this bound can be significantly sublinear in n depending on the value of k. We also develop parallel versions of our algorithms which give near-optimal bounds on the number of adaptive rounds required to match the query complexity.

To show our lower bounds we introduce new general information theoretic methods; as well as use, in completely novel way, information theoretic inequalities to design efficient algorithms for clustering with near-optimal complexity. We believe our techniques both for the lower and upper bounds are of general interest, and will find many applications in theoretical computer science and machine learning.

This talk is based on a joint work with Barna Saha.

**Bio** Arya Mazumdar is an assistant professor in CICS.

Oct 5 **Barna Saha** Language Edit Distance, (min,+)-Matrix Multiplication & Beyond

**Abstract** The language edit distance is a significant generalization of two basic problems in computer science: parsing and string edit distance computation. Given any context free grammar, it computes the minimum number of insertions, deletions and substitutions required to convert a given input string into a valid member of the language. In 1972, Aho and Peterson gave a dynamic programming algorithm that solves this problem in time cubic in the string length. Despite its vast number of applications, in forty years there has been no improvement over this running time.

Computing (min,+)-product of two n by n matrices in truly subcubic time is an outstanding open question, as it is equivalent to the famous All-Pairs-Shortest-Paths problem. Even when matrices have entries bounded in 1,n, obtaining a truly subcubic (min,+)-product algorithm will be a major breakthrough in computer science.

In this presentation, I will explore the connection between these two problems which led us to develop the first truly subcubic algorithms for the following problems: (1) language edit distance, (2) RNA-folding-a basic computational biology problem and a special case of language edit distance computation, (3) stochastic grammar parsing—fundamental to natural language processing, and (4) (min,+)-product of integer matrices with entries bounded in n(3-ω-c) where c >0 is any constant and, ω is the exponent of the fast matrix multiplication, widely believed to be 2. Time permitting, we will also discuss developing highly efficient linear time approximation algorithms for language edit distance for important subclasses of context free grammars.

**Bio** Barna Saha is an assistant professor in CICS.

Oct 19 **Lirong Xia** Learning from Big Ranking Data

**Abstract** Decision-making with ranking data is ubiquitous in our life: voters rank candidates in elections, search engines rank websites based on keywords, e-commerce websites recommend items based on users’ information and behavior. The fundamental challenge is: How can we make better decisions by learning from big ranking data?

My research tackles this multi-disciplinary challenge by taking a unified approach of statistics, machine learning, and economics. In this talk I will focus on learning aspects. I will report our recent theoretical and algorithmic progresses in efficient learning of random utility models and their mixtures, which are among the most well-applied statistical models for ranking data.

**Bio** Lirong Xia is an assistant professor in the Department of Computer Science at Rensselaer Polytechnic Institute (RPI). Prior to joining RPI in 2013, he was a CRCS fellow and NSF CI Fellow at the Center for Research on Computation and Society at Harvard University. He received his PhD in Computer Science and MA in Economics from Duke University. His research focuses on the intersection of computer science and microeconomics. He is an associate editor of Mathematical Social Sciences and is on the editorial board of Journal of Artificial Intelligence Research. He is the recipient of an NSF CAREER award, a Simons-Berkeley Research Fellowship, and was named as one of “AI's 10 to watch” by IEEE Intelligent Systems.

Oct 26 **Sophie Koffler** A Descriptive Approach to Graph Isomorphism

**Abstract** The complexity of Graph Isomorphism remains unknown. It is in NP, but not known to be in P and unlikely to be NP complete. Over the course of two talks I will present the main result from “An Optimal Lower Bound on the Number of Variables for Graph Identification” by Cai, Fürer and Immerman, in which they show that Omega(n) variables are needed to identify graphs with n vertices, disproving a conjecture that First Order Logic with Fixed Point and Counting (FPC) captures the PTIME properties of (unordered) graphs. C^k is FO with counting and at most k variables. In this talk I will show that testing C^k equivalence is equal to the (k-1) dimensional Weisfeiler-Lehman method. I will demonstrate this using my implementation of the 1-dim and 2-dim variants of the W-L method.

Nov 2 **Jonathan Ullman** Dusting for Fingerprints in Private Data

**Abstract** We describe a powerful new family of attacks that recover sensitive information about individuals using only simple summary statistics computed on a dataset. Notably our attacks succeed under minimal assumptions on the distribution of the data, even if the attacker has very limited information about this distribution, and even if the summary statistics are significantly distorted. Our attacks build on and generalize the method of fingerprinting codes for proving lower bounds in differential privacy, and also extend the practical attacks on genomic datasets demonstrated by Homer et al. Surprisingly, the amount of noise that our attacks can tolerate is nearly matched by the amount of noise required to achieve differential privacy, meaning that the robust privacy guarantees of differential privacy come at almost no cost in our model.

Based on joint work with Cynthia Dwork, Adam Smith, Thomas Steinke, and Salil Vadhan.

**Bio** Jon Ullman is an assistant professor in the College of Computer and Information Sciences at Northeastern University. His research addresses questions like “when and how can we analyze sensitive datasets without compromising privacy” and “how can we prevent false discovery in the empirical sciences” using tools from cryptography, machine learning, algorithms, and game theory. Prior to joining Northeastern, he completed his PhD at Harvard University, and was in the inaugural class of junior fellows in the Simons Society of Fellows.

Nov 9 **Pardis Malekjadeh** Some Simple Group Testing Algorithms

**Abstract** The group testing problem consists of identifying a sparse defective subset of a set of items by doing a number of tests. In each test a group of items are examined to determine if there are any defective items among them. The goal is to detect defective items with the minimum number of tests and as fast as possible.

In this talk, I will present the paper “SAFFRON: A Fast, Efficient, and Robust Framework for Group Testing based on Sparse-Graph Codes” by Lee et al: I will describe SAFFRON, a group testing paradigm that recovers a close-to-one fraction of defective items with high probability in optimal computational complexity.

Nov 16 **Sofya Vorotnikova** Storage Capacity in Special Families of Graphs

**Abstract** Motivated by applications in distributed storage, the storage capacity of a graph was recently defined to be the maximum amount of information that can be stored across the vertices of a graph such that the information at any vertex can be recovered from the information stored at the neighboring vertices. While it was known that storage capacity is upper bounded by minimum vertex cover and lower bounded by the size of maximum matching, we show that better bounds exist for some special families of graphs. In particular, we show an algorithm computing a 3*2-approximation of storage capacity in planar graphs and a 4*3-approximation in triangle-free planar graphs. We then develop a general method of “gadget covering” to upper bound the storage capacity in terms of the average of a set of vertex covers. We first illustrate this approach by finding the exact storage capacity of some simple graphs and then use it to show a bound on any graph that admits a specific type of vertex partition. With this we prove exact bounds on a family of Cartesian product graphs.

Nov 30 **Qianxin Xu** Time series prediction using support vector machine

**Abstract** An introduction to time series prediction and application of support vector machines in this domain will be provided.

Dec 7 **Raj Kumar Maity** Rank aggregation and Minimum Feedback Arc set

**Abstract** A presentation based on the paper “Aggregating Inconsistent Information: Ranking and Clustering” by Ailon et al.

Dec 14 **Anil Saini** PIR with Low Storage Overhead: Coding instead of Replication

**Abstract** Private information retrieval (PIR) protocols allow a user to retrieve a data item from a database without revealing any information about the identity of the item being retrieved. This talk will cover general introduction to PIR protocols and how they can be coded so as to reduce their storage overhead. We will focus on information theoretic PIR only.
The asymptotic behavior of these codes and the recent works on lower bounds of redundancy of PIR codes will also be discussed.