Cornell Census-NSF Research Node (NCRN)

Project Overview

The Cornell node of the NSF-Census Research Network (NCRN) was funded by the National Science Foundation to develop infrastructure for using administrative data in social science research. The project focused on:

Privacy-preserving methods for administrative data
Training programs for researchers using confidential data
Documentation standards for statistical products
Synthetic data methods for broader access

Funding

National Science Foundation
- Award Number: 1131848
- Period: September 19, 2011 - September 12, 2016
- Amount: $3,560,887
- Role: Principal Investigator (with John M Abowd, William C Block, Ping Li)
National Science Foundation (partial)
- Award Number: 1012593
- Period: July 14, 2010 - June 27, 2016
- Amount: $1,326,660.00
- Role: Co-Principal Investigator (with Johannes E Gehrke, John M Abowd)

Team

Lars Vilhuber - Principal Investigator (2014-2018)
John M. Abowd - Former Principal Investigator (2011-2014)
William Block - Co-Principal Investigator
Ping Li - Co-Principal Investigator

Repositories

The project produced multiple open-source repositories and tools. See

NCRN Cornell GitHub Organization

Publications

Publications by grant

All publications funded by grant SES-1131848:

An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices

John M. Abowd and Ian M. Schmutte

American Economic Review, forthcoming
Sorting Between and Within Industries: A Testable Model of Assortative Matching

John M. Abowd, Francis Kramarz, Sebastien Perez-Duarte, and 1 more author

Annals of Economics and Statistics, 2018

Abs DOI URL

We test Shimer’s (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting-more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated.
Earnings Inequality and Mobility Trends in the United States: Nationally Representative Estimates from Longitudinally Linked Employer-Employee Data

John M. Abowd, Kevin L. Mckinney, and Nellie Zhao

Journal of Labor Economics, 2018

Abs DOI URL

Using earnings data from the U.S. Census Bureau, this paper analyzes the role of the employer in explaining the rise in earnings inequality in the United States. We first establish a consistent frame of analysis appropriate for administrative data used to study earnings inequality. We show that the trends in earnings inequality in the administrative data from the Longitudinal Employer-Household Dynamics Program are inconsistent with other data sources when we do not correct for the presence of misused SSNs. After this correction to the worker frame, we analyze how the earnings distribution has changed in the last decade. We present a decomposition of the year-to-year changes in the earnings distribution from 2004-2013. Even when simplifying these flows to movements between the bottom 20%, the middle 60% and the top 20% of the earnings distribution, about 20.5 million workers undergo a transition each year. Another 19.9 million move between employment and nonemployment. To understand the role of the firm in these transitions, we estimate a model for log earnings with additive fixed worker and firm effects using all jobs held by eligible workers from 2004-2013. We construct a composite log earnings firm component across all jobs for a worker in a given year and a non-firm component. We also construct a skill-type index. We show that, while the difference between working at a low- or middle-paying firm are relatively small, the gains from working at a top-paying firm are large. Specifically, the benefits of working for a high-paying firm are not only realized today, through higher earnings paid to the worker, but also persist through an increase in the probability of upward mobility. High-paying firms facilitate moving workers to the top of the earnings distribution and keeping them there.
An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices

John M. Abowd and Ian M. Schmutte

Aug 2018

Abs DOI URL

Statistical agencies face a dual mandate to publish accurate statistics while protecting respondent privacy. Increasing privacy protection requires decreased accuracy. Recognizing this as a resource allocation problem, we propose an economic solution: operate where the marginal cost of increasing privacy equals the marginal benefit. Our model of production, from computer science, assumes data are published using an efficient differentially private algorithm. Optimal choice weighs the demand for accurate statistics against the demand for privacy. Examples from U.S. statistical programs show how our framework can guide decision-making. Further progress requires a better understanding of willingness-to-pay for privacy and statistical accuracy.
An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices

John M. Abowd and Ian M. Schmutte

2018

URL
Codebook for the SIPP Synthetic Beta v7 [Online]

Lori B. Reeder, Jordan C. Stanley, and Lars Vilhuber

2018

URL
Codebook for the SIPP Synthetic Beta 7.0 (PDF version)

Lori B. Reeder, Jordan C. Stanley, and Lars Vilhuber

Nov 2018

DOI URL
Codebook for the SIPP Synthetic Beta 7.0 (DDI-C and PDF)

Lori B. Reeder, Jordan C. Stanley, and Lars Vilhuber

Nov 2018

DOI URL
How Will Statistical Agencies Operate When All Data Are Private?

John M. Abowd

Journal of Privacy and Confidentiality, 2017

Abs DOI URL

The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.
Sorting Between and Within Industries: A Testable Model of Assortative Matching

John M. Abowd, Francis Kramarz, Sebastien Perez-Duarte, and 1 more author

2017

Abs URL

We test Shimer’s (2005) theory of the sorting of workers between and within industrial sectors based on directed search with coordination frictions, deliberately maintaining its static general equilibrium framework. We fit the model to sector-specific wage, vacancy and output data, including publicly-available statistics that characterize the distribution of worker and employer wage heterogeneity across sectors. Our empirical method is general and can be applied to a broad class of assignment models. The results indicate that industries are the loci of sorting–more productive workers are employed in more productive industries. The evidence confirms that strong assortative matching can be present even when worker and employer components of wage heterogeneity are weakly correlated.
Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods

John M. Abowd and Ian M. Schmutte

04/2017 2017

Abs URL

We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently under-supplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.
Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods

John M. Abowd and Ian M. Schmutte

Jan 2017

Abs DOI URL

We consider the problem of determining the optimal accuracy of public statistics when increased accuracy requires a loss of privacy. To formalize this allocation problem, we use tools from statistics and computer science to model the publication technology used by a public statistical agency. We derive the demand for accurate statistics from first principles to generate interdependent preferences that account for the public-good nature of both data accuracy and privacy loss. We first show data accuracy is inefficiently undersupplied by a private provider. Solving the appropriate social planner’s problem produces an implementable publication strategy. We implement the socially optimal publication plan for statistics on income and health status using data from the American Community Survey, National Health Interview Survey, Federal Statistical System Public Opinion Survey and Cornell National Social Survey. Our analysis indicates that welfare losses from providing too much privacy protection and, therefore, too little accuracy can be substantial.
Noise infusion as a confidentiality protection measure for graph-based statistics

John M. Abowd and Kevin L. McKinney

Statistical Journal of the IAOS, Feb 2016

Abs DOI URL

We use the bipartite graph representation of longitudinally linked employer-employee data, and the associated projections onto the employer and employee nodes, respectively, to characterize the set of potential statistical summaries that the trusted custodian might produce. We consider noise infusion as the primary confidentiality protection method. We show that a relatively straightforward extension of the dynamic noise-infusion method used in the U.S. Census Bureau’s Quarterly Workforce Indicators can be adapted to provide the same confidentiality guarantees for the graph-based statistics: all inputs have been modified by a minimum percentage deviation (i.e., no actual respondent data are used) and, as the number of entities contributing to a particular statistic increases, the accuracy of that statistic approaches the unprotected value. Our method also ensures that the protected statistics will be identical in all releases based on the same inputs.
Modeling Endogenous Mobility in Wage Determination

John M. Abowd, Kevin L. McKinney, and Ian M. Schmutte

May 2016

Abs URL

We evaluate the bias from endogenous job mobility in fixed-effects estimates of worker- and firm-specific earnings heterogeneity using longitudinally linked employer-employee data from the LEHD infrastructure file system of the U.S. Census Bureau. First, we propose two new residual diagnostic tests of the assumption that mobility is exogenous to unmodeled determinants of earnings. Both tests reject exogenous mobility. We relax the exogenous mobility assumptions by modeling the evolution of the matched data as an evolving bipartite graph using a Bayesian latent class framework. Our results suggest that endogenous mobility biases estimated firm effects toward zero. To assess validity, we match our estimates of the wage components to out-of-sample estimates of revenue per worker. The corrected estimates attribute much more of the variation in revenue per worker to variation in match quality and worker quality than the uncorrected estimates.
How Will Statistical Agencies Operate When All Data Are Private?

John M. Abowd

2016

Abs

The dual problems of respecting citizen privacy and protecting the confidentiality of their data have become hopelessly conflated in the “Big Data” era. There are orders of magnitude more data outside an agency?s firewall than inside it-compromising the integrity of traditional statistical disclosure limitation methods. And increasingly the information processed by the agency was “asked” in a context wholly outside the agency’s operations-blurring the distinction between what was asked and what is published. Already, private businesses like Microsoft, Google and Apple recognize that cybersecurity (safeguarding the integrity and access controls for internal data) and privacy protection (ensuring that what is published does not reveal too much about any person or business) are two sides of the same coin. This is a paradigm-shifting moment for statistical agencies.
Why Statistical Agencies Need to Take Privacy-loss Budgets Seriously, and What It Means When They Do

John M. Abowd

2016

Abs

To appear on fcsm.sites.usa.gov, as presented to the 2016 FCSM Statistical Policy Seminar.
Revisiting the Economics of Privacy: Population Statistics and Confidentiality Protection as Public Goods

John M. Abowd and Ian Schmutte

Jan 2015

Abs URL

We consider the problem of the public release of statistical information about a population?explicitly accounting for the public-good properties of both data accuracy and privacy loss. We first consider the implications of adding the public-good component to recently published models of private data publication under differential privacy guarantees using a Vickery-Clark-Groves mechanism and a Lindahl mechanism. We show that data quality will be inefficiently under-supplied. Next, we develop a standard social planner?s problem using the technology set implied by (?, ?)-differential privacy with (?, ?)-accuracy for the Private Multiplicative Weights query release mechanism to study the properties of optimal provision of data accuracy and privacy loss when both are public goods. Using the production possibilities frontier implied by this technology, explicitly parameterized interdependent preferences, and the social welfare function, we display properties of the solution to the social planner?s problem. Our results directly quantify the optimal choice of data accuracy and privacy loss as functions of the technology and preference parameters. Some of these properties can be quantified using population statistics on marginal preferences and correlations between income, data accuracy preferences, and privacy loss preferences that are available from survey data. Our results show that government data custodians should publish more accurate statistics with weaker privacy guarantees than would occur with purely private data publishing. Our statistical results using the General Social Survey and the Cornell National Social Survey indicate that the welfare losses from under-providing data accuracy while over-providing privacy protection can be substantial.
Codebook for the SIPP Synthetic Beta v6.0.2 [Online]

Lori B. Reeder, Martha Stinson, Kelly E. Trageser, and 1 more author

2015

URL
Graph Kernels via Functional Embedding

Anshumali Shrivastava and Ping Li

CoRR, 2014

Abs URL

We propose a representation of graph as a functional object derived from the power iteration of the underlying adjacency matrix. The proposed functional representation is a graph invariant, i.e., the functional remains unchanged under any reordering of the vertices. This property eliminates the difficulty of handling exponentially many isomorphic forms. Bhattacharyya kernel constructed between these functionals significantly outperforms the state-of-the-art graph kernels on 3 out of the 4 standard benchmark graph classification datasets, demonstrating the superiority of our approach. The proposed methodology is simple and runs in time linear in the number of edges, which makes our kernel more efficient and scalable compared to many widely adopted graph kernels with running time cubic in the number of vertices.
In Defense of MinHash Over SimHash

Anshumali Shrivastava and Ping Li

In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), 2014

Abs URL

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search. The collision probability of MinHash is a function of resemblance similarity (R), while the collision probability of SimHash is a function of cosine similarity (S). To provide a common basis for comparison, we evaluate retrieval results in terms of S for both MinHash and SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH with respect to S, by using a general inequality S2≤R≤S2−S. Our worst case analysis can show that MinHash significantly outperforms SimHash in high similarity region. Interestingly, our intensive experiments reveal that MinHash is also substantially better than SimHash even in datasets where most of the data points are not too similar to each other. This is partly because, in practical data, often R≥Sz−S holds where z is only slightly larger than 2 (e.g., z≤2.1). Our restricted worst case analysis by assuming Sz−S≤R≤S2−S shows that MinHash indeed significantly outperforms SimHash even in low similarity region. We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.
b-Bit Minwise Hashing in Practice

Ping Li, Anshumali Shrivastava, and Arnd Christian König

In Internetware 2013, Oct 2013

Abs URL

Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [26, 32] demonstrated a potential use of b-bit minwise hashing [23, 24] for efficient search and learning on massive, high-dimensional, binary data (which are typical for many applications in Web search and text mining). In this paper, we focus on a number of critical issues which must be addressed before one can apply b-bit minwise hashing to the volumes of data often used industrial applications. Minwise hashing requires an expensive preprocessing step that computes k (e.g., 500) minimal values after applying the corresponding permutations for each data vector. We developed a parallelization scheme using GPUs and observed that the preprocessing time can be reduced by a factor of 20 80 and becomes substantially smaller than the data loading time. Reducing the preprocessing time is highly beneficial in practice, e.g., for duplicate Web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers. Another critical issue is that for very large data sets it becomes impossible to store a (fully) random permutation matrix, due to its space requirements. Our paper is the first study to demonstrate that b-bit minwise hashing implemented using simple hash functions, e.g., the 2-universal (2U) and 4-universal (4U) hash families, can produce very similar learning results as using fully random permutations. Experiments on datasets of up to 200GB are presented.
Exact Sparse Recovery with L0 Projections

Ping Li and Cun-Hui Zhang

In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2013

Abs DOI URL

Many applications (e.g., anomaly detection) concern sparse signals. This paper focuses on the problem of recovering a K-sparse signal x ∈ R/1×N, i.e., K << N and ∑N/i=1 1xi ≠ 0 = K. In the mainstream framework of compressed sensing (CS), × is recovered from M linear measurements y = xS ∈ R/1×M, where S ∈ RN×M is often a Gaussian (or Gaussian-like) design matrix. In our proposed method, the design matrix S is generated from an α-stable distribution with α ≈ 0. Our decoding algorithm mainly requires one linear scan of the coordinates, followed by a few iterations on a small number of coordinates which are "undetermined" in the previous iteration. Our practical algorithm consists of two estimators. In the first iteration, the (absolute) minimum estimator is able to filter out a majority of the zero coordinates. The gap estimator, which is applied in each iteration, can accurately recover the magnitudes of the nonzero coordinates. Comparisons with linear programming (LP) and orthogonal matching pursuit (OMP) demonstrate that our algorithm can be significantly faster in decoding speed and more accurate in recovery quality, for the task of exact spare recovery. Our procedure is robust against measurement noise. Even when there are no sufficient measurements, our algorithm can still reliably recover a significant portion of the nonzero coordinates.
Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search

Anshumali Shrivastava and Ping Li

In Advances in Neural Information Processing Systems 26, 2013

Abs URL

We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R3way = |S1∩S2∩S3| |S1∪S2∪S3| , S1, S2, S3 ∈ C, where C is a size n collection of sets (or binary vectors). We show that approximate R3way similarity search problems admit fast algorithms with provable guarantees, analogous to the pairwise case. Our analysis and speedup guarantees naturally extend to k-way resemblance. In the process, we extend traditional framework of locality sensitive hashing (LSH) to handle higher-order similarities, which could be of independent theoretical interest. The applicability of R3way search is shown on the “Google Sets” application. In addition, we demonstrate the advantage of R3way resemblance over the pairwise case in improving retrieval quality.
One Permutation Hashing

Ping Li, Art Owen, and Cun-Hui Zhang

In Advances in Neural Information Processing Systems 25, 2012

Abs URL

While minwise hashing is promising for large-scale learning in massive binary data, the preprocessing cost is prohibitive as it requires applying (e.g.,) k=500 permutations on the data. The testing time is also expensive if a new data point (e.g., a new document or a new image) has not been processed. In this paper, we develop a simple \textbfone permutation hashing scheme to address this important issue. While it is true that the preprocessing step can be parallelized, it comes at the cost of additional hardware and implementation. Also, reducing k permutations to just one would be much more \textbfenergy-efficient, which might be an important perspective as minwise hashing is commonly deployed in the search industry. While the theoretical probability analysis is interesting, our experiments on similarity estimation and SVM & logistic regression also confirm the theoretical results.
GPU-based minwise hashing: GPU-based minwise hashing

Ping Li, Anshumali Shrivastava, and Arnd Christian König

In Proceedings of the 21st World Wide Web Conference (WWW 2012) (Companion Volume), 2012

Abs DOI URL

Minwise hashing is a standard technique for efficient set similarity estimation in the context of search. The recent work of b-bit minwise hashing provided a substantial improvement by storing only the lowest b bits of each hashed value. Both minwise hashing and b-bit minwise hashing require an expensive preprocessing step for applying k (e.g., k=500) permutations on the entire data in order to compute k minimal values as the hashed data. In this paper, we developed a parallelization scheme using GPUs, which reduced the processing time by a factor of 20-80. Reducing the preprocessing time is highly beneficial in practice, for example, for duplicate web page detection (where minwise hashing is a major step in the crawling pipeline) or for increasing the testing speed of online classifiers (when the test data are not preprocessed).
Entropy Estimations Using Correlated Symmetric Stable Random Projections

Ping Li and Cun-Hui Zhang

In Advances in Neural Information Processing Systems 25, 2012

Abs URL

Methods for efficiently estimating the Shannon entropy of data streams have important applications in learning, data mining, and network anomaly detections (e.g., the DDoS attacks). For nonnegative data streams, the method of Compressed Counting (CC) based on maximally-skewed stable random projections can provide accurate estimates of the Shannon entropy using small storage. However, CC is no longer applicable when entries of data streams can be below zero, which is a common scenario when comparing two streams. In this paper, we propose an algorithm for entropy estimation in general data streams which allow negative entries. In our method, the Shannon entropy is approximated by the finite difference of two correlated frequency moments estimated from correlated samples of symmetric stable random variables. Our experiments confirm that this method is able to substantially better approximate the Shannon entropy compared to the prior state-of-the-art.
Fast Near Neighbor Search in High-Dimensional Binary Data

Anshumali Shrivastava and Ping Li

In The European Conference on Machine Learning (ECML 2012), 2012

Abs URL

Abstract. Numerous applications in search, databases, machine learning, and computer vision, can benefit from efficient algorithms for near neighbor search. This paper proposes a simple framework for fast near neighbor search in high-dimensional binary data, which are common in practice (e.g., text). We develop a very simple and effective strategy for sub-linear time near neighbor search, by creating hash tables directly using the bits generated by b-bit minwise hashing. The advantages of our method are demonstrated through thorough comparisons with two strong baselines: spectral hashing and sign (1-bit) random projections.
Testing for Membership to the IFRA and the NBU Classes of Distributions

Radhendushka Srivastava, Ping Li, and Debasis Sengupta

Journal of Machine Learning Research - Proceedings Track for the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2012), 2012

Abs URL

This paper provides test procedures to determine whether the probability distribution underlying a set of non-negative valued samples belongs to the Increasing Failure Rate Average (IFRA) class or the New Better than Used (NBU) class. Membership of a distribution to one of these classes is known to have implications which are important in reliability, queuing theory, game theory and other disciplines. Our proposed test is based on the Kolmogorov-Smirnov distance between an empirical cumulative hazard function and its best approximation from the class of distributions constituting the null hypothesis. It turns out that the least favorable distribution, which produces the largest probability of Type I error of each of the tests, is the exponential distribution. This fact is used to produce an appropriate cut-off or p-value. Monte Carlo simulations are conducted to check small sample size (i.e., significance) and power of the test. Usefulness of the test is illustrated through the analysis of a set of monthly family expenditure data collected by the National Sample Survey Organization of the Government of India.
Fast Multi-task Learning for Query Spelling Correction

Xu Sun, Anshumali Shrivastava, and Ping Li

In The 21^st ACM International Conference on Information and Knowledge Management (CIKM 2012) , 2012

Abs DOI URL

In this paper, we explore the use of a novel online multi-task learning framework for the task of search query spelling correction. In our procedure, correction candidates are initially generated by a ranker-based system and then re-ranked by our multi-task learning algorithm. With the proposed multi-task learning method, we are able to effectively transfer information from different and highly biased training datasets, for improving spelling correction on all datasets. Our experiments are conducted on three query spelling correction datasets including the well-known TREC benchmark dataset. The experimental results demonstrate that our proposed method considerably outperforms the existing baseline systems in terms of accuracy. Importantly, the proposed method is about one order of magnitude faster than baseline systems in terms of training speed. Compared to the commonly used online learning methods which typically require more than (e.g.,) 60 training passes, our proposed method is able to closely reach the empirical optimum in about 5 passes.
Query spelling correction using multi-task learning

Xu Sun, Anshumali Shrivastava, and Ping Li

In Proceedings of the 21st World Wide Web Conference (WWW 2012)(Companion Volume), 2012

Abs DOI URL

This paper explores the use of online multi-task learning for search query spelling correction, by effectively transferring information from different and biased training datasets for improving spelling correction across datasets. Experiments were conducted on three query spelling correction datasets, including the well-known TREC benchmark data. Our experimental results demonstrate that the proposed method considerably outperforms existing baseline systems in terms of accuracy. Importantly, the proposed method is about one-order of magnitude faster than baseline systems in terms of training speed. In contrast to existing methods which typically require more than (e.g.,) 50 training passes, our algorithm can very closely approach the empirical optimum in around five passes.

External Links

Project Website (archived)
Outcomes Report

Impact

The NCRN project contributed to:

Development of synthetic data methods used by the U.S. Census Bureau
Training of hundreds of researchers in confidential data access
Creation of open-source tools for reproducible research
Advancement of privacy-preserving techniques in economics research