Publications
Publication types
Button Legend
- Abs
- View abstract
- DOI
- Digital Object Identifier - persistent link to publication
- URL
- Direct link to publication
- OA
- Open Access version available
- OA?
- Check Unpaywall for open access version
- RP
- Replication package
- arXiv
- Preprint on arXiv
- Bib
- BibTeX citation
- PDF version
- Video
- Video presentation
- Blog
- Blog post about this work
- Code
- Source code repository
- Slides
- Presentation slides
Work in Progress (2)
Submitted Articles
- Assessing Utility of Differential Privacy for RCTsKaitlyn R. Webb, Soumya Mukherjee, Aratrika Mustafi, and 2 more authorssubmitted, R&R, 2026
Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
@article{webb2026hdsr, abstract = {Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.}, author = {Webb, Kaitlyn R. and Mukherjee, Soumya and Mustafi, Aratrika and Slavković, Aleksandra and Vilhuber, Lars}, journal = {submitted, R\&R}, title = {Assessing {Utility} of {Differential} {Privacy} for {RCTs}}, urldate = {2026-02-10}, year = {2026}, } - Improving Results Reporting in RCT RegistriesJack Cavanagh, Sarah Kopper, and Lars Vilhubersubmitted, 2026
The complete registration of trials is a critical component of transparent science. Completeness requires not only initial registration, but also post-study updates. We show this practice is lacking in the largest social science registry – the AEA RCT Registry – and conduct a mixed-methods study to first determine drivers and then test interventions to improve updating rates. In a qualitative study of Registry users we find salience and lack of incentives to be the largest barriers. From an RCT we find limited evidence of differences between email nudges, but substantial reporting increases relative to baseline from the overall supernormal number of reminders.
@article{zotero-item-10957, abstract = {The complete registration of trials is a critical component of transparent science. Completeness requires not only initial registration, but also post-study updates. We show this practice is lacking in the largest social science registry – the AEA RCT Registry – and conduct a mixed-methods study to first determine drivers and then test interventions to improve updating rates. In a qualitative study of Registry users we find salience and lack of incentives to be the largest barriers. From an RCT we find limited evidence of differences between email nudges, but substantial reporting increases relative to baseline from the overall supernormal number of reminders.}, author = {Cavanagh, Jack and Kopper, Sarah and Vilhuber, Lars}, journal = {submitted}, title = {Improving {Results} {Reporting} in {RCT} {Registries}}, year = {2026}, }
Published Articles (45)
- Reproducibility and Robustness of Economics and Political Science ResearchAbel Brodeur, Derek Mikola, Nikolai Cook, and 346 more authorsNature, Apr 2026
This study pushes our understanding of research reliability by reproducing and replicating claims from 110 papers in leading economic and political science journals. The analysis involves computational reproducibility checks and robustness assessments. It reveals several patterns. First, we uncover a high rate of fully computationally reproducible results (over 85%). Second, excluding minor issues like missing packages or broken pathways, we uncover coding errors for about 25% of studies, with some studies containing multiple errors. Third, we test the robustness of the results to 5,511 re-analyses. We find a robustness reproducibility of about 70%. Robustness reproducibility rates are relatively higher for re-analyses that introduce new data and lower for re-analyses that change the sample or the definition of the dependent variable. Fourth, 52% of re-analysis effect size estimates are smaller than the original published estimates and the average statistical significance of a re-analysis is 77% of the original. Lastly, we rely on six teams of researchers working independently to answer eight additional research questions on the determinants of robustness reproducibility. Most teams find a negative relationship between replicators’ experience and reproducibility, while finding no relationship between reproducibility and the provision of intermediate or even raw data combined with the necessary cleaning codes.
@article{brodeur2026, abstract = {This study pushes our understanding of research reliability by reproducing and replicating claims from 110 papers in leading economic and political science journals. The analysis involves computational reproducibility checks and robustness assessments. It reveals several patterns. First, we uncover a high rate of fully computationally reproducible results (over 85\%). Second, excluding minor issues like missing packages or broken pathways, we uncover coding errors for about 25\% of studies, with some studies containing multiple errors. Third, we test the robustness of the results to 5,511 re-analyses. We find a robustness reproducibility of about 70\%. Robustness reproducibility rates are relatively higher for re-analyses that introduce new data and lower for re-analyses that change the sample or the definition of the dependent variable. Fourth, 52\% of re-analysis effect size estimates are smaller than the original published estimates and the average statistical significance of a re-analysis is 77\% of the original. Lastly, we rely on six teams of researchers working independently to answer eight additional research questions on the determinants of robustness reproducibility. Most teams find a negative relationship between replicators' experience and reproducibility, while finding no relationship between reproducibility and the provision of intermediate or even raw data combined with the necessary cleaning codes.}, author = {Brodeur, Abel and Mikola, Derek and Cook, Nikolai and Brailey, Thomas and Briggs, Ryan and de Gendre, Alexandra and Dupraz, Yannick and Fiala, Lenka and Gabani, Jacopo and Gauriot, Romain and Haddad, Joanne and Lima, Goncalo and Ankel-Peters, Jörg and Dreber, Anna and Campbell, Douglas and Kattan, Lamis and Marino Fages, Diego and Mierisch, Fabian and Sun, Pu and Wright, Taylor and Connolly, Marie and Hoces de la Guardia, Fernando and Johannesson, Magnus and Miguel, Edward and Vilhuber, Lars and Abarca, Alejandro and Acharya, Mahesh and Adjisse, Sossou Simplice and Akhtar, Ahwaz and Ramirez Lizardi, Eduardo Alberto and Albrecht, Sabina and Andersen, Synøve Nygaard and Andlib, Zubaria and Arrora, Falak and Ash, Thomas and Bacher, Etienne and Bachler, Sebastian and Bacon, Félix and Bagues, Manuel and Balogh, Timea and Batmanov, Alisher and Barschkett, Mara and Basdil, B. Kaan and Baxa, Jaromír and Becker, Sascha and Beeder, Monica and Beland, Louis-Philippe and Bello, Abdel Hamid and Markovits, Daniel Benenson and Benjamin, Grant and Bergeron, Thomas and Blimpo, Moussa P. and Binetti, Marco and Bonander, Carl and Bonneau, Joseph and Borbáth, Endre and Topstad Borgen, Nicolai and Topstad Borgen, Solveig and Borowsky, Jonathan and Brini, Elisa and Brown, Myriam and Brun, Martín and Bruns, Stephan and Buliskeria, Nino and Calef, Andrea and Cameron, Alistair and Campa, Pamela and Campos-Rodríguez, Santiago and Cantone, Giulio Giacomo and Carpena, Fenella and Carter, Perry and Castañeda Dower, Paul and Castek, Ondrej and Caviglia-Harris, Jill and Strand, Gabriella Chauca and Chen, Shi and Chzhen, Asya and Chung, Jong and Collins, Jason and Coppock, Alexander and Cordeau, Hugo and Couillard, Ben and Crechet, Jonathan and Crippa, Lorenzo and Cui, Jeanne and Czymara, Christian and Daarstad, Haley and Dao, Danh Chi and Dao, Dong and Schmandt, Marco David and de Linde, Astrid and De Melo, Lucas and Deer, Lachlan and De Vera, Micole and Dimitrova, Velichka and Dollbaum, Jan Fabian and Dollbaum, Jan Matti and Donnelly, Michael and Huynh, Luu Duc Toan and Dumbalska, Tsvetomira and Duncan, Jamie and Duong, Kiet Tuan and Duprey, Thibaut and Dworschak, Christoph and Ellingsrud, Sigmund and Elminejad, Ali and Eissa, Yasmine and Erhart, Andrea and Etingin-Frati, Giulian and Fatemi-Pour, Elaheh and Federice, Alexa and Feld, Jan and Fenig, Guidon and Firouzjaeiangalougah, Mojtaba and Fleisje, Erlend and Fortier-Chouinard, Alexandre and Engel, Julia Francesca and Fries, Tilman and Fortier, Reid and Fréchet, Nadjim and Galipeau, Thomas and Gallegos, Sebastián and Gangji, Areez and Gao, Xiaoying and Garnache, Cloé and Gáspár, Attila and Gavrilova, Evelina and Ghosh, Arijit and Gibney, Garreth and Gibson, Grant and Godager, Geir and Goff, Leonard and Gong, Da and González, Javier and Gretton, Jeremy and Griffa, Cristina and Grigoryeva, Idaliya and Grøtting, Maja and Guntermann, Eric and Guo, Jiaqi and Gugushvili, Alexi and Habibnia, Hooman and Häffner, Sonja and Hall, Jonathan D. and Hammar, Olle and Kordt, Amund Hanson and Hashimoto, Barry and Hartley, Jonathan S. and Hausladen, Carina I. and Havránek, Tomáš and Hazen, Jacob and He, Harry and Hepplewhite, Matthew and Herrera-Rodriguez, Mario and Heuer, Felix and Heyes, Anthony and Ho, Anson T. Y. and Holmes, Jonathan and Holzknecht, Armando and Hsu, Yu-Hsiang Dexter and Hu, Shiang-Hung and Huang, Yu-Shiuan and Huebener, Mathias and Huber, Christoph and Huynh, Kim P. and Irsova, Zuzana and Isler, Ozan and Jakobsson, Niklas and Frith, Michael James and Jananji, Raphaël and Jayalath, Tharaka A. and Jetter, Michael and John, Jenny and Forshaw, Rachel Joy and Juan, Felipe and Kadriu, Valon and Karim, Sunny and Kelly, Edmund and Dang, Duy Khanh Hoang and Khushboo, Tazia and Kim, Jin and Kjellsson, Gustav and Kjelsrud, Anders and Kotsadam, Andreas and Korpershoek, Jori and Krashinsky, Lewis and Kundu, Suranjana and Kustov, Alexander and Lalayev, Nurlan and Langlois, Audrée and Laufer, Jill and Lee-Whiting, Blake and Leibing, Andreas and Lenz, Gabriel and Levin, Joel and Li, Peng and Li, Tongzhe and Lin, Yuchen and Listo, Ariel and Liu, Dan and Lu, Xuewen and Lukmanova, Elvina and Luscombe, Alex and Lusher, Lester R. and Lyu, Ke and Ma, Hai and Mäder, Nicolas and Makate, Clifton and Malmberg, Alice and Maitra, Adit and Mandas, Marco and Marcus, Jan and Margaryan, Shushanik and Márk, Lili and Martignano, Andres and Marsh, Abigail and Masetto, Isabella and McCanny, Anthony and McManus, Emma and McWay, Ryan and Metson, Lennard and Kinge, Jonas Minet and Mishra, Sumit and Mohnen, Myra and Möller, Jakob and Montambeault, Rosalie and Montpetit, Sébastien and Morin, Louis-Philippe and Morris, Todd and Moser, Scott and Motoki, Fabio and Muehlenbachs, Lucija and Musulan, Andreea and Musumeci, Marco and Nabin, Munirul and Nchare, Karim and Neubauer, Florian and Nguyen, Quan M. P. and Nguyen, Tuan and Nguyen-Tien, Viet and Niazi, Ali and Nikolaishvili, Giorgi and Nordstrom, Ardyn and Nüß, Patrick and Odermatt, Angela and Olson, Matt and Øien, Henning and Ölkers, Tim and Oliver i Vert, Miquel and Oral, Emre and Oswald, Christian and Ousman, Ali and Özak, Ömer and Pandey, Shubham and Pavlov, Alexandre and Pelli, Martino and Penheiro, Romeo and Park, RyuGyung and Pérez Martel, Eva and Petrovičová, Tereza and Phan, Linh and Prettyman, Alexa and Procházka, Jakub and Putri, Aqila and Quandt, Julian and Qiu, Kangyu and Nguyen, Loan Quynh Thi and Rahman, Andaleeb and Rea, Carson H. and Reiremo, Adam and Renée, Laëtitia and Richardson, Joseph and Rivers, Nicholas and Rodrigues, Bruno and Roelofs, William and Roemer, Tobias and Rogeberg, Ole and Rose, Julian and Roskos-Ewoldsen, Andrew and Rosmer, Paul and Sabada, Barbara and Saberian, Soodeh and Salamanca, Nicolas and Sator, Georg and Sawyer, Antoine and Scates, Daniel and Schlüter, Elmar and Sells, Cameron and Sen, Sharmi and Sethi, Ritika and Shcherbiak, Anna and Sogaolu, Moyosore and Soosalu, Matt and Sørensen, Erik Ø and Sovani, Manali and Spencer, Noah and Staubli, Stefan and Stans, Renske and Stewart, Anya and Stips, Felix and Stockley, Kieran and Strobel, Stephenson and Struby, Ethan and Tang, John and Tanrisever, Idil and Yang, Thomas Tao and Tastan, Ipek and Tatić, Dejan and Tatlow, Benjamin and Seuyong, Féraud Tchuisseu and Thériault, Rémi and Thivierge, Vincent and Tian, Wenjie and Toma, Filip-Mihai and Totarelli, Maddalena and Tran, Van-Anh and Truong, Hung and Tsoy, Nikita and Tuzcuoglu, Kerem and Ubfal, Diego and Villalobos, Laura and Walterskirchen, Julian and Wang, Joseph Taoyi and Wattal, Vasudha and Webb, Matthew D. and Weber, Bryan and Weisser, Reinhard and Weng, Wei-Chien and Westheide, Christian and White, Kimberly and Winter, Jacob and Wochner, Timo and Woerman, Matt and Wong, Jared and Woodard, Ritchie and Wroński, Marcin and Yazbeck, Myra and Yang, Gustav Chung and Yap, Luther and Yassin, Kareman and Ye, Hao and Yoon, Jin Young and Yurris, Chris and Zahra, Tahreen and Zaneva, Mirela and Zayat, Aline and Zhang, Jonathan and Zhao, Ziwei and Yaolang, Zhong}, copyright = {All rights reserved}, doi = {10.1038/s41586-026-10251-x}, journal = {Nature}, language = {eng}, month = apr, pages = {151--156}, shorttitle = {Mass {Reproducibility} and {Replicability}}, title = {Reproducibility and {Robustness} of {Economics} and {Political} {Science} {Research}}, urldate = {2024-04-08}, volume = {652}, year = {2026}, month_numeric = {4} } - Open Science in den Wirtschaftswissenschaften: Transparenz, Reproduzierbarkeit und ZugangKlaus M. Schmidt, Levent Neyse, Marianne Saam, and 3 more authorsPerspektiven der Wirtschaftspolitik, 2026
Der Beitrag diskutiert Open Science in den Wirtschaftswissenschaften als Bündel von Praktiken zur Verbesserung von Transparenz, Reproduzierbarkeit und Zugänglichkeit wissenschaftlicher Forschung. Der Artikel zeigt, dass Präregistrierungen und Registered Reports, Open Data und Open Code sowie Open Access die Glaubwürdigkeit empirischer Forschung stärken können, zugleich aber disziplinspezifische Grenzen und Zielkonflikte berücksichtigen müssen. Für die Wirtschaftswissenschaften und die Forschungsförderung folgt daraus die Notwendigkeit verlässlicher Infrastrukturen, klarer Standards und nachhaltiger institutioneller Unterstützung, insbesondere für nicht-kommerzielle Open-Access-Modelle wie Diamond Open Access.
@article{schmidtetal2026, abstract = {Der Beitrag diskutiert Open Science in den Wirtschaftswissenschaften als Bündel von Praktiken zur Verbesserung von Transparenz, Reproduzierbarkeit und Zugänglichkeit wissenschaftlicher Forschung. Der Artikel zeigt, dass Präregistrierungen und Registered Reports, Open Data und Open Code sowie Open Access die Glaubwürdigkeit empirischer Forschung stärken können, zugleich aber disziplinspezifische Grenzen und Zielkonflikte berücksichtigen müssen. Für die Wirtschaftswissenschaften und die Forschungsförderung folgt daraus die Notwendigkeit verlässlicher Infrastrukturen, klarer Standards und nachhaltiger institutioneller Unterstützung, insbesondere für nicht-kommerzielle Open-Access-Modelle wie Diamond Open Access.}, author = {Schmidt, Klaus M. and Neyse, Levent and Saam, Marianne and Siegfried, Doreen and Vilhuber, Lars and Winter, Joachim}, journal = {Perspektiven der Wirtschaftspolitik}, title = {Open {Science} in den {Wirtschaftswissenschaften}: {Transparenz}, {Reproduzierbarkeit} und {Zugang}}, volume = {forthcoming}, year = {2026}, } - A Simulated Reconstruction and Reidentification Attack on the 2010 U.S. CensusJohn M. Abowd, Tamara Adams, Robert Ashmead, and 12 more authorsHarvard Data Science Review, Jul 2025
For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the United States used different disclosure limitation rules for its tabular and microdata publications. This article demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level—individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics’ utility for the primary statutory use case: redrawing the boundaries of all of the nation’s legislative and voting districts in compliance with the 1965 Voting Rights Act.
@article{abowd2025, abstract = {For the last half-century, it has been a common and accepted practice for statistical agencies, including the United States Census Bureau, to adopt different strategies to protect the confidentiality of aggregate tabular data products from those used to protect the individual records contained in publicly released microdata products. This strategy was premised on the assumption that the aggregation used to generate tabular data products made the resulting statistics inherently less disclosive than the microdata from which they were tabulated. Consistent with this common assumption, the 2010 Census of Population and Housing in the United States used different disclosure limitation rules for its tabular and microdata publications. This article demonstrates that, in the context of disclosure limitation for the 2010 Census, the assumption that tabular data are inherently less disclosive than their underlying microdata is fundamentally flawed. The 2010 Census published more than 150 billion aggregate statistics in 180 table sets. Most of these tables were published at the most detailed geographic level—individual census blocks, which can have populations as small as one person. Using only 34 of the published table sets, we reconstructed microdata records including five variables (census block, sex, age, race, and ethnicity) from the confidential 2010 Census person records. Using only published data, an attacker using our methods can verify that all records in 70\% of all census blocks (97 million people) are perfectly reconstructed. We further confirm, through reidentification studies, that an attacker can, within census blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with race and ethnicity different from the modal person on the census block) with 95\% accuracy. Having shown the vulnerabilities inherent to the disclosure limitation methods used for the 2010 Census, we proceed to demonstrate that the more robust disclosure limitation framework used for the 2020 Census publications defends against attacks that are based on reconstruction. Finally, we show that available alternatives to the 2020 Census Disclosure Avoidance System would either fail to protect confidentiality, or would overly degrade the statistics’ utility for the primary statutory use case: redrawing the boundaries of all of the nation’s legislative and voting districts in compliance with the 1965 Voting Rights Act.}, author = {Abowd, John M. and Adams, Tamara and Ashmead, Robert and Darais, David and Dey, Sourya and Garfinkel, Simson and Goldschlag, Nathan and Hawes, Michael B. and Kifer, Daniel and Leclerc, Philip and Lew, Ethan and Moore, Scott and Rodríguez, Rolando A. and Tadros, Ramy N. and Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.1162/99608f92.4a1ebf70}, journal = {Harvard Data Science Review}, language = {en}, month = jul, number = {3}, publisher = {MIT Press}, title = {A {Simulated} {Reconstruction} and {Reidentification} {Attack} on the 2010 {U}.{S}. {Census}}, url = {https://hdsr.mitpress.mit.edu/pub/ntchx9im}, urldate = {2025-07-21}, volume = {7}, year = {2025}, month_numeric = {7} } - TROV - A Model and Vocabulary for Describing Transparent Research ObjectsMeng Li, Timothy McPhillips, Craig Willis, and 6 more authorsInternational Journal of Digital Curation, Feb 2025
The Transparent Research Object Vocabulary (TROV) is a key element of the Transparency Certified (TRACE) approach to ensuring research trustworthiness. In contrast with methods that entail repeating computations in part or in full to verify that the descriptions of methods included in a publication are sufficient to reproduce reported results, the TRACE approach depends on a controlled computing environment termed a Transparent Research System (TRS) to guarantee that accurate, sufficiently complete, and otherwise trustworthy records are captured when results are obtained in the first place. Records identifying (1) the digital artifacts and computations that yielded a research result, (2) the TRS that witnessed the artifacts and supervised the computations, and (3) the specific conditions enforced by the TRS that warrant trust in these records, together constitute a Transparent Research Object (TRO). Digital signatures provided by the TRS and by a trusted third-party timestamp authority (TSA) guarantee the integrity and authenticity of the TRO. The controlled vocabulary TROV provides means to declare and query the properties of a TRO, to enumerate the dimensions of trustworthiness the TRS asserts for a TRO, and to verify that each such assertion is warranted by the documented capabilities of the TRS. Our approach for describing, publishing, and working with TROs imposes no restrictions on how computational artifacts are packaged or otherwise shared, and aims to be interoperable with, rather than to replace, current and future Research Object standards, archival formats, and repository layouts.
@article{li2025, abstract = {The Transparent Research Object Vocabulary (TROV) is a key element of the Transparency Certified (TRACE) approach to ensuring research trustworthiness. In contrast with methods that entail repeating computations in part or in full to verify that the descriptions of methods included in a publication are sufficient to reproduce reported results, the TRACE approach depends on a controlled computing environment termed a Transparent Research System (TRS) to guarantee that accurate, sufficiently complete, and otherwise trustworthy records are captured when results are obtained in the first place. Records identifying (1) the digital artifacts and computations that yielded a research result, (2) the TRS that witnessed the artifacts and supervised the computations, and (3) the specific conditions enforced by the TRS that warrant trust in these records, together constitute a Transparent Research Object (TRO). Digital signatures provided by the TRS and by a trusted third-party timestamp authority (TSA) guarantee the integrity and authenticity of the TRO. The controlled vocabulary TROV provides means to declare and query the properties of a TRO, to enumerate the dimensions of trustworthiness the TRS asserts for a TRO, and to verify that each such assertion is warranted by the documented capabilities of the TRS. Our approach for describing, publishing, and working with TROs imposes no restrictions on how computational artifacts are packaged or otherwise shared, and aims to be interoperable with, rather than to replace, current and future Research Object standards, archival formats, and repository layouts.}, author = {Li, Meng and McPhillips, Timothy and Willis, Craig and Parulian, Nikolaus and Ludäscher, Bertram and Kowalik, Kacper and Vilhuber, Lars and Lewis, Thu-Mai and Gooch, Mandy}, copyright = {CC BY Attribution 4.0 International}, doi = {10.2218/ijdc.v19i1.1019}, issn = {1746-8256}, journal = {International Journal of Digital Curation}, language = {en}, month = feb, number = {1}, pages = {7--7}, title = {{TROV} - {A} {Model} and {Vocabulary} for {Describing} {Transparent} {Research} {Objects}}, url = {https://ijdc.net/index.php/ijdc/article/view/1019}, urldate = {2025-02-12}, volume = {19}, year = {2025}, month_numeric = {2} } - Using Containers to Validate Research on Confidential Data at ScaleLars VilhuberHarvard Data Science Review, Jun 2025
I describe past experience with the validation server process over 10 years and several hundred users, as a means to provide proxy access to confidential data. As a modern replacement, I propose the use of containers—simulated computers that encapsulate entire software and file structures. The use of containers ensures reproducibility, reliable portability, and enables scalability. Infrastructure can be outsourced to commercial providers or users, at little to no cost to data providers. The only likely limitation to full automation is the absence of automated output vetting algorithms at statistical agencies.
@article{vilhuber2025e, abstract = {I describe past experience with the validation server process over 10 years and several hundred users, as a means to provide proxy access to confidential data. As a modern replacement, I propose the use of containers—simulated computers that encapsulate entire software and file structures. The use of containers ensures reproducibility, reliable portability, and enables scalability. Infrastructure can be outsourced to commercial providers or users, at little to no cost to data providers. The only likely limitation to full automation is the absence of automated output vetting algorithms at statistical agencies.}, author = {Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.1162/99608f92.4d1853ce}, journal = {Harvard Data Science Review}, language = {en}, month = jun, publisher = {The MIT Press}, title = {Using {Containers} to {Validate} {Research} on {Confidential} {Data} at {Scale}}, url = {https://hdsr.mitpress.mit.edu/pub/8r9mdkpl/release/1}, urldate = {2025-06-23}, year = {2025}, month_numeric = {6} } - Reproducibility and Open Science in EconomicsLars VilhuberRevue économique, 2025
Drawing on my experience as the American Economic Association’s data editor, I examine the current state of open science in economics as facilitated by and related to reproducibility. I touch on the tension between accessibility, sharing, and preservation. The guiding theme is the accessibility of the key ingredients for scholarship : manuscripts, data, software, and the necessary technology to combine the latter two in order to produce knowledge. I analyze how economic research balances openness with necessary restrictions, particularly regarding administrative and confidential data. I argue that a large degree of openness is nevertheless present, with extensive networks that include thousands of researchers supporting collaborative science. I argue that resource constraints, such as software licensing costs and computational resource requirements, pose similar challenges. I illustrate concrete benefits of open science in the economics literature, using recent articles. I wrap up by discussing the state of access to scientific articles in economics.
@article{vilhuberlarsReproducibilityOpenScience2025, abstract = {Drawing on my experience as the American Economic Association’s data editor, I examine the current state of open science in economics as facilitated by and related to reproducibility. I touch on the tension between accessibility, sharing, and preservation. The guiding theme is the accessibility of the key ingredients for scholarship : manuscripts, data, software, and the necessary technology to combine the latter two in order to produce knowledge. I analyze how economic research balances openness with necessary restrictions, particularly regarding administrative and confidential data. I argue that a large degree of openness is nevertheless present, with extensive networks that include thousands of researchers supporting collaborative science. I argue that resource constraints, such as software licensing costs and computational resource requirements, pose similar challenges. I illustrate concrete benefits of open science in the economics literature, using recent articles. I wrap up by discussing the state of access to scientific articles in economics.}, author = {Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.3917/reco.765.0697}, journal = {Revue économique}, number = {5}, title = {Reproducibility and {Open} {Science} in {Economics}}, url = {https://shs.cairn.info/journal-revue-economique-2025-5-page-697?lang=en}, volume = {76}, year = {2025}, } - Reproduce to validate: A comprehensive study on the reproducibility of economics researchSylvérie Herbert, Hautahi Kingi, Flavio Stanchi, and 1 more authorCanadian Journal of Economics/Revue canadienne d’économique, Aug 2024
Abstract Journals have pushed for transparency of research through data availability policies. Such data policies improve availability of data and code, but what is the impact on reproducibility? We present results from a large reproduction exercise for articles published in the American Economic Journal: Applied Economics, which has had a data availability policy since its inception in 2009. Out of 363 published articles, we assessed 274 articles. All articles provided some materials. We excluded 122 articles that required confidential or proprietary data or that required the replicator to otherwise obtain the data (44.5% of assessed articles). We attempted to reproduce 152 articles and were able to fully reproduce the results of 68 (44.7% of attempted reproductions). A further 66 (43.4% of attempted reproductions) were partially reproduced. Many articles required complex code changes even when at least partially reproduced. We collect bibliometric characteristics of authors, but find no evidence for author characteristics as determinants of reproducibility. There does not appear to be a citation bonus for reproducibility. The data availability policy of this journal was effective to ensure availability of materials, but is insufficient to ensure reproduction without additional work by replicators. , Résumé Les journaux militent pour la transparence de la recherche par le biais de politiques sur la disponibilité des données . De telles politiques sur les données améliorent la disponibilité des données et du code, mais quelle en est l’incidence sur la reproductibilité? Nous présentons les résultats d’un grand exercice de reproduction pour des articles publiés dans le American Economic Journal: Applied Economics , qui a une politique de disponibilité des données depuis sa création en 2009. Parmi les 363 articles publiés, nous en évaluons 274. Tous les articles avaient fourni certains documents. Nous avons exclu 122 articles qui nécessitaient des données confidentielles ou exclusives, ou qui exigeaient que la personne chargée de la réplication obtienne autrement les données (44,5% des articles évalués). Nous avons tenté de reproduire 152 articles et avons réussi à entièrement reproduire les résultats de 68 articles (44,7% des reproductions tentées). Nous avons aussi partiellement reproduit 66 autres articles (43,4% des reproductions tentées). De nombreux articles nécessitaient de complexes changements de code, même lorsqu’ils étaient au moins partiellement reproduits. Nous avons recueilli les caractéristiques bibliométriques des auteurs, mais n’avons pas constaté qu’elles étaient déterminantes pour la reproductibilité. La reproductibilité ne semble offrir aucun avantage en matière de citations. La politique de disponibilité des données de ce journal a été efficace pour assurer la disponibilité des documents, mais insuffisante pour assurer la reproduction sans travail supplémentaire des personnes chargées de la réplication.
@article{herbert2024b, abstract = {Abstract Journals have pushed for transparency of research through data availability policies. Such data policies improve availability of data and code, but what is the impact on reproducibility? We present results from a large reproduction exercise for articles published in the American Economic Journal: Applied Economics, which has had a data availability policy since its inception in 2009. Out of 363 published articles, we assessed 274 articles. All articles provided some materials. We excluded 122 articles that required confidential or proprietary data or that required the replicator to otherwise obtain the data (44.5\% of assessed articles). We attempted to reproduce 152 articles and were able to fully reproduce the results of 68 (44.7\% of attempted reproductions). A further 66 (43.4\% of attempted reproductions) were partially reproduced. Many articles required complex code changes even when at least partially reproduced. We collect bibliometric characteristics of authors, but find no evidence for author characteristics as determinants of reproducibility. There does not appear to be a citation bonus for reproducibility. The data availability policy of this journal was effective to ensure availability of materials, but is insufficient to ensure reproduction without additional work by replicators. , Résumé Les journaux militent pour la transparence de la recherche par le biais de politiques sur la disponibilité des données . De telles politiques sur les données améliorent la disponibilité des données et du code, mais quelle en est l'incidence sur la reproductibilité? Nous présentons les résultats d'un grand exercice de reproduction pour des articles publiés dans le American Economic Journal: Applied Economics , qui a une politique de disponibilité des données depuis sa création en 2009. Parmi les 363 articles publiés, nous en évaluons 274. Tous les articles avaient fourni certains documents. Nous avons exclu 122 articles qui nécessitaient des données confidentielles ou exclusives, ou qui exigeaient que la personne chargée de la réplication obtienne autrement les données (44,5\% des articles évalués). Nous avons tenté de reproduire 152 articles et avons réussi à entièrement reproduire les résultats de 68 articles (44,7\% des reproductions tentées). Nous avons aussi partiellement reproduit 66 autres articles (43,4\% des reproductions tentées). De nombreux articles nécessitaient de complexes changements de code, même lorsqu'ils étaient au moins partiellement reproduits. Nous avons recueilli les caractéristiques bibliométriques des auteurs, mais n'avons pas constaté qu'elles étaient déterminantes pour la reproductibilité. La reproductibilité ne semble offrir aucun avantage en matière de citations. La politique de disponibilité des données de ce journal a été efficace pour assurer la disponibilité des documents, mais insuffisante pour assurer la reproduction sans travail supplémentaire des personnes chargées de la réplication.}, author = {Herbert, Sylvérie and Kingi, Hautahi and Stanchi, Flavio and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1111/caje.12728}, issn = {0008-4085, 1540-5982}, journal = {Canadian Journal of Economics/Revue canadienne d'économique}, language = {en}, month = aug, pages = {caje.12728}, shorttitle = {Reproduce to validate}, title = {Reproduce to validate: {A} comprehensive study on the reproducibility of economics research}, url = {https://onlinelibrary.wiley.com/doi/10.1111/caje.12728}, urldate = {2024-08-05}, year = {2024}, month_numeric = {8} } - A Guide for Social Science Journal Editors on Easing into Open SciencePriya Silverstein, Colin Elman, Amanda Montoya, and 36 more authorsResearch Integrity and Peer Review, Feb 2024
Journal editors have a large amount of power to advance open science in their respective fields by incentivizing and mandating open policies and practices at their journals. The Data PASS Journal Editors Discussion Interface (JEDI, an online community for social science journal editors: www.dpjedi.org) has collated several resources on embedding open science in journal editing (www.dpjedi.org/resources). However, it can be overwhelming as an editor new to open science practices to know where to start. For this reason, we have created a guide for journal editors on how to get started with open science. The guide outlines steps that editors can take to implement open policies and practices within their journal, and goes through the what, why, how, and worries of each policy and practice. This manuscript introduces and summarizes the guide (full guide: https://doi.org/10.31219/osf.io/hstcx)).
@article{silverstein2024, abstract = {Journal editors have a large amount of power to advance open science in their respective fields by incentivizing and mandating open policies and practices at their journals. The Data PASS Journal Editors Discussion Interface (JEDI, an online community for social science journal editors: www.dpjedi.org) has collated several resources on embedding open science in journal editing (www.dpjedi.org/resources). However, it can be overwhelming as an editor new to open science practices to know where to start. For this reason, we have created a guide for journal editors on how to get started with open science. The guide outlines steps that editors can take to implement open policies and practices within their journal, and goes through the what, why, how, and worries of each policy and practice. This manuscript introduces and summarizes the guide (full guide: https://doi.org/10.31219/osf.io/hstcx)).}, author = {Silverstein, Priya and Elman, Colin and Montoya, Amanda and McGillivray, Barbara and Pennington, Charlotte Rebecca and Harrison, Chase H. and Steltenpohl, Crystal N. and Röer, Jan Philipp and Corker, Katherine S. and Charron, Lisa M. and Elsherif, Mahmoud and Na, Ana and Hayes-Harb, Rachel and Grinschgl, Sandra and Neal, Tess M. S. and Evans, Thomas Rhys and Karhulahti, Veli-Matti and Krenzer, William Leo Donald and Belaus, Anabel and Moreau, David and Burin, D. I. and Chin, Elizabeth and Plomp, Esther and Mayo-Wilson, Evan and Lyle, Jared and Adler, Jonathan M. and Bottesini, Julia G. and Lawson, Katherine M. and Schmidt, Kathleen and Reneau, Kyrani and Vilhuber, Lars and Waltman, Ludo and Gernsbacher, Morton Ann and Plonski, Paul E. and Ghai, Sakshi and Grant, Sean and Christian, Thu-Mai Lewis and Ngiam, William X. Q. and Syed, Moin}, copyright = {CC BY Attribution 4.0 International}, doi = {10.1186/s41073-023-00141-5}, journal = {Research Integrity and Peer Review}, language = {en-us}, month = feb, number = {2}, title = {A {Guide} for {Social} {Science} {Journal} {Editors} on {Easing} into {Open} {Science}}, urldate = {2024-01-18}, volume = {9}, year = {2024}, month_numeric = {2} } - Introduction to the special issue: Models of linked employer–employee data: Twenty years after “High wage workers and high wage firms”David Card, Ian Schmutte, and Lars VilhuberJournal of Econometrics, 2023
Thirteen papers in this special issue build directly on the legacy of AKM’s model and assumptions. One paper addresses the question of whether prior employers matter for wages. Four papers examine how the firm effects in Eq. (1) evolve over time or with specific external shocks. Three papers study the sorting of workers to firms — i.e., the matching process between and – and how this process varies over the business cycle. Four papers use models based on Eq. (1) to decompose gaps in mean outcomes between subgroups of workers and quantify the mediating role of firms in these gaps. And one examines the role of firm-wide factors in the take-up rate of social insurance by employees. Finally, two other papers, while not directly based on AKM, address important questions that arise in longitudinal panel data when there is endogenous sorting.
@article{CARD2023333, abstract = {Thirteen papers in this special issue build directly on the legacy of AKM’s model and assumptions. One paper addresses the question of whether prior employers matter for wages. Four papers examine how the firm effects in Eq. (1) evolve over time or with specific external shocks. Three papers study the sorting of workers to firms — i.e., the matching process between and – and how this process varies over the business cycle. Four papers use models based on Eq. (1) to decompose gaps in mean outcomes between subgroups of workers and quantify the mediating role of firms in these gaps. And one examines the role of firm-wide factors in the take-up rate of social insurance by employees. Finally, two other papers, while not directly based on AKM, address important questions that arise in longitudinal panel data when there is endogenous sorting.}, author = {Card, David and Schmutte, Ian and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {https://doi.org/10.1016/j.jeconom.2023.01.012}, issn = {0304-4076}, journal = {Journal of Econometrics}, number = {2}, pages = {333--339}, title = {Introduction to the special issue: {Models} of linked employer–employee data: {Twenty} years after “{High} wage workers and high wage firms”}, url = {https://www.sciencedirect.com/science/article/pii/S0304407623000337}, volume = {233}, year = {2023}, grant = {G-2019-12486}, } - Reproducibility and transparency versus privacy and confidentiality: Reflections from a data editorLars VilhuberJournal of Econometrics, Jun 2023Published online
Transparency and reproducibility are often seen in opposition to privacy and confidentiality. Data that need to be kept confidential are seen as an impediment to reproducibility, and privacy would seem to inhibit transparency. I bring a more nuanced view to the discussion, and show, using examples from over 1,000 reproducibility assessments, that confidential data can very well be used in reproducible and transparent research. The key insight is that access to most confidential data, while tedious, is open to hundreds if not thousands of researchers. In cases where few researchers can consider accessing such data in the future, reproducibility services, such as those provided by some journals, can provide some evidence for effective reproducibility even when the same data may not be available for future research.
@article{vilhuber2023a, abstract = {Transparency and reproducibility are often seen in opposition to privacy and confidentiality. Data that need to be kept confidential are seen as an impediment to reproducibility, and privacy would seem to inhibit transparency. I bring a more nuanced view to the discussion, and show, using examples from over 1,000 reproducibility assessments, that confidential data can very well be used in reproducible and transparent research. The key insight is that access to most confidential data, while tedious, is open to hundreds if not thousands of researchers. In cases where few researchers can consider accessing such data in the future, reproducibility services, such as those provided by some journals, can provide some evidence for effective reproducibility even when the same data may not be available for future research.}, author = {Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1016/j.jeconom.2023.05.001}, issn = {03044076}, journal = {Journal of Econometrics}, language = {en}, month = jun, note = {Published online}, number = {2}, pages = {2285--2294}, shorttitle = {Reproducibility and transparency versus privacy and confidentiality}, title = {Reproducibility and transparency versus privacy and confidentiality: {Reflections} from a data editor}, url = {https://linkinghub.elsevier.com/retrieve/pii/S0304407623001471}, urldate = {2023-06-06}, volume = {235}, year = {2023}, month_numeric = {6} } - Reinforcing Reproducibility and Replicability: An IntroductionLars Vilhuber, Ian Schmutte, Aleksandr Michuda, and 1 more authorHarvard Data Science Review, Jul 2023
The purpose of scientific publishing is the dissemination of robust research findings, exposing them to the scrutiny of peers. The key to this endeavor is documenting the provenance of those findings. Scientific practices during the course of research and subsequent publication, peer review, and dissemination practices and tools, all interact to (hopefully) enable a meaningful discourse about the veracity of scientific claims. However, while all practices and tools contribute to the final output, some are less often discussed than others, and perceptions, usage, and acceptance differ in myriad ways across disciplines. In this special theme, and in a subsequent column called “Reinforcing Reproducibility and Replicability,”we will explore these topics, with expert providers and expert users providing their input. While we will start within the economics discipline in this special theme, the column will not be as narrowly focused, providing context and voice from other disciplines over time.
@article{vilhuber2023i, abstract = {The purpose of scientific publishing is the dissemination of robust research findings, exposing them to the scrutiny of peers. The key to this endeavor is documenting the provenance of those findings. Scientific practices during the course of research and subsequent publication, peer review, and dissemination practices and tools, all interact to (hopefully) enable a meaningful discourse about the veracity of scientific claims. However, while all practices and tools contribute to the final output, some are less often discussed than others, and perceptions, usage, and acceptance differ in myriad ways across disciplines. In this special theme, and in a subsequent column called “Reinforcing Reproducibility and Replicability,”we will explore these topics, with expert providers and expert users providing their input. While we will start within the economics discipline in this special theme, the column will not be as narrowly focused, providing context and voice from other disciplines over time.}, author = {Vilhuber, Lars and Schmutte, Ian and Michuda, Aleksandr and Connolly, Marie}, copyright = {CC BY Attribution 4.0 International}, doi = {10.1162/99608f92.9ba2bd43}, issn = {2644-2353, 688-8513}, journal = {Harvard Data Science Review}, language = {en}, month = jul, number = {3}, publisher = {The MIT Press}, shorttitle = {Reinforcing {Reproducibility} and {Replicability}}, title = {Reinforcing {Reproducibility} and {Replicability}: {An} {Introduction}}, url = {https://hdsr.mitpress.mit.edu/pub/l8dmf3cm/release/2}, urldate = {2026-02-20}, volume = {5}, year = {2023}, grant = {SES-2217493}, month_numeric = {7} } - An Interview with John M. AbowdIan Schmutte and Lars VilhuberInternational Statistical Review, Feb 2022
John M. Abowd is the Chief Scientist and Associate Director for Research and Methodology, U.S. Census Bureau. He completed his A.B. in Economics at NotreDame in 1973 and his Ph.D. in Economics at University of Chicago in 1977 under Arnold Zellner. During his academic career, John has held faculty positions at Princeton, the University of Chicago, and, since 1987 at Cornell University where he is the Edmund Ezra Day Professor Emeritus of Economics, Statistics and Data Science. John was trained as a statistician and labor economist, and his economic research has focused on the rigorous empirical evaluation of labor market institutions. In the late 1990s, he began working with the Census Bureau on projects that would end up leveraging administrative and survey records into official statistical products. Through that work, he has developed a research agenda focused on issues necessary to generate those products, including data privacy, synthetic data, total error analysis, data linkage, missing data problems, among others.
@article{schmutte2022, abstract = {John M. Abowd is the Chief Scientist and Associate Director for Research and Methodology, U.S. Census Bureau. He completed his A.B. in Economics at NotreDame in 1973 and his Ph.D. in Economics at University of Chicago in 1977 under Arnold Zellner. During his academic career, John has held faculty positions at Princeton, the University of Chicago, and, since 1987 at Cornell University where he is the Edmund Ezra Day Professor Emeritus of Economics, Statistics and Data Science. John was trained as a statistician and labor economist, and his economic research has focused on the rigorous empirical evaluation of labor market institutions. In the late 1990s, he began working with the Census Bureau on projects that would end up leveraging administrative and survey records into official statistical products. Through that work, he has developed a research agenda focused on issues necessary to generate those products, including data privacy, synthetic data, total error analysis, data linkage, missing data problems, among others.}, author = {Schmutte, Ian and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1111/insr.12489}, issn = {0306-7734, 1751-5823}, journal = {International Statistical Review}, language = {en}, month = feb, pages = {insr.12489}, title = {An {Interview} with {John} {M}. {Abowd}}, url = {https://onlinelibrary.wiley.com/doi/10.1111/insr.12489}, urldate = {2022-02-21}, year = {2022}, month_numeric = {2} } - Teaching for large-scale Reproducibility VerificationLars Vilhuber, Hyuk Harry Son, Meredith Welch, and 2 more authorsJournal of Statistics and Data Science Education, Sep 2022
We describe a unique environment in which undergraduate students from various STEM and social science disciplines are trained in data provenance and reproducible methods, and then apply that knowledge to real, conditionally accepted manuscripts and associated replication packages. We describe in detail the recruitment, training, and regular activities. While the activity is not part of a regular curriculum, the skills and knowledge taught through explicit training of reproducible methods and principles, and reinforced through repeated application in a real-life workflow, contribute to the education of these undergraduate students, and prepare them for post-graduation jobs and further studies.
@article{vilhuber2022a, abstract = {We describe a unique environment in which undergraduate students from various STEM and social science disciplines are trained in data provenance and reproducible methods, and then apply that knowledge to real, conditionally accepted manuscripts and associated replication packages. We describe in detail the recruitment, training, and regular activities. While the activity is not part of a regular curriculum, the skills and knowledge taught through explicit training of reproducible methods and principles, and reinforced through repeated application in a real-life workflow, contribute to the education of these undergraduate students, and prepare them for post-graduation jobs and further studies.}, author = {Vilhuber, Lars and Son, Hyuk Harry and Welch, Meredith and Wasser, David N. and Darisse, Michael}, copyright = {CC BY Attribution 4.0 International}, doi = {10.1080/26939169.2022.2074582}, journal = {Journal of Statistics and Data Science Education}, language = {en}, month = sep, number = {3}, pages = {274--281}, title = {Teaching for large-scale {Reproducibility} {Verification}}, url = {https://arxiv.org/abs/2204.01540v1}, volume = {30}, year = {2022}, month_numeric = {9} } - On privacy in the age of COVID-19Cynthia Dwork, Alan Karr, Kobbi Nissim, and 1 more authorJournal of Privacy and Confidentiality, Feb 2021Not peer-reviewed.
As a third of the world population has been in lockdown [1], nearly half a million peoplehave died of COVID-19 [2], and the world’s economies have nosedived, policy makers andthe public clamor for good news, or even just less uncertainty. Questions such as “MightI be infected if I go to work?” or “Does wearing a mask help prevent the spread of thedisease?” are being asked. Answering these questions requires data! We need data onthe infectiousness of the disease, as well as the efficacy of interventions such as lockdowns,distancing, and protective measures. Because the disease is novel, we do not know whetherscientifically collected data from previous pandemics are relevant. Both repurposing relevantexisting data collections, and the quick and effective design of new data collections are toppriorities for informed, high quality decision-making
@article{Dwork_Karr_Nissim_Vilhuber_2021, abstract = {As a third of the world population has been in lockdown [1], nearly half a million peoplehave died of COVID-19 [2], and the world’s economies have nosedived, policy makers andthe public clamor for good news, or even just less uncertainty. Questions such as “MightI be infected if I go to work?” or “Does wearing a mask help prevent the spread of thedisease?” are being asked. Answering these questions requires data! We need data onthe infectiousness of the disease, as well as the efficacy of interventions such as lockdowns,distancing, and protective measures. Because the disease is novel, we do not know whetherscientifically collected data from previous pandemics are relevant. Both repurposing relevantexisting data collections, and the quick and effective design of new data collections are toppriorities for informed, high quality decision-making}, author = {Dwork, Cynthia and Karr, Alan and Nissim, Kobbi and Vilhuber, Lars}, copyright = {CC BY-NC-ND Attribution-NonCommercial-NoDerivatives 4.0 International}, doi = {10.29012/jpc.749}, journal = {Journal of Privacy and Confidentiality}, month = feb, note = {Not peer-reviewed.}, number = {2}, title = {On privacy in the age of {COVID}-19}, url = {https://journalprivacyconfidentiality.org/index.php/jpc/article/view/749}, volume = {10}, year = {2021}, month_numeric = {2} } - Recalculating ... : How Uncertainty in Local Labour Market Definitions Affects Empirical FindingsAndrew Foote, Mark J. Kutzbach, and Lars VilhuberApplied Economics, Jan 2021
This paper evaluates the use of commuting zones as a local labor market definition. We revisit Tolbert and Sizer (1996) and demonstrate the sensitivity of definitions to two features of the methodology. We show how these features impact empirical estimates using a well-known application of commuting zones. We conclude with advice to researchers using commuting zones on how to demonstrate the robustness of empirical findings to uncertainty in definitions. The analysis, conclusions, and opinions expressed herein are those of the author(s) alone and do not necessarily represent the views of the U.S. Census Bureau or the Federal Deposit Insurance Corporation. All results have been reviewed to ensure that no confidential information is disclosed, and no confidential data was used in this paper. This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Much of the work developing this paper occurred while Mark Kutzbach was an employee of the U.S. Census Bureau.
@article{foote2021, abstract = {This paper evaluates the use of commuting zones as a local labor market definition. We revisit Tolbert and Sizer (1996) and demonstrate the sensitivity of definitions to two features of the methodology. We show how these features impact empirical estimates using a well-known application of commuting zones. We conclude with advice to researchers using commuting zones on how to demonstrate the robustness of empirical findings to uncertainty in definitions. The analysis, conclusions, and opinions expressed herein are those of the author(s) alone and do not necessarily represent the views of the U.S. Census Bureau or the Federal Deposit Insurance Corporation. All results have been reviewed to ensure that no confidential information is disclosed, and no confidential data was used in this paper. This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Much of the work developing this paper occurred while Mark Kutzbach was an employee of the U.S. Census Bureau.}, author = {Foote, Andrew and Kutzbach, Mark J. and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1080/00036846.2020.1841083}, issn = {0003-6846, 1466-4283}, journal = {Applied Economics}, language = {en}, month = jan, pages = {1--15}, shorttitle = {Recalculating ...}, title = {Recalculating ... : {How} {Uncertainty} in {Local} {Labour} {Market} {Definitions} {Affects} {Empirical} {Findings}}, url = {https://www.tandfonline.com/doi/full/10.1080/00036846.2020.1841083}, urldate = {2021-02-06}, year = {2021}, month_numeric = {1} } - metajelo: A metadata package for journals to support external linked objectsCarl Lagoze and Lars VilhuberInternational Journal of Digital Curation, 2021
We propose a metadata package that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. As such, the vocabulary described here complements existing initiatives that specify vocabularies to describe the supplementary materials or the repositories and archives in which they have been deposited. Where possible, it reuses elements of relevant other vocabularies, facilitating coexistence with them. Furthermore, it provides an “at publication” record of reproducibility characteristics of a particular article that has been selected for publication. The proposed metadata package documents the key characteristics that journals care about in the case of supplementary materials that are held by third parties: existence, accessibility, and permanence. It does so in a robust, time-invariant fashion at the time of publication, when the editorial decisions are made. It also allows for better documentation of less accessible (non-public data), by treating it symmetrically from the point of view of the journal, therefore increasing the transparency of what up until now has been very opaque.
@article{lagoze2021, abstract = {We propose a metadata package that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. As such, the vocabulary described here complements existing initiatives that specify vocabularies to describe the supplementary materials or the repositories and archives in which they have been deposited. Where possible, it reuses elements of relevant other vocabularies, facilitating coexistence with them. Furthermore, it provides an “at publication” record of reproducibility characteristics of a particular article that has been selected for publication. The proposed metadata package documents the key characteristics that journals care about in the case of supplementary materials that are held by third parties: existence, accessibility, and permanence. It does so in a robust, time-invariant fashion at the time of publication, when the editorial decisions are made. It also allows for better documentation of less accessible (non-public data), by treating it symmetrically from the point of view of the journal, therefore increasing the transparency of what up until now has been very opaque.}, author = {Lagoze, Carl and Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.2218/ijdc.v16i1.600}, journal = {International Journal of Digital Curation}, number = {1}, title = {metajelo: {A} metadata package for journals to support external linked objects}, volume = {16}, year = {2021}, grant = {G-2018-11377}, } - Applying data synthesis for longitudinal business data across three countriesM. Jahangir Alam, Benoit Dostie, Jörg Drechsler, and 1 more authorStatistics in Transition New Series, 2020
Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (LEAP) and Germany (BHP). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.
@article{alam2020, abstract = {Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (LEAP) and Germany (BHP). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.}, author = {Alam, M. Jahangir and Dostie, Benoit and Drechsler, Jörg and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.21307/stattrans-2020-039}, issn = {1234-7655, 2450-0291}, journal = {Statistics in Transition New Series}, language = {en}, number = {4}, pages = {212--236}, title = {Applying data synthesis for longitudinal business data across three countries}, url = {https://www.exeley.com/statistics_in_transition/doi/10.21307/stattrans-2020-039}, urldate = {2020-10-26}, volume = {21}, year = {2020}, } - Total Error and Variability Measures for the Quarterly Workforce Indicators and Lehd Origin-Destination Employment Statistics in OnthemapKevin L McKinney, Andrew S Green, Lars Vilhuber, and 1 more authorJournal of Survey Statistics and Methodology, Nov 2020:cen:wpaper:17-71
Abstract We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full-quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM), including OTM for Emergency Management. We account for errors due to coverage; record-level non-response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10.
@article{mckinney2020, abstract = {Abstract We report results from the first comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total flow-employment, beginning-of-quarter employment, full-quarter employment, average monthly earnings of full-quarter employees, and total quarterly payroll. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM), including OTM for Emergency Management. We account for errors due to coverage; record-level non-response; edit and imputation of item missing data; and statistical disclosure limitation. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs are a transition zone, where cells may be fit for use with caution. Tabulations involving one or two jobs, which are generally suppressed on fitness-for-use criteria in the QWI and synthesized in LODES, have substantial total variability but can still be used to estimate statistics for untabulated aggregates as long as the job count in the aggregate is more than 10.}, author = {McKinney, Kevin L and Green, Andrew S and Vilhuber, Lars and Abowd, John M}, copyright = {All rights reserved}, doi = {10.1093/jssam/smaa029}, issn = {2325-0984, 2325-0992}, journal = {Journal of Survey Statistics and Methodology}, language = {en}, month = nov, note = {:cen:wpaper:17-71}, pages = {smaa029}, title = {Total {Error} and {Variability} {Measures} for the {Quarterly} {Workforce} {Indicators} and {Lehd} {Origin}-{Destination} {Employment} {Statistics} in {Onthemap}}, url = {https://academic.oup.com/jssam/advance-article/doi/10.1093/jssam/smaa029/5955529}, urldate = {2021-02-06}, year = {2020}, month_numeric = {11} } - Reproducibility and Replicability in EconomicsLars VilhuberHarvard Data Science Review, Dec 2020
@article{vilhuber2020, author = {Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.1162/99608f92.4f6b9e67}, journal = {Harvard Data Science Review}, month = dec, number = {4}, title = {Reproducibility and {Replicability} in {Economics}}, url = {https://hdsr.mitpress.mit.edu/pub/fgpmpj1l}, volume = {2}, year = {2020}, month_numeric = {12} } - Effects of a Government-Academic Partnership: Has the NSF-Census Bureau Research Network Helped Improve the US Statistical System?Daniel H. Weinberg, John M. Abowd, Robert F. Belli, and 13 more authorsJournal of Survey Statistics and Methodology, 2019First published December 2018 :cen:wpaper:17-59r
Abstract. The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodolo
@article{weinberg2018, abstract = {Abstract. The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodolo}, author = {Weinberg, Daniel H. and Abowd, John M. and Belli, Robert F. and Cressie, Noel and Folch, David C. and Holan, Scott H. and Levenstein, Margaret C. and Olson, Kristen M. and Reiter, Jerome P. and Shapiro, Matthew D. and Smyth, Jolene D. and Soh, Leen-Kiat and Spencer, Bruce D. and Spielman, Seth E. and Vilhuber, Lars and Wikle, Christopher K.}, copyright = {All rights reserved}, doi = {10.1093/jssam/smy023}, journal = {Journal of Survey Statistics and Methodology}, language = {en}, note = {First published December 2018 :cen:wpaper:17-59r}, number = {4}, pages = {589--619}, shorttitle = {Effects of a {Government}-{Academic} {Partnership}}, title = {Effects of a {Government}-{Academic} {Partnership}: {Has} the {NSF}-{Census} {Bureau} {Research} {Network} {Helped} {Improve} the {US} {Statistical} {System}?}, url = {https://doi.org/10.1093/jssam/smy023}, urldate = {2018-12-29}, volume = {7}, year = {2019}, grant = {SES-1131848}, } - Remembering Stephen FienbergAleksandra Slavković and Lars VilhuberJournal of Privacy and Confidentiality, Dec 2018
@article{slavkovic2018, author = {Slavković, Aleksandra and Vilhuber, Lars}, copyright = {CC BY-NC-ND Creative Commons License – Attribution-NonCommercial-NoDerivatives 4.0 International}, doi = {10.29012/jpc.685}, journal = {Journal of Privacy and Confidentiality}, month = dec, number = {1}, title = {Remembering {Stephen} {Fienberg}}, volume = {8}, year = {2018}, month_numeric = {12} } - Relaunching the Journal of Privacy and ConfidentialityLars VilhuberJournal of Privacy and Confidentiality, Dec 2018
@article{vilhuber2018a, author = {Vilhuber, Lars}, copyright = {CC BY-NC-ND Creative Commons License – Attribution-NonCommercial-NoDerivatives 4.0 International}, doi = {10.29012/jpc.706}, journal = {Journal of Privacy and Confidentiality}, month = dec, number = {1}, title = {Relaunching the {Journal} of {Privacy} and {Confidentiality}}, volume = {8}, year = {2018}, month_numeric = {12} } - Understanding the effect of procedural justice on psychological distressJulie Cloutier, Lars Vilhuber, Denis Harrisson, and 1 more authorInternational Journal of Stress Management, 2017
Studies on the effect of procedural justice on psychological distress present conflicting results. Drawing on instrumental and relational perspectives of justice, we test the hypothesis that the perception of procedural justice influences the level of workers’ psychological distress. Using a number of validated instruments to collected data from 659 workers in three call centers, we use OLS regressions and Hayes’ PROCESS tool to show that the perception of procedural justice has a direct, unique, and independent effect on psychological distress. The perception of procedural justice has no instrumental role, the key mechanism being the relational role, suggesting that perceived injustice influences psychological distress because it threatens self-esteem. Distributive justice perceptions (recognition, promotions, job security) are not associated with psychological distress, calling into question Siegrist’s model. Our findings suggest that perceived procedural justice provides workers better evidence of the extent to which they are valued and appreciated members of their organizations than do perceptions of distributive justice. The results highlight the greater need for workers to be valued and appreciated for who they are (consideration and esteem), rather than for what they do for their organization (distributive justice of rewards).
@article{CloutierVilhuber2017, abstract = {Studies on the effect of procedural justice on psychological distress present conflicting results. Drawing on instrumental and relational perspectives of justice, we test the hypothesis that the perception of procedural justice influences the level of workers' psychological distress. Using a number of validated instruments to collected data from 659 workers in three call centers, we use OLS regressions and Hayes' PROCESS tool to show that the perception of procedural justice has a direct, unique, and independent effect on psychological distress. The perception of procedural justice has no instrumental role, the key mechanism being the relational role, suggesting that perceived injustice influences psychological distress because it threatens self-esteem. Distributive justice perceptions (recognition, promotions, job security) are not associated with psychological distress, calling into question Siegrist's model. Our findings suggest that perceived procedural justice provides workers better evidence of the extent to which they are valued and appreciated members of their organizations than do perceptions of distributive justice. The results highlight the greater need for workers to be valued and appreciated for who they are (consideration and esteem), rather than for what they do for their organization (distributive justice of rewards).}, author = {Cloutier, Julie and Vilhuber, Lars and Harrisson, Denis and Béland-Ouellette, Vanessa}, copyright = {All rights reserved}, doi = {10.1037/str0000065}, journal = {International Journal of Stress Management}, number = {3}, pages = {283--300}, title = {Understanding the effect of procedural justice on psychological distress}, volume = {25}, year = {2017}, } - Making confidential data part of reproducible researchC Lagoze and L VilhuberChance, 2017
The rise of data-centric research practices has uncovered shortcomings in the traditional scholarly communication system. The foundation of that system, the peer-reviewed publication,“[the] selective distribution of ink on paper, or… electronic facsimiles of the same”(Bourne, et al., 2011), does not adequately support what has become an essential element of scholarship; the reproducibility of research results. That is, duplicating a ...
@article{Lagoze2017-qv, abstract = {The rise of data-centric research practices has uncovered shortcomings in the traditional scholarly communication system. The foundation of that system, the peer-reviewed publication,“[the] selective distribution of ink on paper, or… electronic facsimiles of the same”(Bourne, et al., 2011), does not adequately support what has become an essential element of scholarship; the reproducibility of research results. That is, duplicating a ...}, author = {Lagoze, C and Vilhuber, L}, copyright = {All rights reserved}, doi = {10.1080/09332480.2017.1383118}, journal = {Chance}, title = {Making confidential data part of reproducible research}, year = {2017}, } - Using partially synthetic microdata to protect sensitive cells in business statisticsJavier Miranda and Lars VilhuberStatistical Journal of the International Association for Official Statistics, 2016:cen:wpaper:16-10
We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau’s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
@article{miranda2016, abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).}, author = {Miranda, Javier and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.3233/SJI-160963}, journal = {Statistical Journal of the International Association for Official Statistics}, note = {:cen:wpaper:16-10}, number = {1}, pages = {69--80}, title = {Using partially synthetic microdata to protect sensitive cells in business statistics}, url = {https://content.iospress.com/articles/statistical-journal-of-the-iaos/sji963}, volume = {32}, year = {2016}, grant = {SES-1131848}, } - Synthetic establishment microdata around the worldLars Vilhuber, John M. Abowd, and Jerome P. ReiterStatistical Journal of the International Association for Official Statistics, 2016
In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.
@article{vilhuber2016, abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.}, author = {Vilhuber, Lars and Abowd, John M. and Reiter, Jerome P.}, copyright = {All rights reserved}, doi = {10.3233/SJI-160964}, journal = {Statistical Journal of the International Association for Official Statistics}, number = {1}, pages = {65--68}, title = {Synthetic establishment microdata around the world}, volume = {32}, year = {2016}, grant = {SES-1131848}, } - Looking back on three years of using the Synthetic LBD betaJavier Miranda and Lars VilhuberStatistical Journal of the IAOS: Journal of the International Association for Official Statistics, 2014
Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.
@article{SJIAOS-2014a, abstract = {Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.}, author = {Miranda, Javier and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.3233/SJI-140811}, journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics}, title = {Looking back on three years of using the {Synthetic} {LBD} beta}, url = {http://iospress.metapress.com/content/X415V18331Q33150}, volume = {30}, year = {2014}, } - A First Step Towards A German SynLBD: Constructing A German Longitudinal Business DatabaseJörg Drechsler and Lars VilhuberStatistical Journal of the IAOS: Journal of the International Association for Official Statistics, 2014:cen:wpaper:14-13
One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.
@article{SJIAOS-2014b, abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.}, author = {Drechsler, Jörg and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.3233/SJI-140812}, journal = {Statistical Journal of the IAOS: Journal of the International Association for Official Statistics}, note = {:cen:wpaper:14-13}, title = {A {First} {Step} {Towards} {A} {German} {SynLBD}: {Constructing} {A} {German} {Longitudinal} {Business} {Database}}, url = {http://iospress.metapress.com/content/X415V18331Q33150}, volume = {30}, year = {2014}, } - Differential privacy applications to bayesian and linear mixed model estimationJohn M. Abowd, Matthew J. Schneider, and Lars VilhuberJournal of Privacy and Confidentiality, 2013
We consider a particular maximum likelihood estimator (MLE) and a computationally intensive Bayesian method for differentially private estimation of the linear mixed-effects model (LMM) with normal random errors. The LMM is important because it is used in small-area estimation and detailed industry tabulations that present significant challenges for confidentiality protection of the underlying data. The differentially private MLE performs well compared to the regular MLE, and deteriorates as the protection increases for a problem in which the small-area variation is at the county level. More dimensions of random effects are needed to adequately represent the time dimension of the data, and for these cases the differentially private MLE cannot be computed. The direct Bayesian approach for the same model uses an informative, reasonably diffuse prior to compute the posterior predictive distribution for the random effects. The empirical differential privacy of this approach is estimated by direct computation of the relevant odds ratios after deleting influential observations according to various criteria.
@article{AbowdSchneiderVilhuber2013, abstract = {We consider a particular maximum likelihood estimator (MLE) and a computationally intensive Bayesian method for differentially private estimation of the linear mixed-effects model (LMM) with normal random errors. The LMM is important because it is used in small-area estimation and detailed industry tabulations that present significant challenges for confidentiality protection of the underlying data. The differentially private MLE performs well compared to the regular MLE, and deteriorates as the protection increases for a problem in which the small-area variation is at the county level. More dimensions of random effects are needed to adequately represent the time dimension of the data, and for these cases the differentially private MLE cannot be computed. The direct Bayesian approach for the same model uses an informative, reasonably diffuse prior to compute the posterior predictive distribution for the random effects. The empirical differential privacy of this approach is estimated by direct computation of the relevant odds ratios after deleting influential observations according to various criteria.}, author = {Abowd, John M. and Schneider, Matthew J. and Vilhuber, Lars}, copyright = {CC BY-NC-ND Attribution-NonCommercial-NoDerivatives 4.0 International}, doi = {10.29012/jpc.v5i1.627}, journal = {Journal of Privacy and Confidentiality}, number = {1}, title = {Differential privacy applications to bayesian and linear mixed model estimation}, url = {https://doi.org/10.29012/jpc.v5i1.627}, volume = {5}, year = {2013}, } - Data Management of Confidential DataCarl Lagoze, William C. Block, Jeremy Williams, and 2 more authorsInternational Journal of Digital Curation, 2013
Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data, such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfil US federal data management mandates and interfering with basic scholarly practices, such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access and cite such data.
@article{lagoze2013a, abstract = {Social science researchers increasingly make use of data that is confidential because it contains linkages to the identities of people, corporations, etc. The value of this data lies in the ability to join the identifiable entities with external data, such as genome data, geospatial information, and the like. However, the confidentiality of this data is a barrier to its utility and curation, making it difficult to fulfil US federal data management mandates and interfering with basic scholarly practices, such as validation and reuse of existing results. We describe the complexity of the relationships among data that span a public and private divide. We then describe our work on the CED2AR prototype, a first step in providing researchers with a tool that spans this divide and makes it possible for them to search, access and cite such data.}, author = {Lagoze, Carl and Block, William C. and Williams, Jeremy and Abowd, John and Vilhuber, {and} Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.2218/ijdc.v8i1.259}, journal = {International Journal of Digital Curation}, language = {it}, number = {1}, pages = {265--278}, title = {Data {Management} of {Confidential} {Data}}, volume = {8}, year = {2013}, grant = {SES-1131848}, } - Did the Housing Price Bubble Clobber Local Labor Market Job and Worker Flows When It Burst?John M Abowd and Lars VilhuberThe American Economic Review, May 2012
We use the Census Bureau’s Quarterly Workforce Indicators and the Federal Housing Finance Agency’s House Price Indices to study the effects of the housing price bubble on local labor markets. We show that the 35 MSAs in the top decile of the house price boom were most severely impacted. Their stable job employment fell much more than the national average. Their real wage rates did not fall as fast as the national average. Accessions fell much faster than average while separations were constant. Job creations fell substantially while destructions rose slightly.
@article{abowd2012a, abstract = {We use the Census Bureau's Quarterly Workforce Indicators and the Federal Housing Finance Agency's House Price Indices to study the effects of the housing price bubble on local labor markets. We show that the 35 MSAs in the top decile of the house price boom were most severely impacted. Their stable job employment fell much more than the national average. Their real wage rates did not fall as fast as the national average. Accessions fell much faster than average while separations were constant. Job creations fell substantially while destructions rose slightly.}, author = {Abowd, John M and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1257/aer.102.3.589}, issn = {0002-8282}, journal = {The American Economic Review}, month = may, number = {3}, pages = {589--593}, title = {Did the {Housing} {Price} {Bubble} {Clobber} {Local} {Labor} {Market} {Job} and {Worker} {Flows} {When} {It} {Burst}?}, volume = {102}, year = {2012}, month_numeric = {5} } - National estimates of gross employment and job flows from the quarterly workforce indicators with demographic and industry detailJohn M. Abowd and Lars VilhuberJournal of Econometrics, 2011
The Quarterly Workforce Indicators (QWI) are local labor market data produced and released every quarter by the United States Census Bureau. Unlike any other local labor market series produced in the US or the rest of the world, QWI measure employment flows for workers (accession and separations), jobs (creations and destructions) and earnings for demographic subgroups (age and gender), economic industry (NAICS industry groups), detailed geography (block (experimental), county, Core-Based Statistical Area, and Workforce Investment Area), and ownership (private, all) with fully interacted publication tables. The current QWI data cover 47 states, about 98% of the private workforce in those states, and about 92% of all private employment in the entire economy. State participation is sufficiently extensive to permit us to present the first national estimates constructed from these data. We focus on worker, job, and excess (churning) reallocation rates, rather than on levels of the basic variables. This permits a comparison to existing series from the Job Openings and Labor Turnover Survey and the Business Employment Dynamics Series from the Bureau of Labor Statistics (BLS). The national estimates from the QWI are an important enhancement to existing series because they include demographic and industry detail for both worker and job flow data compiled from underlying micro-data that have been integrated at the job and establishment levels by the Longitudinal Employer-Household Dynamics Program at the Census Bureau. The estimates presented herein were compiled exclusively from public-use data series and are available for download.
@article{AbowdVilhuber2010, abstract = {The Quarterly Workforce Indicators (QWI) are local labor market data produced and released every quarter by the United States Census Bureau. Unlike any other local labor market series produced in the US or the rest of the world, QWI measure employment flows for workers (accession and separations), jobs (creations and destructions) and earnings for demographic subgroups (age and gender), economic industry (NAICS industry groups), detailed geography (block (experimental), county, Core-Based Statistical Area, and Workforce Investment Area), and ownership (private, all) with fully interacted publication tables. The current QWI data cover 47 states, about 98\% of the private workforce in those states, and about 92\% of all private employment in the entire economy. State participation is sufficiently extensive to permit us to present the first national estimates constructed from these data. We focus on worker, job, and excess (churning) reallocation rates, rather than on levels of the basic variables. This permits a comparison to existing series from the Job Openings and Labor Turnover Survey and the Business Employment Dynamics Series from the Bureau of Labor Statistics (BLS). The national estimates from the QWI are an important enhancement to existing series because they include demographic and industry detail for both worker and job flow data compiled from underlying micro-data that have been integrated at the job and establishment levels by the Longitudinal Employer-Household Dynamics Program at the Census Bureau. The estimates presented herein were compiled exclusively from public-use data series and are available for download.}, author = {Abowd, John M. and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1016/j.jeconom.2010.09.008}, journal = {Journal of Econometrics}, pages = {82--99}, title = {National estimates of gross employment and job flows from the quarterly workforce indicators with demographic and industry detail}, volume = {161}, year = {2011}, } - Science, confidentiality, and the public interestJohn M. Abowd and Lars VilhuberChance, 2011
In this month’s column, we will continue down that path, describing in more detail the benefits of providing data to public agencies and how pub- lic agencies navigate the narrow path between too much information disclo- sure on one hand and the release of useful information on the other.
@article{Chance2011, abstract = {In this month’s column, we will continue down that path, describing in more detail the benefits of providing data to public agencies and how pub- lic agencies navigate the narrow path between too much information disclo- sure on one hand and the release of useful information on the other.}, author = {Abowd, John M. and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1080/09332480.2011.10739876}, journal = {Chance}, number = {3}, pages = {58--62}, title = {Science, confidentiality, and the public interest}, url = {http://dx.doi.org/10.1080/09332480.2011.10739876}, volume = {24}, year = {2011}, } - Procedural justice criteria in salary determinationJulie Cloutier and Lars VilhuberJournal of Managerial Psychology, 2008
Purpose – The purpose of this research is to identify the dimensionality of the procedural justice construct and the criteria used by employees to assess procedural justice, in the context of salary determination.Design/methodology/approach – Based on a survey of 297 Canadian workers, the paper uses confirmatory factor analysis (CFA) to test the dimensionality and the discriminant and convergent validity of our procedural justice construct. Convergent and predictive validity are also tested using hierarchical linear regressions.Findings – The paper shows the multidimensionality of the procedural justice construct: justice of the salary determination process is assessed through the perceived characteristics of allocation procedures, the perceived characteristics of decision‐makers, and system transparency.Research limitations/implications – Results could be biased towards acceptance; this is discussed. The results also suggest possible extensions to the study.Practical implications – Knowledge of the justice standards improves the ability of organizations to effectively manage the salary determination process and promote its acceptance among employees. Emphasizes the need to adequately manage the selection, training, and perception of decision makers.Originality/value – The paper identifies the standards of procedural justice for salary determination processes. It contributes to the theoretical literature by providing a new multidimensional conceptualization, which helps to better understand the psychological process underlying the perception of procedural justice. The presence of a dimension associated with decision makers is novel and critical for compensation studies.
@article{CloutierVilhuber2008, abstract = {Purpose – The purpose of this research is to identify the dimensionality of the procedural justice construct and the criteria used by employees to assess procedural justice, in the context of salary determination.Design/methodology/approach – Based on a survey of 297 Canadian workers, the paper uses confirmatory factor analysis (CFA) to test the dimensionality and the discriminant and convergent validity of our procedural justice construct. Convergent and predictive validity are also tested using hierarchical linear regressions.Findings – The paper shows the multidimensionality of the procedural justice construct: justice of the salary determination process is assessed through the perceived characteristics of allocation procedures, the perceived characteristics of decision‐makers, and system transparency.Research limitations/implications – Results could be biased towards acceptance; this is discussed. The results also suggest possible extensions to the study.Practical implications – Knowledge of the justice standards improves the ability of organizations to effectively manage the salary determination process and promote its acceptance among employees. Emphasizes the need to adequately manage the selection, training, and perception of decision makers.Originality/value – The paper identifies the standards of procedural justice for salary determination processes. It contributes to the theoretical literature by providing a new multidimensional conceptualization, which helps to better understand the psychological process underlying the perception of procedural justice. The presence of a dimension associated with decision makers is novel and critical for compensation studies.}, author = {Cloutier, Julie and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1108/02683940810894765}, journal = {Journal of Managerial Psychology}, number = {6}, pages = {713--740}, title = {Procedural justice criteria in salary determination}, url = {https://doi.org/10.1108/02683940810894765}, volume = {23}, year = {2008}, } - The sensitivity of economic statistics to coding errors in personal identifiersJohn M. Abowd and Lars VilhuberJournal of Business & Economic Statistics, Apr 2005
In this article we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau’s Quarterly Workforce Indicators before and after correcting for such errors in Social Security Number-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from .25% up to 15% for flow statistics, and up to 5% for payroll aggregates.
@article{AbowdVilhuber2005, abstract = {In this article we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau's Quarterly Workforce Indicators before and after correcting for such errors in Social Security Number-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from .25\% up to 15\% for flow statistics, and up to 5\% for payroll aggregates.}, author = {Abowd, John M. and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, journal = {Journal of Business \& Economic Statistics}, month = apr, number = {2}, pages = {133--152}, title = {The sensitivity of economic statistics to coding errors in personal identifiers}, url = {http://www.jstor.org/stable/27638803}, volume = {23}, year = {2005}, month_numeric = {4} } - Escaping poverty for low-wage workers: The role of employer characteristics and changesHarry Holzer, Julia Lane, and Lars VilhuberIndustrial and Labor Relations Review, Jul 2004tex.alturl: http://www.jstor.org/stable/4126683
Using a unique dataset based on individual Unemployment Insurance wage records for Illinois in the 1990s that are matched to other Census data, the authors analyze the extent to which escape from or entry into low earnings among adult workers was associated with changes in their employers and firm characteristics. The results show considerable mobility into and out of low earnings status, even for adults. They indicate that job changes were an important part of the process by which workers escaped or entered low-wage status, and that changes in employer characteristics help to account for these job changes. Matches between personal and firm characteristics also contributed to observed earnings outcomes.
@article{HolzerLaneVilhuber2004, abstract = {Using a unique dataset based on individual Unemployment Insurance wage records for Illinois in the 1990s that are matched to other Census data, the authors analyze the extent to which escape from or entry into low earnings among adult workers was associated with changes in their employers and firm characteristics. The results show considerable mobility into and out of low earnings status, even for adults. They indicate that job changes were an important part of the process by which workers escaped or entered low-wage status, and that changes in employer characteristics help to account for these job changes. Matches between personal and firm characteristics also contributed to observed earnings outcomes.}, author = {Holzer, Harry and Lane, Julia and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.1177/001979390405700405}, journal = {Industrial and Labor Relations Review}, month = jul, note = {tex.alturl: http://www.jstor.org/stable/4126683}, number = {4}, title = {Escaping poverty for low-wage workers: {The} role of employer characteristics and changes}, url = {http://journals.sagepub.com/doi/pdf/10.1177/001979390405700405}, volume = {57}, year = {2004}, month_numeric = {7} } - Early career experiences and later career outcomes: Comparing the United States, France and GermanyDavid N. Margolis, Véronique Simonnet, and Lars VilhuberVierteljahrshefte zur Wirtschaftsforschung, 2001
This paper explores the links between individuals’ early career experiences and their labor market outcomes 5 to 20 years later using data from France, (western) Germany, and the United States. Relative to most of the literature, we consider a large set of measures of men’s early career experiences and later career outcomes. Our results differ significantly across countries. Labor market outcomes in Germany are consistent with a dual labor market model. In the case of American workers, either the market learns about unobservable worker characteristics over time or the implicit contracts established at the start of the career are increasingly renegotiated over time. Unobserved heterogeneity in individuals’ networks of labor market contacts is consistent with our results for France. These results reflect optimal firm responses to the different institutional environments in each country in the presence of ex ante imperfect information concerning young workers.
@article{MargolisEtAl2001, abstract = {This paper explores the links between individuals' early career experiences and their labor market outcomes 5 to 20 years later using data from France, (western) Germany, and the United States. Relative to most of the literature, we consider a large set of measures of men's early career experiences and later career outcomes. Our results differ significantly across countries. Labor market outcomes in Germany are consistent with a dual labor market model. In the case of American workers, either the market learns about unobservable worker characteristics over time or the implicit contracts established at the start of the career are increasingly renegotiated over time. Unobserved heterogeneity in individuals' networks of labor market contacts is consistent with our results for France. These results reflect optimal firm responses to the different institutional environments in each country in the presence of ex ante imperfect information concerning young workers.}, author = {Margolis, David N. and Simonnet, Véronique and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.3790/vjh.70.1.31}, issn = {1861-1559}, journal = {Vierteljahrshefte zur Wirtschaftsforschung}, number = {1}, pages = {31--38}, title = {Early career experiences and later career outcomes: {Comparing} the {United} {States}, {France} and {Germany}}, url = {http://hdl.handle.net/10419/99179}, volume = {70}, year = {2001}, } - La spécificité de la formation en milieu de travail : un survol des contributions théoriques et empiriques récentes,Lars VilhuberL’Actualité économique, Revue d’analyse économique, Mar 2001In French
Cette recension des écrits a pour objet la spécificité de la formation formelle, dans le cadre précis du capital humain. Nous nous concentrerons tout particulièrement sur la formation formelle payée et généralement dispensée par les entreprises. Une distinction supplémentaire sera également apportée entre la formation formelle, se tenant en classe ou en séminaire, et la formation sur le tas. Nous débuterons par un bref examen des modèles pertinents, présentant d’abord la théorie classique pour ensuite passer à des modèles plus récents. Suivra un aperçu des données disponibles dans divers pays, et qui ont été utilisées dans le cadre de nombreuses études empiriques. Toujours dans l’optique de la spécificité du capital humain créée par la formation formelle, les études répertoriées feront ensuite l’objet d’un examen critique. Ce faisant, nous porterons une attention particulière à la cohérence entre les modèles théoriques et les mesures du capital humain disponibles. Nous conclurons notre recension en formulant des suggestions sur les façons de combler les lacunes relevées.
@article{Vilhuber2001, abstract = {Cette recension des écrits a pour objet la spécificité de la formation formelle, dans le cadre précis du capital humain. Nous nous concentrerons tout particulièrement sur la formation formelle payée et généralement dispensée par les entreprises. Une distinction supplémentaire sera également apportée entre la formation formelle, se tenant en classe ou en séminaire, et la formation sur le tas. Nous débuterons par un bref examen des modèles pertinents, présentant d’abord la théorie classique pour ensuite passer à des modèles plus récents. Suivra un aperçu des données disponibles dans divers pays, et qui ont été utilisées dans le cadre de nombreuses études empiriques. Toujours dans l’optique de la spécificité du capital humain créée par la formation formelle, les études répertoriées feront ensuite l’objet d’un examen critique. Ce faisant, nous porterons une attention particulière à la cohérence entre les modèles théoriques et les mesures du capital humain disponibles. Nous conclurons notre recension en formulant des suggestions sur les façons de combler les lacunes relevées.}, author = {Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.7202/602347ar}, journal = {L'Actualité économique, Revue d'analyse économique}, month = mar, note = {In French}, number = {1}, title = {La spécificité de la formation en milieu de travail : un survol des contributions théoriques et empiriques récentes,}, volume = {77}, year = {2001}, month_numeric = {3} } - Continuous Training and sectoral mobility in Germany: Evidence from the 90sLars VilhuberVierteljahresheft für Wirtschaftsforschung, 1999
see Vilhuber99b
@article{Vilhuber99a, abstract = {see Vilhuber99b}, author = {Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, issn = {0340-1707}, journal = {Vierteljahresheft für Wirtschaftsforschung}, number = {2}, pages = {209--214}, title = {Continuous {Training} and sectoral mobility in {Germany}: {Evidence} from the 90s}, url = {http://hdl.handle.net/10419/141240}, volume = {68}, year = {1999}, }
Working Papers (80)
- Assessing Utility of Differential Privacy for RCTsKaitlyn R. Webb, Soumya Mukherjee, Aratrika Mustafi, and 2 more authorsarXiv arxiv:2309.14581v2, 2026Version Number: 2
Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
@techreport{webbetal2026-arxiv, abstract = {Randomized controlled trials (RCTs) have become powerful tools for assessing the impact of interventions and policies in many contexts. They are considered the gold standard for causal inference in the biomedical fields and many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of their inference. These studies typically include the response data that has been collected, de-identified, and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of privacy-preserving synthetic data generation methodologies on published RCT analyses by leveraging available replication packages (research compendia) in economics and policy analysis. We implement three privacy-preserving algorithms, that use as a base one of the basic differentially private (DP) algorithms, the perturbed histogram, to support the quality of statistical inference. We highlight challenges with the straight use of this algorithm and the stability-based histogram in our setting and described the adjustments needed. We provide simulation studies and demonstrate that we can replicate the analysis in a published economics article on privacy-protected data under various parameterizations. We find that relatively straightforward (at a high-level) privacy-preserving methods influenced by DP techniques allow for inference-valid protection of published data. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.}, author = {Webb, Kaitlyn R. and Mukherjee, Soumya and Mustafi, Aratrika and Slavković, Aleksandra and Vilhuber, Lars}, copyright = {Creative Commons Attribution 4.0 International}, doi = {10.48550/ARXIV.2309.14581}, institution = {arXiv}, note = {Version Number: 2}, number = {arxiv:2309.14581v2}, title = {Assessing {Utility} of {Differential} {Privacy} for {RCTs}}, url = {https://arxiv.org/abs/2309.14581}, urldate = {2026-02-10}, year = {2026}, } - Assessing Reproducibility in Economics Using Standardized Crowd-sourced AnalysisAbel Brodeur, Seung Yong Sung, Edward Miguel, and 2 more authorsNational Bureau of Economic Research, Working Paper 33753, May 2025
This paper presents a framework to standardize crowd-sourced computational reproductions in economics through the Social Science Reproduction Platform (SSRP). The approach address four main challenges for computational reproductions: a lack of standardization, aggregation issues, existing incentives for “adversarial” interactions, and the loss of knowledge from analyses that are never published. We then summarize the first 487 reproductions uploaded on the SSRP. The results show substantial heterogeneity in the ability to successfully reproduce empirical results in economics research, with approximately 30% of recent studies meeting at least a basic definition of being computationally reproducible.
@techreport{brodeur2025, abstract = {This paper presents a framework to standardize crowd-sourced computational reproductions in economics through the Social Science Reproduction Platform (SSRP). The approach address four main challenges for computational reproductions: a lack of standardization, aggregation issues, existing incentives for “adversarial” interactions, and the loss of knowledge from analyses that are never published. We then summarize the first 487 reproductions uploaded on the SSRP. The results show substantial heterogeneity in the ability to successfully reproduce empirical results in economics research, with approximately 30\% of recent studies meeting at least a basic definition of being computationally reproducible.}, author = {Brodeur, Abel and Sung, Seung Yong and Miguel, Edward and Vilhuber, Lars and Hoces de la Guardia, Fernando}, copyright = {All rights reserved}, doi = {10.3386/w33753}, institution = {National Bureau of Economic Research}, month = may, number = {33753}, title = {Assessing {Reproducibility} in {Economics} {Using} {Standardized} {Crowd}-sourced {Analysis}}, type = {Working {Paper}}, url = {https://www.nber.org/papers/w33753}, urldate = {2025-05-16}, year = {2025}, month_numeric = {5} } - Mass Reproducibility and Replicability: A New HopeAbel Brodeur, Derek Mikola, Nikolai Cook, and 346 more authorsI4R Discussion Paper Series, Working Paper 107, 2024
This study pushes our understanding of research reliability by reproducing and replicating claims from 110 papers in leading economic and political science journals. The analysis involves computational reproducibility checks and robustness assessments. It reveals several patterns. First, we uncover a high rate of fully computationally reproducible results (over 85%). Second, excluding minor issues like missing packages or broken pathways, we uncover coding errors for about 25% of studies, with some studies containing multiple errors. Third, we test the robustness of the results to 5,511 re-analyses. We find a robustness reproducibility of about 70%. Robustness reproducibility rates are relatively higher for re-analyses that introduce new data and lower for re-analyses that change the sample or the definition of the dependent variable. Fourth, 52% of re-analysis effect size estimates are smaller than the original published estimates and the average statistical significance of a re-analysis is 77% of the original. Lastly, we rely on six teams of researchers working independently to answer eight additional research questions on the determinants of robustness reproducibility. Most teams find a negative relationship between replicators’ experience and reproducibility, while finding no relationship between reproducibility and the provision of intermediate or even raw data combined with the necessary cleaning codes.
@techreport{brodeur2024, abstract = {This study pushes our understanding of research reliability by reproducing and replicating claims from 110 papers in leading economic and political science journals. The analysis involves computational reproducibility checks and robustness assessments. It reveals several patterns. First, we uncover a high rate of fully computationally reproducible results (over 85\%). Second, excluding minor issues like missing packages or broken pathways, we uncover coding errors for about 25\% of studies, with some studies containing multiple errors. Third, we test the robustness of the results to 5,511 re-analyses. We find a robustness reproducibility of about 70\%. Robustness reproducibility rates are relatively higher for re-analyses that introduce new data and lower for re-analyses that change the sample or the definition of the dependent variable. Fourth, 52\% of re-analysis effect size estimates are smaller than the original published estimates and the average statistical significance of a re-analysis is 77\% of the original. Lastly, we rely on six teams of researchers working independently to answer eight additional research questions on the determinants of robustness reproducibility. Most teams find a negative relationship between replicators' experience and reproducibility, while finding no relationship between reproducibility and the provision of intermediate or even raw data combined with the necessary cleaning codes.}, author = {Brodeur, Abel and Mikola, Derek and Cook, Nikolai and Brailey, Thomas and Briggs, Ryan and de Gendre, Alexandra and Dupraz, Yannick and Fiala, Lenka and Gabani, Jacopo and Gauriot, Romain and Haddad, Joanne and Lima, Goncalo and Ankel-Peters, Jörg and Dreber, Anna and Campbell, Douglas and Kattan, Lamis and Marino Fages, Diego and Mierisch, Fabian and Sun, Pu and Wright, Taylor and Connolly, Marie and Hoces de la Guardia, Fernando and Johannesson, Magnus and Miguel, Edward and Vilhuber, Lars and Abarca, Alejandro and Acharya, Mahesh and Adjisse, Sossou Simplice and Akhtar, Ahwaz and Ramirez Lizardi, Eduardo Alberto and Albrecht, Sabina and Andersen, Synøve Nygaard and Andlib, Zubaria and Arrora, Falak and Ash, Thomas and Bacher, Etienne and Bachler, Sebastian and Bacon, Félix and Bagues, Manuel and Balogh, Timea and Batmanov, Alisher and Barschkett, Mara and Basdil, B. Kaan and Baxa, Jaromír and Becker, Sascha and Beeder, Monica and Beland, Louis-Philippe and Bello, Abdel Hamid and Markovits, Daniel Benenson and Benjamin, Grant and Bergeron, Thomas and Blimpo, Moussa P. and Binetti, Marco and Bonander, Carl and Bonneau, Joseph and Borbáth, Endre and Topstad Borgen, Nicolai and Topstad Borgen, Solveig and Borowsky, Jonathan and Brini, Elisa and Brown, Myriam and Brun, Martín and Bruns, Stephan and Buliskeria, Nino and Calef, Andrea and Cameron, Alistair and Campa, Pamela and Campos-Rodríguez, Santiago and Cantone, Giulio Giacomo and Carpena, Fenella and Carter, Perry and Castañeda Dower, Paul and Castek, Ondrej and Caviglia-Harris, Jill and Strand, Gabriella Chauca and Chen, Shi and Chzhen, Asya and Chung, Jong and Collins, Jason and Coppock, Alexander and Cordeau, Hugo and Couillard, Ben and Crechet, Jonathan and Crippa, Lorenzo and Cui, Jeanne and Czymara, Christian and Daarstad, Haley and Dao, Danh Chi and Dao, Dong and Schmandt, Marco David and de Linde, Astrid and De Melo, Lucas and Deer, Lachlan and De Vera, Micole and Dimitrova, Velichka and Dollbaum, Jan Fabian and Dollbaum, Jan Matti and Donnelly, Michael and Huynh, Luu Duc Toan and Dumbalska, Tsvetomira and Duncan, Jamie and Duong, Kiet Tuan and Duprey, Thibaut and Dworschak, Christoph and Ellingsrud, Sigmund and Elminejad, Ali and Eissa, Yasmine and Erhart, Andrea and Etingin-Frati, Giulian and Fatemi-Pour, Elaheh and Federice, Alexa and Feld, Jan and Fenig, Guidon and Firouzjaeiangalougah, Mojtaba and Fleisje, Erlend and Fortier-Chouinard, Alexandre and Engel, Julia Francesca and Fries, Tilman and Fortier, Reid and Fréchet, Nadjim and Galipeau, Thomas and Gallegos, Sebastián and Gangji, Areez and Gao, Xiaoying and Garnache, Cloé and Gáspár, Attila and Gavrilova, Evelina and Ghosh, Arijit and Gibney, Garreth and Gibson, Grant and Godager, Geir and Goff, Leonard and Gong, Da and González, Javier and Gretton, Jeremy and Griffa, Cristina and Grigoryeva, Idaliya and Grøtting, Maja and Guntermann, Eric and Guo, Jiaqi and Gugushvili, Alexi and Habibnia, Hooman and Häffner, Sonja and Hall, Jonathan D. and Hammar, Olle and Kordt, Amund Hanson and Hashimoto, Barry and Hartley, Jonathan S. and Hausladen, Carina I. and Havránek, Tomáš and Hazen, Jacob and He, Harry and Hepplewhite, Matthew and Herrera-Rodriguez, Mario and Heuer, Felix and Heyes, Anthony and Ho, Anson T. Y. and Holmes, Jonathan and Holzknecht, Armando and Hsu, Yu-Hsiang Dexter and Hu, Shiang-Hung and Huang, Yu-Shiuan and Huebener, Mathias and Huber, Christoph and Huynh, Kim P. and Irsova, Zuzana and Isler, Ozan and Jakobsson, Niklas and Frith, Michael James and Jananji, Raphaël and Jayalath, Tharaka A. and Jetter, Michael and John, Jenny and Forshaw, Rachel Joy and Juan, Felipe and Kadriu, Valon and Karim, Sunny and Kelly, Edmund and Dang, Duy Khanh Hoang and Khushboo, Tazia and Kim, Jin and Kjellsson, Gustav and Kjelsrud, Anders and Kotsadam, Andreas and Korpershoek, Jori and Krashinsky, Lewis and Kundu, Suranjana and Kustov, Alexander and Lalayev, Nurlan and Langlois, Audrée and Laufer, Jill and Lee-Whiting, Blake and Leibing, Andreas and Lenz, Gabriel and Levin, Joel and Li, Peng and Li, Tongzhe and Lin, Yuchen and Listo, Ariel and Liu, Dan and Lu, Xuewen and Lukmanova, Elvina and Luscombe, Alex and Lusher, Lester R. and Lyu, Ke and Ma, Hai and Mäder, Nicolas and Makate, Clifton and Malmberg, Alice and Maitra, Adit and Mandas, Marco and Marcus, Jan and Margaryan, Shushanik and Márk, Lili and Martignano, Andres and Marsh, Abigail and Masetto, Isabella and McCanny, Anthony and McManus, Emma and McWay, Ryan and Metson, Lennard and Kinge, Jonas Minet and Mishra, Sumit and Mohnen, Myra and Möller, Jakob and Montambeault, Rosalie and Montpetit, Sébastien and Morin, Louis-Philippe and Morris, Todd and Moser, Scott and Motoki, Fabio and Muehlenbachs, Lucija and Musulan, Andreea and Musumeci, Marco and Nabin, Munirul and Nchare, Karim and Neubauer, Florian and Nguyen, Quan M. P. and Nguyen, Tuan and Nguyen-Tien, Viet and Niazi, Ali and Nikolaishvili, Giorgi and Nordstrom, Ardyn and Nüß, Patrick and Odermatt, Angela and Olson, Matt and Øien, Henning and Ölkers, Tim and Oliver i Vert, Miquel and Oral, Emre and Oswald, Christian and Ousman, Ali and Özak, Ömer and Pandey, Shubham and Pavlov, Alexandre and Pelli, Martino and Penheiro, Romeo and Park, RyuGyung and Pérez Martel, Eva and Petrovičová, Tereza and Phan, Linh and Prettyman, Alexa and Procházka, Jakub and Putri, Aqila and Quandt, Julian and Qiu, Kangyu and Nguyen, Loan Quynh Thi and Rahman, Andaleeb and Rea, Carson H. and Reiremo, Adam and Renée, Laëtitia and Richardson, Joseph and Rivers, Nicholas and Rodrigues, Bruno and Roelofs, William and Roemer, Tobias and Rogeberg, Ole and Rose, Julian and Roskos-Ewoldsen, Andrew and Rosmer, Paul and Sabada, Barbara and Saberian, Soodeh and Salamanca, Nicolas and Sator, Georg and Sawyer, Antoine and Scates, Daniel and Schlüter, Elmar and Sells, Cameron and Sen, Sharmi and Sethi, Ritika and Shcherbiak, Anna and Sogaolu, Moyosore and Soosalu, Matt and Sørensen, Erik Ø and Sovani, Manali and Spencer, Noah and Staubli, Stefan and Stans, Renske and Stewart, Anya and Stips, Felix and Stockley, Kieran and Strobel, Stephenson and Struby, Ethan and Tang, John and Tanrisever, Idil and Yang, Thomas Tao and Tastan, Ipek and Tatić, Dejan and Tatlow, Benjamin and Seuyong, Féraud Tchuisseu and Thériault, Rémi and Thivierge, Vincent and Tian, Wenjie and Toma, Filip-Mihai and Totarelli, Maddalena and Tran, Van-Anh and Truong, Hung and Tsoy, Nikita and Tuzcuoglu, Kerem and Ubfal, Diego and Villalobos, Laura and Walterskirchen, Julian and Wang, Joseph Taoyi and Wattal, Vasudha and Webb, Matthew D. and Weber, Bryan and Weisser, Reinhard and Weng, Wei-Chien and Westheide, Christian and White, Kimberly and Winter, Jacob and Wochner, Timo and Woerman, Matt and Wong, Jared and Woodard, Ritchie and Wroński, Marcin and Yazbeck, Myra and Yang, Gustav Chung and Yap, Luther and Yassin, Kareman and Ye, Hao and Yoon, Jin Young and Yurris, Chris and Zahra, Tahreen and Zaneva, Mirela and Zayat, Aline and Zhang, Jonathan and Zhao, Ziwei and Yaolang, Zhong}, copyright = {All Rights Reserved (Free to Read)}, institution = {I4R Discussion Paper Series}, language = {eng}, number = {107}, shorttitle = {Mass {Reproducibility} and {Replicability}}, title = {Mass {Reproducibility} and {Replicability}: {A} {New} {Hope}}, type = {Working {Paper}}, url = {https://hdl.handle.net/10419/289437}, urldate = {2024-04-08}, year = {2024}, } - Crowdsourcing Digital Public Goods: A Field Experiment on Metadata ContributionsLinfeng Li, Yan Chen, Margaret C. Levenstein, and 1 more authorSocial Science Research Network, SSRN Scholarly Paper 5008203, Nov 2024
This study explores why people choose to contribute metadata, which is data about data. Using a field experiment conducted with more than 3,000 authors of AEA journal articles, our control message reduces the uncertainty about the future value of metadata, whereas those from the treatment conditions additionally make the private or social benefits of metadata salient. Surprisingly, we find that participants in the control condition provide significantly more metadata compared to those in the treatments. This suggests that simply knowing that metadata will have value is sufficient to motivate people to contribute. Our results also highlight the importance of interface design in online field experiments.
@techreport{li2024, abstract = {This study explores why people choose to contribute metadata, which is data about data. Using a field experiment conducted with more than 3,000 authors of AEA journal articles, our control message reduces the uncertainty about the future value of metadata, whereas those from the treatment conditions additionally make the private or social benefits of metadata salient. Surprisingly, we find that participants in the control condition provide significantly more metadata compared to those in the treatments. This suggests that simply knowing that metadata will have value is sufficient to motivate people to contribute. Our results also highlight the importance of interface design in online field experiments.}, address = {Rochester, NY}, author = {Li, Linfeng and Chen, Yan and Levenstein, Margaret C. and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.2139/ssrn.5008203}, institution = {Social Science Research Network}, language = {en}, month = nov, number = {5008203}, shorttitle = {Crowdsourcing {Digital} {Public} {Goods}}, title = {Crowdsourcing {Digital} {Public} {Goods}: {A} {Field} {Experiment} on {Metadata} {Contributions}}, type = {{SSRN} {Scholarly} {Paper}}, url = {https://doi.org/10.2139/ssrn.5008203}, urldate = {2025-01-11}, year = {2024}, month_numeric = {11} } - Protecting Confidential Data through Non-Statistical MethodsLars VilhuberCornell University, Document 116054, Oct 2024
This chapter will rely on and update previous overviews of how researchers, citizens, and administrators can reliably and securely access confidential data, i.e., data that cannot be simply published as “open data”. I will discuss various legal, technical, and practical ways of securing access to data that is needed for computations. This obviously depends on the type and complexity of the computations but also depends on the who, how, and where access is needed.
@techreport{vilhuber2024c, abstract = {This chapter will rely on and update previous overviews of how researchers, citizens, and administrators can reliably and securely access confidential data, i.e., data that cannot be simply published as “open data”. I will discuss various legal, technical, and practical ways of securing access to data that is needed for computations. This obviously depends on the type and complexity of the computations but also depends on the who, how, and where access is needed.}, author = {Vilhuber, Lars}, copyright = {CC BY-NC-ND Attribution-NonCommercial-NoDerivatives 4.0 International}, institution = {Cornell University}, language = {en\_US}, month = oct, number = {116054}, title = {Protecting {Confidential} {Data} through {Non}-{Statistical} {Methods}}, type = {Document}, url = {https://hdl.handle.net/1813/116054}, urldate = {2025-04-08}, year = {2024}, month_numeric = {10} } - The 2010 Census Confidentiality Protections Failed, Here’s How and WhyJohn M. Abowd, Tamara Adams, Robert Ashmead, and 11 more authorsarXiv arXiv:2312.11283, Dec 2023arXiv:2312.11283 null
Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can verify that all records in 70% of all census blocks (97 million people) are perfectly reconstructed. The tabular publications in Summary File 1 thus have prohibited disclosure risk similar to the unreleased confidential microdata. Reidentification studies confirm that an attacker can, within blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with nonmodal characteristics) with 95% accuracy, the same precision as the confidential data achieve and far greater than statistical baselines. The flaw in the 2010 Census framework was the assumption that aggregation prevented accurate microdata reconstruction, justifying weaker disclosure limitation methods than were applied to 2010 Census public microdata. The framework used for 2020 Census publications defends against attacks that are based on reconstruction, as we also demonstrate here. Finally, we show that alternatives to the 2020 Census Disclosure Avoidance System with similar accuracy (enhanced swapping) also fail to protect confidentiality, and those that partially defend against reconstruction attacks (incomplete suppression implementations) destroy the primary statutory use case: data for redistricting all legislatures in the country in compliance with the 1965 Voting Rights Act.
@techreport{abowd2023, abstract = {Using only 34 published tables, we reconstruct five variables (census block, sex, age, race, and ethnicity) in the confidential 2010 Census person records. Using the 38-bin age variable tabulated at the census block level, at most 20.1\% of reconstructed records can differ from their confidential source on even a single value for these five variables. Using only published data, an attacker can verify that all records in 70\% of all census blocks (97 million people) are perfectly reconstructed. The tabular publications in Summary File 1 thus have prohibited disclosure risk similar to the unreleased confidential microdata. Reidentification studies confirm that an attacker can, within blocks with perfect reconstruction accuracy, correctly infer the actual census response on race and ethnicity for 3.4 million vulnerable population uniques (persons with nonmodal characteristics) with 95\% accuracy, the same precision as the confidential data achieve and far greater than statistical baselines. The flaw in the 2010 Census framework was the assumption that aggregation prevented accurate microdata reconstruction, justifying weaker disclosure limitation methods than were applied to 2010 Census public microdata. The framework used for 2020 Census publications defends against attacks that are based on reconstruction, as we also demonstrate here. Finally, we show that alternatives to the 2020 Census Disclosure Avoidance System with similar accuracy (enhanced swapping) also fail to protect confidentiality, and those that partially defend against reconstruction attacks (incomplete suppression implementations) destroy the primary statutory use case: data for redistricting all legislatures in the country in compliance with the 1965 Voting Rights Act.}, author = {Abowd, John M. and Adams, Tamara and Ashmead, Robert and Darais, David and Dey, Sourya and Garfinkel, Simson L. and Goldschlag, Nathan and Kifer, Daniel and Leclerc, Philip and Lew, Ethan and Moore, Scott and Rodríguez, Rolando A. and Tadros, Ramy N. and Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, institution = {arXiv}, month = dec, note = {arXiv:2312.11283 null}, number = {arXiv:2312.11283}, title = {The 2010 {Census} {Confidentiality} {Protections} {Failed}, {Here}'s {How} and {Why}}, url = {http://arxiv.org/abs/2312.11283}, urldate = {2023-12-19}, year = {2023}, month_numeric = {12} } - Assessing Utility of Differential Privacy for RCTsSoumya Mukherjee, Aratrika Mustafi, Aleksandra Slavković, and 1 more authorarXiv arXiv:2309.14581v1, Sep 2023arXiv:2309.14581 [cs, econ, stat]
Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the response data collected, de-identified and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of strong privacy-preservation methodology (with }ac{DP} guarantees), on published analyses from RCTs, leveraging the availability of replication packages (research compendia) in economics and policy analysis. We provide simulations studies and demonstrate how we can replicate the analysis in a published economics article on privacy-protected data under various parametrizations. We find that relatively straightforward DP-based methods allow for inference-valid protection of the published data, though computational issues may limit more complex analyses from using these methods. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
@techreport{mukherjee2023, abstract = {Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the response data collected, de-identified and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of strong privacy-preservation methodology (with {\textbackslash}ac\{DP\} guarantees), on published analyses from RCTs, leveraging the availability of replication packages (research compendia) in economics and policy analysis. We provide simulations studies and demonstrate how we can replicate the analysis in a published economics article on privacy-protected data under various parametrizations. We find that relatively straightforward DP-based methods allow for inference-valid protection of the published data, though computational issues may limit more complex analyses from using these methods. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.}, author = {Mukherjee, Soumya and Mustafi, Aratrika and Slavković, Aleksandra and Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.48550/arXiv.2309.14581}, institution = {arXiv}, month = sep, note = {arXiv:2309.14581 [cs, econ, stat]}, number = {arXiv:2309.14581v1}, title = {Assessing {Utility} of {Differential} {Privacy} for {RCTs}}, url = {http://arxiv.org/abs/2309.14581}, urldate = {2024-04-08}, year = {2023}, month_numeric = {9} } - Reproducibility and Transparency versus Privacy and Confidentiality: Reflections from a Data EditorLars VilhuberarXiv, submitted journal version 2305.14478, 2023Version Number: 1
Transparency and reproducibility are often seen in opposition to privacy and confidentiality. Data that need to be kept confidential are seen as an impediment to reproducibility, and privacy would seem to inhibit transparency. I bring a more nuanced view to the discussion, and show, using examples from over 1,000 reproducibility assessments, that confidential data can very well be used in reproducible and transparent research. The key insight is that access to most confidential data, while tedious, is open to hundreds if not thousands of researchers. In cases where few researchers can consider accessing such data in the future, reproducibility services, such as those provided by some journals, can provide some evidence for effective reproducibility even when the same data may not be available for future research.
@techreport{vilhuber2023-arxiv-reflections, abstract = {Transparency and reproducibility are often seen in opposition to privacy and confidentiality. Data that need to be kept confidential are seen as an impediment to reproducibility, and privacy would seem to inhibit transparency. I bring a more nuanced view to the discussion, and show, using examples from over 1,000 reproducibility assessments, that confidential data can very well be used in reproducible and transparent research. The key insight is that access to most confidential data, while tedious, is open to hundreds if not thousands of researchers. In cases where few researchers can consider accessing such data in the future, reproducibility services, such as those provided by some journals, can provide some evidence for effective reproducibility even when the same data may not be available for future research.}, author = {Vilhuber, Lars}, copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}, doi = {10.48550/ARXIV.2305.14478}, institution = {arXiv}, note = {Version Number: 1}, number = {2305.14478}, shorttitle = {Reproducibility and {Transparency} versus {Privacy} and {Confidentiality}}, title = {Reproducibility and {Transparency} versus {Privacy} and {Confidentiality}: {Reflections} from a {Data} {Editor}}, type = {submitted journal version}, url = {https://arxiv.org/abs/2305.14478}, urldate = {2023-06-06}, year = {2023}, } - Data and Code Availability StandardMiklós Koren, Marie Connolly, Joan Lull, and 1 more authorSocial Science Data Editors, Dec 2022Version Number: 1.0
DCAS is a standard for sharing research code and data, endorsed by leading journals in social sciences. It is maintained by the Social Science Data Editors.
@techreport{koren2022, abstract = {DCAS is a standard for sharing research code and data, endorsed by leading journals in social sciences. It is maintained by the Social Science Data Editors.}, author = {Koren, Miklós and Connolly, Marie and Lull, Joan and Vilhuber, Lars}, copyright = {Creative Commons Attribution 4.0 International, Open Access}, doi = {10.5281/ZENODO.7436134}, institution = {Social Science Data Editors}, language = {en}, month = dec, note = {Version Number: 1.0}, title = {Data and {Code} {Availability} {Standard}}, url = {https://zenodo.org/record/7436134}, urldate = {2025-02-08}, year = {2022}, month_numeric = {12} } - An Interview with John M. AbowdIan Schmutte and Lars VilhuberCornell University Labor Dynamics Institute Document, Feb 2022
John M. Abowd is the Chief Scientist and Associate Director for Research and Methodology, U.S. Census Bureau. He completed his A.B. in Economics at NotreDame in 1973 and his Ph.D. in Economics at University of Chicago in 1977 under Arnold Zellner. During his academic career, John has held faculty positions at Princeton, the University of Chicago, and, since 1987 at Cornell University where he is the Edmund Ezra Day Professor Emeritus of Economics, Statistics and Data Science. John was trained as a statistician and labor economist, and his economic research has focused on the rigorous empirical evaluation of labor market institutions. In the late 1990s, he began working with the Census Bureau on projects that would end up leveraging administrative and survey records into official statistical products. Through that work, he has developed a research agenda focused on issues necessary to generate those products, including data privacy, synthetic data, total error analysis, data linkage, missing data problems, among others.
@techreport{schmutte2022a, abstract = {John M. Abowd is the Chief Scientist and Associate Director for Research and Methodology, U.S. Census Bureau. He completed his A.B. in Economics at NotreDame in 1973 and his Ph.D. in Economics at University of Chicago in 1977 under Arnold Zellner. During his academic career, John has held faculty positions at Princeton, the University of Chicago, and, since 1987 at Cornell University where he is the Edmund Ezra Day Professor Emeritus of Economics, Statistics and Data Science. John was trained as a statistician and labor economist, and his economic research has focused on the rigorous empirical evaluation of labor market institutions. In the late 1990s, he began working with the Census Bureau on projects that would end up leveraging administrative and survey records into official statistical products. Through that work, he has developed a research agenda focused on issues necessary to generate those products, including data privacy, synthetic data, total error analysis, data linkage, missing data problems, among others.}, author = {Schmutte, Ian and Vilhuber, Lars}, copyright = {Attribution-NonCommercial 4.0 International}, institution = {Cornell University Labor Dynamics Institute Document}, language = {en\_US}, month = feb, title = {An {Interview} with {John} {M}. {Abowd}}, url = {https://hdl.handle.net/1813/110981}, urldate = {2026-02-20}, year = {2022}, month_numeric = {2} } - A template README for social science replication packagesLars Vilhuber, Marie Connolly, Miklós Koren, and 2 more authorsZenodo v1.1.0, Nov 2022
The typical README in social science journals serves the purpose of guiding a reader through the available material and a route to replicating the results in the research paper, including the description of the origins of data and/or description of programs. As such, a good README file should first provide a brief overview of the available material and a brief guide as to how to proceed from beginning to end, before then diving into the specifics. These template files structure such a README in a way that is compliant with the typical data and code workflow in the social sciences.
@techreport{templateREADMEv1.1, abstract = {The typical README in social science journals serves the purpose of guiding a reader through the available material and a route to replicating the results in the research paper, including the description of the origins of data and/or description of programs. As such, a good README file should first provide a brief overview of the available material and a brief guide as to how to proceed from beginning to end, before then diving into the specifics. These template files structure such a README in a way that is compliant with the typical data and code workflow in the social sciences.}, author = {Vilhuber, Lars and Connolly, Marie and Koren, Miklós and Llull, Joan and Morrow, Peter}, copyright = {Creative Commons Attribution Non Commercial 4.0 International, Open Access}, institution = {Zenodo}, month = nov, number = {v1.1.0}, title = {A template {README} for social science replication packages}, url = {https://doi.org/10.5281/zenodo.7293838}, urldate = {2023-05-17}, year = {2022}, month_numeric = {11} } - Teaching for large-scale Reproducibility VerificationLars Vilhuber, Hyuk Harry Son, Meredith Welch, and 2 more authorsarXiv arxiv:2204.01540v1, Mar 2022
We describe a unique environment in which undergraduate students from various STEM and social science disciplines are trained in data provenance and reproducible methods, and then apply that knowledge to real, conditionally accepted manuscripts and associated replication packages. We describe in detail the recruitment, training, and regular activities. While the activity is not part of a regular curriculum, the skills and knowledge taught through explicit training of reproducible methods and principles, and reinforced through repeated application in a real-life workflow, contribute to the education of these undergraduate students, and prepare them for post-graduation jobs and further studies.
@techreport{vilhuber2022b, abstract = {We describe a unique environment in which undergraduate students from various STEM and social science disciplines are trained in data provenance and reproducible methods, and then apply that knowledge to real, conditionally accepted manuscripts and associated replication packages. We describe in detail the recruitment, training, and regular activities. While the activity is not part of a regular curriculum, the skills and knowledge taught through explicit training of reproducible methods and principles, and reinforced through repeated application in a real-life workflow, contribute to the education of these undergraduate students, and prepare them for post-graduation jobs and further studies.}, author = {Vilhuber, Lars and Son, Hyuk Harry and Welch, Meredith and Wasser, David N. and Darisse, Michael}, copyright = {CC BY Attribution 4.0 International}, institution = {arXiv}, language = {en}, month = mar, number = {arxiv:2204.01540v1}, title = {Teaching for large-scale {Reproducibility} {Verification}}, url = {https://arxiv.org/abs/2204.01540v1}, urldate = {2022-04-05}, year = {2022}, month_numeric = {3} } - Applying Data Synthesis for Longitudinal Business Data across Three CountriesM. Jahangir Alam, Benoit Dostie, Jörg Drechsler, and 1 more authorarXiv arxiv:2008.02246, Jul 2020
Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (LEAP) and Germany (BHP). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.
@techreport{alam2020b, abstract = {Data on businesses collected by statistical agencies are challenging to protect. Many businesses have unique characteristics, and distributions of employment, sales, and profits are highly skewed. Attackers wishing to conduct identification attacks often have access to much more information than for any individual. As a consequence, most disclosure avoidance mechanisms fail to strike an acceptable balance between usefulness and confidentiality protection. Detailed aggregate statistics by geography or detailed industry classes are rare, public-use microdata on businesses are virtually inexistant, and access to confidential microdata can be burdensome. Synthetic microdata have been proposed as a secure mechanism to publish microdata, as part of a broader discussion of how to provide broader access to such data sets to researchers. In this article, we document an experiment to create analytically valid synthetic data, using the exact same model and methods previously employed for the United States, for data from two different countries: Canada (LEAP) and Germany (BHP). We assess utility and protection, and provide an assessment of the feasibility of extending such an approach in a cost-effective way to other data.}, author = {Alam, M. Jahangir and Dostie, Benoit and Drechsler, Jörg and Vilhuber, Lars}, copyright = {CC BY-NC-SA Attribution-NonCommercial-ShareAlike 4.0 International}, doi = {10.48550/arXiv.2008.02246}, institution = {arXiv}, month = jul, number = {arxiv:2008.02246}, title = {Applying {Data} {Synthesis} for {Longitudinal} {Business} {Data} across {Three} {Countries}}, url = {http://arxiv.org/abs/2008.02246}, urldate = {2026-02-10}, year = {2020}, month_numeric = {7} } - Consumer expectations around COVID-19: Evolution over timeFabian Lange and Lars VilhuberLabor Dynamics Institute, Online, 2020
@techreport{langevilhuber202005, author = {Lange, Fabian and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Labor Dynamics Institute}, title = {Consumer expectations around {COVID}-19: {Evolution} over time}, type = {Online}, url = {https://labordynamicsinstitute.github.io/gcs_covid19_expectations/text/analysis_week5/}, year = {2020}, } - A template README for social science replication packagesLars Vilhuber, Marie Connolly, Miklós Koren, and 2 more authorsZenodo, Dec 2020Version Number: v1.0.0
The typical README in social science journals serves the purpose of guiding a reader through the available material and a route to replicating the results in the research paper, including the description of the origins of data and/or description of programs. As such, a good README file should first provide a brief overview of the available material and a brief guide as to how to proceed from beginning to end, before then diving into the specifics. These template files structure such a README in a way that is compliant with the typical data and code workflow in the social sciences.
@techreport{templateREADMEv1.0, abstract = {The typical README in social science journals serves the purpose of guiding a reader through the available material and a route to replicating the results in the research paper, including the description of the origins of data and/or description of programs. As such, a good README file should first provide a brief overview of the available material and a brief guide as to how to proceed from beginning to end, before then diving into the specifics. These template files structure such a README in a way that is compliant with the typical data and code workflow in the social sciences.}, author = {Vilhuber, Lars and Connolly, Marie and Koren, Miklós and Llull, Joan and Morrow, Peter}, copyright = {Creative Commons Attribution Non Commercial 4.0 International, Open Access}, doi = {10.5281/ZENODO.4319999}, institution = {Zenodo}, language = {en}, month = dec, note = {Version Number: v1.0.0}, title = {A template {README} for social science replication packages}, url = {https://zenodo.org/record/4319999}, urldate = {2021-04-01}, year = {2020}, month_numeric = {12} } - Migrating historical AEA supplementsLars VilhuberCornell University, release v20200515, May 2020
@techreport{vilhuber2020a, author = {Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Cornell University}, month = may, number = {v20200515}, title = {Migrating historical {AEA} supplements}, type = {release}, url = {https://github.com/AEADataEditor/aea-supplement-migration/releases/tag/v20200515}, year = {2020}, month_numeric = {5} } - Criminal Record Inaccuracies and the Impact of a Record Education Intervention on Employment-Related OutcomesMartin Wells, Erin York Cornwell, Linda Barrington, and 3 more authorsCornell University, Final Report, Jan 2020
More than 70 million Americans have some form of criminal record, which can limit their access to employment opportunities, eligibility for occupational licensure, and public benefits. The use of criminal background checks in the hiring process has also dramatically increased over the past decade, and there is reason to think that many criminal records are inaccurate. Prior research has not determined the extent of errors on criminal records. We also do not know educating individuals about their records may promote efforts toward record correction and improve employment and other economic outcomes. The present study harnesses a unique opportunity to investigate the accuracy of criminal records and the impact of a record education intervention on job-seeking behaviors, employment opportunities, and economic outcomes for people with criminal records. We focus on class members of the Gonzalez, et al. v. Pritzker class action lawsuit. This group of individuals applied for a job with the 2010 Census, but they were denied employment because of a criminal background check. As part of the lawsuit settlement, class members were offered the choice of one of two remedies: a criminal records intervention that educates them about their criminal record and their related employment rights, or early notice of hiring for the 2020 Census. Individuals who chose the record education intervention are provided with a copy of their criminal record and a training session to review their record and provide information about their rights when applying for jobs or other employment-related opportunities. In addition, all class members in the two remedy groups were invited to participate in the first two waves of the Cornell Criminal Records Panel Survey (CCRPS). We combine data from the panel survey with administrative data from the records training (including actual criminal records) to address two main research questions. First, we ask: What is the prevalence of errors in criminal records of members of this class, and how are these errors distributed across racial/ethnic and sociodemographic groups? Using data from the record education intervention, we describe the errors discovered on participants’ records and how those errors vary across racial/ethnic and socioeconomic groups. Second, we ask: How does understanding one’s criminal record and relevant legal rights affect job- seeking behaviors, employment opportunities, economic attainment, and social engagement? To address this question, we leverage a quasi-experimental design, comparing class members who receive the criminal records intervention to those who opt into early notice of Census 2020 hiring, in order to examine how the criminal records intervention shapes job- seeking and other behaviors.
@techreport{wells2020a, abstract = {More than 70 million Americans have some form of criminal record, which can limit their access to employment opportunities, eligibility for occupational licensure, and public benefits. The use of criminal background checks in the hiring process has also dramatically increased over the past decade, and there is reason to think that many criminal records are inaccurate. Prior research has not determined the extent of errors on criminal records. We also do not know educating individuals about their records may promote efforts toward record correction and improve employment and other economic outcomes. The present study harnesses a unique opportunity to investigate the accuracy of criminal records and the impact of a record education intervention on job-seeking behaviors, employment opportunities, and economic outcomes for people with criminal records. We focus on class members of the Gonzalez, et al. v. Pritzker class action lawsuit. This group of individuals applied for a job with the 2010 Census, but they were denied employment because of a criminal background check. As part of the lawsuit settlement, class members were offered the choice of one of two remedies: a criminal records intervention that educates them about their criminal record and their related employment rights, or early notice of hiring for the 2020 Census. Individuals who chose the record education intervention are provided with a copy of their criminal record and a training session to review their record and provide information about their rights when applying for jobs or other employment-related opportunities. In addition, all class members in the two remedy groups were invited to participate in the first two waves of the Cornell Criminal Records Panel Survey (CCRPS). We combine data from the panel survey with administrative data from the records training (including actual criminal records) to address two main research questions. First, we ask: What is the prevalence of errors in criminal records of members of this class, and how are these errors distributed across racial/ethnic and sociodemographic groups? Using data from the record education intervention, we describe the errors discovered on participants’ records and how those errors vary across racial/ethnic and socioeconomic groups. Second, we ask: How does understanding one’s criminal record and relevant legal rights affect job- seeking behaviors, employment opportunities, economic attainment, and social engagement? To address this question, we leverage a quasi-experimental design, comparing class members who receive the criminal records intervention to those who opt into early notice of Census 2020 hiring, in order to examine how the criminal records intervention shapes job- seeking and other behaviors.}, author = {Wells, Martin and York Cornwell, Erin and Barrington, Linda and Bigler, Esta and Enayati, Hassan and Vilhuber, Lars}, copyright = {CC BY-NC-SA Attribution-NonCommercial-ShareAlike 4.0 International}, institution = {Cornell University}, language = {en\_US}, month = jan, title = {Criminal {Record} {Inaccuracies} and the {Impact} of a {Record} {Education} {Intervention} on {Employment}-{Related} {Outcomes}}, type = {Final {Report}}, url = {https://hdl.handle.net/1813/103780}, urldate = {2026-02-23}, year = {2020}, month_numeric = {1} } - Introductory Readings in Formal Privacy for EconomistsJohn Abowd, Ian Schmutte, William Sexton, and 1 more authorApr 2019Not peer-reviewed.
The purpose of this document is to provide scholars with a comprehensive list of readings relevant to the economic analysis of formal privacy, and particularly its application to public statistics. Statistical agencies and tech giants are rapidly adopting formal privacy models which make the tradeoff between privacy and data quality precise. The question then becomes, how much privacy loss should they allow? Abowd and Schmutte (2019) argue that this choice ultimately depends on how decision makers weigh the costs of privacy loss against the benefits of higher-quality data. Making progress on these questions requires familiarity with new tools from computer science and statistics, the objectives and policy environment within which statistical agencies operate, along with the economic analysis of information. We have organized these references into a reading course focused on 10-15 primary references in each of six different topics: Formal Privacy Policy and Official Statistics Statistical Disclosure Limitation Economics of Privacy Value of Privacy and Data Accuracy
@techreport{abowd2019, abstract = {The purpose of this document is to provide scholars with a comprehensive list of readings relevant to the economic analysis of formal privacy, and particularly its application to public statistics. Statistical agencies and tech giants are rapidly adopting formal privacy models which make the tradeoff between privacy and data quality precise. The question then becomes, how much privacy loss should they allow? Abowd and Schmutte (2019) argue that this choice ultimately depends on how decision makers weigh the costs of privacy loss against the benefits of higher-quality data. Making progress on these questions requires familiarity with new tools from computer science and statistics, the objectives and policy environment within which statistical agencies operate, along with the economic analysis of information. We have organized these references into a reading course focused on 10-15 primary references in each of six different topics: Formal Privacy Policy and Official Statistics Statistical Disclosure Limitation Economics of Privacy Value of Privacy and Data Accuracy}, author = {Abowd, John and Schmutte, Ian and Sexton, William and Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.5281/zenodo.2621344}, month = apr, note = {Not peer-reviewed.}, title = {Introductory {Readings} in {Formal} {Privacy} for {Economists}}, type = {Preprint}, url = {https://labordynamicsinstitute.github.io/privacy-bibliography/}, year = {2019}, month_numeric = {4} } - Why the Economics Profession Must Actively Participate in the Privacy Protection DebateJohn M. Abowd, Ian M. Schmutte, William N. Sexton, and 1 more authorLabor Dynamics Institute, Cornell University, Document 51, Jan 2019
When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.
@techreport{abowd2019a, abstract = {When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.}, author = {Abowd, John M. and Schmutte, Ian M. and Sexton, William N. and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Labor Dynamics Institute, Cornell University}, month = jan, number = {51}, title = {Why the {Economics} {Profession} {Must} {Actively} {Participate} in the {Privacy} {Protection} {Debate}}, type = {Document}, url = {https://hdl.handle.net/1813/89107}, year = {2019}, month_numeric = {1} } - Suboptimal Provision of Privacy and Statistical Accuracy When They are Public GoodsJohn M. Abowd, Ian M. Schmutte, William Sexton, and 1 more authorarXiv arxiv:1906.09353, Jun 2019
With vast databases at their disposal, private tech companies can compete with public statistical agencies to provide population statistics. However, private companies face different incentives to provide high-quality statistics and to protect the privacy of the people whose data are used. When both privacy protection and statistical accuracy are public goods, private providers tend to produce at least one suboptimally, but it is not clear which. We model a firm that publishes statistics under a guarantee of differential privacy. We prove that provision by the private firm results in inefficiently low data quality in this framework.
@techreport{abowd2019c, abstract = {With vast databases at their disposal, private tech companies can compete with public statistical agencies to provide population statistics. However, private companies face different incentives to provide high-quality statistics and to protect the privacy of the people whose data are used. When both privacy protection and statistical accuracy are public goods, private providers tend to produce at least one suboptimally, but it is not clear which. We model a firm that publishes statistics under a guarantee of differential privacy. We prove that provision by the private firm results in inefficiently low data quality in this framework.}, author = {Abowd, John M. and Schmutte, Ian M. and Sexton, William and Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, doi = {10.48550/arXiv.1906.09353}, institution = {arXiv}, language = {en}, month = jun, number = {arxiv:1906.09353}, title = {Suboptimal {Provision} of {Privacy} and {Statistical} {Accuracy} {When} {They} are {Public} {Goods}}, url = {https://arxiv.org/abs/1906.09353v1}, urldate = {2026-02-10}, year = {2019}, month_numeric = {6} } - Cornell Criminal Records Panel Study Questionnaire Wave 2Linda Barrington, Esta R. Bigler, Hassan Enayati, and 3 more authorsCornell Criminal Records Panel Study (CRPS), Document 69333, Mar 2019
Questionnaire used by the Cornell Criminal Records Panel Study. Subjects are members of the class in the law suit against the US Census Bureau. The purpose of this research is to learn more about the errors in background screening and understand how criminal records impact employment opportunities.
@techreport{barrington2019, abstract = {Questionnaire used by the Cornell Criminal Records Panel Study. Subjects are members of the class in the law suit against the US Census Bureau. The purpose of this research is to learn more about the errors in background screening and understand how criminal records impact employment opportunities.}, author = {Barrington, Linda and Bigler, Esta R. and Enayati, Hassan and Vilhuber, Lars and Wells, Martin T. and York Cornwell, Erin}, copyright = {Attribution-ShareAlike 4.0 International}, institution = {Cornell Criminal Records Panel Study (CRPS)}, language = {en\_US}, month = mar, number = {69333}, title = {Cornell {Criminal} {Records} {Panel} {Study} {Questionnaire} {Wave} 2}, type = {Document}, url = {https://hdl.handle.net/1813/69333}, urldate = {2019-10-16}, year = {2019}, month_numeric = {3} } - metajelo: A metadata package for journals to support external linked objectsCarl Lagoze and Lars VilhuberLabor Dynamics Institute, Document, 2019
We propose a metadata package that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. As such, the vocabulary described here complements existing initiatives that specify vocabularies to describe the supplementary materials or the repositories and archives in which they have been deposited. Where possible, it reuses elements of relevant other vocabularies, facilitating coexistence with them. Furthermore, it provides an “at publication” record of reproducibility characteristics of a particular article that has been selected for publication. The proposed metadata package documents the key characteristics that journals care about in the case of supplementary materials that are held by third parties: existence, accessibility, and permanence. It does so in a robust, time-invariant fashion at the time of publication, when the editorial decisions are made. It also allows for better documentation of less accessible (non-public data), by treating it symmetrically from the point of view of the journal, therefore increasing the transparency of what up until now has been very opaque.
@techreport{lagoze2022, abstract = {We propose a metadata package that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. As such, the vocabulary described here complements existing initiatives that specify vocabularies to describe the supplementary materials or the repositories and archives in which they have been deposited. Where possible, it reuses elements of relevant other vocabularies, facilitating coexistence with them. Furthermore, it provides an “at publication” record of reproducibility characteristics of a particular article that has been selected for publication. The proposed metadata package documents the key characteristics that journals care about in the case of supplementary materials that are held by third parties: existence, accessibility, and permanence. It does so in a robust, time-invariant fashion at the time of publication, when the editorial decisions are made. It also allows for better documentation of less accessible (non-public data), by treating it symmetrically from the point of view of the journal, therefore increasing the transparency of what up until now has been very opaque.}, author = {Lagoze, Carl and Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, institution = {Labor Dynamics Institute}, pages = {accepted}, title = {metajelo: {A} metadata package for journals to support external linked objects}, type = {Document}, url = {https://hdl.handle.net/1813/89106}, year = {2019}, } - Outcomes report of the Cornell Node of the NSF-Census Research NetworkLars Vilhuber and William BlockJan 2019
Description and List of Outcomes of the Cornell node of the NSF-Census Research Network.
@techreport{vilhuber2019b, abstract = {Description and List of Outcomes of the Cornell node of the NSF-Census Research Network.}, author = {Vilhuber, Lars and Block, William}, copyright = {Attribution-NonCommercial-ShareAlike 4.0 International}, language = {en\_US}, month = jan, title = {Outcomes report of the {Cornell} {Node} of the {NSF}-{Census} {Research} {Network}}, type = {report}, url = {https://ecommons.cornell.edu/handle/1813/65011}, urldate = {2019-04-10}, year = {2019}, month_numeric = {1} } - Disclosure Limitation and Confidentiality Protection in Linked DataJohn M. Abowd, Ian M. Schmutte, and Lars VilhuberLabor Dynamics Institute, Cornell University, Document 47, Jan 2018
Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.
@techreport{abowd2018, abstract = {Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.}, author = {Abowd, John M. and Schmutte, Ian M. and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Labor Dynamics Institute, Cornell University}, month = jan, number = {47}, title = {Disclosure {Limitation} and {Confidentiality} {Protection} in {Linked} {Data}}, type = {Document}, url = {https://hdl.handle.net/1813/89103}, year = {2018}, grant = {SES-1131848}, month_numeric = {1} } - The Reproducibility of Economics Research: A Case StudyHautahi Kingi, Flavio Stanchi, Lars Vilhuber, and 1 more author2018
@techreport{Kingi2018, address = {Berkeley, CA}, author = {Kingi, Hautahi and Stanchi, Flavio and Vilhuber, Lars and Herbert, Sylverie}, copyright = {All rights reserved}, title = {The {Reproducibility} of {Economics} {Research}: {A} {Case} {Study}}, type = {Presentation}, url = {https://osf.io/srg57/}, year = {2018}, } - LEHD infrastructure S2014 files in the FSRDCLars VilhuberCenter for Economic Studies, U.S. Census Bureau, Working Papers 18-27, May 2018
The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, maintains a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. This document describes the structure and content of the 2014 Snapshot of the LEHD Infrastructure files as they are made available in the Census Bureau’s secure and restricted-access Research Data Center network. The document attempts to provide a comprehensive description of all researcher-accessible files, of their creation, and of any modifications made to the files to facilitate researcher access.
@techreport{RePEc:cen:wpaper:18-27, abstract = {The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, maintains a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. This document describes the structure and content of the 2014 Snapshot of the LEHD Infrastructure files as they are made available in the Census Bureau’s secure and restricted-access Research Data Center network. The document attempts to provide a comprehensive description of all researcher-accessible files, of their creation, and of any modifications made to the files to facilitate researcher access.}, author = {Vilhuber, Lars}, copyright = {Public Domain}, institution = {Center for Economic Studies, U.S. Census Bureau}, month = may, number = {18-27}, title = {{LEHD} infrastructure {S2014} files in the {FSRDC}}, type = {Working {Papers}}, url = {https://ideas.repec.org/p/cen/wpaper/18-27.html}, year = {2018}, month_numeric = {5} } - Issues with Unpaywall in EconomicsLars VilhuberLabor Dynamics Institute, Oct 2018
@techreport{vilhuber2018g, author = {Vilhuber, Lars}, copyright = {CC BY Attribution 4.0 International}, institution = {Labor Dynamics Institute}, month = oct, title = {Issues with {Unpaywall} in {Economics}}, url = {https://labordynamicsinstitute.github.io/unpaywall-in-economics/README.html}, year = {2018}, month_numeric = {10} } - Reproducibility and replicability in economicsLars VilhuberNational Academies of Sciences, Engineering, and Medicine, Commissioned Paper, 2018
@techreport{vilhuber2018nap, author = {Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {National Academies of Sciences, Engineering, and Medicine}, title = {Reproducibility and replicability in economics}, type = {Commissioned {Paper}}, url = {https://www.nap.edu/resource/25303/Reproducibility%20in%20Economics.pdf}, year = {2018}, } - Cornell Project for Record Assistance Questionnaire - B-filersLinda Barrington, Esta R. Bigler, Hassan Enayati, and 3 more authorsMay 2017
Questionnaire used by the Cornell Project for Records Assistance, part of the Cornell University ILR School (Industrial and Labor Relations), to collect information about respondents, their work history, their involvement with the criminal justice system, if any, and their choice of remedy under the settlement of the Gonzalez et al v. Pritzker case. These respondents had chosen not to receive remedy, information was collected to understand their choice.
@techreport{barrington2017, abstract = {Questionnaire used by the Cornell Project for Records Assistance, part of the Cornell University ILR School (Industrial and Labor Relations), to collect information about respondents, their work history, their involvement with the criminal justice system, if any, and their choice of remedy under the settlement of the Gonzalez et al v. Pritzker case. These respondents had chosen not to receive remedy, information was collected to understand their choice.}, author = {Barrington, Linda and Bigler, Esta R. and Enayati, Hassan and Vilhuber, Lars and Wells, Martin T. and York Cornwell, Erin}, copyright = {Attribution-ShareAlike 4.0 International}, language = {en\_US}, month = may, title = {Cornell {Project} for {Record} {Assistance} {Questionnaire} - {B}-filers}, url = {https://doi.org/10.7298/8djc-nr53}, urldate = {2019-10-16}, year = {2017}, month_numeric = {5} } - Cornell Project for Record Assistance Questionnaire - A-filersLinda Barrington, Esta R. Bigler, Hassan Enayati, and 3 more authorsMay 2017
Questionnaire used by the Cornell Project for Records Assistance, part of the Cornell University ILR School (Industrial and Labor Relations), to collect information about respondents, their work history, their involvement with the criminal justice system, if any, and their choice of remedy under the settlement of the Gonzalez et al v. Pritzker case. The responses were used to inform support and remedy provided to respondents.
@techreport{barrington2017a, abstract = {Questionnaire used by the Cornell Project for Records Assistance, part of the Cornell University ILR School (Industrial and Labor Relations), to collect information about respondents, their work history, their involvement with the criminal justice system, if any, and their choice of remedy under the settlement of the Gonzalez et al v. Pritzker case. The responses were used to inform support and remedy provided to respondents.}, author = {Barrington, Linda and Bigler, Esta R. and Enayati, Hassan and Vilhuber, Lars and Wells, Martin T. and York Cornwell, Erin}, copyright = {Attribution-ShareAlike 4.0 International}, language = {en\_US}, month = may, title = {Cornell {Project} for {Record} {Assistance} {Questionnaire} - {A}-filers}, url = {https://doi.org/10.7298/k6yz-6394}, urldate = {2019-10-16}, year = {2017}, month_numeric = {5} } - Cornell Project for Records Assistance Questionnaire - with routingLinda Barrington, Esta R. Bigler, Hassan Enayati, and 3 more authorsMay 2017
Questionnaire used by the Cornell Project for Records Assistance, part of the Cornell University ILR School (Industrial and Labor Relations), to collect information about respondents, their work history, their involvement with the criminal justice system, if any, and their choice of remedy under the settlement of the Gonzalez et al v. Pritzker case. This version contains routing information.
@techreport{barrington2017b, abstract = {Questionnaire used by the Cornell Project for Records Assistance, part of the Cornell University ILR School (Industrial and Labor Relations), to collect information about respondents, their work history, their involvement with the criminal justice system, if any, and their choice of remedy under the settlement of the Gonzalez et al v. Pritzker case. This version contains routing information.}, author = {Barrington, Linda and Bigler, Esta R. and Enayati, Hassan and Vilhuber, Lars and Wells, Martin T. and York Cornwell, Erin}, doi = {10.7298/qdtv-wy74}, language = {en\_US}, month = may, title = {Cornell {Project} for {Records} {Assistance} {Questionnaire} - with routing}, url = {https://hdl.handle.net/1813/60391}, urldate = {2026-02-23}, year = {2017}, month_numeric = {5} } - Understanding the effect of procedural justice on psychological distressJulie Cloutier, Lars Vilhuber, Denis Harrisson, and 1 more authorLabor Dynamics Institute, Cornell University, Document 35, 2017
Studies on the effect of procedural justice on psychological distress present conflicting results. Drawing on instrumental and relational perspectives of justice, we test the hypothesis that the perception of procedural justice influences the level of workers’ psychological distress. Using a number of validated instruments to collected data from 659 workers in three call centers, we use OLS regressions and Hayes’ PROCESS tool to show that the perception of procedural justice has a direct, unique, and independent effect on psychological distress. The perception of procedural justice has no instrumental role, the key mechanism being the relational role, suggesting that perceived injustice influences psychological distress because it threatens self-esteem. Distributive justice perceptions (recognition, promotions, job security) are not associated with psychological distress, calling into question Siegrist’s model. Our findings suggest that perceived procedural justice provides workers better evidence of the extent to which they are valued and appreciated members of their organizations than do perceptions of distributive justice. The results highlight the greater need for workers to be valued and appreciated for who they are (consideration and esteem), rather than for what they do for their organization (distributive justice of rewards).
@techreport{CloutierVilhuberLDI2017, abstract = {Studies on the effect of procedural justice on psychological distress present conflicting results. Drawing on instrumental and relational perspectives of justice, we test the hypothesis that the perception of procedural justice influences the level of workers' psychological distress. Using a number of validated instruments to collected data from 659 workers in three call centers, we use OLS regressions and Hayes' PROCESS tool to show that the perception of procedural justice has a direct, unique, and independent effect on psychological distress. The perception of procedural justice has no instrumental role, the key mechanism being the relational role, suggesting that perceived injustice influences psychological distress because it threatens self-esteem. Distributive justice perceptions (recognition, promotions, job security) are not associated with psychological distress, calling into question Siegrist's model. Our findings suggest that perceived procedural justice provides workers better evidence of the extent to which they are valued and appreciated members of their organizations than do perceptions of distributive justice. The results highlight the greater need for workers to be valued and appreciated for who they are (consideration and esteem), rather than for what they do for their organization (distributive justice of rewards).}, author = {Cloutier, Julie and Vilhuber, Lars and Harrisson, Denis and Béland-Ouellette, Vanessa}, copyright = {All Rights Reserved (Free to Read)}, institution = {Labor Dynamics Institute, Cornell University}, number = {35}, title = {Understanding the effect of procedural justice on psychological distress}, type = {Document}, url = {http://digitalcommons.ilr.cornell.edu/ldi/35/}, year = {2017}, } - Two Perspectives on Commuting: A Comparison of Home to Work Flows Across Job-Linked Survey and Administrative Files.Andrew S. Green, Mark J. Kutzbach, and and Lars VilhuberU.S. Census Bureau Center for Economic Studies Discussion, Paper, 2017
@techreport{green2017, author = {Green, Andrew S. and Kutzbach, Mark J. and Vilhuber, {and} Lars}, copyright = {Public Domain}, institution = {U.S. Census Bureau Center for Economic Studies Discussion}, language = {en}, pages = {17--34}, title = {Two {Perspectives} on {Commuting}: {A} {Comparison} of {Home} to {Work} {Flows} {Across} {Job}-{Linked} {Survey} and {Administrative} {Files}.}, type = {Paper}, url = {https://www.census.gov/library/working-papers/2017/adrm/ces-wp-17-34.html}, year = {2017}, grant = {SES-1131848}, } - Recalculating - how uncertainty in local labor market definitions affects empirical findingsAndrew Foote, Mark J. Kutzbach, and Lars VilhuberNSF Census Research Network - NCRN-Cornell, Preprint 1813:52649, 2017
This paper evaluates the use of commuting zones as a local labor market definition. We revisit Tolbert and Sizer (1996) and demonstrate the sensitivity of definitions to two features of the methodology. We show how these features impact empirical estimates using a well-known application of commuting zones. We conclude with advice to researchers using commuting zones on how to demonstrate the robustness of empirical findings to uncertainty in definitions. The analysis, conclusions, and opinions expressed herein are those of the author(s) alone and do not necessarily represent the views of the U.S. Census Bureau or the Federal Deposit Insurance Corporation. All results have been reviewed to ensure that no confidential information is disclosed, and no confidential data was used in this paper. This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Much of the work developing this paper occurred while Mark Kutzbach was an employee of the U.S. Census Bureau.
@techreport{handle:1813:52649, abstract = {This paper evaluates the use of commuting zones as a local labor market definition. We revisit Tolbert and Sizer (1996) and demonstrate the sensitivity of definitions to two features of the methodology. We show how these features impact empirical estimates using a well-known application of commuting zones. We conclude with advice to researchers using commuting zones on how to demonstrate the robustness of empirical findings to uncertainty in definitions. The analysis, conclusions, and opinions expressed herein are those of the author(s) alone and do not necessarily represent the views of the U.S. Census Bureau or the Federal Deposit Insurance Corporation. All results have been reviewed to ensure that no confidential information is disclosed, and no confidential data was used in this paper. This document is released to inform interested parties of ongoing research and to encourage discussion of work in progress. Much of the work developing this paper occurred while Mark Kutzbach was an employee of the U.S. Census Bureau.}, author = {Foote, Andrew and Kutzbach, Mark J. and Vilhuber, Lars}, copyright = {Public Domain}, institution = {NSF Census Research Network - NCRN-Cornell}, number = {1813:52649}, title = {Recalculating - how uncertainty in local labor market definitions affects empirical findings}, type = {Preprint}, url = {http://hdl.handle.net/1813/52649}, year = {2017}, } - Utility Cost of Formal Privacy for Releasing National Employer-Employee StatisticsSamuel Haney, Ashwin Machanavajjhala, John M. Abowd, and 3 more authorsLabor Dynamics Institute, Cornell University, Document 36, 2017
National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter {}epsilon}geq 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.
@techreport{haney2017a, abstract = {National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter \${\textbackslash}epsilon{\textbackslash}geq\$ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.}, author = {Haney, Samuel and Machanavajjhala, Ashwin and Abowd, John M. and Graham, Matthew and Kutzbach, Mark and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Labor Dynamics Institute, Cornell University}, number = {36}, title = {Utility {Cost} of {Formal} {Privacy} for {Releasing} {National} {Employer}-{Employee} {Statistics}}, type = {Document}, url = {https://hdl.handle.net/1813/49652}, year = {2017}, grant = {SES-1131848}, } - Effects of a government-academic partnership: Has the NSF-census bureau research network helped improve the U.S. statistical system?Daniel H. Weinberg, John M. Abowd, Robert F. Belli, and 13 more authorsCenter for Economic Studies, U.S. Census Bureau, Working Papers 17-59r, Jan 2017
The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives.
@techreport{RePEc:cen:wpaper:17-59r, abstract = {The National Science Foundation-Census Bureau Research Network (NCRN) was established in 2011 to create interdisciplinary research nodes on methodological questions of interest and significance to the broader research community and to the Federal Statistical System (FSS), particularly the Census Bureau. The activities to date have covered both fundamental and applied statistical research and have focused at least in part on the training of current and future generations of researchers in skills of relevance to surveys and alternative measurement of economic units, households, and persons. This paper discusses some of the key research findings of the eight nodes, organized into six topics: (1) Improving census and survey data collection methods; (2) Using alternative sources of data; (3) Protecting privacy and confidentiality by improving disclosure avoidance; (4) Using spatial and spatio-temporal statistical modeling to improve estimates; (5) Assessing data cost and quality tradeoffs; and (6) Combining information from multiple sources. It also reports on collaborations across nodes and with federal agencies, new software developed, and educational activities and outcomes. The paper concludes with an evaluation of the ability of the FSS to apply the NCRN’s research outcomes and suggests some next steps, as well as the implications of this research-network model for future federal government renewal initiatives.}, author = {Weinberg, Daniel H. and Abowd, John M. and Belli, Robert F. and Cressie, Noel and Folch, David C. and Holan, Scott H. and Levenstein, Margaret C. and Olson, Kristen M. and Reiter, Jerome P. and Shapiro, Matthew D. and Smyth, Jolene and Soh, Leen-Kiat and Spencer, Bruce D. and Spielman, Seth E. and Vilhuber, Lars and Wikle, Christopher K.}, copyright = {Public Domain}, institution = {Center for Economic Studies, U.S. Census Bureau}, month = jan, number = {17-59r}, title = {Effects of a government-academic partnership: {Has} the {NSF}-census bureau research network helped improve the {U}.{S}. statistical system?}, type = {Working {Papers}}, url = {https://ideas.repec.org/p/cen/wpaper/17-59r.html}, year = {2017}, grant = {SES-1131848}, month_numeric = {1} } - Total error and variability measures with integrated disclosure limitation for quarterly workforce indicators and LEHD origin destination employment statistics in on the mapKevin L. McKinney, Andrew S. Green, Lars Vilhuber, and 1 more authorCenter for Economic Studies, U.S. Census Bureau, Working Papers 17-71, Jan 2017
We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau’s Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.
@techreport{RePEc:cen:wpaper:17-71, abstract = {We report results from the rst comprehensive total quality evaluation of five major indicators in the U.S. Census Bureau's Longitudinal Employer-Household Dynamics (LEHD) Program Quarterly Workforce Indicators (QWI): total employment, beginning-of-quarter employment, full-quarter employment, total payroll, and average monthly earnings of full-quarter employees. Beginning-of-quarter employment is also the main tabulation variable in the LEHD Origin-Destination Employment Statistics (LODES) workplace reports as displayed in OnTheMap (OTM). The evaluation is conducted by generating multiple threads of the edit and imputation models used in the LEHD Infrastructure File System. These threads conform to the Rubin (1987) multiple imputation model, with each thread or implicate being the output of formal probability models that address coverage, edit, and imputation errors. Design-based sampling variability and nite population corrections are also included in the evaluation. We derive special formulas for the Rubin total variability and its components that are consistent with the disclosure avoidance system used for QWI and LODES/OTM workplace reports. These formulas allow us to publish the complete set of detailed total quality measures for QWI and LODES. The analysis reveals that the five publication variables under study are estimated very accurately for tabulations involving at least 10 jobs. Tabulations involving three to nine jobs have quality in the range generally deemed acceptable. Tabulations involving zero, one or two jobs, which are generally suppressed in the QWI and synthesized in LODES, have substantial total variability but their publication in LODES allows the formation of larger custom aggregations, which will in general have the accuracy estimated for tabulations in the QWI based on a similar number of workers.}, author = {McKinney, Kevin L. and Green, Andrew S. and Vilhuber, Lars and Abowd, John M.}, copyright = {Public Domain}, institution = {Center for Economic Studies, U.S. Census Bureau}, month = jan, number = {17-71}, title = {Total error and variability measures with integrated disclosure limitation for quarterly workforce indicators and {LEHD} origin destination employment statistics in on the map}, type = {Working {Papers}}, url = {https://ideas.repec.org/p/cen/wpaper/17-71.html}, year = {2017}, grant = {SES-1131848}, month_numeric = {1} } - Making Confidential Data Part of Reproducible ResearchLars Vilhuber and Carl LagozeLabor Dynamics Institute, Cornell University, Document 41, 2017
@techreport{vilhuber2017a, author = {Vilhuber, Lars and Lagoze, Carl}, copyright = {Attribution-NonCommercial-ShareAlike 4.0 International}, institution = {Labor Dynamics Institute, Cornell University}, number = {41}, title = {Making {Confidential} {Data} {Part} of {Reproducible} {Research}}, type = {Document}, url = {https://hdl.handle.net/1813/52474}, year = {2017}, grant = {SES-1131848}, } - Proceedings from the 2017 Cornell-Census-NSF-Sloan Workshop on Practical PrivacyLars Vilhuber and Ian SchmutteLabor Dynamics Institute, Cornell University, Document 43, 2017
These proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. This workshop was a successor to a previous workshop held in October 2016 (Vilhuber & Schmutte 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cutting-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. The teams developing those applications were just starting out when our first workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cutting-edge formal privacy models, there had been very little effort in the academic literature to apply those methods in real-world settings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject matter challenges and address how data users might react to changes in data availability and publishing standards. In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four different areas. The four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality. As in our earlier workshop, our goals were to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.
@techreport{vilhuber2017b, abstract = {These proceedings report on a workshop hosted at the U.S. Census Bureau on May 8, 2017. Our purpose was to gather experts from various backgrounds together to continue discussing the development of formal privacy systems for Census Bureau data products. This workshop was a successor to a previous workshop held in October 2016 (Vilhuber \& Schmutte 2017). At our prior workshop, we hosted computer scientists, survey statisticians, and economists, all of whom were experts in data privacy. At that time we discussed the practical implementation of cutting-edge methods for publishing data with formal, provable privacy guarantees, with a focus on applications to Census Bureau data products. The teams developing those applications were just starting out when our first workshop took place, and we spent our time brainstorming solutions to the various problems researchers were encountering, or anticipated encountering. For these cutting-edge formal privacy models, there had been very little effort in the academic literature to apply those methods in real-world settings with large, messy data. We therefore brought together an expanded group of specialists from academia and government who could shed light on technical challenges, subject matter challenges and address how data users might react to changes in data availability and publishing standards. In May 2017, we organized a follow-up workshop, which these proceedings report on. We reviewed progress made in four different areas. The four topics discussed as part of the workshop were 1. the 2020 Decennial Census; 2. the American Community Survey (ACS); 3. the 2017 Economic Census; 4. measuring the demand for privacy and for data quality. As in our earlier workshop, our goals were to 1. Discuss the specific challenges that have arisen in ongoing efforts to apply formal privacy models to Census data products by drawing together expertise of academic and governmental researchers; 2. Produce short written memos that summarize concrete suggestions for practical applications to specific Census Bureau priority areas.}, author = {Vilhuber, Lars and Schmutte, Ian}, copyright = {All Rights Reserved (Free to Read)}, institution = {Labor Dynamics Institute, Cornell University}, number = {43}, title = {Proceedings from the 2017 {Cornell}-{Census}-{NSF}-{Sloan} {Workshop} on {Practical} {Privacy}}, type = {Document}, url = {https://hdl.handle.net/1813/89101}, year = {2017}, grant = {SES-1131848}, } - Proceedings from the Synthetic LBD International SeminarLars Vilhuber, Saki Kinney, and Ian SchmutteLabor Dynamics Institute, Cornell University, Document 44, 2017
On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop.
@techreport{vilhuber2017c, abstract = {On May 9, 2017, we hosted a seminar to discuss the conditions necessary to implement the SynLBD approach with interested parties, with the goal of providing a straightforward toolkit to implement the same procedure on other data. The proceedings summarize the discussions during the workshop.}, author = {Vilhuber, Lars and Kinney, Saki and Schmutte, Ian}, copyright = {Creative Commons Attribution Non Commercial Share Alike 4.0 International}, institution = {Labor Dynamics Institute, Cornell University}, number = {44}, title = {Proceedings from the {Synthetic} {LBD} {International} {Seminar}}, type = {Document}, url = {https://hdl.handle.net/1813/52472}, year = {2017}, grant = {SES-1131848}, } - Proceedings from the 2016 NSF-Sloan Workshop on Practical PrivacyLars Vilhuber and Ian SchmutteLabor Dynamics Institute, Cornell University, Document 33, 2017
@techreport{vilhuber2017e, author = {Vilhuber, Lars and Schmutte, Ian}, copyright = {All rights reserved}, institution = {Labor Dynamics Institute, Cornell University}, number = {33}, title = {Proceedings from the 2016 {NSF}-{Sloan} {Workshop} on {Practical} {Privacy}}, type = {Document}, url = {http://digitalcommons.ilr.cornell.edu/ldi/33/}, year = {2017}, grant = {SES-1131848}, } - Using partially synthetic microdata to protect sensitive cells in business statisticsJavier Miranda and Lars VilhuberNSF Census Research Network - NCRN-Cornell 1813:42339, 2016
We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau’s Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).
@techreport{miranda-vilhuber-2016-ecommons, abstract = {We describe and analyze a method that blends records from both observed and synthetic microdata into public-use tabulations on establishment statistics. The resulting tables use synthetic data only in potentially sensitive cells. We describe different algorithms, and present preliminary results when applied to the Census Bureau's Business Dynamics Statistics and Synthetic Longitudinal Business Database, highlighting accuracy and protection afforded by the method when compared to existing public-use tabulations (with suppressions).}, author = {Miranda, Javier and Vilhuber, Lars}, copyright = {Public Domain}, institution = {NSF Census Research Network - NCRN-Cornell}, number = {1813:42339}, title = {Using partially synthetic microdata to protect sensitive cells in business statistics}, url = {http://hdl.handle.net/1813/42339}, year = {2016}, } - Synthetic establishment microdata around the worldLars Vilhuber, John A. Abowd, and Jerome P. ReiterNSF Census Research Network - NCRN-Cornell 1813:42340, 2016tex.owner: vilhuber tex.timestamp: 2014.03.24
In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.
@techreport{vilhuber-abowd-reiter-2016-ecommons, abstract = {In contrast to the many public-use microdata samples available for individual and household data from many statistical agencies around the world, there are virtually no establishment or firm microdata available. In large part, this difficulty in providing access to business micro data is due to the skewed and sparse distributions that characterize business data. Synthetic data are simulated data generated from statistical models. We organized sessions at the 2015 World Statistical Congress and the 2015 Joint Statistical Meetings, highlighting work on synthetic establishment microdata. This overview situates those papers, published in this issue, within the broader literature.}, author = {Vilhuber, Lars and Abowd, John A. and Reiter, Jerome P.}, institution = {NSF Census Research Network - NCRN-Cornell}, note = {tex.owner: vilhuber tex.timestamp: 2014.03.24}, number = {1813:42340}, title = {Synthetic establishment microdata around the world}, url = {http://hdl.handle.net/1813/42340}, year = {2016}, } - Looking back on three years of using the Synthetic LBD betaJavier Miranda and Lars VilhuberCenter for Economic Studies, U.S. Census Bureau, Working Papers 14-11, Feb 2014
Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.
@techreport{RePEc:cen:wpaper:14-11, abstract = {Distributions of business data are typically much more skewed than those for household or individual data and public knowledge of the underlying units is greater. As a results, national statistical offices (NSOs) rarely release establishment or firm-level business microdata due to the risk to respondent confidentiality. One potential approach for overcoming these risks is to release synthetic data where the establishment data are simulated from statistical models designed to mimic the distributions of the real underlying microdata. The US Census Bureau?s Center for Economic Studies in collaboration with Duke University, the National Institute of Statistical Sciences, and Cornell University made available a synthetic public use file for the Longitudinal Business Database (LBD) comprising more than 20 million records for all business establishment with paid employees dating back to 1976. The resulting product, dubbed the SynLBD, was released in 2010 and is the first-ever comprehensive business microdata set publicly released in the United States including data on establishments employment and payroll, birth and death years, and industrial classification. This pa- per documents the scope of projects that have requested and used the SynLBD.}, author = {Miranda, Javier and Vilhuber, Lars}, institution = {Center for Economic Studies, U.S. Census Bureau}, month = feb, number = {14-11}, title = {Looking back on three years of using the {Synthetic} {LBD} beta}, type = {Working {Papers}}, url = {http://ideas.repec.org/p/cen/wpaper/14-11.html}, year = {2014}, month_numeric = {2} } - A First Step Towards A German SynLBD: Constructing A German Longitudinal Business DatabaseJörg Drechsler and Lars VilhuberCenter for Economic Studies, U.S. Census Bureau, Working Papers 14-13, Feb 2014
One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.
@techreport{RePEc:cen:wpaper:14-13, abstract = {One major criticism against the use of synthetic data has been that the efforts necessary to generate useful synthetic data are so in- tense that many statistical agencies cannot afford them. We argue many lessons in this evolving field have been learned in the early years of synthetic data generation, and can be used in the development of new synthetic data products, considerably reducing the required in- vestments. The final goal of the project described in this paper will be to evaluate whether synthetic data algorithms developed in the U.S. to generate a synthetic version of the Longitudinal Business Database (LBD) can easily be transferred to generate a similar data product for other countries. We construct a German data product with infor- mation comparable to the LBD - the German Longitudinal Business Database (GLBD) - that is generated from different administrative sources at the Institute for Employment Research, Germany. In a fu- ture step, the algorithms developed for the synthesis of the LBD will be applied to the GLBD. Extensive evaluations will illustrate whether the algorithms provide useful synthetic data without further adjustment. The ultimate goal of the project is to provide access to multiple synthetic datasets similar to the SynLBD at Cornell to enable comparative studies between countries. The Synthetic GLBD is a first step towards that goal.}, author = {Drechsler, Jörg and Vilhuber, Lars}, copyright = {Public Domain}, institution = {Center for Economic Studies, U.S. Census Bureau}, month = feb, number = {14-13}, title = {A {First} {Step} {Towards} {A} {German} {SynLBD}: {Constructing} {A} {German} {Longitudinal} {Business} {Database}}, type = {Working {Papers}}, url = {http://ideas.repec.org/p/cen/wpaper/14-13.html}, year = {2014}, month_numeric = {2} } - LEHD Infrastructure files in the Census RDC - OverviewLars Vilhuber and Kevin McKinneyCenter for Economic Studies, U.S. Census Bureau 14-26, Jun 2014
The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, maintains a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. This document describes the structure and content of the 2011 Snapshot of the LEHD Infrastructure files as they are made available in the Census Bureaus secure and restricted-access Research Data Center network. The document attempts to provide a comprehensive description of all researcher-accessible files, of their creation, and of any modifcations made to the files to facilitate researcher access.
@techreport{vilhuber2014, abstract = {The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, maintains a set of infrastructure files using administrative data provided by state agencies, enhanced with information from other administrative data sources, demographic and economic (business) surveys and censuses. The LEHD Infrastructure Files provide a detailed and comprehensive picture of workers, employers, and their interaction in the U.S. economy. This document describes the structure and content of the 2011 Snapshot of the LEHD Infrastructure files as they are made available in the Census Bureaus secure and restricted-access Research Data Center network. The document attempts to provide a comprehensive description of all researcher-accessible files, of their creation, and of any modifcations made to the files to facilitate researcher access.}, author = {Vilhuber, Lars and McKinney, Kevin}, copyright = {All rights reserved}, institution = {Center for Economic Studies, U.S. Census Bureau}, month = jun, number = {14-26}, title = {{LEHD} {Infrastructure} files in the {Census} {RDC} - {Overview}}, url = {http://ideas.repec.org/p/cen/wpaper/14-26.html}, year = {2014}, month_numeric = {6} } - Estimation de la contribution de la réallocation de la main-d’oeuvre à la croissance de la productivité au CanadaCharles Bérubé, Benoit Dostie, and Lars VilhuberCentre sur la productivité et la prospérité, HEC Montréal, 2013
In this report, we estimate the contribution of labour reallocation to productivity growth in the Canadian manufacturing sector. We find that most of productivity growth comes from within firm improvements, leaving a limited role for labour reallocation. Still, we also find that the importance of labour reallocation increase over time. This is both due to increasing net-entry and inter-firm effects. These effects are much more important post 2000 than in the 1990s. We also find that lost production from exiting firms is now most likely replaced by production from existing firms, while previously, it was more likely to be replaced by production from new firms. (French only)
@techreport{BerubeDostieVilhuber2013, abstract = {In this report, we estimate the contribution of labour reallocation to productivity growth in the Canadian manufacturing sector. We find that most of productivity growth comes from within firm improvements, leaving a limited role for labour reallocation. Still, we also find that the importance of labour reallocation increase over time. This is both due to increasing net-entry and inter-firm effects. These effects are much more important post 2000 than in the 1990s. We also find that lost production from exiting firms is now most likely replaced by production from existing firms, while previously, it was more likely to be replaced by production from new firms. (French only)}, author = {Bérubé, Charles and Dostie, Benoit and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Centre sur la productivité et la prospérité, HEC Montréal}, title = {Estimation de la contribution de la réallocation de la main-d'oeuvre à la croissance de la productivité au {Canada}}, url = {http://cpp.hec.ca/cms/assets/documents/recherches_publiees/CH_2012_01.pdf}, year = {2013}, } - Methods for Protecting the Confidentiality of Firm-Level Data: Issues and SolutionsLars VilhuberLabor Dynamics Institute 19, Mar 2013
This report will provide an overview of methods used by statistical agencies to encourage, support, and enhance research access to data for the purpose of generating new knowledge. Quite a few reports and scientific articles have addressed the issue before, and we will be highly indebted to that literature. To a summary of that literature, we hope to provide some recent developments and experiences derived from a decade of working with systems that increase access as both researchers as well as data providers. The report will focus on the data provided by statistical agencies, but it should be understood that government agencies other than a National Statistical Office (NSO) may acquire that function. While excluding the legal background limiting or permitting such data collection and provision, we will highlight some alternate sources and methods, prior to concluding.
@techreport{vilhuber2013a, abstract = {This report will provide an overview of methods used by statistical agencies to encourage, support, and enhance research access to data for the purpose of generating new knowledge. Quite a few reports and scientific articles have addressed the issue before, and we will be highly indebted to that literature. To a summary of that literature, we hope to provide some recent developments and experiences derived from a decade of working with systems that increase access as both researchers as well as data providers. The report will focus on the data provided by statistical agencies, but it should be understood that government agencies other than a National Statistical Office (NSO) may acquire that function. While excluding the legal background limiting or permitting such data collection and provision, we will highlight some alternate sources and methods, prior to concluding.}, author = {Vilhuber, Lars}, copyright = {All rights reserved}, institution = {Labor Dynamics Institute}, month = mar, number = {19}, title = {Methods for {Protecting} the {Confidentiality} of {Firm}-{Level} {Data}: {Issues} and {Solutions}}, url = {https://hdl.handle.net/1813/89089}, year = {2013}, month_numeric = {3} } - Dynamically Consistent Noise Infusion and Partially Synthetic Data as Confidentiality Protection Measures for Related Time SeriesJohn M Abowd, R Kaj Kaj Gittings, Kevin L McKinney, and 3 more authorsU.S. Census Bureau, Center for Economic Studies 12-13, Oct 2012
The Census Bureau’s Quarterly Workforce Indicators (QWI) provide detailed quarterly statistics on employment measures such as worker and job flows, tabulated by
@techreport{abowd2012b, abstract = {The Census Bureau's Quarterly Workforce Indicators (QWI) provide detailed quarterly statistics on employment measures such as worker and job flows, tabulated by}, author = {Abowd, John M and Gittings, R Kaj Kaj and McKinney, Kevin L and Stephens, Bryce and Vilhuber, Lars and Woodcock, Simon D}, copyright = {Public Domain}, doi = {10.2139/ssrn.2159800}, institution = {U.S. Census Bureau, Center for Economic Studies}, month = oct, number = {12-13}, title = {Dynamically {Consistent} {Noise} {Infusion} and {Partially} {Synthetic} {Data} as {Confidentiality} {Protection} {Measures} for {Related} {Time} {Series}}, url = {https://www.census.gov/library/working-papers/2012/adrm/ces-wp-12-13.html}, year = {2012}, month_numeric = {10} } - Did the Housing Price Bubble Clobber Local Labor Market Job and Worker Flows When It Burst?John M Abowd and Lars VilhuberLabor Dynamics Institute, Working Paper 89078, Jan 2012
We use the Census Bureau’s Quarterly Workforce Indicators and the Federal Housing Finance Agency’s House Price Indices to study the effects of the housing price bubble on local labor markets. We show that the 35 MSAs in the top decile of the house price boom were most severely impacted. Their stable job employment fell much more than the national average. Their real wage rates did not fall as fast as the national average. Accessions fell much faster than average while separations were constant. Job creations fell substantially while destructions rose slightly.
@techreport{abowdvilhuber2012-ldi, abstract = {We use the Census Bureau's Quarterly Workforce Indicators and the Federal Housing Finance Agency's House Price Indices to study the effects of the housing price bubble on local labor markets. We show that the 35 MSAs in the top decile of the house price boom were most severely impacted. Their stable job employment fell much more than the national average. Their real wage rates did not fall as fast as the national average. Accessions fell much faster than average while separations were constant. Job creations fell substantially while destructions rose slightly.}, author = {Abowd, John M and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Labor Dynamics Institute}, month = jan, number = {89078}, title = {Did the {Housing} {Price} {Bubble} {Clobber} {Local} {Labor} {Market} {Job} and {Worker} {Flows} {When} {It} {Burst}?}, type = {Working {Paper}}, url = {https://hdl.handle.net/1813/89078}, year = {2012}, month_numeric = {1} } - New York State Disability and Employment Status Report, 2011Sarah Von Schrader, William Erickson, Thomas Golden, and 1 more authorCornell University, Employment and Disability Institute, Report on behalf of New York Makes Work Pay Comprehensive Employment System Medicaid Infrastructure Grant, 2011
@techreport{Employment2011, author = {Von Schrader, Sarah and Erickson, William and Golden, Thomas and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Cornell University, Employment and Disability Institute}, title = {New {York} {State} {Disability} and {Employment} {Status} {Report}, 2011}, type = {Report on behalf of {New} {York} {Makes} {Work} {Pay} {Comprehensive} {Employment} {System} {Medicaid} {Infrastructure} {Grant}}, url = {http://ilr-edi-r1.ilr.cornell.edu/nymakesworkpay/docs/Report_Card_2011/NYS%20Report%20Card%202011.pdf}, urldate = {2014-04-10}, year = {2011}, } - LEHD Infrastructure files in the Census RDC-Overview S2008Kevin L. McKinney and Lars VilhuberU.S. Census Bureau, Working Papers 11-43, 2011
@techreport{mckinney2011, author = {McKinney, Kevin L. and Vilhuber, Lars}, copyright = {Public Domain}, institution = {U.S. Census Bureau}, number = {11-43}, title = {{LEHD} {Infrastructure} files in the {Census} {RDC}-{Overview} {S2008}}, type = {Working {Papers}}, url = {https://ideas.repec.org/p/cen/wpaper/11-43.html}, year = {2011}, } - LEHD Infrastructure Files in the Census RDC: Overview of S2004 SnapshotKevin L McKinney and Lars VilhuberU.S. Census Bureau, Working Papers 11-13, Apr 2011
The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, has built a set o
@techreport{mckinney2011a, abstract = {The Longitudinal Employer-Household Dynamics (LEHD) Program at the U.S. Census Bureau, with the support of several national research agencies, has built a set o}, author = {McKinney, Kevin L and Vilhuber, Lars}, copyright = {Public Domain}, institution = {U.S. Census Bureau}, month = apr, number = {11-13}, title = {{LEHD} {Infrastructure} {Files} in the {Census} {RDC}: {Overview} of {S2004} {Snapshot}}, type = {Working {Papers}}, url = {https://ideas.repec.org/p/cen/wpaper/11-13.html}, year = {2011}, month_numeric = {4} } - National estimates of gross employment and job flows from the Quarterly Workforce Indicators with demographic and industry detail (with color graphs)John M. Abowd and Lars VilhuberCenter for Economic Studies, U.S. Census Bureau, Working Papers 10-11, Jun 2010
The Quarterly Workforce Indicators (QWI) are local labor market data produced and released every quarter by the United States Census Bureau. Unlike any other local labor market series produced in the U.S. or the rest of the world, the QWI measure employment flows for workers (accession and separations), jobs (creations and destructions) and earnings for demographic subgroups (age and gender), economic industry (NAICS industry groups), detailed geography (block (experimental), county, Core- Based Statistical Area, and Workforce Investment Area), and ownership (private, all) with fully interacted publication tables. The current QWI data cover 47 states, about 98% of the private workforce in those states, and about 92% of all private employment in the entire economy. State participation is sufficiently extensive to permit us to present the first national estimates constructed from these data. We focus on worker, job, and excess (churning) reallocation rates, rather than on levels of the basic variables. This permits comparison to existing series from the Job Openings and Labor Turnover Survey and the Business Employment Dynamics Series from the Bureau of Labor Statistics. The national estimates from the QWI are an important enhancement to existing series because they include demographic and industry detail for both worker and job flow data compiled from underlying micro-data that have been integrated at the job and establishment levels by the Longitudinal Employer-Household Dynamics Program at the Census Bureau. The estimates presented herein were compiled exclusively from public-use data series and are available for download.
@techreport{ces-wp-10-11, abstract = {The Quarterly Workforce Indicators (QWI) are local labor market data produced and released every quarter by the United States Census Bureau. Unlike any other local labor market series produced in the U.S. or the rest of the world, the QWI measure employment flows for workers (accession and separations), jobs (creations and destructions) and earnings for demographic subgroups (age and gender), economic industry (NAICS industry groups), detailed geography (block (experimental), county, Core- Based Statistical Area, and Workforce Investment Area), and ownership (private, all) with fully interacted publication tables. The current QWI data cover 47 states, about 98\% of the private workforce in those states, and about 92\% of all private employment in the entire economy. State participation is sufficiently extensive to permit us to present the first national estimates constructed from these data. We focus on worker, job, and excess (churning) reallocation rates, rather than on levels of the basic variables. This permits comparison to existing series from the Job Openings and Labor Turnover Survey and the Business Employment Dynamics Series from the Bureau of Labor Statistics. The national estimates from the QWI are an important enhancement to existing series because they include demographic and industry detail for both worker and job flow data compiled from underlying micro-data that have been integrated at the job and establishment levels by the Longitudinal Employer-Household Dynamics Program at the Census Bureau. The estimates presented herein were compiled exclusively from public-use data series and are available for download.}, author = {Abowd, John M. and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {Center for Economic Studies, U.S. Census Bureau}, month = jun, number = {10-11}, title = {National estimates of gross employment and job flows from the {Quarterly} {Workforce} {Indicators} with demographic and industry detail (with color graphs)}, type = {Working {Papers}}, url = {http://ideas.repec.org/p/cen/wpaper/10-11.html}, year = {2010}, month_numeric = {6} } - New York State Disability and Employment Status Report, 2009Sarah Von Schrader, William Erickson, Lars Vilhuber, and 1 more authorCornell University, Employment and Disability Institute, Report on behalf of New York Makes Work Pay Comprehensive Employment System Medicaid Infrastructure Grant, 2010
@techreport{Employment2009, author = {Von Schrader, Sarah and Erickson, William and Vilhuber, Lars and Golden, Thomas.}, copyright = {All Rights Reserved (Free to Read)}, institution = {Cornell University, Employment and Disability Institute}, title = {New {York} {State} {Disability} and {Employment} {Status} {Report}, 2009}, type = {Report on behalf of {New} {York} {Makes} {Work} {Pay} {Comprehensive} {Employment} {System} {Medicaid} {Infrastructure} {Grant}}, url = {http://digitalcommons.ilr.cornell.edu/edicollect/1282/}, urldate = {2014-04-10}, year = {2010}, } - Measuring firm-level displacement events with administrative dataLars VilhuberWorkshop on Measurement Error in Administrative Data, 2010
@techreport{Vilhuber2010, address = {Mannheim, Germany}, author = {Vilhuber, Lars}, institution = {Workshop on Measurement Error in Administrative Data}, title = {Measuring firm-level displacement events with administrative data}, year = {2010}, } - Adjusting imperfect data: Overview and case studiesLars VilhuberNBER, Working paper 12977, 2007
@techreport{Vilhuber2007, author = {Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.3386/w12977}, institution = {NBER}, number = {12977}, title = {Adjusting imperfect data: {Overview} and case studies}, type = {Working paper}, url = {http://www.nber.org/papers/w12977}, year = {2007}, } - Confidentiality Protection in the Census Bureau’s Quarterly Workforce IndicatorsJohn M. Abowd, Bryce E. Stephens, and Lars VilhuberU.S. Census Bureau, LEHD and Cornell University, presented at the Joint Statistical Meetings 2005, Minneapolis, MN. 2006-02, 2005
@techreport{AbowdEtAl2005b, author = {Abowd, John M. and Stephens, Bryce E. and Vilhuber, Lars}, copyright = {Public Domain}, institution = {U.S. Census Bureau, LEHD and Cornell University}, number = {2006-02}, title = {Confidentiality {Protection} in the {Census} {Bureau}'s {Quarterly} {Workforce} {Indicators}}, type = {presented at the {Joint} {Statistical} {Meetings} 2005, {Minneapolis}, {MN}.}, url = {https://econpapers.repec.org/paper/centpaper/2006-02.htm}, year = {2005}, } - Adjusting imperfect data: Overview and case studiesLars VilhuberLEHD, Technical paper TP-2004-05, 2004
@techreport{tp-2004-05, author = {Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, institution = {LEHD}, number = {TP-2004-05}, title = {Adjusting imperfect data: {Overview} and case studies}, type = {Technical paper}, url = {https://econpapers.repec.org/paper/centpaper/2004-05.htm}, year = {2004}, } - Abandoning the sinking ship: The composition of worker flows prior to displacementPaul A. Lengermann and Lars VilhuberLEHD, U.S. Census Bureau, Technical paper TP-2002-11, 2002
@techreport{tp-2002-11, author = {Lengermann, Paul A. and Vilhuber, Lars}, copyright = {Public Domain}, institution = {LEHD, U.S. Census Bureau}, number = {TP-2002-11}, title = {Abandoning the sinking ship: {The} composition of worker flows prior to displacement}, type = {Technical paper}, url = {https://econpapers.repec.org/paper/centpaper/2002-11.htm}, year = {2002}, } - The creation of the employment dynamics estimatesJohn M. Abowd, Paul A. Lengermann, and Lars VilhuberLEHD, U.S. Census Bureau, Technical paper TP-2002-13, 2002
@techreport{tp-2002-13, author = {Abowd, John M. and Lengermann, Paul A. and Vilhuber, Lars}, copyright = {Public Domain}, institution = {LEHD, U.S. Census Bureau}, number = {TP-2002-13}, title = {The creation of the employment dynamics estimates}, type = {Technical paper}, url = {https://ideas.repec.org/p/cen/tpaper/2002-13.html}, year = {2002}, } - The sensitivity of economic statistics to coding errors in personal identifiersJohn M. Abowd and Lars VilhuberLEHD, U.S. Census Bureau, Technical paper TP-2002-17, 2002
In this article we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau’s Quarterly Workforce Indicators before and after correcting for such errors in Social Security Number-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from .25% up to 15% for flow statistics, and up to 5% for payroll aggregates.
@techreport{tp-2002-17, abstract = {In this article we describe the sensitivity of small-cell flow statistics to coding errors in the identity of the underlying entities. Specifically, we present results based on a comparison of the U.S. Census Bureau's Quarterly Workforce Indicators before and after correcting for such errors in Social Security Number-based identifiers in the underlying individual wage records. The correction used involves a novel application of existing statistical matching techniques. It is found that even a very conservative correction procedure has a sizable impact on the statistics. The average bias ranges from .25\% up to 15\% for flow statistics, and up to 5\% for payroll aggregates.}, author = {Abowd, John M. and Vilhuber, Lars}, copyright = {Public Domain}, institution = {LEHD, U.S. Census Bureau}, number = {TP-2002-17}, title = {The sensitivity of economic statistics to coding errors in personal identifiers}, type = {Technical paper}, url = {https://econpapers.repec.org/paper/centpaper/2002-17.htm}, year = {2002}, } - Displaced workers, early leavers, and re-employment wagesAudra Bowlus and Lars VilhuberLEHD, U.S. Census Bureau, Technical paper TP-2002-18, 2002
@techreport{tp-2002-18, author = {Bowlus, Audra and Vilhuber, Lars}, copyright = {Public Domain}, institution = {LEHD, U.S. Census Bureau}, number = {TP-2002-18}, title = {Displaced workers, early leavers, and re-employment wages}, type = {Technical paper}, url = {https://econpapers.repec.org/paper/centpaper/2002-18.htm}, year = {2002}, } - Escaping poverty for low-wage workers: The role of employer characteristics and changesHarry J. Holzer, Julia I. Lane, Lars Vilhuber, and 2 more authorsLEHD, U.S. Census Bureau, Technical paper TP-2001-02, 2001
@techreport{tp-2001-02, author = {Holzer, Harry J. and Lane, Julia I. and Vilhuber, Lars and Jackson, Henry and Putnam, George}, copyright = {Public Domain}, institution = {LEHD, U.S. Census Bureau}, number = {TP-2001-02}, title = {Escaping poverty for low-wage workers: {The} role of employer characteristics and changes}, type = {Technical paper}, url = {https://econpapers.repec.org/paper/centpaper/2001-02.htm}, year = {2001}, } - Longitudinal analysis of SSN response on SIPP 1990-1993 panelsLars Vilhuber and Robert PedaceLEHD, U.S. Census Bureau, Technical paper TP-2000-01, 2000
@techreport{tp-2000-01, author = {Vilhuber, Lars and Pedace, Robert}, copyright = {Public Domain}, institution = {LEHD, U.S. Census Bureau}, number = {TP-2000-01}, title = {Longitudinal analysis of {SSN} response on {SIPP} 1990-1993 panels}, type = {Technical paper}, url = {https://econpapers.repec.org/paper/centpaper/2000-01.htm}, year = {2000}, } - Continuous Training and sectoral mobility in GermanyLars VilhuberCIRANO, Scientific Series 99s-03, 1999
This article studies mobility patterns of German workers in light of a model of sector-specific human capital. Furthermore, I employ and describe little-used data on continuous on-the-job training occuring after apprenticeships. Results are presented describing the incidence and duration of continuous training. Continuous training is quite common, depite the high incidence of apprenticeships which precedes this part of a worker’s career. Most previous studies have only distinguished between firm-specific and general human capital, generally concluding that training was general. Inconsistent with those conclusions, I show that German men are more likely to find a job within the same sector if they have received continuous training in that sector. These results are similar to results obtained for young U.S. workers, and suggest that sector-specific capital is an important feature of very different labor markets. Furthermore, the results suggest that the observed effect of training on mobility is sensitive to the state of the business cycle, indicating a more complex interaction between supply and demand that most theoretical models allow for.
@techreport{Vilhuber99b, abstract = {This article studies mobility patterns of German workers in light of a model of sector-specific human capital. Furthermore, I employ and describe little-used data on continuous on-the-job training occuring after apprenticeships. Results are presented describing the incidence and duration of continuous training. Continuous training is quite common, depite the high incidence of apprenticeships which precedes this part of a worker's career. Most previous studies have only distinguished between firm-specific and general human capital, generally concluding that training was general. Inconsistent with those conclusions, I show that German men are more likely to find a job within the same sector if they have received continuous training in that sector. These results are similar to results obtained for young U.S. workers, and suggest that sector-specific capital is an important feature of very different labor markets. Furthermore, the results suggest that the observed effect of training on mobility is sensitive to the state of the business cycle, indicating a more complex interaction between supply and demand that most theoretical models allow for.}, author = {Vilhuber, Lars}, institution = {CIRANO}, number = {99s-03}, title = {Continuous {Training} and sectoral mobility in {Germany}}, type = {Scientific {Series}}, year = {1999}, } - Sector-specific on-the-job training: Evidence from U.S. dataLars VilhuberCIRANO, Scientific Series 97s-42, 1997
Using data from the National Longitudinal Survey of Youth (NLSY), we re-examine the effect of formal on-the-job training on mobility patterns of young American workers. By employing parametric duration models, we evaluate the economic impact of training on productive time with an employer. Confirming previous studies, we find a positive and statistically significant impact of formal on-the-job training on tenure with the employer providing the training. However, expected duration net of the time spent in the training program is generally not significantly increased. We proceed to document and analyze intra-sectoral and cross-sectoral mobility patterns in order to infer whether training provides firm-specific, industry-specific, or general human capital. The econometric analysis rejects a sequential model of job separation in favor of a competing risks specification. We find significant evidence for the industry-specificity of training. The probability of sectoral mobility upon job separation decreases with training received in the current industry, whether with the last employer or previous employers, and employment attachment increases with on-the-job training. These results are robust to a number of variations on the base model.
@techreport{Vilhuber97a, abstract = {Using data from the National Longitudinal Survey of Youth (NLSY), we re-examine the effect of formal on-the-job training on mobility patterns of young American workers. By employing parametric duration models, we evaluate the economic impact of training on productive time with an employer. Confirming previous studies, we find a positive and statistically significant impact of formal on-the-job training on tenure with the employer providing the training. However, expected duration net of the time spent in the training program is generally not significantly increased. We proceed to document and analyze intra-sectoral and cross-sectoral mobility patterns in order to infer whether training provides firm-specific, industry-specific, or general human capital. The econometric analysis rejects a sequential model of job separation in favor of a competing risks specification. We find significant evidence for the industry-specificity of training. The probability of sectoral mobility upon job separation decreases with training received in the current industry, whether with the last employer or previous employers, and employment attachment increases with on-the-job training. These results are robust to a number of variations on the base model.}, author = {Vilhuber, Lars}, institution = {CIRANO}, number = {97s-42}, title = {Sector-specific on-the-job training: {Evidence} from {U}.{S}. data}, type = {Scientific {Series}}, year = {1997}, } - Wage flexibility and contract structure in GermanyLars VilhuberCIRANO, Scientific Series 96s-28, 1996
In this paper, we look at how labor market conditions at different points during the tenure of individuals with firms are correlated with current earnings. Using data from the German Socioeconomic Panel on individuals for the period 1984 to 1994, we find that both the contemporaneous unemployment rate and prior values of the unemployment rate are significantly correlated with current earnings, contrary to results for the American labor market. We interpret this result as evidence that German unions do in fact bargain over both wages and employment, but that the models of individualistic contracts, such as the implicit contract model, may explain some of the observed wage drift and longer-term wage movements reasonably well. Furthermore, we explore the heterogeneity of contracts over a variety of worker and job characteristics. In particular, we find evidence that contracts differ across industries and across firm size. Workers of large firms are remarkably more insulated from the job market than workers for any other type of firm, indicating the importance of internal job markets.
@techreport{Vilhuber96, abstract = {In this paper, we look at how labor market conditions at different points during the tenure of individuals with firms are correlated with current earnings. Using data from the German Socioeconomic Panel on individuals for the period 1984 to 1994, we find that both the contemporaneous unemployment rate and prior values of the unemployment rate are significantly correlated with current earnings, contrary to results for the American labor market. We interpret this result as evidence that German unions do in fact bargain over both wages and employment, but that the models of individualistic contracts, such as the implicit contract model, may explain some of the observed wage drift and longer-term wage movements reasonably well. Furthermore, we explore the heterogeneity of contracts over a variety of worker and job characteristics. In particular, we find evidence that contracts differ across industries and across firm size. Workers of large firms are remarkably more insulated from the job market than workers for any other type of firm, indicating the importance of internal job markets.}, author = {Vilhuber, Lars}, institution = {CIRANO}, number = {96s-28}, title = {Wage flexibility and contract structure in {Germany}}, type = {Scientific {Series}}, year = {1996}, }
Conference Papers (30)
- Report of the AEA Data EditorLars Vilhuber and Jack CavanaghIn AEA Papers and Proceedings, May 2025tex.ids= vilhuber2025
@inproceedings{vilhuber2025, author = {Vilhuber, Lars and Cavanagh, Jack}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.1257/pandp.115.944}, issn = {2574-0768, 2574-0776}, language = {en}, month = may, note = {tex.ids= vilhuber2025}, pages = {944--957}, title = {Report of the {AEA} {Data} {Editor}}, url = {https://pubs.aeaweb.org/doi/10.1257/pandp.115.944}, urldate = {2025-12-01}, volume = {115}, year = {2025}, month_numeric = {5} } - Report of the AEA Data EditorLars VilhuberIn AEA Papers and Proceedings, May 2024
@inproceedings{10.1257/pandp.114.878, author = {Vilhuber, Lars}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.1257/pandp.114.878}, issn = {2574-0768, 2574-0776}, language = {en}, month = may, pages = {878--890}, title = {Report of the {AEA} {Data} {Editor}}, url = {https://pubs.aeaweb.org/doi/10.1257/pandp.114.878}, urldate = {2024-08-22}, volume = {114}, year = {2024}, month_numeric = {5} } - Report of the AEA Data EditorLars VilhuberIn AEA Papers and Proceedings, May 2023
@inproceedings{10.1257/pandp.113.850, author = {Vilhuber, Lars}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.1257/pandp.113.850}, issn = {2574-0768, 2574-0776}, language = {en}, month = may, pages = {850--863}, title = {Report of the {AEA} {Data} {Editor}}, url = {https://pubs.aeaweb.org/doi/10.1257/pandp.113.850}, urldate = {2023-11-26}, volume = {113}, year = {2023}, month_numeric = {5} } - Report by the AEA Data EditorLars VilhuberIn AEA Papers and Proceedings, May 2022
@inproceedings{10.1257/pandp.112.813, author = {Vilhuber, Lars}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.1257/pandp.112.813}, issn = {2574-0768, 2574-0776}, language = {en}, month = may, pages = {813--23}, title = {Report by the {AEA} {Data} {Editor}}, volume = {112}, year = {2022}, month_numeric = {5} } - Report by the AEA Data EditorLars VilhuberIn AEA Papers and Proceedings, May 2021
@inproceedings{10.1257/pandp.111.808, author = {Vilhuber, Lars}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.1257/pandp.111.808}, issn = {2574-0768, 2574-0776}, language = {en}, month = may, pages = {808--817}, title = {Report by the {AEA} {Data} {Editor}}, url = {https://pubs.aeaweb.org/doi/10.1257/pandp.111.808}, urldate = {2021-05-20}, volume = {111}, year = {2021}, month_numeric = {5} } - Report by the AEA Data EditorLars Vilhuber, James Turitto, and Keesler WelchIn AEA Papers and Proceedings, May 2020
@inproceedings{10.1257/pandp.110.764, author = {Vilhuber, Lars and Turitto, James and Welch, Keesler}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.1257/pandp.110.764}, issn = {2574-0768, 2574-0776}, language = {en}, month = may, pages = {764--75}, title = {Report by the {AEA} {Data} {Editor}}, volume = {110}, year = {2020}, month_numeric = {5} } - Report by the AEA Data EditorLars VilhuberIn AEA Papers and Proceedings, May 2019
@inproceedings{10.1257/pandp.109.718, author = {Vilhuber, Lars}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.1257/pandp.109.718}, issn = {2574-0768, 2574-0776}, language = {en}, month = may, pages = {718--29}, title = {Report by the {AEA} {Data} {Editor}}, url = {http://www.aeaweb.org/articles?id=10.1257/pandp.109.718}, urldate = {2019-09-21}, volume = {109}, year = {2019}, month_numeric = {5} } - Why the Economics Profession Must Actively Participate in the Privacy Protection DebateJohn M. Abowd, Ian M. Schmutte, William N. Sexton, and 1 more authorIn AEA Papers and Proceedings, May 2019
When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.
@inproceedings{abowd2019t, abstract = {When Google or the U.S. Census Bureau publish detailed statistics on browsing habits or neighborhood characteristics, some privacy is lost for everybody while supplying public information. To date, economists have not focused on the privacy loss inherent in data publication. In their stead, these issues have been advanced almost exclusively by computer scientists who are primarily interested in technical problems associated with protecting privacy. Economists should join the discussion, first, to determine where to balance privacy protection against data quality; a social choice problem. Furthermore, economists must ensure new privacy models preserve the validity of public data for economic research.}, author = {Abowd, John M. and Schmutte, Ian M. and Sexton, William N. and Vilhuber, Lars}, booktitle = {{AEA} {Papers} and {Proceedings}}, copyright = {All rights reserved}, doi = {10.1257/pandp.20191106}, month = may, pages = {397--402}, title = {Why the {Economics} {Profession} {Must} {Actively} {Participate} in the {Privacy} {Protection} {Debate}}, url = {https://www.aeaweb.org/articles?id=10.1257/pandp.20191106}, volume = {109}, year = {2019}, month_numeric = {5} } - Making Confidential Data Part of Reproducible ResearchLars VilhuberIn Methods to Foster Transparency and Reproducibility of Federal Statistics: Proceedings of a Workshop, 2019
In 2014 the National Science Foundation (NSF) provided support to the National Academies of Sciences, Engineering, and Medicine for a series of Forums on Open Science in response to a government-wide directive to support increased public access to the results of research funded by the federal government. However, the breadth of the work resulting from the series precluded a focus on any specific topic or discussion about how to improve public access. Thus, the main goal of the Workshop on Transparency and Reproducibility in Federal Statistics was to develop some understanding of what principles and practices are, or would be, supportive of making federal statistics more understandable and reviewable, both by agency staff and the public. This publication summarizes the presentations and discussions from the workshop.
@inproceedings{vilhuber2019a, abstract = {In 2014 the National Science Foundation (NSF) provided support to the National Academies of Sciences, Engineering, and Medicine for a series of Forums on Open Science in response to a government-wide directive to support increased public access to the results of research funded by the federal government. However, the breadth of the work resulting from the series precluded a focus on any specific topic or discussion about how to improve public access. Thus, the main goal of the Workshop on Transparency and Reproducibility in Federal Statistics was to develop some understanding of what principles and practices are, or would be, supportive of making federal statistics more understandable and reviewable, both by agency staff and the public. This publication summarizes the presentations and discussions from the workshop.}, address = {Washington, DC}, author = {Vilhuber, Lars}, booktitle = {Methods to {Foster} {Transparency} and {Reproducibility} of {Federal} {Statistics}: {Proceedings} of a {Workshop}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.17226/25305}, editor = {{National Academies of Sciences, Engineering, and Medicine}}, isbn = {978-0-309-48629-3}, pages = {63--66}, publisher = {The National Academies Press}, title = {Making {Confidential} {Data} {Part} of {Reproducible} {Research}}, url = {https://www.nap.edu/catalog/25305/methods-to-foster-transparency-and-reproducibility-of-federal-statistics-proceedings}, year = {2019}, } - Synthetic data via quantile regression for heavy-tailed and heteroskedastic dataMichelle Pistner, Aleksandra Slavković, and Lars VilhuberIn Privacy in statistical databases, 2018See https://github.com/labordynamicsinstitute/replication_qr_synthetic for replication code.
@inproceedings{PistnerSlavkovicVilhuber:PSD:2018, author = {Pistner, Michelle and Slavković, Aleksandra and Vilhuber, Lars}, booktitle = {Privacy in statistical databases}, copyright = {All rights reserved}, doi = {10.1007/978-3-319-99771-1_7}, editor = {Domingo-Ferrer, Josep and Montes, Francisco}, note = {See https://github.com/labordynamicsinstitute/replication\_qr\_synthetic for replication code.}, title = {Synthetic data via quantile regression for heavy-tailed and heteroskedastic data}, url = {http://dx.doi.org/10.1007/978-3-642-TBD}, year = {2018}, } - Utility Cost of Formal Privacy for Releasing National Employer-Employee StatisticsSamuel Haney, Ashwin Machanavajjhala, John M. Abowd, and 3 more authorsIn Proceedings of the 2017 International Conference on Management of Data, 2017
National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter {}epsilon}geq 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.
@inproceedings{haney2017, abstract = {National statistical agencies around the world publish tabular summaries based on combined employer-employee (ER-EE) data. The privacy of both individuals and business establishments that feature in these data are protected by law in most countries. These data are currently released using a variety of statistical disclosure limitation (SDL) techniques that do not reveal the exact characteristics of particular employers and employees, but lack provable privacy guarantees limiting inferential disclosures. In this work, we present novel algorithms for releasing tabular summaries of linked ER-EE data with formal, provable guarantees of privacy. We show that state-of-the-art differentially private algorithms add too much noise for the output to be useful. Instead, we identify the privacy requirements mandated by current interpretations of the relevant laws, and formalize them using the Pufferfish framework. We then develop new privacy definitions that are customized to ER-EE data and satisfy the statutory privacy requirements. We implement the experiments in this paper on production data gathered by the U.S. Census Bureau. An empirical evaluation of utility for these data shows that for reasonable values of the privacy-loss parameter \${\textbackslash}epsilon{\textbackslash}geq\$ 1, the additive error introduced by our provably private algorithms is comparable, and in some cases better, than the error introduced by existing SDL techniques that have no provable privacy guarantees. For some complex queries currently published, however, our algorithms do not have utility comparable to the existing traditional SDL algorithms. Those queries are fodder for future research.}, address = {Chicago, Illinois, USA}, author = {Haney, Samuel and Machanavajjhala, Ashwin and Abowd, John M. and Graham, Matthew and Kutzbach, Mark and Vilhuber, Lars}, booktitle = {Proceedings of the 2017 {International} {Conference} on {Management} of {Data}}, copyright = {CC BY Attribution 4.0 International}, doi = {10.1145/3035918.3035940}, isbn = {978-1-4503-4197-4}, language = {en}, pages = {1339--1354}, publisher = {ACM}, series = {{SIGMOD} '17}, title = {Utility {Cost} of {Formal} {Privacy} for {Releasing} {National} {Employer}-{Employee} {Statistics}}, url = {http://doi.acm.org/10.1145/3035918.3035940}, year = {2017}, grant = {SES-1131848}, } - Proceedings from the 2016 NSF–Sloan Workshop on Practical PrivacyLars Vilhuber, Ian M. Schmutte, and and John M. AbowdIn 2016 NSF–Sloan Workshop on Practical Privacy, 2017
@inproceedings{vilhuber2017f, author = {Vilhuber, Lars and Schmutte, Ian M. and Abowd, {and} John M.}, booktitle = {2016 {NSF}–{Sloan} {Workshop} on {Practical} {Privacy}}, copyright = {All Rights Reserved (Free to Read)}, language = {en}, publisher = {Cornell University Labor Dynamics Institute Document}, title = {Proceedings from the 2016 {NSF}–{Sloan} {Workshop} on {Practical} {Privacy}}, url = {https://hdl.handle.net/1813/89100}, year = {2017}, } - CED2AR: The Comprehensive Extensible Data Documentation and Access Repository.Carl Lagoze, Lars Vilhuber, Jeremy Williams, and 2 more authorsIn ACM/IEEE Joint Conference on Digital Libraries (JCDL 2014), 2014
We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED2AR is distinguished from other metadata repository-based applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains.
@inproceedings{lagoze2014, abstract = {We describe the design, implementation, and deployment of the Comprehensive Extensible Data Documentation and Access Repository (CED2AR). This is a metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata through either a web-based user interface or programmatically through a search API, all the while re-reusing and linking to existing archive and provider generated metadata. CED2AR is distinguished from other metadata repository-based applications due to requirements that derive from its social science context. These include the need to cloak confidential data and metadata and manage complex provenance chains.}, address = {London, United Kingdom}, author = {Lagoze, Carl and Vilhuber, Lars and Williams, Jeremy and Perry, Benjamin and Block, {and} William C.}, booktitle = {{ACM}/{IEEE} {Joint} {Conference} on {Digital} {Libraries} ({JCDL} 2014)}, copyright = {All rights reserved}, doi = {10.1109/JCDL.2014.6970178}, language = {en}, pages = {267--276}, title = {{CED2AR}: {The} {Comprehensive} {Extensible} {Data} {Documentation} and {Access} {Repository}.}, url = {http://dx.doi.org/10.1109/JCDL.2014.6970178}, year = {2014}, } - Using partially synthetic data to replace suppression in the business dynamics statistics: Early resultsJavier Miranda and Lars VilhuberIn Privacy in statistical databases, 2014
The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.
@inproceedings{psd2014a, abstract = {The Business Dynamics Statistics is a product of the U.S. Census Bureau that provides measures of business openings and closings, and job creation and destruction, by a variety of cross-classifications (firm and establishment age and size, industrial sector, and geography). Sensitive data are currently protected through suppression. However, as additional tabulations are being developed, at ever more detailed geographic levels, the number of suppressions increases dramatically. This paper explores the option of providing public-use data that are analytically valid and without suppressions, by leveraging synthetic data to replace observations in sensitive cells.}, author = {Miranda, Javier and Vilhuber, Lars}, booktitle = {Privacy in statistical databases}, doi = {10.1007/978-3-319-11257-2_18}, editor = {Domingo-Ferrer, Josep}, isbn = {978-3-319-11256-5}, pages = {232--242}, publisher = {Springer International Publishing}, series = {Lecture notes in computer science}, title = {Using partially synthetic data to replace suppression in the business dynamics statistics: {Early} results}, url = {http://dx.doi.org/10.1007/978-3-319-11257-2_18}, volume = {8744}, year = {2014}, grant = {SES-1131848}, } - Synthetic longitudinal business databases for international comparisonsJörg Drechsler and Lars VilhuberIn Privacy in statistical databases, 2014
International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.
@inproceedings{psd2014b, abstract = {International comparison studies on economic activity are often hampered by the fact that access to business microdata is very limited on an international level. A recently launched project tries to overcome these limitations by improving access to Business Censuses from multiple countries based on synthetic data. Starting from the synthetic version of the longitudinally edited version of the U.S. Business Register (the Longitudinal Business Database, LBD), the idea is to create similar data products in other countries by applying the synthesis methodology developed for the LBD to generate synthetic replicates that could be distributed without confidentiality concerns. In this paper we present some first results of this project based on German business data collected at the Institute for Employment Research.}, author = {Drechsler, Jörg and Vilhuber, Lars}, booktitle = {Privacy in statistical databases}, copyright = {All rights reserved}, doi = {10.1007/978-3-319-11257-2_19}, editor = {Domingo-Ferrer, Josep}, isbn = {978-3-319-11256-5}, pages = {243--252}, publisher = {Springer International Publishing}, series = {Lecture notes in computer science}, title = {Synthetic longitudinal business databases for international comparisons}, url = {http://dx.doi.org/10.1007/978-3-319-11257-2_19}, volume = {8744}, year = {2014}, grant = {SES-1131848}, } - Replicating the Synthetic LBD with German establishment dataJörg Drechsler and Lars VilhuberIn Proceedings 59th ISI world statistics congress, 25-30 august 2013, hong kong (session STS062), 2013
@inproceedings{ISI2013-3, author = {Drechsler, Jörg and Vilhuber, Lars}, booktitle = {Proceedings 59th {ISI} world statistics congress, 25-30 august 2013, hong kong (session {STS062})}, isbn = {978-90-73592-34-6}, pages = {2291--2296}, title = {Replicating the {Synthetic} {LBD} with {German} establishment data}, url = {http://2013.isiproceedings.org}, urldate = {2014-03-24}, year = {2013}, } - Encoding Provenance Metadata for Social Science Datasets.Carl Lagoze, Jeremy Willliams, and Lars VilhuberIn Metadata and Semantics Research. Communications in Computer and Information Science, 2013
@inproceedings{lagoze2013, author = {Lagoze, Carl and Willliams, Jeremy and Vilhuber, Lars}, booktitle = {Metadata and {Semantics} {Research}. {Communications} in {Computer} and {Information} {Science}}, copyright = {All rights reserved}, doi = {10.1007/978-3-319-03437-9_13}, editor = {{Emmanouel Garoufallou and Jane Greenberg}}, isbn = {978-3-319-03436-2}, language = {en}, pages = {123--134}, series = {Communications in {Computer} and {Information} {Science}}, title = {Encoding {Provenance} {Metadata} for {Social} {Science} {Datasets}.}, volume = {390}, year = {2013}, grant = {SES-1131848}, } - Encoding provenance of social science data: Integrating PROV with DDICarl Lagoze, William C. Block, Jeremy Williams, and 1 more authorIn 5th annual european DDI user conference, 2013
Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface.
@inproceedings{LagozeEtAl2013, abstract = {Provenance is a key component of evaluating the integrity and reusability of data for scholarship. While recording and providing access provenance has always been important, it is even more critical in the web environment in which data from distributed sources and of varying integrity can be combined and derived. The PROV model, developed under the auspices of the W3C, is a foundation for semantically-rich, interoperable, and web-compatible provenance metadata. We report on the results of our experimentation with integrating the PROV model into the DDI metadata for a complex, but characteristic, example social science data. We also present some preliminary thinking on how to visualize those graphs in the user interface.}, author = {Lagoze, Carl and Block, William C. and Williams, Jeremy and Vilhuber, Lars}, booktitle = {5th annual european {DDI} user conference}, title = {Encoding provenance of social science data: {Integrating} {PROV} with {DDI}}, url = {https://eddi-conferences.eu/assets/pdf/eddi-2013-program.pdf}, year = {2013}, } - A Proposed Solution to the Archiving and Curation of Confidential Scientific InputsJ.M. Abowd, Lars Vilhuber, and William C. BlockIn Privacy in Statistical Databases, 2012
We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials.
@inproceedings{abowd2012c, abstract = {We develop the core of a method for solving the data archive and curation problem that confronts the custodians of restricted-access research data and the scientific users of such data. Our solution recognizes the dual protections afforded by physical security and access limitation protocols. It is based on extensible tools and can be easily incorporated into existing instructional materials.}, author = {Abowd, J.M. and Vilhuber, Lars and Block, William C.}, booktitle = {Privacy in {Statistical} {Databases}}, copyright = {All rights reserved}, doi = {10.1007/978-3-642-33627-0_17}, number = {978-3-642-33626-3}, pages = {216--225}, publisher = {Springer Berlin Heidelberg}, series = {Lecture {Notes} in {Computer} {Science}}, title = {A {Proposed} {Solution} to the {Archiving} and {Curation} of {Confidential} {Scientific} {Inputs}}, url = {https://doi.org/10.1007/978-3-642-33627-0_17}, volume = {7556}, year = {2012}, grant = {SES-1131848}, } - Using linked employer-employee data to investigate the speed of adjustment in downsizing firms in Canada and the USBenoit Dostie, Kevin L. McKinney, and Lars VilhuberIn International census research data center conference, Oct 2009
When firms are faced with a demand shock, adjustment can take many forms. Firms can adjust physical capital, human capital, or both. The speed of adjustment may differ as well: costs of adjustment, the type of shock, the legal and economic enviroment all matter. In this paper, we focus on firms that downsized between 1992 and 1997, but ultimately survive, and investigate how the human capital distribution within a firm influences the speed of adjustment, \textitceteris paribus. In other words, when do firms use mass layoffs instead of attrition to adjust the level of employment. We combine worker-level wage records and measures of human capital with firm-level characteristics of the production function, and use levels and changes in these variables to characterize the choice of adjustment method and speed. Firms are described/compared up to 9 years prior to death. We also consider how workers fare after leaving downsizing firms, and analyze if observed differences in post-separation outcomes of workers provide clues to the choice of adjustment speed.
@inproceedings{DostieMcKinneyVilhuber2009, abstract = {When firms are faced with a demand shock, adjustment can take many forms. Firms can adjust physical capital, human capital, or both. The speed of adjustment may differ as well: costs of adjustment, the type of shock, the legal and economic enviroment all matter. In this paper, we focus on firms that downsized between 1992 and 1997, but ultimately survive, and investigate how the human capital distribution within a firm influences the speed of adjustment, \textit{ceteris paribus}. In other words, when do firms use mass layoffs instead of attrition to adjust the level of employment. We combine worker-level wage records and measures of human capital with firm-level characteristics of the production function, and use levels and changes in these variables to characterize the choice of adjustment method and speed. Firms are described/compared up to 9 years prior to death. We also consider how workers fare after leaving downsizing firms, and analyze if observed differences in post-separation outcomes of workers provide clues to the choice of adjustment speed.}, address = {Ithaca, NY}, author = {Dostie, Benoit and McKinney, Kevin L. and Vilhuber, Lars}, booktitle = {International census research data center conference}, month = oct, publisher = {U.S. Census Bureau, LEHD and Cornell University}, series = {Conference on {Research} in {Income} and {Wealth}}, title = {Using linked employer-employee data to investigate the speed of adjustment in downsizing firms in {Canada} and the {US}}, year = {2009}, month_numeric = {10} } - How protective are synthetic dataJohn Abowd and Lars VilhuberIn Privacy in statistical database, Sep 2008
@inproceedings{AbowdVilhuber2008, author = {Abowd, John and Vilhuber, Lars}, booktitle = {Privacy in statistical database}, copyright = {All rights reserved}, doi = {10.1007/978-3-540-87471-3_20}, editor = {Domingo-Ferrer, Josep and Saygın, Yücel}, month = sep, pages = {239--246}, publisher = {Springer Berlin Heidelberg}, series = {Lecture notes in computer science}, title = {How protective are synthetic data}, url = {https://doi.org/10.1007/978-3-540-87471-3_20}, volume = {5262}, year = {2008}, month_numeric = {9} } - Privacy: Theory meets practice on the mapAshwin Machanavajjhala, Daniel Kifer, John M Abowd, and 2 more authorsIn Proceedings of the International Conference on Data Engineering, 2008
@inproceedings{machanavajjhala2008, author = {Machanavajjhala, Ashwin and Kifer, Daniel and Abowd, John M and Gehrke, Johannes and Vilhuber, Lars}, booktitle = {Proceedings of the {International} {Conference} on {Data} {Engineering}}, copyright = {All rights reserved}, doi = {10.1109/ICDE.2008.4497436}, pages = {277--286}, title = {Privacy: {Theory} meets practice on the map}, url = {http://dx.doi.org/10.1109/ICDE.2008.4497436}, year = {2008}, } - Using linked employer-employee data to investigate the speed of adjustment in downsizing firmsKevin L. McKinney and Lars VilhuberIn Conference on the analysis of firms and employees (CAFE), Sep 2006
When firms are faced with a demand shock, adjustment can take many forms. Firms can adjust physical capital, human capital, or both. The speed of adjustment may differ as well: costs of adjustment, the type of shock, the legal and economic enviroment all matter. In this paper, we focus on firms that downsized between 1992 and 1997, but ultimately survive, and investigate how the human capital distribution within a firm influences the speed of adjustment, \textitceteris paribus. In other words, when do firms use mass layoffs instead of attrition to adjust the level of employment. We combine worker-level wage records and measures of human capital with firm-level characteristics of the production function, and use levels and changes in these variables to characterize the choice of adjustment method and speed. Firms are described/compared up to 9 years prior to death. We also consider how workers fare after leaving downsizing firms, and analyze if observed differences in post-separation outcomes of workers provide clues to the choice of adjustment speed.
@inproceedings{McKinneyVilhuber2006, abstract = {When firms are faced with a demand shock, adjustment can take many forms. Firms can adjust physical capital, human capital, or both. The speed of adjustment may differ as well: costs of adjustment, the type of shock, the legal and economic enviroment all matter. In this paper, we focus on firms that downsized between 1992 and 1997, but ultimately survive, and investigate how the human capital distribution within a firm influences the speed of adjustment, \textit{ceteris paribus}. In other words, when do firms use mass layoffs instead of attrition to adjust the level of employment. We combine worker-level wage records and measures of human capital with firm-level characteristics of the production function, and use levels and changes in these variables to characterize the choice of adjustment method and speed. Firms are described/compared up to 9 years prior to death. We also consider how workers fare after leaving downsizing firms, and analyze if observed differences in post-separation outcomes of workers provide clues to the choice of adjustment speed.}, address = {Nuremberg, Germany}, author = {McKinney, Kevin L. and Vilhuber, Lars}, booktitle = {Conference on the analysis of firms and employees ({CAFE})}, month = sep, publisher = {U.S. Census Bureau, LEHD and Cornell University}, series = {Conference on {Research} in {Income} and {Wealth}}, title = {Using linked employer-employee data to investigate the speed of adjustment in downsizing firms}, url = {https://econpapers.repec.org/paper/centpaper/2006-03.htm}, year = {2006}, month_numeric = {9} }
Book Chapters (15)
- Improving Privacy for Respondents in Randomized Controlled Trials: A Differential Privacy ApproachSoumya Mukherjee, Aratrika Mustafi, Aleksandra Slavković, and 1 more authorIn Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and New Findings, 2026Longer version: https://doi.org/10.48550/arXiv.2309.14581 (v1). Corrected version: https://doi.org/10.48550/ARXIV.2309.14581 (v2).
Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the response data collected, de-identified and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of strong privacy-preservation methodology (with }ac{DP} guarantees), on published analyses from RCTs, leveraging the availability of replication packages (research compendia) in economics and policy analysis. We provide simulations studies and demonstrate how we can replicate the analysis in a published economics article on privacy-protected data under various parametrizations. We find that relatively straightforward DP-based methods allow for inference-valid protection of the published data, though computational issues may limit more complex analyses from using these methods. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.
@incollection{mukherjee2026-nber, abstract = {Randomized control trials, RCTs, have become a powerful tool for assessing the impact of interventions and policies in many contexts. They are considered the gold-standard for inference in the biomedical fields and in many social sciences. Researchers have published an increasing number of studies that rely on RCTs for at least part of the inference, and these studies typically include the response data collected, de-identified and sometimes protected through traditional disclosure limitation methods. In this paper, we empirically assess the impact of strong privacy-preservation methodology (with {\textbackslash}ac\{DP\} guarantees), on published analyses from RCTs, leveraging the availability of replication packages (research compendia) in economics and policy analysis. We provide simulations studies and demonstrate how we can replicate the analysis in a published economics article on privacy-protected data under various parametrizations. We find that relatively straightforward DP-based methods allow for inference-valid protection of the published data, though computational issues may limit more complex analyses from using these methods. The results have applicability to researchers wishing to share RCT data, especially in the context of low- and middle-income countries, with strong privacy protection.}, author = {Mukherjee, Soumya and Mustafi, Aratrika and Slavković, Aleksandra and Vilhuber, Lars}, booktitle = {Data {Privacy} {Protection} and the {Conduct} of {Applied} {Research}: {Methods}, {Approaches} and {New} {Findings}}, copyright = {All rights reserved}, editor = {Hotz, V. Joseph and Gong, Ruobin and Schmutte, Ian M}, note = {Longer version: https://doi.org/10.48550/arXiv.2309.14581 (v1). Corrected version: https://doi.org/10.48550/ARXIV.2309.14581 (v2).}, publisher = {University of Chicago Press / National Bureau of Economic Research}, title = {Improving {Privacy} for {Respondents} in {Randomized} {Controlled} {Trials}: {A} {Differential} {Privacy} {Approach}}, url = {http://arxiv.org/abs/2309.14581}, volume = {forthcoming}, year = {2026}, } - Using Containers for Analysis Validation at ScaleLars VilhuberIn Data Privacy Protection and the Conduct of Applied Research: Methods, Approaches and New Findings, 2026For expanded version, see https://doi.org/10.1162/99608f92.4d1853ce
@incollection{vilhuber-nber-2026, author = {Vilhuber, Lars}, booktitle = {Data {Privacy} {Protection} and the {Conduct} of {Applied} {Research}: {Methods}, {Approaches} and {New} {Findings}}, editor = {Hotz, V. Joseph and Gong, Ruobin and Schmutte, Ian M}, note = {For expanded version, see https://doi.org/10.1162/99608f92.4d1853ce}, publisher = {University of Chicago Press / National Bureau of Economic Research}, series = {{NBER}}, title = {Using {Containers} for {Analysis} {Validation} at {Scale}}, volume = {forthcoming}, year = {2026}, } - Protecting Confidential Data through Non-Statistical MethodsLars VilhuberIn Handbook of sharing confidential data: differential privacy, secure multiparty computation, and synthetic data, 2024
"Statistical agencies, research organizations, companies, and other data stewards that seek to share data with the public face a challenging dilemma. They need to protect the privacy and confidentiality of data subjects’ and their attributes while providing data products that are useful for their intended purposes. In an age when information on data subjects is available from a wide range of data sources, as are the computational resources to obtain that information, this challenge is increasingly difficult. The Handbook of Sharing Confidential Data helps data stewards understand how tools from the data confidentiality literature-specifically, synthetic data, formal privacy, and secure computation-can be used to manage trade-offs in disclosure risk and data usefulness"–
@incollection{vilhuber2024a, abstract = {"Statistical agencies, research organizations, companies, and other data stewards that seek to share data with the public face a challenging dilemma. They need to protect the privacy and confidentiality of data subjects' and their attributes while providing data products that are useful for their intended purposes. In an age when information on data subjects is available from a wide range of data sources, as are the computational resources to obtain that information, this challenge is increasingly difficult. The Handbook of Sharing Confidential Data helps data stewards understand how tools from the data confidentiality literature-specifically, synthetic data, formal privacy, and secure computation-can be used to manage trade-offs in disclosure risk and data usefulness"--}, address = {Boca Raton, FL}, author = {Vilhuber, Lars}, booktitle = {Handbook of sharing confidential data: differential privacy, secure multiparty computation, and synthetic data}, copyright = {All rights reserved}, edition = {First edition}, editor = {Drechsler, Jörg and Kifer, Daniel and Reiter, Jerome P and Slavković, Aleksandra}, isbn = {978-1-032-02803-3 978-1-032-02807-1}, publisher = {CRC Press}, series = {Chapman \& {Hall}/{CRC} handbooks of modern statistical methods}, title = {Protecting {Confidential} {Data} through {Non}-{Statistical} {Methods}}, year = {2024}, } - Disclosure Limitation and Confidentiality Protection in Linked DataJohn M. Abowd, Ian M. Schmutte, and Lars VilhuberIn Administrative Records for Survey Methodology, Apr 2021
Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.
@incollection{abowd2021, abstract = {Confidentiality protection for linked administrative data is a combination of access modalities and statistical disclosure limitation. We review traditional statistical disclosure limitation methods and newer methods based on synthetic data, input noise infusion and formal privacy. We discuss how these methods are integrated with access modalities by providing three detailed examples. The first example is the linkages in the Health and Retirement Study to Social Security Administration data. The second example is the linkage of the Survey of Income and Program Participation to administrative data from the Internal Revenue Service and the Social Security Administration. The third example is the Longitudinal Employer-Household Dynamics data, which links state unemployment insurance records for workers and firms to a wide variety of censuses and surveys at the U.S. Census Bureau. For examples, we discuss access modalities, disclosure limitation methods, the effectiveness of those methods, and the resulting analytical validity. The final sections discuss recent advances in access modalities for linked administrative data.}, author = {Abowd, John M. and Schmutte, Ian M. and Vilhuber, Lars}, booktitle = {Administrative {Records} for {Survey} {Methodology}}, copyright = {All rights reserved}, editor = {Chun, Asaph Young and Larsen, Michael D.}, isbn = {978-1-119-27204-5}, month = apr, publisher = {Wiley}, series = {Survey {Research} {Methods} \& {Sampling}}, title = {Disclosure {Limitation} and {Confidentiality} {Protection} in {Linked} {Data}}, url = {https://www.wiley.com/en-ca/Administrative+Records+for+Survey+Methodology-p-9781119272045}, year = {2021}, month_numeric = {4} } - Using Administrative Data for Research and Evidence-Based Policy: An IntroductionShawn Cole, Iqbal Dhaliwal, Anja Sautmann, and 1 more authorIn Handbook on Using Administrative Data for Research and Evidence-based Policy, Jan 2021
@incollection{cole2021a, author = {Cole, Shawn and Dhaliwal, Iqbal and Sautmann, Anja and Vilhuber, Lars}, booktitle = {Handbook on {Using} {Administrative} {Data} for {Research} and {Evidence}-based {Policy}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.31485/admindatahandbook.1.0}, editor = {Cole, Shawn and Dhaliwal, Iqbal and Sautmann, Anja and Vilhuber, Lars}, isbn = {978-1-7360216-0-6}, language = {en}, month = jan, pages = {1--36}, publisher = {Abdul Latif Jameel Poverty Action Lab}, title = {Using {Administrative} {Data} for {Research} and {Evidence}-{Based} {Policy}: {An} {Introduction}}, url = {https://admindatahandbook.mit.edu/print/v1.0/handbook_ch1_Introduction.pdf}, urldate = {2021-04-08}, year = {2021}, grant = {G-2019-11391}, month_numeric = {1} } - Balancing Privacy and Data Usability: An Overview of Disclosure Avoidance MethodsIan M. Schmutte and Lars VilhuberIn Handbook on Using Administrative Data for Research and Evidence-based Policy, Jan 2021
The Five Safes framework (safe projects, safe people, safe settings, safe data, and safe outputs) is one way of thinking about security of different aspects of a project, and is used throughout the Handbook and in research with administrative data. Within the Five Safes framework, data providers need to create safe data that can be provided to trusted safe people for use within safe settings, as part of safe projects. Finally, any findings that are shared publicly must be safe outputs. The processes used to create safe data and safe outputs (manipulations that render data less sensitive and therefore more appropriate for public release) are generally referred to as statistical disclosure limitation (SDL). This chapter describes techniques traditionally used within the field of SDL, pointing at methods as well as metrics to assess the resultant statistical quality and sensitivity of the data, and offers technical guidance applicable to any data provider or researcher looking for practical tools to apply to their own data to reduce the risk to privacy.
@incollection{schmutte2021, abstract = {The Five Safes framework (safe projects, safe people, safe settings, safe data, and safe outputs) is one way of thinking about security of different aspects of a project, and is used throughout the Handbook and in research with administrative data. Within the Five Safes framework, data providers need to create safe data that can be provided to trusted safe people for use within safe settings, as part of safe projects. Finally, any findings that are shared publicly must be safe outputs. The processes used to create safe data and safe outputs (manipulations that render data less sensitive and therefore more appropriate for public release) are generally referred to as statistical disclosure limitation (SDL). This chapter describes techniques traditionally used within the field of SDL, pointing at methods as well as metrics to assess the resultant statistical quality and sensitivity of the data, and offers technical guidance applicable to any data provider or researcher looking for practical tools to apply to their own data to reduce the risk to privacy.}, author = {Schmutte, Ian M. and Vilhuber, Lars}, booktitle = {Handbook on {Using} {Administrative} {Data} for {Research} and {Evidence}-based {Policy}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.31485/admindatahandbook.1.0}, editor = {Cole, Shawn and Dhaliwal, Iqbal and Sautmann, Anja and Vilhuber, Lars}, isbn = {978-1-7360216-0-6}, language = {en}, month = jan, pages = {145--172}, publisher = {Abdul Latif Jameel Poverty Action Lab}, title = {Balancing {Privacy} and {Data} {Usability}: {An} {Overview} of {Disclosure} {Avoidance} {Methods}}, url = {https://admindatahandbook.mit.edu/print/v1.0/handbook_ch5_SDL.pdf}, urldate = {2021-04-08}, year = {2021}, grant = {G-2019-11391}, month_numeric = {1} } - Physically Protecting Sensitive DataJim Shen and Lars VilhuberIn Handbook on Using Administrative Data for Research and Evidence-based Policy, Jan 2021
Keeping sensitive data safe relies heavily on the physical environments in which data are stored, processed, transmitted, and accessed, and from which researchers can access computers that store and process the data. However, it is also the setting that is most dependent on rapidly evolving technology. The chapter provides snapshot of the technologies available and in use as of 2020, and characterizes the technologies along a multi-dimensional scale, allowing for some comparability across methods.
@incollection{shen2021, abstract = {Keeping sensitive data safe relies heavily on the physical environments in which data are stored, processed, transmitted, and accessed, and from which researchers can access computers that store and process the data. However, it is also the setting that is most dependent on rapidly evolving technology. The chapter provides snapshot of the technologies available and in use as of 2020, and characterizes the technologies along a multi-dimensional scale, allowing for some comparability across methods.}, author = {Shen, Jim and Vilhuber, Lars}, booktitle = {Handbook on {Using} {Administrative} {Data} for {Research} and {Evidence}-based {Policy}}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.31485/admindatahandbook.1.0}, editor = {Cole, Shawn and Dhaliwal, Iqbal and Sautmann, Anja and Vilhuber, Lars}, isbn = {978-1-7360216-0-6}, language = {en}, month = jan, pages = {37--84}, publisher = {Abdul Latif Jameel Poverty Action Lab}, title = {Physically {Protecting} {Sensitive} {Data}}, url = {https://admindatahandbook.mit.edu/print/v1.0/handbook_ch2_Physical-protection.pdf}, urldate = {2021-04-08}, year = {2021}, grant = {G-2019-11391}, month_numeric = {1} } - The LEHD Infrastructure Files and the Creation of the Quarterly Workforce IndicatorsJ M Abowd, B E Stephens, L Vilhuber, and 4 more authorsIn Producer Dynamics: New Evidence from Micro Data, 2009
@incollection{abowd2009a, author = {Abowd, J M and Stephens, B E and Vilhuber, L and Andersson, F and McKinney, K L and Roemer, M and Woodcock, S D}, booktitle = {Producer {Dynamics}: {New} {Evidence} from {Micro} {Data}}, copyright = {All Rights Reserved (Free to Read)}, editor = {{T. Dunne and J B Jensen and M J Roberts}}, isbn = {978-0-226-17256-9}, publisher = {University of Chicago Press}, title = {The {LEHD} {Infrastructure} {Files} and the {Creation} of the {Quarterly} {Workforce} {Indicators}}, url = {http://www.nber.org/chapters/c0485}, year = {2009}, } - The link between human capital, mass layoffs, and firm deathsJohn M. Abowd, Kevin L. McKinney, and Lars VilhuberIn Producer dynamics: New evidence from micro data, 2009
@incollection{AbowdEtAl2009c, author = {Abowd, John M. and McKinney, Kevin L. and Vilhuber, Lars}, booktitle = {Producer dynamics: {New} evidence from micro data}, copyright = {All Rights Reserved (Free to Read)}, editor = {Dunne, Timothy and Jensen, J. Bradford and Roberts, Mark J.}, isbn = {978-0-226-17256-9}, publisher = {University of Chicago Press}, title = {The link between human capital, mass layoffs, and firm deaths}, url = {http://www.nber.org/chapters/c0497/}, year = {2009}, } - Adjusting imperfect data: Overview and case studiesLars VilhuberIn The structure of wages: An international comparison, Jan 2009
@incollection{NBERc2366, author = {Vilhuber, Lars}, booktitle = {The structure of wages: {An} international comparison}, copyright = {All Rights Reserved (Free to Read)}, editor = {Lazear, Edward P. and Shaw, Kathryn L.}, month = jan, pages = {59--80}, publisher = {University of Chicago Press / National Bureau of Economic Research}, title = {Adjusting imperfect data: {Overview} and case studies}, url = {http://www.nber.org/chapters/c2366}, year = {2009}, month_numeric = {1} } - How did universal primary education affect returns to education and labor market participation in uganda?Lisa Dragoset and Lars VilhuberIn Youth in africa’s labor market, 2008
@incollection{DragosetVilhuber2008, address = {1818 H ST NW, WASHINGTON, DC 20433 USA}, author = {Dragoset, Lisa and Vilhuber, Lars}, booktitle = {Youth in africa's labor market}, copyright = {All rights reserved}, doi = {10.1596/978-0-8213-6884-8_ch11}, editor = {Garcia, M and Fares, J}, isbn = {978-0-8213-6885-5}, pages = {263--280}, publisher = {WORLD BANK INST}, title = {How did universal primary education affect returns to education and labor market participation in uganda?}, year = {2008}, } - Early career experiences and later career outcomes: An InternationalComparisonDavid N. Margolis, Erik Plug, Véronique Simonnet, and 1 more authorIn Human capital over the life cycle - A European perspective, 2004Section: 5
@incollection{MargolisEtAl2004, author = {Margolis, David N. and Plug, Erik and Simonnet, Véronique and {LarsVilhuber}}, booktitle = {Human capital over the life cycle - {A} {European} perspective}, copyright = {All rights reserved}, editor = {Sofer, Catherine}, note = {Section: 5}, pages = {90--117}, publisher = {Edward Elgar}, title = {Early career experiences and later career outcomes: {An} {InternationalComparison}}, year = {2004}, }
Books (2)
- Handbook on Using Administrative Data for Research and Evidence-based PolicyShawn Cole, Iqbal Dhaliwal, Anja Sautmann, and 1 more authorJan 2021
The Handbook serves as a go-to reference for researchers seeking to use administrative data and for data providers looking to make their data accessible for research. The handbook is published online under an open licensing model and freely available to all. It provides information, best practices, and case studies on how to create privacy-protected access to, handle, and analyze administrative data, with the aim of pushing the research frontier as well as informing evidence-based policy innovations.
@book{cole2021, abstract = {The Handbook serves as a go-to reference for researchers seeking to use administrative data and for data providers looking to make their data accessible for research. The handbook is published online under an open licensing model and freely available to all. It provides information, best practices, and case studies on how to create privacy-protected access to, handle, and analyze administrative data, with the aim of pushing the research frontier as well as informing evidence-based policy innovations.}, author = {Cole, Shawn and Dhaliwal, Iqbal and Sautmann, Anja and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.31485/admindatahandbook.1.0}, isbn = {978-1-7360216-0-6}, language = {en}, month = jan, publisher = {Abdul Latif Jameel Poverty Action Lab}, title = {Handbook on {Using} {Administrative} {Data} for {Research} and {Evidence}-based {Policy}}, url = {https://admindatahandbook.mit.edu/}, urldate = {2021-04-08}, year = {2021}, grant = {G-2019-11391}, month_numeric = {1} }
Presentations (2)
- The Reproducibility of Economics Research: A Case StudyHautahi Kingi, Flavio Stanchi, Lars Vilhuber, and 1 more author2018
@techreport{Kingi2018, address = {Berkeley, CA}, author = {Kingi, Hautahi and Stanchi, Flavio and Vilhuber, Lars and Herbert, Sylverie}, copyright = {All rights reserved}, title = {The {Reproducibility} of {Economics} {Research}: {A} {Case} {Study}}, type = {Presentation}, url = {https://osf.io/srg57/}, year = {2018}, } - Usage and outcomes of the Synthetic Data ServerLars Vilhuber and John Abowd2016
@misc{vilhuber2016a, author = {Vilhuber, Lars and Abowd, John}, copyright = {All rights reserved}, language = {en}, title = {Usage and outcomes of the {Synthetic} {Data} {Server}}, type = {Presentation}, url = {http://hdl.handle.net/1813/43883}, urldate = {2018-07-22}, year = {2016}, }
Other Publications (25)
- Code and Data for: Reproducibility and Open Science in EconomicsLars VilhuberDec 2025
@misc{vilhuber2025-rp, author = {Vilhuber, Lars}, copyright = {Creative Commons Attribution 4.0 International}, doi = {10.5281/ZENODO.17957200}, month = dec, publisher = {Zenodo}, shorttitle = {Code and {Data} for}, title = {Code and {Data} for: {Reproducibility} and {Open} {Science} in {Economics}}, url = {https://zenodo.org/doi/10.5281/zenodo.17957200}, urldate = {2026-03-26}, year = {2025}, month_numeric = {12} } - Replication Data and Code for: Reproduce to Validate: a Comprehensive Study on the Reproducibility of Economics ResearchSylvérie Herbert, Hautahi Kingi, Flavio Stanchi, and 1 more author2024
Journals have pushed for transparency of research through data availability policies. Such data policies improve availability of data and code, but what is the impact on reproducibility? We present results from a large reproduction exercise for articles published in the American Economic Journal: Applied Economics, which has had a data availability policy since its inception in 2009. This replication package contains all data and code to reproduce the tables in the paper. Please see the ReadMe file for additional details.
@misc{herbertetal-data-2024, abstract = {Journals have pushed for transparency of research through data availability policies. Such data policies improve availability of data and code, but what is the impact on reproducibility? We present results from a large reproduction exercise for articles published in the American Economic Journal: Applied Economics, which has had a data availability policy since its inception in 2009. This replication package contains all data and code to reproduce the tables in the paper. Please see the ReadMe file for additional details.}, author = {Herbert, Sylvérie and Kingi, Hautahi and Stanchi, Flavio and Vilhuber, Lars}, collaborator = {Vilhuber, Lars}, copyright = {BSD, CC0, CC-BY-NC-4.0}, doi = {10.5683/SP3/GJVVLI}, publisher = {Borealis}, shorttitle = {Replication {Data} and {Code} for}, title = {Replication {Data} and {Code} for: {Reproduce} to {Validate}: a {Comprehensive} {Study} on the {Reproducibility} of {Economics} {Research}}, urldate = {2026-03-26}, year = {2024}, } - The Reproducibility of Economics Research: A Case StudySylverie Herbert, Hautahi Kingi, Flavio Stanchi, and 1 more author2023
Given the importance of reproducibility for the scientific ethos, more and more journals have pushed for transparency of research through data availability policies. If the introduction and implementation of such data policies improve the availability of researchers’ code and data, what is the impact on reproducibility? We describe and present the results of a large reproduction exercise in which we assess the reproducibility of research articles published in the American Economic Journal: Applied Economics, which has implemented a data availability policy since 2005. Our replication success rate is relatively moderate, with 37.78% of replication attempts successful. 68 of 162 eligible replication attempts successfully replicated the article’s analysis (41.98%) conditional on non-confidential data. A further 69 (42.59%) were at least partially successful. A total of 98 out of 303 (32.34%) relied on confidential or proprietary data, and were thus not reproducible by this project. We also conduct several bibliometric analyses of reproducible vs. non-reproducible articles and show that replicable papers do not provide citation bonuses for authors.
@misc{herbert2023, abstract = {Given the importance of reproducibility for the scientific ethos, more and more journals have pushed for transparency of research through data availability policies. If the introduction and implementation of such data policies improve the availability of researchers' code and data, what is the impact on reproducibility? We describe and present the results of a large reproduction exercise in which we assess the reproducibility of research articles published in the American Economic Journal: Applied Economics, which has implemented a data availability policy since 2005. Our replication success rate is relatively moderate, with 37.78\% of replication attempts successful. 68 of 162 eligible replication attempts successfully replicated the article's analysis (41.98\%) conditional on non-confidential data. A further 69 (42.59\%) were at least partially successful. A total of 98 out of 303 (32.34\%) relied on confidential or proprietary data, and were thus not reproducible by this project. We also conduct several bibliometric analyses of reproducible vs. non-reproducible articles and show that replicable papers do not provide citation bonuses for authors.}, author = {Herbert, Sylverie and Kingi, Hautahi and Stanchi, Flavio and Vilhuber, Lars}, copyright = {All Rights Reserved (Free to Read)}, doi = {10.2139/ssrn.4325149}, language = {en}, shorttitle = {The {Reproducibility} of {Economics} {Research}}, title = {The {Reproducibility} of {Economics} {Research}: {A} {Case} {Study}}, type = {Banque de {France} {Working} {Paper}}, url = {https://papers.ssrn.com/abstract=4325149}, urldate = {2023-11-26}, year = {2023}, } - labordynamicsinstitute/metajelo-ui: v0.1.1Brandon Elam Barker and Lars VilhuberFeb 2021
Targets the upcoming release of Metajelo v0.9. Updates from previous release: describe CI/build process change title of page to not be purescript-web-xpath move related identifiers to beginning in preview (metajelo-web) replace travis badge with github actions badge describe that CI builds the product using the github actions list ancillary repositories in metajelo-ui repositories with unversioned zenodo badges
@misc{barker2021, abstract = {Targets the upcoming release of Metajelo v0.9. Updates from previous release: describe CI/build process change title of page to not be purescript-web-xpath move related identifiers to beginning in preview (metajelo-web) replace travis badge with github actions badge describe that CI builds the product using the github actions list ancillary repositories in metajelo-ui repositories with unversioned zenodo badges}, author = {Barker, Brandon Elam and Vilhuber, Lars}, copyright = {BSD-3-Clause}, doi = {10.5281/zenodo.4509001}, month = feb, publisher = {Zenodo}, shorttitle = {labordynamicsinstitute/metajelo-ui}, title = {labordynamicsinstitute/metajelo-ui: v0.1.1}, url = {https://doi.org/10.5281/zenodo.4509001}, urldate = {2021-04-02}, year = {2021}, month_numeric = {2} } - labordynamicsinstitute/metajelo-web: v2.0.0Brandon Elam Barker and Lars VilhuberFeb 2021
Targets the upcoming release of Metajelo v0.9.
@misc{barker2021b, abstract = {Targets the upcoming release of Metajelo v0.9.}, author = {Barker, Brandon Elam and Vilhuber, Lars}, copyright = {BSD-3-Clause}, doi = {10.5281/zenodo.4507862}, month = feb, publisher = {Zenodo}, shorttitle = {labordynamicsinstitute/metajelo-web}, title = {labordynamicsinstitute/metajelo-web: v2.0.0}, url = {https://doi.org/10.5281/zenodo.4507862}, urldate = {2021-04-02}, year = {2021}, month_numeric = {2} } - Applying Data Synthesis for Longitudinal Business Data across Three Countries [data and code]M. Jahangir Alam, Benoit Dostie, Jörg Drechsler, and 1 more authorMay 2020
Data on businesses collected by statistical agencies are challenging to protect.Many businesses have unique characteristics, and distributions of employment,sales, and profits are highly skewed. Attackers wishing to conduct identificationattacks often have access to much more information than for any individual. Asa consequence, most disclosure avoidance mechanisms fail to strike an accept-able balance between usefulness and confidentiality protection. Detailed aggregatestatistics by geography or detailed industry classes are rare, public-use microdataon businesses are virtually inexistant, and access to confidential microdata can beburdensome. Synthetic microdata have been proposed as a secure mechanism topublish microdata, as part of a broader discussion of how to provide broader accessto such datasets to researchers. In this article, we document an experiment to cre-ate analytically valid synthetic data, using the exact same model and methods previ-ously employed for the United States, for data from two different countries: Canada(Longitudinal Employment Analysis Program (LEAP)) and Germany (EstablishmentHistory Panel (BHP)). We assess utility and protection, and provide an assessmentof the feasibility of extending such an approach in a cost-effective way to other data.
@misc{alamApplyingDataSynthesis2020, abstract = {Data on businesses collected by statistical agencies are challenging to protect.Many businesses have unique characteristics, and distributions of employment,sales, and profits are highly skewed. Attackers wishing to conduct identificationattacks often have access to much more information than for any individual. Asa consequence, most disclosure avoidance mechanisms fail to strike an accept-able balance between usefulness and confidentiality protection. Detailed aggregatestatistics by geography or detailed industry classes are rare, public-use microdataon businesses are virtually inexistant, and access to confidential microdata can beburdensome. Synthetic microdata have been proposed as a secure mechanism topublish microdata, as part of a broader discussion of how to provide broader accessto such datasets to researchers. In this article, we document an experiment to cre-ate analytically valid synthetic data, using the exact same model and methods previ-ously employed for the United States, for data from two different countries: Canada(Longitudinal Employment Analysis Program (LEAP)) and Germany (EstablishmentHistory Panel (BHP)). We assess utility and protection, and provide an assessmentof the feasibility of extending such an approach in a cost-effective way to other data.}, author = {Alam, M. Jahangir and Dostie, Benoit and Drechsler, Jörg and Vilhuber, Lars}, copyright = {Creative Commons Attribution Non Commercial 4.0 International, Open Access}, doi = {10.5281/ZENODO.3832173}, language = {en}, month = may, publisher = {Zenodo}, title = {Applying {Data} {Synthesis} for {Longitudinal} {Business} {Data} across {Three} {Countries} [data and code]}, url = {https://zenodo.org/record/3832173}, urldate = {2026-03-29}, year = {2020}, month_numeric = {5} } - Replication code and data for: Recalculating ... How Uncertainty in Local Labor Market Definitions Affects Empirical FindingsAndrew D. Foote, Mark J. Kutzbach, and Lars VilhuberOct 2020
This repository contains the code and data to replicate all the analyses in our paper "Recalculating ... : How Uncertainty in Local Labor Market Definitions Affects Empirical Findings." Some of the data can also be used in other researchers’ analyses to investigate the robustness of their results when they use commuting zones to aggregate or collect data.
@misc{footeetal-rp, abstract = {This repository contains the code and data to replicate all the analyses in our paper "Recalculating ... : How Uncertainty in Local Labor Market Definitions Affects Empirical Findings." Some of the data can also be used in other researchers' analyses to investigate the robustness of their results when they use commuting zones to aggregate or collect data.}, author = {Foote, Andrew D. and Kutzbach, Mark J. and Vilhuber, Lars}, copyright = {Creative Commons Attribution 4.0 International, Open Access}, doi = {10.5281/ZENODO.4072428}, month = oct, publisher = {Zenodo}, shorttitle = {Replication code and data for}, title = {Replication code and data for: {Recalculating} ... {How} {Uncertainty} in {Local} {Labor} {Market} {Definitions} {Affects} {Empirical} {Findings}}, url = {https://zenodo.org/record/4072428}, urldate = {2026-03-26}, year = {2020}, month_numeric = {10} } - Data for: Requesting replication materials via emailLars VilhuberNov 2020
No description provided.
@misc{vilhuber2020b, abstract = {No description provided.}, author = {Vilhuber, Lars}, copyright = {Open Access}, doi = {10.5281/ZENODO.4267155}, month = nov, publisher = {Labor Dynamics Institute}, title = {Data for: {Requesting} replication materials via email}, url = {https://doi.org/10.5281/ZENODO.4267155}, urldate = {2020-11-11}, year = {2020}, month_numeric = {11} } - labordynamicsinstitute/readin_qcew_sas: A sequence of programs to readin in QCEW data from the Bureau of Labor StatisticsLars Vilhuber and Melissa BjellandJun 2020
These programs download and readin bulk data from the U.S. Bureau of Labor Statistics for the Quarterly Census of Employment and Wages data in ENB format.
@misc{vilhuber2020c, abstract = {These programs download and readin bulk data from the U.S. Bureau of Labor Statistics for the Quarterly Census of Employment and Wages data in ENB format.}, author = {Vilhuber, Lars and Bjelland, Melissa}, copyright = {MIT}, doi = {10.5281/zenodo.3903458}, month = jun, publisher = {Labor Dynamics Institute, Cornell University}, shorttitle = {labordynamicsinstitute/readin\_qcew\_sas}, title = {labordynamicsinstitute/readin\_qcew\_sas: {A} sequence of programs to readin in {QCEW} data from the {Bureau} of {Labor} {Statistics}}, url = {https://zenodo.org/record/3903458}, urldate = {2020-09-20}, year = {2020}, month_numeric = {6} } - Uncertainty in times of COVID-19: Raw survey dataFabian Lange, Lars Vilhuber, and Nicholas GordonLabor Dynamics Institute, [data] v20200622-clean, Jul 2020
@misc{lange_fabian_2020_3966534, author = {Lange, Fabian and Vilhuber, Lars and Gordon, Nicholas}, title = {Uncertainty in times of COVID-19: Raw survey data}, month = jul, year = {2020}, institution = {Labor Dynamics Institute}, number = {v20200622-clean}, type = {[data]}, keyword = dataset, doi = {10.5281/zenodo.3966534}, url = {https://doi.org/10.5281/zenodo.3966534}, month_numeric = {7} } - Presentation: metajelo, a metadata package for journals to support external linked objectsLars Vilhuber and Carl LagozeFeb 2019
We propose a metadata package (called metajelo) that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. Researchers struggle when attempting to fully comply with data documentation and provenance documentation standards.\textlessbr\textgreater
@misc{vilhuber2019c, abstract = {We propose a metadata package (called metajelo) that is intended to provide academic journals with a lightweight means of registering, at the time of publication, the existence and disposition of supplementary materials. Information about the supplementary materials is, in most cases, critical for the reproducibility and replicability of scholarly results. In many instances, these materials are curated by a third party, which may or may not follow developing standards for the identification and description of those materials. Researchers struggle when attempting to fully comply with data documentation and provenance documentation standards.{\textless}br{\textgreater}}, author = {Vilhuber, Lars and Lagoze, Carl}, copyright = {CC BY Attribution 4.0 International}, doi = {10.5281/zenodo.2577295}, month = feb, shorttitle = {Presentation}, title = {Presentation: metajelo, a metadata package for journals to support external linked objects}, url = {https://zenodo.org/record/2577295}, urldate = {2019-03-09}, year = {2019}, month_numeric = {2} } - ncrncornell/ced2ar: 2.9.0.0Charles Simmer, Ben Perry, Brandon Elam Barker, and 2 more authorsFeb 2018
@misc{Simmer2018-pc, author = {Simmer, Charles and Perry, Ben and Barker, Brandon Elam and Vilhuber, Lars and Brumsted, Kyle}, copyright = {CC BY-NC-SA Attribution-NonCommercial-ShareAlike 4.0 International}, doi = {10.5281/zenodo.1186381}, month = feb, title = {ncrncornell/ced2ar: 2.9.0.0}, url = {https://zenodo.org/record/1186381}, year = {2018}, month_numeric = {2} } - larsvilhuber/clone-chetty-use-admin-data: Data behind the Chetty (2012) figure on Time Trends in the Use of Administrative DataLars VilhuberOct 2018
@misc{Vilhuber2018-zq, author = {Vilhuber, Lars}, copyright = {CC BY-NC-SA Attribution-NonCommercial-ShareAlike 4.0 International}, doi = {10.5281/zenodo.1453345}, month = oct, title = {larsvilhuber/clone-chetty-use-admin-data: {Data} behind the {Chetty} (2012) figure on {Time} {Trends} in the {Use} of {Administrative} {Data}}, url = {https://zenodo.org/record/1453345}, year = {2018}, month_numeric = {10} } - Computational Tools for Social Scientists WorkshopLars Vilhuber2018
@misc{vilhuber2018b, author = {Vilhuber, Lars}, copyright = {CC BY-NC Attribution-NonCommercial 4.0 International}, title = {Computational {Tools} for {Social} {Scientists} {Workshop}}, url = {https://labordynamicsinstitute.github.io/computing4economists/web/#/}, urldate = {2018-12-08}, year = {2018}, } - Ced²Ar: 2.8.2.0Brandon Elam Barker, Charles Simmer, Lars Vilhuber, and 2 more authors2017
Installation Note: IF you are upgrading an existing installation of CED2AR AND you back up and restore your config files (like we do) THEN you need to add the contents of the patch files here to your existing config files. Instructions are at: Patches/v2.8.2.0 For server installation: ced2ar.war is the server binary BaseX.war is the server database template For desktop installation: ced2ar.jar is the desktop binary ced2ar.sh will run ced2ar.jar The following high level features have been added in this release: New Features and Issues Currently, new features and issues come from two sites: GitHub - The public site where users of the system can post Issues. JIRA - The restricted site used by the development team to track work related to CED2AR development. We try to cross reference a GitHub issue against a jira issue (CDR). New Features The following high level features have been added in this release: Browse by Study - Displays a list of Study Titles (stdyDscr/citation/titlStmt/titl). Clicking on a study title displays the codebook in a tabbed horizontal layout. The tabs are the DDI codebook complex types (Doc, Study, File, Data and Other Material). User Documentation #3 docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl enhancement [CDR-157] - docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl [GitHub issue] UI Navigation Customization - Can set the properties used to display/hide navigation tabs and the names of those tabs. (They are set using the /ced2ar-web/config page.) Administrator Documentation [CDR-157] - docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl [GitHub issue] Global Authentication Option Added - Setting accessMode to AdminOnly allows only users with the ROLE_ADMIN role to access the application. All others are prevented from accessing the pages. This is primarily used for codebook-development servers where crowd-sourced edits are curated and edited by administrators. Administrator Documentation [CDR-189] - Enable global authentication option in web properties Resolved Issues The following issues were fixed in this release. They are listed below. github Issues #26 Possible to delete investigator? #16 Variable edit mode: access levels question #9 Uploads sometimes fail and produce bad error message: invalid XML bug #8 can’t see crowdsourced edits; official version unviewable bug question #6 sha1sum bug #3 docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl enhancement jira Issues Issue [CDR-168] - Improve versioning automation [CDR-180] - ERROR SimpleAsyncUncaughtExceptionHandler...ArrayIndexOutOfBoundsException in ced2ar.eapi.VersionControl [CDR-183] - Configuration page is not updating [CDR-194] - delete investigator [#26] New Feature [CDR-157] - docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl [GitHub issue] [#3] [CDR-189] - Enable global authentication option in web properties QA Fix [CDR-181] - Version control is not commiting or pushing in v2 [CDR-191] - Enable generation of ced2ar.log [CDR-193] - header image for wiki-census
@misc{Barker2017-cj, abstract = {Installation Note: IF you are upgrading an existing installation of CED2AR AND you back up and restore your config files (like we do) THEN you need to add the contents of the patch files here to your existing config files. Instructions are at: Patches/v2.8.2.0 For server installation: ced2ar.war is the server binary BaseX.war is the server database template For desktop installation: ced2ar.jar is the desktop binary ced2ar.sh will run ced2ar.jar The following high level features have been added in this release: New Features and Issues Currently, new features and issues come from two sites: GitHub - The public site where users of the system can post Issues. JIRA - The restricted site used by the development team to track work related to CED2AR development. We try to cross reference a GitHub issue against a jira issue (CDR). New Features The following high level features have been added in this release: Browse by Study - Displays a list of Study Titles (stdyDscr/citation/titlStmt/titl). Clicking on a study title displays the codebook in a tabbed horizontal layout. The tabs are the DDI codebook complex types (Doc, Study, File, Data and Other Material). User Documentation \#3 docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl enhancement [CDR-157] - docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl [GitHub issue] UI Navigation Customization - Can set the properties used to display/hide navigation tabs and the names of those tabs. (They are set using the /ced2ar-web/config page.) Administrator Documentation [CDR-157] - docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl [GitHub issue] Global Authentication Option Added - Setting accessMode to AdminOnly allows only users with the ROLE\_ADMIN role to access the application. All others are prevented from accessing the pages. This is primarily used for codebook-development servers where crowd-sourced edits are curated and edited by administrators. Administrator Documentation [CDR-189] - Enable global authentication option in web properties Resolved Issues The following issues were fixed in this release. They are listed below. github Issues \#26 Possible to delete investigator? \#16 Variable edit mode: access levels question \#9 Uploads sometimes fail and produce bad error message: invalid XML bug \#8 can't see crowdsourced edits; official version unviewable bug question \#6 sha1sum bug \#3 docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl enhancement jira Issues Issue [CDR-168] - Improve versioning automation [CDR-180] - ERROR SimpleAsyncUncaughtExceptionHandler...ArrayIndexOutOfBoundsException in ced2ar.eapi.VersionControl [CDR-183] - Configuration page is not updating [CDR-194] - delete investigator [\#26] New Feature [CDR-157] - docDscr/citation/titlStmt/titl vs. stdyDscr/citation/titlStmt/titl [GitHub issue] [\#3] [CDR-189] - Enable global authentication option in web properties QA Fix [CDR-181] - Version control is not commiting or pushing in v2 [CDR-191] - Enable generation of ced2ar.log [CDR-193] - header image for wiki-census}, author = {Barker, Brandon Elam and Simmer, Charles and Vilhuber, Lars and Brumsted, Kyle and Perry, Ben}, copyright = {CC BY-NC-SA Attribution-NonCommercial-ShareAlike 4.0 International}, doi = {10.5281/ZENODO.495191}, title = {Ced²{Ar}: 2.8.2.0}, year = {2017}, } - Synthetic population housing and person records for the United StatesWilliam Sexton, John M. Abowd, Ian M. Schmutte, and 1 more author2017
@misc{sexton2017a, author = {Sexton, William and Abowd, John M. and Schmutte, Ian M. and Vilhuber, Lars}, copyright = {All rights reserved}, doi = {10.3886/e100274v1}, publisher = {ICPSR - Interuniversity Consortium for Political and Social Research}, title = {Synthetic population housing and person records for the {United} {States}}, year = {2017}, } - Replication Materials for Disclosure Limitation and Confidentality Protection in Linked DataLars Vilhuber, John M. Abowd, and Ian M. SchmutteDec 2017itemType: dataset
@misc{vilhuber2017g, author = {Vilhuber, Lars and Abowd, John M. and Schmutte, Ian M.}, copyright = {CC BY-SA Attribution-ShareAlike 4.0 International}, doi = {10.5281/zenodo.1116995}, month = dec, note = {itemType: dataset}, title = {Replication {Materials} for {Disclosure} {Limitation} and {Confidentality} {Protection} in {Linked} {Data}}, url = {https://doi.org/10.5281/zenodo.1116995}, year = {2017}, month_numeric = {12} } - labordynamicsinstitute/rampnoise: Code for Multiplicative Noise InfusionLars VilhuberDec 2017
@misc{vilhuber2017h, author = {Vilhuber, Lars}, copyright = {CC BY-SA Attribution-ShareAlike 4.0 International}, doi = {10.5281/zenodo.1116352}, month = dec, title = {labordynamicsinstitute/rampnoise: {Code} for {Multiplicative} {Noise} {Infusion}}, url = {https://doi.org/10.5281/zenodo.1116352}, year = {2017}, month_numeric = {12} } - Larsvilhuber/Jobcreationblog: Replication For: How Much Do Startups Impact Employment Growth In The U.S.?Lars Vilhuber2016
@misc{larsvilhuber2016, author = {{Lars Vilhuber}}, copyright = {The Unlicense}, doi = {10.5281/zenodo.192385}, shorttitle = {Larsvilhuber/{Jobcreationblog}}, title = {Larsvilhuber/{Jobcreationblog}: {Replication} {For}: {How} {Much} {Do} {Startups} {Impact} {Employment} {Growth} {In} {The} {U}.{S}.?}, url = {https://doi.org/10.5281/zenodo.192385}, urldate = {2016-12-05}, year = {2016}, } - CED²AR: Comprehensive Extensible Data Documentation and Access RepositoryBenjamin Perry, Jeremy Williams, Lars Vilhuber, and 1 more author2013tex.howpublished: online resource
@misc{ced2ar, author = {Perry, Benjamin and Williams, Jeremy and Vilhuber, Lars and Block, William}, note = {tex.howpublished: online resource}, publisher = {Cornell University, for NSF Grant SES-1131848 / Cornell University, for NSF Grant SES-1131848}, title = {{CED}²{AR}: {Comprehensive} {Extensible} {Data} {Documentation} and {Access} {Repository}}, type = {online resource}, url = {http://www2.ncrn.cornell.edu/ced2ar-web/}, urldate = {2014-04-10}, year = {2013}, } - NSF-Census Research Network - Cornell node websiteLars Vilhuber, Benjamin Perry, William Block, and 1 more author2012
@misc{ncrn.cornell, author = {Vilhuber, Lars and Perry, Benjamin and Block, William and Williams, Jeremy}, publisher = {Cornell University, for NSF Grant SES-1131848}, title = {{NSF}-{Census} {Research} {Network} - {Cornell} node website}, url = {http://www.ncrn.cornell.edu}, urldate = {2014-04-10}, year = {2012}, } - NSF-Census Research NetworkAlan Karr, Lars Vilhuber, Jamie Nunnelly, and 1 more author2012
@misc{ncrn.info, author = {Karr, Alan and Vilhuber, Lars and Nunnelly, Jamie and Kantner, Katherine}, publisher = {National Institute for the Statistical Sciencies (NISS), Cornell University, and Duke University, for NSF Grant SES-1237602 / National Institute for the Statistical Sciencies (NISS), Cornell University, and Duke University, for NSF Grant SES-1237602}, title = {{NSF}-{Census} {Research} {Network}}, url = {http://www.ncrn.info}, urldate = {2014-04-10}, year = {2012}, } - National Quarterly Workforce Indicators, r2254John M. Abowd and Lars Vilhuber2012
@misc{NQWI, address = {Ithaca, NY, USA}, author = {Abowd, John M. and Vilhuber, Lars}, publisher = {Cornell University, Labor Dynamics Institute [distributor] / Cornell University, Labor Dynamics Institute [distributor]}, title = {National {Quarterly} {Workforce} {Indicators}, r2254}, url = {http://www2.vrdc.cornell.edu/news/data/qwi-national-data/}, year = {2012}, } - VirtualRDC - Synthetic Data ServerJohn M. Abowd and Lars Vilhuber2010tex.howpublished: online resource tex.owner: vilhuber tex.timestamp: 2013.10.15
@misc{AbowdVilhuber, author = {Abowd, John M. and Vilhuber, Lars}, note = {tex.howpublished: online resource tex.owner: vilhuber tex.timestamp: 2013.10.15}, publisher = {Cornell University, Labor Dynamics Institute / Cornell University, Labor Dynamics Institute}, title = {{VirtualRDC} - {Synthetic} {Data} {Server}}, type = {online resource}, url = {http://www.vrdc.cornell.edu/sds/}, year = {2010}, } - New York State Disability and Employment Status Report, 2010Sarah Von Schrader, William Erickson, Thomas Golden, and 1 more author2010
@misc{Employment2010, author = {Von Schrader, Sarah and Erickson, William and Golden, Thomas and Vilhuber, Lars}, publisher = {Cornell University, Employment and Disability Institute on behalf of New York Makes Work Pay Comprehensive Employment System Medicaid Infrastructure Grant}, title = {New {York} {State} {Disability} and {Employment} {Status} {Report}, 2010}, url = {http://www.nymakesworkpay.org/status-reports/index.cfm}, urldate = {2014-04-10}, year = {2010}, } - County-level disability and employment status reports, 2007Sarah Von Schrader, William Erickson, Lars Vilhuber, and 1 more author2009
@misc{Employment2007, author = {Von Schrader, Sarah and Erickson, William and Vilhuber, Lars and Golden, Thomas}, publisher = {Cornell University, Employment and Disability Institute on behalf of New York Makes Work Pay Comprehensive Employment System Medicaid Infrastructure Grant}, title = {County-level disability and employment status reports, 2007}, url = {http://www.ilr.cornell.edu/edi/nymakesworkpay/policy/stats_2009.cfm}, urldate = {2009-07-01}, year = {2009}, } - County-level disability and employment status reports, 2009Sarah Von Schrader, William Erickson, Lars Vilhuber, and 1 more author2009
@misc{Employment2009online, author = {Von Schrader, Sarah and Erickson, William and Vilhuber, Lars and Golden, Thomas.}, publisher = {Cornell University, Employment and Disability Institute on behalf of New York Makes Work Pay Comprehensive Employment System Medicaid Infrastructure Grant}, title = {County-level disability and employment status reports, 2009}, url = {http://www.ilr.cornell.edu/edi/nymakesworkpay/policy/index.cfm}, urldate = {2010-01-01}, year = {2009}, } - VirtualRDCJohn M. Abowd and Lars Vilhuber2004tex.howpublished: online resource tex.owner: vilhuber tex.timestamp: 2013.10.15
@misc{vrdc, author = {Abowd, John M. and Vilhuber, Lars}, note = {tex.howpublished: online resource tex.owner: vilhuber tex.timestamp: 2013.10.15}, publisher = {Cornell University, Labor Dynamics Institute / Cornell University, Labor Dynamics Institute}, title = {{VirtualRDC}}, type = {online resource}, url = {http://www.vrdc.cornell.edu/}, year = {2004}, }