کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
416894 681414 2011 12 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات
پیش نمایش صفحه اول مقاله
An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets
چکیده انگلیسی

When intense redaction is needed to protect the confidentiality of data subjects’ identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms–classification and regression trees, bagging, random forests, and support vector machines–are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.


► Statistical agencies can release simulated data as public use files.
► Nonparametric regression can be adapted to simulate such datasets.
► Synthesizers using CART, random forests, support vector machines were compared.
► CART shown to give highest data utility for acceptable disclosure risk.
► Nonparametric methods are easy to employ and hence appealing options for agencies.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computational Statistics & Data Analysis - Volume 55, Issue 12, 1 December 2011, Pages 3232–3243
نویسندگان
, ,