An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

کد مقاله	کد نشریه	سال انتشار	مقاله انگلیسی	نسخه تمام متن
416894	681414	2011	12 صفحه PDF	دانلود رایگان

عنوان انگلیسی مقاله ISI

دانلود مقاله + سفارش ترجمه

دانلود مقاله ISI انگلیسی

رایگان برای ایرانیان

کلمات کلیدی

disclosure - افشای Confidentiality - رازداری Census - سرشماری Imputation - محاسبه Synthetic - مصنوعی Microdata - میکروتاچ

موضوعات مرتبط

مهندسی و علوم پایه مهندسی کامپیوتر نظریه محاسباتی و ریاضیات

پیش نمایش صفحه اول مقاله

An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

چکیده انگلیسی

When intense redaction is needed to protect the confidentiality of data subjects’ identities and sensitive attributes, statistical agencies can use synthetic data approaches. To create synthetic data, the agency replaces identifying or sensitive values with draws from statistical models estimated from the confidential data. Many agencies are reluctant to implement this idea because (i) the quality of the generated data depends strongly on the quality of the underlying models, and (ii) developing effective synthesis models can be a labor-intensive and difficult task. Recently, there have been suggestions that agencies use nonparametric methods from the machine learning literature to generate synthetic data. These methods can estimate non-linear relationships that might otherwise be missed and can be run with minimal tuning, thus considerably reducing burdens on the agency. Four synthesizers based on machine learning algorithms–classification and regression trees, bagging, random forests, and support vector machines–are evaluated in terms of their potential to preserve analytical validity while reducing disclosure risks. The evaluation is based on a repeated sampling simulation with a subset of the 2002 Uganda census public use sample data. The simulation suggests that synthesizers based on regression trees can result in synthetic datasets that provide reliable estimates and low disclosure risks, and that these synthesizers can be implemented easily by statistical agencies.

► Statistical agencies can release simulated data as public use files.
► Nonparametric regression can be adapted to simulate such datasets.
► Synthesizers using CART, random forests, support vector machines were compared.
► CART shown to give highest data utility for acceptable disclosure risk.
► Nonparametric methods are easy to employ and hence appealing options for agencies.

ناشر

Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Computational Statistics & Data Analysis - Volume 55, Issue 12, 1 December 2011, Pages 3232–3243

نویسندگان

Jörg Drechsler, Jerome P. Reiter,

علوم انسانی و هنر

فنی، مهندسی و علوم پایه

پزشکی و سلامت

بیو تکنولوژی

پذیرش سفارش ترجمه

An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets

دسترسی سریع

ارتباط

English Website