کد مقاله کد نشریه سال انتشار مقاله انگلیسی نسخه تمام متن
484163 703253 2016 9 صفحه PDF دانلود رایگان
عنوان انگلیسی مقاله ISI
Denormalize and Delimit: How Not to Make Data Extraction for Analysis More Complex Than Necessary
موضوعات مرتبط
مهندسی و علوم پایه مهندسی کامپیوتر علوم کامپیوتر (عمومی)
پیش نمایش صفحه اول مقاله
Denormalize and Delimit: How Not to Make Data Extraction for Analysis More Complex Than Necessary
چکیده انگلیسی

There are many legitimate reasons why standards for formatting of biomedical research data are lengthy and complex (Souza, Kush, & Evans, 2007). However, the common scenario of a biostatistician simply needing to import a given dataset into their statistical software is at best under-served by these standards. Statisticians are forced to act as amateur database administrators to pivot and join their data into a usable form before they can even begin the work that they specialize in doing. Or worse, they find their choice of statistical tools dictated not by their own experience and skills, but by remote standards bodies or inertial administrative choices. This may limit academic freedom. If the formats in question require the use of one proprietary software package, it also raises concerns about vendor lock-in (DeLano, 2005) and stewardship of public resources.The logistics and transparency of data sharing can be made more tractable by an appreciation of the differences between structural, semantic, and syntactic levels of data interoperability. The semantic level is legitimately a complex problem. Here we make the case that, for the limited purpose of statistical analysis, a simplifying assumption can be made about structural level: the needs of a large number of statistical models can often be met with a modified variant of the first normal form or 1NF (Codd, 1979). Once data is merged into one such table, the syntactic level becomes a solved problem, with many text based formats available and robustly supported by virtually all statistical software without the need for any custom or third-party client-side add-ons. We implemented our denormalization approach in DataFinisher, an open source server-side add-on for i2b2 (Murphy et al., 2009), which we use at our site to enable self-service pulls of de-identified data by researchers.

ناشر
Database: Elsevier - ScienceDirect (ساینس دایرکت)
Journal: Procedia Computer Science - Volume 80, 2016, Pages 1033–1041
نویسندگان
, , , , ,