Information preserving regression-based tools for statistical disclosure control

Langsrud, Øyvind

dc.contributor.author	Langsrud, Øyvind
dc.date.accessioned	2020-03-23T11:10:55Z
dc.date.available	2020-03-23T11:10:55Z
dc.date.created	2019-07-03T14:16:24Z
dc.date.issued	2019-01
dc.identifier.citation	Statistics and computing. 2019, 29 (5), 965-976.	en_US
dc.identifier.issn	0960-3174
dc.identifier.uri	https://hdl.handle.net/11250/2648071
dc.description.abstract	This paper presents a unified framework for regression-based statistical disclosure control for microdata. A basic method, known as information preserving statistical obfuscation (IPSO), produces synthetic data that preserve variances, covariances and fitted values. The data are then generated conditionally according to the multivariate normal distribution. Generalizations of the IPSO method are described in the literature, and these methods aim to generate data more similar to the original data. This paper describes these methods in a concise and interpretable way, which is close to efficient implementation. Decomposing the residual data into orthogonal scores and corresponding loadings is an essential part of the framework. Both QR decomposition (Gram–Schmidt orthogonalization) and singular value decomposition (principal components) may be used. Within this framework, new and generalized methods are presented. In particular, a method is described by means of which the correlations to the original principal component scores can be controlled exactly. It is shown that a suggested method of random orthogonal matrix masking can be implemented without generating an orthogonal matrix. Generalized methodology for hierarchical categories is presented within the context of microaggregation. Some information can then be preserved at the lowest level and more information at higher levels. The presented methodology is also applicable to tabular data. One possibility is to replace the content of primary and secondary suppressed cells with generated values. It is proposed replacing suppressed cell frequencies with decimal numbers, and it is argued that this can be a useful method.	en_US
dc.language.iso	eng	en_US
dc.publisher	Springer	en_US
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/deed.no	*
dc.subject	Microdata	en_US
dc.subject	Anonymization	en_US
dc.subject	Microaggregation	en_US
dc.subject	Official statistics	en_US
dc.subject	Synthetic data	en_US
dc.subject	Mikrodata	en_US
dc.subject	Anonymisering	en_US
dc.subject	Statistikkproduksjon	en_US
dc.title	Information preserving regression-based tools for statistical disclosure control	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	acceptedVersion	en_US
dc.rights.holder	Springer	en_US
dc.subject.nsi	VDP::Matematikk og Naturvitenskap: 400::Matematikk: 410::Statistikk: 412	en_US
dc.source.pagenumber	965-976	en_US
dc.source.volume	29	en_US
dc.source.journal	Statistics and computing	en_US
dc.source.issue	5	en_US
dc.identifier.doi	10.1007/s11222-018-9848-9
dc.identifier.cristin	1709826
cristin.ispublished	true
cristin.fulltext	postprint
cristin.qualitycode	2

Tilhørende fil(er)

Filnavn:: RegressionBasedSDC.pdf
Størrelse:: 272.1Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Artikler / Journal articles [426]
Publikasjoner fra Cristin [138]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Attribution-NonCommercial-NoDerivatives 4.0 Internasjonal