Statistical disclosure control

SDC can also describe protection methods applied to the data: for example, removing names and addresses, limiting extreme values, or swapping problematic observations.

The value and drawbacks of rules for frequency and magnitude tables have been discussed extensively since the late 20th Century.

Some statistical outputs, such as frequency tables, have a high level of inherent risk: differencing, low numbers, class disclosure.

[11] In principles-based systems, disclosure control attempts to uphold a specific set of fundamental principles—for example, "no person should be identifiable in released microdata".

[12] Rules-based systems, in contrast, are evidenced by a specific set of rules that a person performing disclosure control follows (for example, "any frequency must be based on at least five observations"), after which the data are presumed to be safe to release.

[13] In rules-based SDC, a rigid set of rules is used to determine whether or not the results of data analysis can be released.

Rules-based systems are good for ensuring consistency across time, across data sources, and across production teams, which makes them appealing for statistical agencies.

It requires training and an understanding of statistics and data analysis,[11] although it has been argued[13] that this can be used to make the process more efficient than a rules-based model.

In the UK all major secure research environments in social science and public health, with the exception of Northern Ireland, are principles-based.

Many contemporary statistical disclosure control techniques, such as generalization and cell suppression, have been shown to be vulnerable to attack by a hypothetical data intruder.

These were extended and simplified as part of the SACRO project (see below), and more guidelines for data services staff added.

They provide the output checkers with extensive information on potential problems, including secondary disclosure across tables.

However, in official statistics, where the same tables are being repeatedly generated and where secondary differencing is considered a significant problem, the investment in setting up the tools can be very cost-effective.

The software for both is open source at GitHub https://github.com/sdcTools/tauargus and CRAN https://cran.r-project.org/web/packages/sdcTable/ SACRO (Semi-autonomous checking of research outputs) is a WPR tool, originally commissioned (ACRO) by Eurostat in 2020 as a proof-of-concept to show that a general-purpose output checking tool could be developed.

[22] In 2023 the UK Medical Research Council commissioned a generalised version (SACRO) which would work with multiple languages (as of 2024: Stata, R and Python) and provide a more user-friendly interface.

[23] SACRO directly implements the statbarns model and is principles-based; hence, it is 'semi-automatic' as it allows users to request exceptions and for output checkers to override the automated recommendations.