Basel Biometric Society Seminar, May 16th 2024
Today we will explore data sharing scenarios involving data from individuals\(^\mathsf{a}\) and discuss methods for privacy-preserving data sharing and analysis\(^\mathsf{b}\).
Privacy basics for shared data.
Publishing a dataset can lead to disclosure of information on individuals who contributed to it.
Identity Disclosure: Individual can be uniquely identified from a dataset, either directly through specific identifiers (like name or social security number) or indirectly through a combination of attributes (like age, zip code, and profession).
Attribute Disclosure: Information about an individual is revealed even if an individual has not been uniquely identified.
Membership Disclosure: Infer that an individual’s data is included in a dataset.
Linkage Disclosure: Ability to link data relating to an individual across multiple databases, potentially leading to identity or attribute disclosure by combining information.
… and where to start.
Technology basics.
Keeping computer systems secure is a specialized, complex skill. Ensure you know whom to ask / work with for keeping data secure, keep on top of trainings, and (cautiously) use common sense!
Applied governance for data access is relevant for data scientists, too:
While some of the above sound like “IT-department issues”, they have concrete implications for data science work - e.g. not copying data out of their proper location, ensuring derivations and analyses are reproducible & documented, and ensuring that the purpose of the analysis is compatible with the conditions of use!
Using analytical tools & software securely isn’t straightforward with complex software dependencies.
E.g. Pypi, Huggingface, many others have been targeted by malware. See also (4), (5). 1
Know where the data goes.
As soon as a system has internet access this can become difficult! 2
The right technology skills make work inside gated data platforms easier!
The MIT Missing Semester is a good starting point for that: (6)
Some simple examples
Example 1: The US Census bureau uses differential privacy to protect its data releases (7) - group summaries for small groups (e.g. small towns and/or minorities) can lead to unintended disclosures.
Example 2: Allele frequency summaries in genetic data can allow membership inference (8).
Key idea:1
Add noise to summaries that “hides” the contribution of individual data records.
Questions:
Differential privacy provides a bound for the probability of computing the same output statistic with or without any specific subject’s data record, given a mechanism that adds noise.
Definition (\(\varepsilon\)-Differential Privacy)
Let \(\varepsilon > 0\), \(n\) be the number of records in our dataset. A randomized algorithm \(T:\mathcal{D}^n \rightarrow \mathbb{R}^p\) is said to be \(\varepsilon\)-differentially private if for every subset \(A\) of \(\mathbb{R}^p\) and for all datasets \(d_1, d_2 \in \mathcal{D}^n\) which differ in only a single element, we have \(\mathbf{P}(T(d_1) \in A) \leq \text{exp}(\varepsilon) \ \mathbf{P}(T(d_2) \in A)\).
Notes:
Large population in sample has mean 5, but a small subset of sample have mean 10000.
Non-private mean ~= 14.985
Calculating bounded mean with diffprivlib(16): Increasing \(\varepsilon\) / decreasing privacy moves the privacy-preserving mean estimate closer to the correct mean of the input population.
Every time we release data for one value of \(\varepsilon\), we release more information about the original data! Also, choosing \(\varepsilon\) s.t. we retain “useful” data utility is not always possible!
Using differential privacy in practice needs careful choice of noise and sensitivity parameters. Not all implementations got this right at all times(17) - specifically for calculating a (bounded) mean.
First Name | Last Name | Sex | DOB | Cat or dog person |
---|---|---|---|---|
Antony | Hill | Male | 09-03-1986 | 🐈⬛ |
Garry | Armstrong | Male | 03-01-1986 | 🐕 |
Fenton | Riley | Male | 01-02-1987 | 🐕 |
Vivian | Johnson | Female | 04-05-1987 | 🐕 |
Wilson | Tucker | Male | 09-10-1991 | 🐕 |
ID | First Name | Last Name | Sex | Age | Cat or dog person |
---|---|---|---|---|---|
A | XXX | XXX | Male | 38 | 🐈⬛ |
B | XXX | XXX | Male | 38 | 🐕 |
C | XXX | XXX | Male | 37 | 🐕 |
D | XXX | XXX | Female | 37 | 🐕 |
E | XXX | XXX | Male | 32 | 🐕 |
When publishing subject-level data, we can “de-identify” records by removing personal identifiers (e.g. Name, DOB).
In many cases, this is not sufficient to protect privacy of individuals (18).
[1] http://www.randat.com/
ID | Sex | Age | Profession |
---|---|---|---|
A | Male | 38 | Nurse |
B | Male | 38 | Doctor |
C | Male | 37 | Nurse |
D | Female | 37 | Doctor |
E | Male | 32 | House Spouse |
Assume we know everyone lives on the same street:
Can we re-identify individuals?
ID | Sex | Age | Profession |
---|---|---|---|
A | Male | 30-40 | Nurse |
B | Male | 30-40 | Doctor |
C | Male | 30-40 | Nurse |
D | Female | 30-40 | Doctor |
E | Male | 30-40 | House Spouse |
Age or date of birth (19) can be bucketed to avoid re-identification.
ID | Sex | Age | Profession |
---|---|---|---|
A | Male | 30-40 | Healthcare |
B | Male | 30-40 | Healthcare |
C | Male | 30-40 | Healthcare |
D | Female | 30-40 | Healthcare |
E | Male | 30-40 | Unspecified |
Rare categories can be obscured, or individual rare records suppressed. Typically, this is done in a way that quantifies disclosure risk for variables individually or jointly.
Comprehensive methodology exists for this (20), but optimizing data utility isn’t always simple.
Working with synthetic data
Can improve data utility over anonymisation (1) and may enable work across different data modalities - e.g. tabular & imaging data (21).
Federated learning
Provides a framework to collaborate on data that cannot be shared directly - a common scenario particularly for healthcare data!
Applying differential privacy
Can help add privacy guarantees to both methods above, but needs careful choice of noise parameters / privacy budget.
Privacy guarantees are typically traded off for data utility.
Disclosure risk and data sharing scenario define which technologies are useful, and what tradeoffs may be made.
Technology and methods alone don’t protect privacy:
… but careful use & choice of parameters do!