Navigating the jungle of data sharing tools

Basel Biometric Society Seminar, May 16th 2024

Peter Krusche

Novartis Pharma AG

Introduction

Today we will explore data sharing scenarios involving data from individuals\(^\mathsf{a}\) and discuss methods for privacy-preserving data sharing and analysis\(^\mathsf{b}\).

  1. Knowing the dangers: Privacy basics.
  2. Mapping the jungle: Where to start.
  3. Basic survival kit: Technology basics.
  4. Entering the jungle: Some simple examples.
  5. Exploring further: Outlook to other methods.
  1. …such as subjects who contributed data to biobanks or clinical studies.
  2. Disclaimer: The goal is to share an introduction and reading list - not a comprehensive summary!

Knowing the dangers

Privacy basics for shared data.

Some references for further reading: (1), (2)

What types of data do we consider?


  • Data summaries such as mean age, grouped summaries, model parameters, e.g. to document research findings.
  • Subject-level records, e.g. for further research, model training.
  • Synthetic data records preserving correlations & distributions in a subject-level dataset (e.g. for ML testing, data augmentation).


Types of data disclosure

Publishing a dataset can lead to disclosure of information on individuals who contributed to it.

Identity Disclosure: Individual can be uniquely identified from a dataset, either directly through specific identifiers (like name or social security number) or indirectly through a combination of attributes (like age, zip code, and profession).

Attribute Disclosure: Information about an individual is revealed even if an individual has not been uniquely identified.

Membership Disclosure: Infer that an individual’s data is included in a dataset.

Linkage Disclosure: Ability to link data relating to an individual across multiple databases, potentially leading to identity or attribute disclosure by combining information.

Mapping the jungle

… and where to start.

Tools for privacy-enhanced data sharing

Basic survival kit

Technology basics.

Keeping computer systems secure is a specialized, complex skill. Ensure you know whom to ask / work with for keeping data secure, keep on top of trainings, and (cautiously) use common sense!

Control access and copies of data


Applied governance for data access is relevant for data scientists, too:

  • Can you copy data to a specific storage location?1
  • Not all copies of data are obvious: E.g. Jupyter/Markdown notebooks have cells that may contain data or summaries. Do you store these in a safe place?2
  • Is your data encrypted?3

While some of the above sound like “IT-department issues”, they have concrete implications for data science work - e.g. not copying data out of their proper location, ensuring derivations and analyses are reproducible & documented, and ensuring that the purpose of the analysis is compatible with the conditions of use!

Work in a secure computing environment


Using analytical tools & software securely isn’t straightforward with complex software dependencies.

E.g. Pypi, Huggingface, many others have been targeted by malware. See also (4), (5). 1

Know where the data goes.

As soon as a system has internet access this can become difficult! 2

The right technology skills make work inside gated data platforms easier!
The MIT Missing Semester is a good starting point for that: (6)


Entering the jungle

Some simple examples

Data summaries do not automatically protect privacy!


Example 1: The US Census bureau uses differential privacy to protect its data releases (7) - group summaries for small groups (e.g. small towns and/or minorities) can lead to unintended disclosures.

Example 2: Allele frequency summaries in genetic data can allow membership inference (8).

What to do?

Different approaches exist - e.g. statistical disclosure control (9) and differential privacy (10) (11). A good summary of both approaches can be found in (12) and (13).

Differential privacy

Key idea:1

Add noise to summaries that “hides” the contribution of individual data records.

Questions:

  • How much noise do we add?
  • How do we quantify “privacy”?

Differential privacy (technical)

Differential privacy provides a bound for the probability of computing the same output statistic with or without any specific subject’s data record, given a mechanism that adds noise.

Definition (\(\varepsilon\)-Differential Privacy)

Let \(\varepsilon > 0\), \(n\) be the number of records in our dataset. A randomized algorithm \(T:\mathcal{D}^n \rightarrow \mathbb{R}^p\) is said to be \(\varepsilon\)-differentially private if for every subset \(A\) of \(\mathbb{R}^p\) and for all datasets \(d_1, d_2 \in \mathcal{D}^n\) which differ in only a single element, we have \(\mathbf{P}(T(d_1) \in A) \leq \text{exp}(\varepsilon) \ \mathbf{P}(T(d_2) \in A)\).

Notes:

  • Smaller \(\varepsilon\) \(\Rightarrow\) greater privacy since \(\mathbf{P}(T(d_1) \in A)\) and \(\mathbf{P}(T(d_2) \in A)\) will be closer.
  • Some statistics like the sample mean or variance never ensure \(\varepsilon\)-DP unless we can bound/clamp their value range.
  • Other definitions of differential privacy have been proposed that allow for different ways to show bounds on the relationship between noise added (amount and distribution) and the probability to get the same summary result - e.g. Gaussian Differential Privacy (15).

Differential private mean calculation

Large population in sample has mean 5, but a small subset of sample have mean 10000.

Non-private mean ~= 14.985

Calculating bounded mean with diffprivlib(16): Increasing \(\varepsilon\) / decreasing privacy moves the privacy-preserving mean estimate closer to the correct mean of the input population.

Every time we release data for one value of \(\varepsilon\), we release more information about the original data! Also, choosing \(\varepsilon\) s.t. we retain “useful” data utility is not always possible!

Using differential privacy in practice needs careful choice of noise and sensitivity parameters. Not all implementations got this right at all times(17) - specifically for calculating a (bounded) mean.

Releasing subject-level data

First Name Last Name Sex DOB Cat or dog person
Antony Hill Male 09-03-1986 🐈‍⬛
Garry Armstrong Male 03-01-1986 🐕
Fenton Riley Male 01-02-1987 🐕
Vivian Johnson Female 04-05-1987 🐕
Wilson Tucker Male 09-10-1991 🐕
ID First Name Last Name Sex Age Cat or dog person
A XXX XXX Male 38 🐈‍⬛
B XXX XXX Male 38 🐕
C XXX XXX Male 37 🐕
D XXX XXX Female 37 🐕
E XXX XXX Male 32 🐕

When publishing subject-level data, we can “de-identify” records by removing personal identifiers (e.g. Name, DOB).

In many cases, this is not sufficient to protect privacy of individuals (18).

Anonymisation


ID Sex Age Profession
A Male 38 Nurse
B Male 38 Doctor
C Male 37 Nurse
D Female 37 Doctor
E Male 32 House Spouse


Assume we know everyone lives on the same street:
Can we re-identify individuals?

ID Sex Age Profession
A Male 30-40 Nurse
B Male 30-40 Doctor
C Male 30-40 Nurse
D Female 30-40 Doctor
E Male 30-40 House Spouse


Age or date of birth (19) can be bucketed to avoid re-identification.

ID Sex Age Profession
A Male 30-40 Healthcare
B Male 30-40 Healthcare
C Male 30-40 Healthcare
D Female 30-40 Healthcare
E Male 30-40 Unspecified


Rare categories can be obscured, or individual rare records suppressed. Typically, this is done in a way that quantifies disclosure risk for variables individually or jointly.

Comprehensive methodology exists for this (20), but optimizing data utility isn’t always simple.

Exploring further

Working with synthetic data

Can improve data utility over anonymisation (1) and may enable work across different data modalities - e.g. tabular & imaging data (21).

Federated learning

Provides a framework to collaborate on data that cannot be shared directly - a common scenario particularly for healthcare data!

Applying differential privacy

Can help add privacy guarantees to both methods above, but needs careful choice of noise parameters / privacy budget.

Summary messages

  1. Privacy guarantees are typically traded off for data utility.

  2. Disclosure risk and data sharing scenario define which technologies are useful, and what tradeoffs may be made.

  3. Technology and methods alone don’t protect privacy:
    … but careful use & choice of parameters do!

Thank you!

Supplementary information

References

1.
El Emam, K., Mosquera, L. & Hoptroff, R. Practical synthetic data generation: Balancing privacy and the broad availability of data. (O’Reilly Media, Inc, 2020).
2.
3.
Kun, J. A high-level technical overview of fully homomorphic encryption. (2024). at <https://www.jeremykun.com/2024/05/04/fhe-overview/>
4.
Dan Goodin, A. T. Latest attack on PyPI users shows crooks are only getting better. (2023). at <https://arstechnica.com/information-technology/2023/02/451-malicious-packages-available-in-pypi-contained-crypto-stealing-malware/>
5.
Dan Goodin, A. T. Hugging face, the GitHub of AI, hosted code that backdoored user devices. (2024). at <https://arstechnica.com/security/2024/03/hugging-face-the-github-of-ai-hosted-code-that-backdoored-user-devices/>
6.
The Missing Semester of Your CS Education. Missing Semester at <https://missing.csail.mit.edu/>
7.
census.gov. A history of census privacy protections. (2019). at <https://www.census.gov/library/visualizations/2019/comm/history-privacy-protection.html>
8.
9.
10.
11.
Dwork, C. & Roth, A. The Algorithmic Foundations of Differential Privacy. Foundations and Trends® in Theoretical Computer Science 9, 211–407 (2013).
12.
Differential Privacy: What Is It? Amstat News. (2019). at <https://magazine.amstat.org/blog/2019/03/01/differentialprivacy/>
13.
Garfinkel, S. Differential Privacy and the 2020 US Census. MIT Case Studies in Social and Ethical Responsibilities of Computing (2022). doi:10.21428/2c646de5.7ec6ab93
14.
Near, J. P. & Abuah, C. Programming Differential PrivacyProgramming Differential Privacy. at <https://programming-dp.com/cover.html>
15.
Dong, J., Roth, A. & Su, W. J. Gaussian Differential Privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology 84, 3–37 (2022).
16.
Holohan, N., Braghin, S., Mac Aonghusa, P. & Levacher, K. Diffprivlib: The IBM differential privacy library. ArXiv e-prints 1907.02444 [cs.CR], (2019).
17.
Casacuberta, S., Shoemate, M., Vadhan, S. & Wagaman, C. Widespread underestimation of sensitivity in differentially private libraries and how to fix it. (2022). at <https://arxiv.org/abs/2207.10635>
18.
Narayanan, A. & Shmatikov, V. How To Break Anonymity of the Netflix Prize Dataset. (2007). doi:10.48550/arXiv.cs/0610105
19.
20.
El Emam, K. & Arbuckle, L. Anonymizing health data: Case studies and methods to get you started. (O’Reilly Media, Inc., 2013).
21.
Ziegler, J. D. et al. Multi-modal conditional GAN: Data synthesis in the medical domain. in NeurIPS 2022 workshop on synthetic data for empowering ML research (2022). at <https://openreview.net/forum?id=8PI7W3bCTl>