Navigating the jungle of data sharing tools

Personalized Health Series Workshop - May 15th, 2025

Peter Krusche

Novartis Pharma AG

Introduction

Today we will explore data sharing scenarios involving data from individuals\(^\mathsf{a}\) and discuss methods for privacy-preserving data sharing and analysis\(^\mathsf{b}\).

Knowing the dangers: Privacy basics.
Mapping the jungle: Where to start.
Basic survival kit: Technology basics.
Entering the jungle: Some simple examples.
Exploring further: Outlook to other methods.

…such as subjects who contributed data to biobanks or clinical studies.
Disclaimer: The goal is to share an introduction and reading list - not a comprehensive summary!

Knowing the dangers

Privacy basics for shared data.

Some references for further reading: (1), (2)

What types of data do we consider?

Data summaries such as mean age, grouped summaries, model parameters, e.g. to document research findings.
Subject-level records, e.g. for further research, model training.
Synthetic data records preserving correlations & distributions in a subject-level dataset (e.g. for ML testing, data augmentation).

Types of data disclosure

Publishing a dataset can lead to disclosure of information on individuals who contributed to it.

Identity Disclosure: Individual can be uniquely identified from a dataset, either directly through specific identifiers (like name or social security number) or indirectly through a combination of attributes (like age, zip code, and profession).

Attribute Disclosure: Information about an individual is revealed even if an individual has not been uniquely identified.

Membership Disclosure: Infer that an individual’s data is included in a dataset.

Linkage Disclosure: Ability to link data relating to an individual across multiple databases, potentially leading to identity or attribute disclosure by combining information.

Mapping the jungle

… and where to start.

Basic survival kit

Technology basics.

Keeping computer systems secure is a specialized, complex skill. Ensure you know whom to ask / work with for keeping data secure, keep on top of trainings, and (cautiously) use common sense!

Control access and copies of data

Applied governance for data access is relevant for data scientists, too:

Can you copy data to a specific storage location?¹
Not all copies of data are obvious: E.g. Jupyter/Markdown notebooks have cells that may contain data or summaries. Do you store these in a safe place?²
Is your data encrypted?³

While some of the above sound like “IT-department issues”, they have concrete implications for data science work - e.g. not copying data out of their proper location, ensuring derivations and analyses are reproducible & documented, and ensuring that the purpose of the analysis is compatible with the conditions of use!

Work in a secure computing environment

Using analytical tools & software securely isn’t straightforward with complex software dependencies.

E.g. Pypi, Huggingface, many others have been targeted by malware. See also (5), (6). ¹

Know where the data goes.

As soon as a system has internet access this can become difficult! ²

The right technology skills make work inside gated data platforms easier!
The MIT Missing Semester is a good starting point for that: (7)

Entering the jungle

Some simple examples

Data summaries do not automatically protect privacy!

Example 1: The US Census bureau uses differential privacy to protect its data releases (8) - group summaries for small groups (e.g. small towns and/or minorities) can lead to unintended disclosures.

Example 2: Allele frequency summaries in genetic data can allow membership inference (9).

What to do?

Different approaches exist - e.g. statistical disclosure control (10) and differential privacy (11) (12). A good summary of both approaches can be found in (13) and (14).

Differential privacy

Key idea:¹

Add noise to summaries that “hides” the contribution of individual data records.

Questions:

How much noise do we add?
How do we quantify “privacy”?

Differential privacy (technical)

Differential privacy provides a bound for the probability of computing the same output statistic with or without any specific subject’s data record, given a mechanism that adds noise.

Definition (\(\varepsilon\)-Differential Privacy)

Let \(\varepsilon > 0\), \(n\) be the number of records in our dataset. A randomized algorithm \(T:\mathcal{D}^n \rightarrow \mathbb{R}^p\) is said to be \(\varepsilon\)-differentially private if for every subset \(A\) of \(\mathbb{R}^p\) and for all datasets \(d_1, d_2 \in \mathcal{D}^n\) which differ in only a single element, we have \(\mathbf{P}(T(d_1) \in A) \leq \text{exp}(\varepsilon) \ \mathbf{P}(T(d_2) \in A)\).

Notes:

Smaller \(\varepsilon\) \(\Rightarrow\) greater privacy since \(\mathbf{P}(T(d_1) \in A)\) and \(\mathbf{P}(T(d_2) \in A)\) will be closer.
Some statistics like the sample mean or variance never ensure \(\varepsilon\)-DP unless we can bound/clamp their value range.
Other definitions of differential privacy have been proposed that allow for different ways to show bounds on the relationship between noise added (amount and distribution) and the probability to get the same summary result - e.g. Gaussian Differential Privacy (16).

Differential private mean calculation

Large population in sample has mean 5, but a small subset of sample have mean 10000.

Non-private mean ~= 14.985

Calculating bounded mean with diffprivlib(17): Increasing \(\varepsilon\) / decreasing privacy moves the privacy-preserving mean estimate closer to the correct mean of the input population.

Every time we release data for one value of \(\varepsilon\), we release more information about the original data! Also, choosing \(\varepsilon\) s.t. we retain “useful” data utility is not always possible!

Using differential privacy in practice needs careful choice of noise and sensitivity parameters. Not all implementations got this right at all times(18) - specifically for calculating a (bounded) mean.

Releasing subject-level data

First Name	Last Name	Sex	DOB	Cat or dog person
Antony	Hill	Male	09-03-1986	🐈‍⬛
Garry	Armstrong	Male	03-01-1986	🐕
Fenton	Riley	Male	01-02-1987	🐕
Vivian	Johnson	Female	04-05-1987	🐕
Wilson	Tucker	Male	09-10-1991	🐕

ID	First Name	Last Name	Sex	Age	Cat or dog person
A	XXX	XXX	Male	38	🐈‍⬛
B	XXX	XXX	Male	38	🐕
C	XXX	XXX	Male	37	🐕
D	XXX	XXX	Female	37	🐕
E	XXX	XXX	Male	32	🐕

When publishing subject-level data, we can “de-identify” records by removing personal identifiers (e.g. Name, DOB).

In many cases, this is not sufficient to protect privacy of individuals (19).

Anonymisation

ID	Sex	Age	Profession
A	Male	38	Nurse
B	Male	38	Doctor
C	Male	37	Nurse
D	Female	37	Doctor
E	Male	32	House Spouse

Assume we know everyone lives on the same street:
Can we re-identify individuals?

ID	Sex	Age	Profession
A	Male	30-40	Nurse
B	Male	30-40	Doctor
C	Male	30-40	Nurse
D	Female	30-40	Doctor
E	Male	30-40	House Spouse

Age or date of birth (20) can be bucketed to avoid re-identification.

ID	Sex	Age	Profession
A	Male	30-40	Healthcare
B	Male	30-40	Healthcare
C	Male	30-40	Healthcare
D	Female	30-40	Healthcare
E	Male	30-40	Unspecified

Rare categories can be obscured, or individual rare records suppressed. Typically, this is done in a way that quantifies disclosure risk for variables individually or jointly.

Comprehensive methodology exists for this (21), but optimizing data utility isn’t always simple.

Exploring further

Working with synthetic data

Can improve data utility over anonymisation (1) and may enable work across different data modalities - e.g. tabular & imaging data (22).

Federated learning

Provides a framework to collaborate on data that cannot be shared directly - a common scenario particularly for healthcare data!

Applying differential privacy

Can help add privacy guarantees to both methods above, but needs careful choice of noise parameters / privacy budget.

Summary messages

Privacy guarantees are typically traded off for data utility.
Disclosure risk and data sharing scenario define which technologies are useful, and what tradeoffs may be made.
Technology and methods alone don’t protect privacy:
… but careful use & choice of parameters do!

Thank you!

Supplementary information

References

El Emam, K., Mosquera, L. & Hoptroff, R. Practical synthetic data generation: Balancing privacy and the broad availability of data. (O’Reilly Media, Inc, 2020).

Privacy for research participants. Wikipedia (2023). at <https://en.wikipedia.org/w/index.php?title=Privacy_for_research_participants&oldid=1186958426>

Mallon, A.-M. et al. Advancing data science in drug development through an innovative computational framework for data sharing and statistical analysis. BMC Medical Research Methodology 21, 250 (2021).

Kun, J. A high-level technical overview of fully homomorphic encryption. (2024). at <https://www.jeremykun.com/2024/05/04/fhe-overview/>

Dan Goodin, A. T. Latest attack on PyPI users shows crooks are only getting better. (2023). at <https://arstechnica.com/information-technology/2023/02/451-malicious-packages-available-in-pypi-contained-crypto-stealing-malware/>

Dan Goodin, A. T. Hugging face, the GitHub of AI, hosted code that backdoored user devices. (2024). at <https://arstechnica.com/security/2024/03/hugging-face-the-github-of-ai-hosted-code-that-backdoored-user-devices/>

The Missing Semester of Your CS Education. Missing Semester at <https://missing.csail.mit.edu/>

census.gov. A history of census privacy protections. (2019). at <https://www.census.gov/library/visualizations/2019/comm/history-privacy-protection.html>

Homer, S. A. R., Nils AND Szelinger. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLOS Genetics 4, 1–9 (2008).

10.

Statistical disclosure control. Wikipedia (2024). at <https://en.wikipedia.org/w/index.php?title=Statistical_disclosure_control&oldid=1221424807>

11.

Differential privacy. Wikipedia (2024). at <https://en.wikipedia.org/w/index.php?title=Differential_privacy&oldid=1212542457>

12.

Dwork, C. & Roth, A. The Algorithmic Foundations of Differential Privacy. Foundations and Trends® in Theoretical Computer Science 9, 211–407 (2013).

13.

Differential Privacy: What Is It? Amstat News. (2019). at <https://magazine.amstat.org/blog/2019/03/01/differentialprivacy/>

14.

Garfinkel, S. Differential Privacy and the 2020 US Census. MIT Case Studies in Social and Ethical Responsibilities of Computing (2022). doi:10.21428/2c646de5.7ec6ab93

15.

Near, J. P. & Abuah, C. Programming Differential Privacy — Programming Differential Privacy. at <https://programming-dp.com/cover.html>

16.

Dong, J., Roth, A. & Su, W. J. Gaussian Differential Privacy. Journal of the Royal Statistical Society Series B: Statistical Methodology 84, 3–37 (2022).

17.

Holohan, N., Braghin, S., Mac Aonghusa, P. & Levacher, K. Diffprivlib: The IBM differential privacy library. ArXiv e-prints 1907.02444 [cs.CR], (2019).

18.

Casacuberta, S., Shoemate, M., Vadhan, S. & Wagaman, C. Widespread underestimation of sensitivity in differentially private libraries and how to fix it. (2022). at <https://arxiv.org/abs/2207.10635>

19.

Narayanan, A. & Shmatikov, V. How To Break Anonymity of the Netflix Prize Dataset. (2007). doi:10.48550/arXiv.cs/0610105

20.

Birthday problem. Wikipedia (2024). at <https://en.wikipedia.org/w/index.php?title=Birthday_problem&oldid=1221534303>

21.

El Emam, K. & Arbuckle, L. Anonymizing health data: Case studies and methods to get you started. (O’Reilly Media, Inc., 2013).

22.

Ziegler, J. D. et al. Multi-modal conditional GAN: Data synthesis in the medical domain. in NeurIPS 2022 workshop on synthetic data for empowering ML research (2022). at <https://openreview.net/forum?id=8PI7W3bCTl>

Navigating the jungle of data sharing tools

Introduction

Knowing the dangers

Data sharing scenarios

What types of data do we consider?

Types of data disclosure

Mapping the jungle

Tools for privacy-enhanced data sharing

Basic survival kit

Control access and copies of data

Work in a secure computing environment

Entering the jungle

Data summaries do not automatically protect privacy!

Differential privacy

Differential privacy (technical)

Differential private mean calculation

Releasing subject-level data

Anonymisation

Exploring further

Summary messages

Thank you!

Supplementary information

References