Better Privacy for EHDS

mcnesium

European Health Data Space

giant health data storage infrastructure for every EU citizen
mandatory starting 2025

types of use cases

primary use: doctors around EU
secondary use: scientific third parties
- universities etc.
- commercial actors

Current state

pseudonyms
opt-out

Requirements

count occurrences of individual facts
no/rare need for individual backtracing

Proposal: cardinality estimation

probabilistic data structures
no pseudonyms, just statistics
no inference back to original data
“privacy by design”

My privacy research project

harvest social media data
derive information
not store pseudonyms
preserve privacy

(Löchner et al. 2023)

The algorithm

HyperLogLog

(Flajolet et al. 2007)

apply a hash function for uniform distribution on pseudonyms
take binary representation of the hash
observe occurrences for leading zeros
- ½ = 01…, ¼ = 001…, ⅛ = 0001…
the more leading zeros, the larger is the set (presumably)

The algorithm 2

Cheng-Wei Hu, 2021. chengweihu.com/hyperloglog

The algorithm 3

get the average number of leading zeros per bucket
the result is the estimated cardinality of the set (~2% error rate)

Example

get_user_id() = 131242

hll_hash_integer( 131242 ) = 6629739536168032365

hll_empty() = \x118b7f

hll_add( \x118b7f, 6629739536168032365 ) = \x128b7f5c0190fb76a8986d

Special features

apply set operations: union and intersection

the key to advanced analyses

Advantages

really fast, small storage
no way to return the original pseudonym
no worries about data getting “stolen”
data sets can be published

Conclusion

use cardinality estimation for EHDS data storage
do massive research
preserve citizen privacy
prevent opt-outs

Link List

mcnesium.de/37c3