Better Privacy for EHDS

mcnesium

European Health Data Space

  • giant health data storage infrastructure for every EU citizen
  • mandatory starting 2025

types of use cases

  • primary use: doctors around EU
  • secondary use: scientific third parties
    • universities etc.
    • commercial actors

Current state

  • pseudonyms
  • opt-out

Requirements

  • count occurrences of individual facts
  • no/rare need for individual backtracing

Proposal: cardinality estimation

  • probabilistic data structures
  • no pseudonyms, just statistics
  • no inference back to original data
  • “privacy by design”

My privacy research project

  • harvest social media data
  • derive information
  • not store pseudonyms
  • preserve privacy

(Löchner et al. 2023)

The algorithm

HyperLogLog

(Flajolet et al. 2007)

  • apply a hash function for uniform distribution on pseudonyms
  • take binary representation of the hash
  • observe occurrences for leading zeros
    • ½ = 01…, ¼ = 001…, ⅛ = 0001…
  • the more leading zeros, the larger is the set (presumably)

The algorithm 2

Cheng-Wei Hu, 2021. chengweihu.com/hyperloglog

The algorithm 3

  • get the average number of leading zeros per bucket
  • the result is the estimated cardinality of the set (~2% error rate)

Example

get_user_id() = 131242

hll_hash_integer( 131242 ) = 6629739536168032365

hll_empty() = \x118b7f

hll_add( \x118b7f, 6629739536168032365 ) = \x128b7f5c0190fb76a8986d

Special features

  • apply set operations: union and intersection

the key to advanced analyses

Advantages

  • really fast, small storage
  • no way to return the original pseudonym
  • no worries about data getting “stolen”
  • data sets can be published

Conclusion

  • use cardinality estimation for EHDS data storage
  • do massive research
  • preserve citizen privacy
  • prevent opt-outs