Better Privacy for EHDS
mcnesium
European Health Data Space
- giant health data storage infrastructure for every EU citizen
- mandatory starting 2025
types of use cases
- primary use: doctors around EU
- secondary use: scientific third parties
- universities etc.
- commercial actors
Requirements
- count occurrences of individual facts
- no/rare need for individual backtracing
Proposal: cardinality estimation
- probabilistic data structures
- no pseudonyms, just statistics
- no inference back to original data
- “privacy by design”
My privacy research project
- harvest social media data
- derive information
- not store pseudonyms
- preserve privacy
(Löchner et al. 2023)
The algorithm
HyperLogLog
(Flajolet et al. 2007)
- apply a hash function for uniform distribution on pseudonyms
- take binary representation of the hash
- observe occurrences for leading zeros
- ½ = 01…, ¼ = 001…, ⅛ = 0001…
- the more leading zeros, the larger is the set (presumably)
The algorithm 2
Cheng-Wei
Hu, 2021. chengweihu.com/hyperloglog
The algorithm 3
- get the average number of leading zeros per bucket
- the result is the estimated cardinality of the set (~2% error
rate)
Example
get_user_id() = 131242
hll_hash_integer( 131242 ) = 6629739536168032365
hll_empty() = \x118b7f
hll_add( \x118b7f, 6629739536168032365 ) = \x128b7f5c0190fb76a8986d
Special features
- apply set operations: union and intersection
the key to advanced analyses
Advantages
- really fast, small storage
- no way to return the original pseudonym
- no worries about data getting “stolen”
- data sets can be published
Conclusion
- use cardinality estimation for EHDS data storage
- do massive research
- preserve citizen privacy
- prevent opt-outs
Link List
mcnesium.de/37c3