Project Leads: Arvind Narayanan and Jacob Metcalf
One of the most persistent challenges for pervasive computing researchers and ethicists alike is quantifying the risk posed by pervasive datasets to data subjects and communities. De-identification was once considered to provide fairly rigorous privacy protections that de facto satisfied privacy requirements. However, mathematically sophisticated re-identification techniques have weakened the dichotomy between sensitive and nonsensitive data, and machine learning threatens to obliterate it.
This sub-project will draw on empirical and mathematical assessments to develop clearer metrics for inference risk. We will revisit (1) well-known inference demonstrations such as the Facebook “likes” study, (2) long-standing research problems such as inference of author gender from writing style, and (3) a representative set of Kaggle contests. In each case, we will re-run the original experiments under a variety of settings, quantify inference using entropy, and objectively compare how accuracy has changed over time, with the volume of data, and based on choice of algorithm. Whenever accuracy and entropy remain stable as these conditions change, it points to an inherent limit to predictability. Further, we will use our metrics to test if deep learning and other novel machine learning techniques hold surprises for privacy.
Developing standards for communicating and ameliorating risk:
Next, we will use the knowledge gained from the above exercises to develop a scientifically-rigorous framework for estimating and communicating the inference risks in common research scenarios. This will allow researchers and regulators to characterize, and perhaps quantify, risk in terms of the sensitivity of the inferred attributes, the accuracy of inference, the time period over which the inference would become feasible, and other factors.
Understanding how stakeholders respond to inferential risks:
Also drawing on qualitative evidence, we plan to uncover areas of divergence between the inference risks we identify and the perception of those risks by both user and researcher communities. We hypothesize that while inference risks are sometimes underappreciated, they are just as often drastically overstated and not rooted in scientific fact. By understanding how stakeholders understand and respond to inferential risk we can develop a more coherent approach to aligning stakeholder expectations and genuine safeguards.