> If the institution is public, data should be public as long as individual PII is removed.
PII is much broader than most people understand because reidentification of what amateurs would see as deidentified data is easy (often trivial), and, as a consequence, to be useful for research data is often not fully deidentified.
EDIT: As an example, the HIPAA safe harbor deidentification standard requires removing 18 kinds of identifiers, including, as one of them:
All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census:
(1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and
(2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000
To add to this, PII isn't always even clear. Different jurisdictions identify PII differently, there isn't One Master Definition that you pass a unit of data through, upon which an authoritative "THIS IS PII" or "THIS ISNT PII" is returned.
We had a multi-month project to get a subset of our data considered 'clean', and it required a consultant, a stats PhD and many dev hours. It was healthcare, so on the high end of paranoia (justifiably) but nowhere it is as simple as dropping the "name" column
HIPAA also allows for “expert determination” [0] for deidentification that differs from safe harbor and can allow for all sorts of things since there’s no definition of what an expert is.
And reidentification risk can be as high as even 1% and still be acceptable for hipaa. In a dataset of a million people that’s 10,000 people identified and still be “acceptable.”
But hipaa doesn’t apply to these CA data, it’s just the clearest example of deidentification regulations I know of.
But it’s totally possible to deidentify data suitable for release to these researchers. It’s just what CA considers deidentified and if it’s still useful enough to these researchers. For the topic they are researching it should be pretty straightforward to remove PII enough to protect individuals and only remove some really unique characteristics (ie, only a single 20 year old or a particular race and ethnicity).
But I’m guessing age groups by race and gender and socioeconomic are possible to preserve without tying back to an individual. Id go so far to say as it would be non-trivial, but pretty easy, for CA to produce this for the researchers, if not to the general public.
The intent of the comment was not to say the process is trivial or that removing PII is sufficient. However, it is not as impossible as people are making it out to be. I’ve worked on datasets at social media companies where literally thousands of columns were considered PII but realistically removing/scrambling just a subset of columns would make it impossible to identify individuals.
> I’ve worked on datasets at social media companies where literally thousands of columns were considered PII but realistically removing/scrambling just a subset of columns would make it impossible to identify individuals.
Maybe, though I doubt it was that easy against any but the most trivial reidentification efforts, but since most privately held PII isn’t regulated (in the US at least), there's little consequence for a social media conpany getting it wrong other than PR.
PII is much broader than most people understand because reidentification of what amateurs would see as deidentified data is easy (often trivial), and, as a consequence, to be useful for research data is often not fully deidentified.
EDIT: As an example, the HIPAA safe harbor deidentification standard requires removing 18 kinds of identifiers, including, as one of them:
All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data from the Bureau of the Census: (1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and (2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000