We count on you! Balancing privacy and analytics - Education news

We count on you! Balancing privacy and analytics

by David Barnett

Analytics at Khan Academy

Providing a free, world-class education to anyone, anywhere is a lofty goal and one that all of us at Khan Academy pursue with passion. Achieving it means trying to make the best decisions for individual learners across many countries, regions, districts, and schools. In order to do  the right thing (and understand what that right thing is) across so many demographics, we must understand our successes and our failures. 

The key to understanding our successes and failures is data. We have designed our systems to provide our analysts with the tools they need without compromising our commitment to privacy. In this post, I’ll go through some examples of how privacy-protecting analytics can be done.

Forget personal data while analyzing user actions

There are many reasons for an organization to store personal data. In Khan Academy’s case, you might want us to email you results, inform you of new features, or just let you know about a new assignment from your teacher! However, if you decide to sever your relationship with the site, we would want to protect your privacy by no longer keeping your personal information around. 

It can be tough to balance privacy protection with the desire to know what people have been doing on our site in a more general sense. Fortunately, this is not an unsolvable problem. Let’s take a look at an example! (Note: The data and schema below are fictional.)

Users
id first_name last_name email state hours_used
1 Nadda Realname nadda@example.com MI 100
2 Madus Allup madus@example.com CA 38
3 Imagine Aryname imagine@example.com MI 15
4 Justin Mymind justin@example.com OH 68
5 Will Ibereal will@example.com CA 103

Through this data, we might be interested in learning some basic things about who is using our website and how:

  • How many users do we have?
  • How many users do we have per state?
  • What is the total number of hours of usage by state?
  • How do states rank by average number of hours used?

Of course, there are lots of other things we could ask and answer, but one thing the questions above have in common is that (despite them being important and interesting) none of them require the analyst to know anything personal about any user. Without diving into SQL or other query methods, we simply don’t need to use any of the columns containing personally identifiable information (PII) to answer any of our proposed questions. 

The best way to avoid misuse of personal data, whether intentional or unintentional, is not to give anyone (external or internal) unnecessary access to that data in the first place.

Mask PII with encryption

Our approach is to encrypt each user’s personal data with an encryption key unique to that user so that analysts can do their work without compromising personal information. These keys can be stored or locked down even further to be accessible only to a few analysts and used only when we need to communicate directly with the user. 

Now the tables may look something like this:

Keys
id encryption_key
1 igwaordks
2 wiorjdfklv
3 fmnaasdnf
4 lkvjwekjsd
5 fhqwhgads
Users
id first_name last_name email state hours_used
1 Ipymv9XvfAWC6OAOZ6SBjwRkcrB1MN24= yFaR17EO2luqxSP4CZEXjSOiUj1j4UeQ= kprB9exzIqFtwqTTa0VIqfc7DlwCW1ssQG4/o2fNCLsu2iVp5C3Si MI 100
2 8yaTJ1DSPglQwJPn7aKEz1rjjS2YbeUGo= 9vVhxLK4aQzhh+BxiiTgbrYbkKHDRo7RU= 6z1rn2zawbS5JomjVFPVvFi8iCn1hiTDuJmusLC4vc4ME+/3ddX88 CA 38
3 9LJfdVIjSoVITCYttPKoUB5GzQCet0n58= 6h+IJ6QXVIeSciFUPyvHoVsfjSUMEflk= 9XT2X2KnPWCNa7NStM73q3jtB2KJA3g1LzK3E4LZWP1V3nOdnVUHZf MI 15
4 viSyRXduVhun8fSgWqm1q6BA5h4haDPQ= FcMEsH1wRPRWfPJ4Y47tyQ4iUVq+4neE= AHbTTPAUMR3zs2leuxhDqr3ixuSwCFXSx0W52bm5EuJsc69NkzmvZ OH 68
5 iQ2KNmKoBFpCG2oJNfGiwSXMOMH1ke8U= 8rVZDpsKFDxyxrTKo41MXFMe48XIG30FU= vG/Q90GW24WLQFR0mzZQdGjIlDKRjcNBse6t00ewy6IEDidhpn4yA CA 103
Astute observers will notice that this data looks base64 encoded rather than encrypted. That’s because we’ve base64 encoded these values after encryption in order to make storage (not to mention display in this article) simpler.

Only people who have access to the keys table will have any idea how funny the names I made up are. But, that’s okay because it’s none of their business. It is their business to be able to answer questions about how people are using the site, and they can still answer all the questions we listed above.

Preserve analytics, but forget PII

It’s common for a provider to store user information in multiple places, such as distinct databases, backups, an object store, or a data warehouse. It is unlikely that one person or even one team knows everywhere user data is stored. This makes it very difficult to remove a user completely. 

However, if we always store user data encrypted and keep the decryption key in a single place, then we only have to worry about deleting the user decryption key. Once the user’s key is removed from the keys table, there is no way to recover the user data. And, since we don’t require any personal information for analytics purposes, we don’t lose the ability to answer our general, aggregated questions about their usage.

This approach has allowed us to respect our users’ right to privacy while still being able to provide essential information to our data analysts and business leaders.

We at Khan Academy love working with data! Are you interested in working with our data or any of our other tools/teams? Our team comes from a wide variety of backgrounds, and we actively foster a cross-disciplinary environment because we believe that’s where the magic happens. Khan Academy currently employs around 200 full-time staff, including the creators of our educational content, who come from teaching backgrounds. Learn more and explore open positions.

Добавить комментарий

Ваш адрес email не будет опубликован.

We count on you! Balancing privacy and analytics

by David Barnett

Analytics at Khan Academy

Providing a free, world-class education to anyone, anywhere is a lofty goal and one that all of us at Khan Academy pursue with passion. Achieving it means trying to make the best decisions for individual learners across many countries, regions, districts, and schools. In order to do  the right thing (and understand what that right thing is) across so many demographics, we must understand our successes and our failures. 

The key to understanding our successes and failures is data. We have designed our systems to provide our analysts with the tools they need without compromising our commitment to privacy. In this post, I’ll go through some examples of how privacy-protecting analytics can be done.

Forget personal data while analyzing user actions

There are many reasons for an organization to store personal data. In Khan Academy’s case, you might want us to email you results, inform you of new features, or just let you know about a new assignment from your teacher! However, if you decide to sever your relationship with the site, we would want to protect your privacy by no longer keeping your personal information around. 

It can be tough to balance privacy protection with the desire to know what people have been doing on our site in a more general sense. Fortunately, this is not an unsolvable problem. Let’s take a look at an example! (Note: The data and schema below are fictional.)

Users
id first_name last_name email state hours_used
1 Nadda Realname nadda@example.com MI 100
2 Madus Allup madus@example.com CA 38
3 Imagine Aryname imagine@example.com MI 15
4 Justin Mymind justin@example.com OH 68
5 Will Ibereal will@example.com CA 103

Through this data, we might be interested in learning some basic things about who is using our website and how:

  • How many users do we have?
  • How many users do we have per state?
  • What is the total number of hours of usage by state?
  • How do states rank by average number of hours used?

Of course, there are lots of other things we could ask and answer, but one thing the questions above have in common is that (despite them being important and interesting) none of them require the analyst to know anything personal about any user. Without diving into SQL or other query methods, we simply don’t need to use any of the columns containing personally identifiable information (PII) to answer any of our proposed questions. 

The best way to avoid misuse of personal data, whether intentional or unintentional, is not to give anyone (external or internal) unnecessary access to that data in the first place.

Mask PII with encryption

Our approach is to encrypt each user’s personal data with an encryption key unique to that user so that analysts can do their work without compromising personal information. These keys can be stored or locked down even further to be accessible only to a few analysts and used only when we need to communicate directly with the user. 

Now the tables may look something like this:

Keys
id encryption_key
1 igwaordks
2 wiorjdfklv
3 fmnaasdnf
4 lkvjwekjsd
5 fhqwhgads
Users
id first_name last_name email state hours_used
1 Ipymv9XvfAWC6OAOZ6SBjwRkcrB1MN24= yFaR17EO2luqxSP4CZEXjSOiUj1j4UeQ= kprB9exzIqFtwqTTa0VIqfc7DlwCW1ssQG4/o2fNCLsu2iVp5C3Si MI 100
2 8yaTJ1DSPglQwJPn7aKEz1rjjS2YbeUGo= 9vVhxLK4aQzhh+BxiiTgbrYbkKHDRo7RU= 6z1rn2zawbS5JomjVFPVvFi8iCn1hiTDuJmusLC4vc4ME+/3ddX88 CA 38
3 9LJfdVIjSoVITCYttPKoUB5GzQCet0n58= 6h+IJ6QXVIeSciFUPyvHoVsfjSUMEflk= 9XT2X2KnPWCNa7NStM73q3jtB2KJA3g1LzK3E4LZWP1V3nOdnVUHZf MI 15
4 viSyRXduVhun8fSgWqm1q6BA5h4haDPQ= FcMEsH1wRPRWfPJ4Y47tyQ4iUVq+4neE= AHbTTPAUMR3zs2leuxhDqr3ixuSwCFXSx0W52bm5EuJsc69NkzmvZ OH 68
5 iQ2KNmKoBFpCG2oJNfGiwSXMOMH1ke8U= 8rVZDpsKFDxyxrTKo41MXFMe48XIG30FU= vG/Q90GW24WLQFR0mzZQdGjIlDKRjcNBse6t00ewy6IEDidhpn4yA CA 103
Astute observers will notice that this data looks base64 encoded rather than encrypted. That’s because we’ve base64 encoded these values after encryption in order to make storage (not to mention display in this article) simpler.

Only people who have access to the keys table will have any idea how funny the names I made up are. But, that’s okay because it’s none of their business. It is their business to be able to answer questions about how people are using the site, and they can still answer all the questions we listed above.

Preserve analytics, but forget PII

It’s common for a provider to store user information in multiple places, such as distinct databases, backups, an object store, or a data warehouse. It is unlikely that one person or even one team knows everywhere user data is stored. This makes it very difficult to remove a user completely. 

However, if we always store user data encrypted and keep the decryption key in a single place, then we only have to worry about deleting the user decryption key. Once the user’s key is removed from the keys table, there is no way to recover the user data. And, since we don’t require any personal information for analytics purposes, we don’t lose the ability to answer our general, aggregated questions about their usage.

This approach has allowed us to respect our users’ right to privacy while still being able to provide essential information to our data analysts and business leaders.

We at Khan Academy love working with data! Are you interested in working with our data or any of our other tools/teams? Our team comes from a wide variety of backgrounds, and we actively foster a cross-disciplinary environment because we believe that’s where the magic happens. Khan Academy currently employs around 200 full-time staff, including the creators of our educational content, who come from teaching backgrounds. Learn more and explore open positions.

Добавить комментарий

Ваш адрес email не будет опубликован.