A new tool to improve the privacy of student text.
As text analysis techniques are adopted more commonly as research tools in educational research, there are growing concerns about student privacy. Students may purposefully or inadvertently provide personal identifying information (PII) in both written or spoken texts that may make it difficult and/or illegal to share openly in open science domains.
Specifically, in the United States, The Family Educational Rights and Privacy Act (FERPA) protects the privacy of student education records. However, these records may be important artifacts that can be used to help researchers better understand the learning and educational process.
Thus, the process of de-identifying student data is an important element of student text analysis, but in big data situations, de-identifying student information manually is time-consuming and expensive.
There are a number of natural language processing (NLP) tools that can help with automatically de-identify PII. Much of the work on PII has focused on removing names and identifying information from medical records (14). Early methods relied on template matching approaches to find identifying information in medical records (20). While successful, the approach requires information on the patients in the datasets (or the students if the approach were to be extended to educational data).
The process of de-identifying student data is an important element of student text analysis, but in big data situations, de-identifying student information manually is time-consuming and expensive.
Newer approaches generally rely on named-entity recognition (NER) programs that seek to locate and classify named entities found in unstructured text into pre-defined categories including names, organization, locations, time expression, and numbers. Once these entities are found, they can be extracted from the text or renamed. However, the algorithms that underlie NER programs are “greedy” and will extract all information related to entities.
Such an approach is problematic because many of the named entities in students’ texts provide important information about students’ knowledge base and may be important predictors of educational outcomes. Removing them may influence text analyses and predictive models of student success.
For example, the de-identification program called Philter was developed at the University of California, San Diego and was designed for removing PII from medical data to make them HIPPA compliant. Philter uses a combination of regular expressions, Part Of Speech (POS), Entity Recognition (NER) tagging to achieve very high rates of text de-identification (~95%) for medical records.
However, for student data, Philter is an unsatisfactory solution because it does not take into consideration context and it will remove every entity it encounters and replace it with the placeholder “PHI.” So, for student essay data, Philter would remove popular culture references from student essays (“Harry Potter” and “Voldemort,” for example) and other instances of entities that the students may be using to provide support for arguments (i.e., evidence). A sample of a student essay ran through the Philter returns the following:
This “greedy” approach to removing PII causes serious problems in assessing student output because much of the meaning and much of the original writing is removed from the text. Additionally, Philter will still miss around 5% of named entities, making FERPA compliance doubtful.
One solution to this problem is hybrid in that it relies on NER, but keeps a human in the loop. Specifically, we have developed a new program that will search through student text for potentially personally identifiable information utilizing the NER used in spaCy . (Download the new tool on GitHub here.) The program will output the named entities per text in rows for humans to examine. The humans can then flag the named entities by text that seem to provide PII. This approach gives humans the ability to make quick decisions about which named entities may qualify as PII without having to read through the entire text. Once a text is flagged as potentially containing PII, the raters will then go back to that text and manually de-identify it if necessary.
The algorithms that underlie NER programs are “greedy” and will extract all information related to entities.
The NER in spaCy can flag 17 different kinds of information–from the names of people to names of works of art. The program works iteratively through text such that 1,000s of texts can be processed automatically in a few minutes. Our initial approaches only automatically flag names of people and locations, however, other elements can be added as we learn more about how the software interprets the content of the essays.
-Scott Crossley, Professor of Applied Linguistics and English as a Second Language at Georgia State University.
Interview with Pooja Agarwal
Ulrich Boser on TEDx Talk
Interview with Ken Koedinger
The Learning Curve publishes articles about how people learn. Please reach out with any ideas for articles on the research on learning