De-identifying Unstructured Medical Text and Attribute-based Utility Measurement

Gun Ro, Jonghoon Chun


De-identification is a method by which the remaining information can not be referred to a specific individual by removing the personal information from the data set. As a result, de-identification can lower the exposure risk of personal information that may occur in the process of collecting, processing, storing and distributing information. Although there have been many studies in de-identification algorithms, protection models, and etc., most of them are limited to structured data, and there are relatively few considerations on de-identification of unstructured data. Especially, in the medical field where the unstructured text is frequently used, many people simply remove all personally identifiable information in order to lower the exposure risk of personal information, while admitting the fact that the data utility is lowered accordingly. This study proposes a new method to perform de-identification by applying the k-anonymity protection model targeting unstructured text in the medical field in which de-identification is mandatory because privacy protection issues are more critical in comparison to other fields. Also, the goal of this study is to propose a new utility metric so that people can comprehend de-identified data set utility intuitively. Therefore, if the result of this research is applied to various industrial fields where unstructured text is used, we expect that we can increase the utility of the unstructured text which contains personal information.

Full Text:



Bayardo, R. J. and Agrawal, R., “Data privacy through optimal k-anonymization,” 21st International Conference on Data Engineering (ICDE’05), 2005.

El Emam, K., Dankar, F. K., Issa, R., and Jonker, E., “A Globally Optimal k-Anonymity Method for the De-Identification of Health Data,” Journal of the American Medical Informatics Association, Vol. 16, No. 5, pp. 670-682, 2009.

Prasser, F. and Kohlmayer, F., “Putting Statistical Disclosure Control Into Practice: The ARX Data Anonymization Tool,” Medical Data Privacy Handbook, Springer, November 2015.

Finkel, J., Grenager T., and Manning, C., “Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling,” Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics, ACL, 2005.

Garfinkel, S. L., “De-Identification of Personal Information,” National Institute of Standards and Technology, 2015.

Gobbel, G. T., Garvin, J., Reeves, R., Cronin, R. M., Heavirland, J., Williams, J., Weaver, A., Jayaramaraja, S., Giuse, D., Speroff, T., Brown, S. H., Xu, H., and Matheny, M. E., “Assisted annotation of medical free text using RapTAT,” Journal of the American Medical Informatics Association, Vol. 21, No. 5, pp. 833-841, 2014.

Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., Ivanov, P. C. H., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C. K., and Stanley, H. E., “PhysioBank, PhysioToolkit, and Physionet: Components of a New Research Resource for Complex Physiologic Signals, Circulation, Vol. 101, No. 23, pp. E215-20, 2000.

Information and Privacy Commissioner of Ontario, “De-identification Guidelines for Structured Data,” Information and Privacy Commissioner of Ontario, 2016.

Iyengar, V. S., “Transforming data to satisfy privacy constraints,” KDD ’02 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002.

Lewis, David D, “Reuters-21578, Distribution 1.0,” UCI Machine Learning Repository,

Neamatullah, I., Douglass, M., Lehman, L. H., Reisner, A., Villarroel, M., Long, W. J., Szolovits, P., Moody, G. B., Mark, R. G., and Clifford, G. D., “Automated De-Identification of Free-Text Medical Records,” BMC Medical Informatics and Decision Making, 2008.

Office for Civil Rights, “Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act(HIPAA) Privacy Rule,” U.S. Department of Health & Human Services, 2015.

Park, C. W., Kim, J. W., and Kwon, H. J., “An Empirical Research on Information Privacy Risks and Policy Model in the Big data Era,” The Journal of Society for e-Business Studies, Vol. 21, No. 1, pp. 131-145, 2016.

Ro, G. and Chun, J. H., “Classification and Performance Evaluation of Personal Identifiers and Quasi-identifiers for Implementing Medical Unstructured Text De-identification System,” KDBC, 2018.

Sweeney, L., “Achieving k-anonymity Privacy Protection Using Generalization and Suppression,” International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, Vol. 10, No .5, pp. 571-588, 2002.

Sweeney, L., “k-anonymity: a model for protecting privacy,” International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, Vol. 10, No. 5, pp. 557-570, 2002.


  • There are currently no refbacks.