Balancing Data Accessibility and Privacy: Machine Learning Approach to PII Detection in Electronic Health Records
No Thumbnail Available
Authors
Musah, Issah
Issue Date
2026-01
Type
Dissertation
Language
en
Keywords
Balancing Data Accessibility and Privacy , Business, Engineering, Science, & Technological Innovation
Alternative Title
Abstract
This constructive research study examined the development of a scalable, context-aware machine learning (ML) framework for detecting personally identifiable information (PII) in unstructured electronic health records (EHRs). The research problem addressed the absence of reproducible, data-driven methods capable of balancing privacy preservation and data accessibility while maintaining compliance with legal frameworks such as the Health Insurance Portability and Accountability Act (HIPAA) and the General Data Protection Regulation (GDPR). The study focused on healthcare organizations and researchers who face challenges protecting sensitive health data while facilitating secure data sharing for clinical and analytical purposes. The study's purpose was to construct, implement, and evaluate a privacy-preserving artifact guided by the Cross-Industry Standard Process for Data Mining (CRISP-DM) framework. The research design integrated natural language processing (NLP) with unsupervised and hybrid ML algorithms, including term frequency–inverse document frequency (TF-IDF) vectorization, singular value decomposition (SVD), and density-based spatial clustering of applications with noise (DBSCAN). A transformer-based named entity recognition (NER) module utilizing Bidirectional Encoder Representations from Transformers (BERT) to validate clustering outputs. The research data were obtained from the Medical Information Mart for Intensive Care (MIMIC-III) database, a publicly available and de-identified dataset licensed through PhysioNet (Johnson et al., 2016).The experimental code and replication scripts are available at: https://github.com/NU-Academics/PII-Detection or Bert & Regular_Expression PII Detection - Colab. The model was trained and evaluated in Google Colab using BigQuery integration to ensure compliance with PhysioNet's data-use requirements. Empirical results showed that at a sample size of 5,000 records, the model achieved a precision of 0.955 and a recall of 0.466. When scaled to 10,000 records, precision remained high at 0.854, while recall improved to 0.580. Clustering validity indices confirmed coherent separation between PII-dense and non-PII clusters (silhouette coefficient ≈ 0.38–0.45; Davies–Bouldin Index ≈ 0.95–0.99). Approximately 61 percent of the records were labeled as noise, indicating that the model effectively isolated high-risk text regions while minimizing false positives. The study concluded that unsupervised NLP methods can reliably identify latent PII patterns within de-identified clinical narratives, achieving performance comparable to that of supervised models with lower computational costs. These findings demonstrate that scalable ML frameworks can reconcile the privacy–utility balance in EHR analytics. The research recommends incorporating hybrid explainable AI components, such as SHAP and LIME, to improve interpretability and extend future validation to institutionally governed datasets containing unredacted identifiers under Institutional Review Board (IRB) oversight.
