2021, Number 4
<< Back Next >>
Revista Cubana de Información en Ciencias de la Salud (ACIMED) 2021; 32 (4)
Characterization of a corpus extracted from maternal electronic health records through natural language processing techniques
Durango BMC, Torres SEA, Florez-Arango JF, Orozgo-Duque A
Language: Spanish
References: 17
Page: 1-22
PDF size: 600.83 Kb.
ABSTRACT
The purpose of this article was to characterize the free text available in an
electronic health record of an institution, directed at the care of patients in
pregnancy. More than being a data repository, the electronic health record (HCE)
has become a clinical decision support system (CDSS). However, due to the high
volume of information, as some of the key information in EHR is in free text form,
using the full potential that EHR information offers to improve clinical decisionmaking
requires the support of methods of text mining and natural language
processing (PLN). Particularly in the area of gynecology and obstetrics, the
implementation of PLN methods could help speed up the identification of factors
associated with maternal risk. Despite this, in the literature there are no papers
that integrate PLN techniques in EHR associated with maternal follow-up in Spanish. Taking into account this knowledge gap, in this work a corpus was
generated and characterized from the EHRs of a gynecology and obstetrics service
characterized by treating high-risk maternal patients. PLN and text mining
methods were implemented on the data, obtaining 659 789 tokens and a dictionary
with unique words given by 7 334 tokens. The characterization of the data was
developed from the identification of the most frequent words and n-grams and a
vector representation of embedding words in a 300-dimensional space was
performed using a CBOW (Continuous Bag Of Words) neural network architecture.
The embedding of words allowed to verify by means of Clustering algorithms, that
the words associated to the same group can come to represent associations
referring to types of patients, or group similar words, including words written with
spelling errors. The corpus generated and the results found lay the foundations for
future work in the detection of entities (symptoms, signs, diagnoses, treatments),
correction of spelling errors and semantic relationships between words to generate
summaries of medical records or assist the follow-up of mothers through the
automated review of the electronic health record.
REFERENCES
Chen J, Wei W, Guo C, Tang L, Sun L. Textual analysis and visualization ofresearch trends in data mining for electronic health records. Heal Policy Technol. 2017;6(4):389–400. DOI: http://dx.doi.org/10.1016/j.hlpt.2017.10.003
González Bernaldo de Quirós F, Otero C, Luna D. Terminology Services:Standard Terminologies to Control Health Vocabulary. Yearb Med Inform.2018;27(1):227–33. DOI: http://dx.doi.org/10.1055/s-0038-1641200
Resnik P, Niv M, Nossal M, Kapit A, Toren R. Communication of ClinicallyRelevant Information in Electronic Health Records: A Comparison betweenStructured Data and Unrestricted Physician Language. Perspect Health Inf Manag.2008 [acceso: 12/11/202012]. Disponible en:https://perspectives.ahima.org/communication
Peng X, Long G, Pan S, Jiang J, Niu Z. Attentive dual embedding forunderstanding medical concepts in electronic health records. Proc Int Jt ConfNeural Networks. 2019;2019. DOI:http://dx.doi.org/10.1109/IJCNN.2019.8852429
Giamouzi M. Discover research from City, University of London. City.2008;34(2019):51–79. Available from: http://openaccess.city.ac.uk/1189/
Neuraz A, Looten V, Rance B, Garcelon N, Llanos LC, et al. Do you needembeddings trained on a massive specialized corpus for your clinical naturallanguage processing task? Stud Health Technol Inform. 2019;264:1558–9. DOI:http://dx.doi.org/10.3233/SHTI190533
Khattak FK, Jeblee S, Pou-Prom C, Abdalla M, Meaney C, Rudzicz F. A surveyof word embeddings for clinical text. J Biomed Informatics X. 2019;4. DOI:https://doi.org/10.1016/j.yjbinx.2019.100057
Khan W, Daud A, Alotaibi F, Aljohani N, Arafat S. Deep recurrent neuralnetworks with word embeddings for Urdu named entity recognition. ETRI J.2020;42(1):90–100. DOI: https://doi.org/10.4218/etrij.2018-0553
Ruas T, Ferreira CHP, Grosky W, de França FO, de Medeiros DMR. Enhancedword embeddings using multi-semantic representation through lexical chains. InfSci. 2020;532:16–32. DOI: https://doi.org/10.1016/j.ins.2020.04.048
Liu Y, Li Z, Xiong H, Gao X, Wu J. Understanding of internal clusteringvalidation measures. IEEE International Conference on Data Mining (ICDM); 2010.
Arrieta Rodríguez EL, Martínez Santos JC. Predicción temprana de morbilidadmaterna extrema usando aprendizaje automático. Cartagena de Indias:Universidad Tecnológica de Bolívar; 2017.
Mohamed EH, Shokry EM. QSST: A quranic semantic search tool based on wordembedding. J King Saud Univ - Comput Inf Sci. 2020;(40). DOI:https://doi.org/10.1016/j.jksuci.2020.01.004
McDonald S, Ramscar M. Testing the distributional hypothesis: The influenceof context on judgements of semantic similarity. University of Edinburgh,Institute for Communicating and Collaborative Systems; 2021.
Zakrzewska D. Cluster analysis in personalized e-learning systems. StudComput Intell. 2009 [acceso: 12/11/2020];252:29–50. Disponible en:https://link.springer.com/chapter/10.1007/978-3-642-04170-9_10
Berzal F. Clustering jerárquico: métodos de agrupamiento. Universidad deGranada; 2020 [acceso: 12/11/2020]. Disponible en:https://elvex.ugr.es/idbis/dm/slides/42Clustering-Hierarchical.pdf
García-Alonso CR, Pérez-Naranjo LM, Fernández-Caballero JC. Multiobjectiveevolutionary algorithms to identify highly autocorrelated areas: The case ofspatial distribution in financially compromised farms. Ann Oper Res. 2014[acceso: 12/11/2020];219(1):187–202. Disponible en:https://link.springer.com/article/10.1007/s10479-011-0841-3
Vilà R, Rubio MJ, Berlanga V, Torrado M. Cómo aplicar un cluster jerárquicoen SPSS. REIRE Rev d’Innovació i Recer en Educ. 2014 [acceso:12/11/2020];7(2):113–27. Disponible en: http://revistes.ub.edu/index.php/REIRE