CASE STUDY

Case study 2: Text extraction and categorisation of doctor – patient mobile chat data in Spanish

We design innovative and impactful data-driven solutions for clients, which we deliver in the form of user-friendly applications and tools. We have extensive experience in implementing and assessing the impact of these solutions in partnership with our clients.

Services performed

  • Text summarisation and medical concept extraction using natural language processing techniques
  • Development of a user interface whereby the original chat text is supplemented with a colour-coded overlay on top of the original text highlighting important medical phrases identified, allowing for an at-a-glance overview of important concepts in the original context

Client

Curatech is a Mexico-based health tech start-up that is building a mobile chat platform that aims to improve doctor-patient remote care interactions, giving doctors an at-a-glance summary of prior and current care advice provided and patients a better experience. 

Requirement

As part of the minimum viable product that was being built for launch in September 2019, our client required a prototype of a model that can read raw doctor-patient chats in Spanish and identify medical terms within the chat text. Challenges

This was a time-sensitive project, as we only had 14 days to build the prototype. Adding to its complexity was the limited amount of raw doctor-patient chats available for training and testing a model (given that Curatech is a start-up). Furthermore, we were building the model on and for the Spanish language, for which fewer public medical terminologies, drug databases and pre-trained word embeddings exist compared to that for the English language.

Solution

As the team appointed to this project, we developed a natural language processing (NLP) model to detect Spanish text related to the following clinical categories: symptoms, body parts, diagnoses and treatments. In addition to the medical classification of the text, which included a colour-coded overlay according to the specific clinical category, we implemented a succinct summary of each chat interaction highlighting the identified medical phrases.

Client feedback

“Different from other data scientist teams who saw the task as too difficult, Stefan and Cristina, from the very first meeting were proposing ways to make it work. They did vast research to find the best Spanish medical databases, were able to combine a few to build, in less than a week, a robust database to which we could match terms. Also, they proposed a few models (fuzzy matching, word embeddings). They quickly turned to writing all the code that prepared the raw chats to be ran through the model, built the model and in only two weeks delivered a high quality model (vs testing with a handful of chats). Finally, they provided ideas on how we could work on subsequent phases to make the model better (and once we have more data). I was very impressed with their practicality to deliver a version 1, but also with their vast knowledge on data science to continue building and improving.”