Text Mining

Natural Language Processing (NLP) and Text Mining (also known as Text Analytics) are Artificial Intelligence (AI) technologies that empower users to rapidly transform the key content in text documents into quantitative, actionable insights.

NLP Text mining can also power machine learning projects to further drive advancements in drug discovery and clinical care.

What is Text Mining or Text Analytics?

Text Analytics, also known as text mining, is the process of examining large collections of written resources to generate new information, and to transform the unstructured text into structured data for use in further analysis. Text mining identifies facts, relationships and assertions that would otherwise remain buried in the mass of textual big data.  These facts are extracted and turned into structured data, for analysis, visualization (e.g. via html tables, mind maps, charts), integration with structured data in databases or warehouses, and further refinement using machine learning (ML) systems.

Traditional keyword search retrieves all the documents that contain the keywords you’ve specified.

That’s great as far as it goes, but you still have to read all those documents to find out whether they actually contain any information that’s relevant to your search.

Text mining software is very different, because it reads and analyzes the documents on your behalf.

It can understand real meanings thanks to sophisticated Natural Language Processing (NLP) algorithms, which allow it to recognize similar concepts – even if they’ve been expressed in very different ways, or with different spellings.

A search using text mining will identify facts, relationships and assertions that would otherwise remain buried in a mass of free text or unstructured data.

So what is Natural Language Processing (NLP)?

Most advanced text mining, or text analytics, software use sophisticated Natural Language Processing (NLP) algorithms. NLP allows the software to recognize similar concepts – even if they’ve been expressed in very different ways. For example, the same word may be spelt differently (hemophilia/haemophilia, tumor/tumour), the same word may be realized differently in different contexts (tumor/tumors, suffers/suffered), the same concept may be expressed by different words entirely (Tylenol/Acetaminophen, heart attack/myocardial infarction), and by using different grammatical constructions (“he has not smoked for 5 years”/ “he stopped smoking 5 years ago”).

NLP and Machine Learning – complementary siblings in AI

NLP and machine learning are both branches of artificial intelligence. You can use machine learning techniques to help in natural language processing tasks. You can also use natural language processing to enhance machine learning, primarily by using it to extract a much greater evidence base of structured data for the machine learning algorithms to learn from.

Machine Learning generally requires well-curated input data to generate good results – typically not available from e.g. electronic health records (EHR), where most of the data is unstructured text. However, NLP applied to EHRs, clinical trial records, or full text literature (and more), can extract the clean, structured data needed to drive the advanced predictive machine learning models.