Natural language processing

Introduction#

Natural language processing (NLP) is the field of computer sciences focused on retrieving information from textual input generated by human beings.

Create a term frequency matrix

The simplest approach to the problem (and the most commonly used so far) is to split sentences into tokens. Simplifying, words have abstract and subjective meanings to the people using and receiving them, tokens have an objective interpretation: an ordered sequence of characters (or bytes). Once sentences are split, the order of the token is disregarded. This approach to the problem in known as bag of words model.

A term frequency is a dictionary, in which to each token is assigned a weight. In the first example, we construct a term frequency matrix from a corpus corpus (a collection of documents) with the R package tm.

require(tm)
doc1 <- "drugs hospitals doctors"
doc2 <- "smog pollution environment"
doc3 <- "doctors hospitals healthcare"
doc4 <- "pollution environment water"
corpus <- c(doc1, doc2, doc3, doc4)
tm_corpus <- Corpus(VectorSource(corpus))

In this example, we created a corpus of class Corpus defined by the package tm with two functions Corpus and VectorSource, which returns a VectorSource object from a character vector. The object tm_corpus is a list our documents with additional (and optional) metadata to describe each document.

str(tm_corpus)
List of 4
 $ 1:List of 2
  ..$ content: chr "drugs hospitals doctors"
  ..$ meta   :List of 7
  .. ..$ author       : chr(0) 
  .. ..$ datetimestamp: POSIXlt[1:1], format: "2017-06-03 00:31:34"
  .. ..$ description  : chr(0) 
  .. ..$ heading      : chr(0) 
  .. ..$ id           : chr "1"
  .. ..$ language     : chr "en"
  .. ..$ origin       : chr(0) 
  .. ..- attr(*, "class")= chr "TextDocumentMeta"
  ..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
[truncated]

Once we have a Corpus, we can proceed to preprocess the tokens contained in the Corpus to improve the quality of the final output (the term frequency matrix). To do this we use the tm function tm_map, which similarly to the apply family of functions, transform the documents in the corpus by applying a function to each document.

tm_corpus <- tm_map(tm_corpus, tolower)
tm_corpus <- tm_map(tm_corpus, removeWords, stopwords("english"))
tm_corpus <- tm_map(tm_corpus, removeNumbers)
tm_corpus <- tm_map(tm_corpus, PlainTextDocument)
tm_corpus <- tm_map(tm_corpus, stemDocument, language="english")
tm_corpus <- tm_map(tm_corpus, stripWhitespace)
tm_corpus <- tm_map(tm_corpus, PlainTextDocument)

Following these transformations, we finally create the term frequency matrix with

tdm <- TermDocumentMatrix(tm_corpus)

which gives a

<<TermDocumentMatrix (terms: 8, documents: 4)>>
Non-/sparse entries: 12/20
Sparsity           : 62%
Maximal term length: 9
Weighting          : term frequency (tf)

that we can view by transforming it to a matrix

as.matrix(tdm)

           Docs
Terms       character(0) character(0) character(0) character(0)
  doctor               1            0            1            0
  drug                 1            0            0            0
  environ              0            1            0            1
  healthcar            0            0            1            0
  hospit               1            0            1            0
  pollut               0            1            0            1
  smog                 0            1            0            0
  water                0            0            0            1

Each row represents the frequency of each token - that as you noticed have been stemmed (e.g. environment to environ) - in each document (4 documents, 4 columns).

In the previous lines, we have weighted each pair token/document with the absolute frequency (i.e. the number of instances of the token that appear in the document).