2019/2020

## Natural Language Processing (NLP)

Natural Language Processing is the technology used to aid computers to understand the human’s natural language. Usually shortened as NLP, is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.

The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable.

A typical interaction between humans and machines using Natural Language Processing could go as follows:

1. A human talks to the machine
2. The machine captures the audio
3. Audio to text conversion takes place
4. Processing of the text’s data
5. Data to audio conversion takes place
6. The machine responds to the human by playing the audio file

Think of Google Translate, spellcheckers, or personal assistant applications such as OK Google, Siri, Cortana, and Alexa.

## Why is NLP difficult?

It’s the nature of the human language that makes NLP difficult. The rules that dictate the passing of information using natural languages are not easy for computers to understand.

Some of these rules can be high-leveled and abstract; for example, when someone uses a sarcastic remark to pass information.

On the other hand, some of these rules can be low-levelled; for example, using the character “s” to signify the plurality of items.

The ambiguity and imprecise characteristics of the natural languages are what make NLP difficult for machines to implement.

## What are the techniques used in NLP?

Syntactic analysis and semantic analysis are the main techniques used to convert the unstructured language data into a form that computers can understand.

### Syntactic analysis

• Apply grammatical rules

### Semantic analysis

• Understand the meaning (much harder)

## Regex

A regular expression (regex or regexp for short) is a special text string for describing a search pattern.

Typically, these patterns are used for four main tasks:

• Find text within a larger body of text
• Validate that a string conforms to a desired format
• Replace or insert text
• Split strings

## Example

Extract hashtags from the following tweet:

“It’s our job to #GoThere & tell the most difficult stories. Join us! For more breaking news updates follow @CNNBRK & Download our app. Email [email protected] to get involved in the new year.”

Hashtags are identified by the “#” symbol followed by one or more alphanumeric characters.

import re
re.findall(r'#\w+', text)
## ['#GoThere']

Extract callouts: strings identified by the “@” symbol followed by one or more alphanumeric characters.

re.findall(r'@\w+', text)
## ['@CNNBRK', '@cnn']

Oops… also part of the email address was extracted. We need to check for word boundaries.

re.findall(r'\[email protected]\w+', text)
## ['@CNNBRK']

Online tool: https://regex101.com

## Meta-characters: Character matches

Metacharacter Description
. wildcard, matches a single character
^ start of a string
$end of a string [] matches one of the set of characters within [] [a-z] matches one of the range of characters a, b, …, z [^abc] matches a character that is not a, b, or, c a|b matches either a or b, where a and b are strings () scoping for operators \ escape character for special characters (\t, \n, \b) ## Meta-characters: Character symbols Metacharacter Description \b matches word boundary \B matches where \b does not match \d any digit, equivalent to [0-9] \D any non-digit, equivalent to [^0-9] \s any whitespace, equivalent to [ \t\n\r\f\v] \S any non-whitespace, equivalent to [^ \t\n\r\f\v] \w alphanumeric character, equivalent to [a-zA-Z0-9_] \W non-alphanumeric, equivalent to [^a-zA-Z0-9_] ## Meta-characters: Quantifiers Metacharacter Description * matches zero or more occurrences + matches one or more occurrences ? matches zero or one occurrences {n} exactly n repetitions {n,} at least n repetitions {,n} at most n repetitions {m,n} at least m and at most n repetitions ## NLP Tasks ### Tokenization ## Tokenization Tokenization is the act of breaking up a sequence of strings into units such as words, keywords, phrases, symbols and other elements called tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph. Tokenization is non trivial. ## How would you split this sentence into words? “Children shouldn’t drink a sugary drink before bed.” text.split(' ') # Naive ## ['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.'] import nltk # NLTK has an in-built tokenizer! nltk.word_tokenize(text) ## ['Children', 'should', "n't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed', '.'] ## How would you split sentences from a long text string? “This is the first sentence. A gallon of milk in the U.S. costs$2.99. Is this the third sentence? Yes, it is!”

# NLTK has an in-built sentence splitter too!
nltk.sent_tokenize(text) 
## ['This is the first sentence.', 'A gallon of milk in the U.S.
## PRP$: pronoun, possessive ## her his mine my our ours their thy your ## Tagging Methods • default tagger: assigns the same tag to each token • regular expression tagger: assigns tags to tokens on the basis of matching patterns • supervised learning • unigram tagger: for each token, assign the tag that is most likely for that particular token • n-gram taggers: generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens • machine learning These methods can be combined using a technique known as backoff. Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we backoff to a more general model (such as a unigram tagger). ## NLP Tasks ### Chunking ## Chunking Chunking is a process of extracting phrases from unstructured text. • Chunking works on top of POS tagging, it uses pos-tags as input and provides chunks as output. • Similar to POS tags, there are a standard set of Chunk tags like Noun Phrase (NP), Verb Phrase (VP), etc. • There are libraries which gives phrases out-of-box such as Spacy. NLTK provides a mechanism using regular expressions to generate chunks. ## Chunking with NLTK In order to create NP chunk, we define the chunk grammar using POS tags. We will define this using a regular expression. grammar = (''' NP: {<DT>?<JJ>*<NN>} V: {<VB[\w]?>} ''') The rule states that whenever the chunk finds: • an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then the chunk NP should be formed • a verb (VB, VBD, VBG, VBN, VBP, VBZ) then the chunk V should be formed text = "This is a simple example of chuncking a sentence" tagged = nltk.pos_tag(nltk.word_tokenize(text)) tree = nltk.RegexpParser(grammar).parse(tagged) for subtree in tree.subtrees(): print(subtree) ## (S ## This/DT ## (V is/VBZ) ## (NP a/DT simple/JJ example/NN) ## of/IN ## (V chuncking/VBG) ## (NP a/DT sentence/NN)) ## (V is/VBZ) ## (NP a/DT simple/JJ example/NN) ## (V chuncking/VBG) ## (NP a/DT sentence/NN) ## NLP Tasks ### Named Entity Recognition (NER) ## Named Entity Named Entities are definite chuncks that refer to specific types of real-world objects, such as organizations, persons, dates, and so on. ## Named Entity Recognition (NER) The goal of Named Entity Recognition (NER) is to identify all textual mentions of the Named Entities. This can be broken down into two sub-tasks: • identifying the boundaries of the Named Entity • identifying the type of the Named Entity The task is well-suited to the type of classifier-based approach that we saw for POS tagging and noun phrase chunking. In particular, we can: • extract chuncks corresponding to noun phrases • build a tagger that labels each chunck using the appropriate type based on the trainig data (unigram/n-gram tagger, …). ## Named Entity Recognition with NLTK NLTK provides a classifier that has already been trained to recognize Named Entities. Example: “European authorities fined Google a record$5.1 billion on Wednesday for abusing its power…”

nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)))
## Tree('S', [Tree('GPE', [('European', 'JJ')]), ('authorities',
'NNS'), ('fined', 'VBD'), Tree('PERSON', [('Google', 'NNP')]), ('a',
'DT'), ('record', 'NN'), ('$', '$'), ('5.1', 'CD'), ('billion',
'CD'), ('on', 'IN'), ('Wednesday', 'NNP'), ('for', 'IN'), ('abusing',
'VBG'), ('its', 'PRP$'), ('power', 'NN'), ('...', ':')]) ## Named Entity Recognition with SpaCy SpaCy features an extremely fast statistical entity recognition system, that assigns labels to contiguous spans of tokens. import spacy nlp = spacy.load("en_core_web_sm") doc = nlp(text) for ent in doc.ents: print(ent.text, ent.label_) ## European NORP ## Google ORG ##$5.1 billion MONEY
## Wednesday DATE

## Take Home Concepts

• Regex are useful text strings to match patterns
• Case-folding, stop words and punctuation removal, are usually but not always a good idea
• Tokenization, sentence splitting, stemming and lemmatization are non-trivial tasks
• Part of Speech Tagging helps the machine understand how a word is used in a sentence
• Chunking is the process of extracting phrases from unstructured text
• Named Entity Recognition allows us to identify real-world objects in a text