In this article, I will introduce some basic functions being uesd in Natural Language Processing using python package nltk.

To use the package nltk, you should download the library and corpus you need beforehands.

import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
True

Word Segmentation

Word segmentation is the problem of dividing a string of written language into its component words.

Sentence Tokenization

sentence = "I do not like green eggs and ham, and I do not like them too!"
tokens = nltk.word_tokenize(sentence)
tokens
['I',
 'do',
 'not',
 'like',
 'green',
 'eggs',
 'and',
 'ham',
 ',',
 'and',
 'I',
 'do',
 'not',
 'like',
 'them',
 'too',
 '!']

N-Gram Model

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a $(n−1)$–order Markov model. N-gram models are now widely used in probability, communication theory, computational linguistics, and so on.

from nltk.util import ngrams
print("1-grams\n--------\n{}\n".format(list(ngrams(tokens, 1))))
print("2-grams\n--------\n{}\n".format(list(ngrams(tokens, 2))))
print("3-grams\n--------\n{}\n".format(list(ngrams(tokens, 3))))
1-grams
--------
[('I',), ('do',), ('not',), ('like',), ('green',), ('eggs',), ('and',), ('ham',), (',',), ('and',), ('I',), ('do',), ('not',), ('like',), ('them',), ('too',), ('!',)]

2-grams
--------
[('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', ','), (',', 'and'), ('and', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'them'), ('them', 'too'), ('too', '!')]

3-grams
--------
[('I', 'do', 'not'), ('do', 'not', 'like'), ('not', 'like', 'green'), ('like', 'green', 'eggs'), ('green', 'eggs', 'and'), ('eggs', 'and', 'ham'), ('and', 'ham', ','), ('ham', ',', 'and'), (',', 'and', 'I'), ('and', 'I', 'do'), ('I', 'do', 'not'), ('do', 'not', 'like'), ('not', 'like', 'them'), ('like', 'them', 'too'), ('them', 'too', '!')]

Next, with n-grams, we can count their frequencies for further probabilistic models generation.

#Create your bigrams
bgs = nltk.bigrams(tokens)

#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
    print(k,v)
('I', 'do') 2
('do', 'not') 2
('not', 'like') 2
('like', 'green') 1
('green', 'eggs') 1
('eggs', 'and') 1
('and', 'ham') 1
('ham', ',') 1
(',', 'and') 1
('and', 'I') 1
('like', 'them') 1
('them', 'too') 1
('too', '!') 1

Part-of-Speech Tagging

In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.

tokens = nltk.word_tokenize("On Thursday morning Arthur Lee didn't feel very good.")
tagged = nltk.pos_tag(tokens)
tagged
[('On', 'IN'),
 ('Thursday', 'NNP'),
 ('morning', 'NN'),
 ('Arthur', 'NNP'),
 ('Lee', 'NNP'),
 ('did', 'VBD'),
 ("n't", 'RB'),
 ('feel', 'VB'),
 ('very', 'RB'),
 ('good', 'JJ'),
 ('.', '.')]

Named Entity Extraction

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Except for NLTK, Stanford Parser is also popular.

entities = nltk.chunk.ne_chunk(tagged)
print(entities)
entities
(S
  On/IN
  Thursday/NNP
  morning/NN
  (PERSON Arthur/NNP Lee/NNP)
  did/VBD
  n't/RB
  feel/VB
  very/RB
  good/JJ
  ./.)

Sentiment Analysis

Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.

Train a sentiment classifier on our own

The following example is referenced from python tutorial page.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
 
def word_feats(words):
    return dict([(word, True) for word in words])
 
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
 
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
 
train_set = negative_features + positive_features + neutral_features
 
classifier = NaiveBayesClassifier.train(train_set) 


neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
 
print("Sentence: '{}'\n--------------\n".format(sentence))
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))
Sentence: 'awesome movie, i liked it'
--------------

Positive: 0.6
Negative: 0.2

Utilize Vader Sentiment Analyzer in NLTK

  • coumpound: 強度
  • neg: 負面成分
  • pos: 正面成分
  • neu: 中性成分
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize

sid = SentimentIntensityAnalyzer()

def sentiment_analysis(s):
    print("'{}'\n--------------\n{}\n".format(s, sid.polarity_scores(s)))
    
sentiment_analysis("VADER is smart, handsome, and funny.")
sentiment_analysis("VADER is smart, handsome, and funny!")
sentiment_analysis("VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!")
'VADER is smart, handsome, and funny.'
--------------
{'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}

'VADER is smart, handsome, and funny!'
--------------
{'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}

'VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!'
--------------
{'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}