Basic Functions in Natural Language Processing
In this article, I will introduce some basic functions being uesd in Natural Language Processing using python package nltk
.
To use the package nltk
, you should download the library and corpus you need beforehands.
import nltk
nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
True
Word Segmentation
Word segmentation is the problem of dividing a string of written language into its component words.
Sentence Tokenization
sentence = "I do not like green eggs and ham, and I do not like them too!"
tokens = nltk.word_tokenize(sentence)
tokens
['I',
'do',
'not',
'like',
'green',
'eggs',
'and',
'ham',
',',
'and',
'I',
'do',
'not',
'like',
'them',
'too',
'!']
N-Gram Model
An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a $(n−1)$–order Markov model. N-gram models are now widely used in probability, communication theory, computational linguistics, and so on.
from nltk.util import ngrams
print("1-grams\n--------\n{}\n".format(list(ngrams(tokens, 1))))
print("2-grams\n--------\n{}\n".format(list(ngrams(tokens, 2))))
print("3-grams\n--------\n{}\n".format(list(ngrams(tokens, 3))))
1-grams
--------
[('I',), ('do',), ('not',), ('like',), ('green',), ('eggs',), ('and',), ('ham',), (',',), ('and',), ('I',), ('do',), ('not',), ('like',), ('them',), ('too',), ('!',)]
2-grams
--------
[('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'green'), ('green', 'eggs'), ('eggs', 'and'), ('and', 'ham'), ('ham', ','), (',', 'and'), ('and', 'I'), ('I', 'do'), ('do', 'not'), ('not', 'like'), ('like', 'them'), ('them', 'too'), ('too', '!')]
3-grams
--------
[('I', 'do', 'not'), ('do', 'not', 'like'), ('not', 'like', 'green'), ('like', 'green', 'eggs'), ('green', 'eggs', 'and'), ('eggs', 'and', 'ham'), ('and', 'ham', ','), ('ham', ',', 'and'), (',', 'and', 'I'), ('and', 'I', 'do'), ('I', 'do', 'not'), ('do', 'not', 'like'), ('not', 'like', 'them'), ('like', 'them', 'too'), ('them', 'too', '!')]
Next, with n-grams, we can count their frequencies for further probabilistic models generation.
#Create your bigrams
bgs = nltk.bigrams(tokens)
#compute frequency distribution for all the bigrams in the text
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
print(k,v)
('I', 'do') 2
('do', 'not') 2
('not', 'like') 2
('like', 'green') 1
('green', 'eggs') 1
('eggs', 'and') 1
('and', 'ham') 1
('ham', ',') 1
(',', 'and') 1
('and', 'I') 1
('like', 'them') 1
('them', 'too') 1
('too', '!') 1
Part-of-Speech Tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context.
tokens = nltk.word_tokenize("On Thursday morning Arthur Lee didn't feel very good.")
tagged = nltk.pos_tag(tokens)
tagged
[('On', 'IN'),
('Thursday', 'NNP'),
('morning', 'NN'),
('Arthur', 'NNP'),
('Lee', 'NNP'),
('did', 'VBD'),
("n't", 'RB'),
('feel', 'VB'),
('very', 'RB'),
('good', 'JJ'),
('.', '.')]
Named Entity Extraction
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
Except for NLTK, Stanford Parser is also popular.
entities = nltk.chunk.ne_chunk(tagged)
print(entities)
entities
(S
On/IN
Thursday/NNP
morning/NN
(PERSON Arthur/NNP Lee/NNP)
did/VBD
n't/RB
feel/VB
very/RB
good/JJ
./.)
Sentiment Analysis
Sentiment analysis (sometimes known as opinion mining or emotion AI) refers to the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.
Train a sentiment classifier on our own
The following example is referenced from python tutorial page.
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
def word_feats(words):
return dict([(word, True) for word in words])
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]
train_set = negative_features + positive_features + neutral_features
classifier = NaiveBayesClassifier.train(train_set)
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify(word_feats(word))
if classResult == 'neg':
neg = neg + 1
if classResult == 'pos':
pos = pos + 1
print("Sentence: '{}'\n--------------\n".format(sentence))
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))
Sentence: 'awesome movie, i liked it'
--------------
Positive: 0.6
Negative: 0.2
Utilize Vader
Sentiment Analyzer in NLTK
coumpound
: 強度neg
: 負面成分pos
: 正面成分neu
: 中性成分
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize
sid = SentimentIntensityAnalyzer()
def sentiment_analysis(s):
print("'{}'\n--------------\n{}\n".format(s, sid.polarity_scores(s)))
sentiment_analysis("VADER is smart, handsome, and funny.")
sentiment_analysis("VADER is smart, handsome, and funny!")
sentiment_analysis("VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!")
'VADER is smart, handsome, and funny.'
--------------
{'neg': 0.0, 'neu': 0.254, 'pos': 0.746, 'compound': 0.8316}
'VADER is smart, handsome, and funny!'
--------------
{'neg': 0.0, 'neu': 0.248, 'pos': 0.752, 'compound': 0.8439}
'VADER is VERY SMART, really handsome, and INCREDIBLY FUNNY!!!'
--------------
{'neg': 0.0, 'neu': 0.294, 'pos': 0.706, 'compound': 0.9469}