Demo: keyword-based sentiment classifier

UMass CS 490A, 9/10/2020. Attempt to replicate Pang et al. 2002's "Human 1, Human 2" experiment.

Data is the "1.0" version from here, which I think was used in the original paper: http://www.cs.cornell.edu/people/pabo/movie-review-data/

Spreadsheet of keyword submissions: https://docs.google.com/spreadsheets/d/1sdYr-iHEe6ZQADWCK0XbRTIq5OlpIoRhXqTPnHnZSNQ/edit?usp=sharing

In [11]:
import glob
import numpy as np

def load_file(filename):
    return open(filename, errors='ignore').read().split()
In [21]:
# Everything is just the test set, since we have no machine learning
true_labels = []
docs = []
for f in glob.glob("tokens/neg/*.txt"):
    docs.append(load_file(f))
    true_labels.append('neg')
for f in glob.glob("tokens/pos/*.txt"):
    docs.append(load_file(f))
    true_labels.append('pos')
In [56]:
# The keyword-based classifier
POS_KEYWORDS = set("""
Amazing
Awesome
Best
Cool
Dazzling
Fantastic
Good
Great
Inspirational
Intense
Phenomenal
Superb
Thunderstruck
Unbelievable
Worth
Wow
absorbing
accurate
amazing
art
artist
awesome
beautiful
best
blowing
breaking
brilliant
captivating
capturing
changing
chemistry
choreographed
cool
creative
deep
depth
emotional
energetic
engaging
enjoyable
enjoyed
enlightening
entertaining
enticing
epic
excellent
exciting
extraordinary
fabulous
fantastic
fascinating
feel
filmed
fresh
fun
funny
genius
good
great
gripping
ground
groundbreaking
happy
harmonious
healing
heart
hilarious
immersed
immersive
impactful
impressive
insightful
inspiring
intense
interesting
intriguing
justified
laugh
life
likable
like
lol
love
loved
lovely
magnificent
masterfully
memorizing
mind
moving
must
nevertheless
nice
nominated
nostalgia
oscar
outstanding
perfect
phenomenal
picking
pog
poggers
poignant
popcorn
powerful
professional
provoking
refreshing
revolutionary
rewatch
riveting
satisfying
saucy
saved
scintillating
strong
stunning
super
terrific
thought
thoughtful
thriller
thrilling
touching
transcendent
unseen
watch
watchable
well
wonderful
wow
wrenching""".split())



NEG_KEYWORDS = set("""
Awful
Bad
Confusing
Crap
Crappy
Horrible
Horrific
It
No
Suck
Terrible
Trash
Worst
accurate
amateur
annoying
anxious
appalling
asleep
atrocious
average
awful
bad
basic
big
boring
careless
cliche
cliched
cliches
contrived
cookie
cringe
cringeworthy
cutter
damned
desolate
disappointed
disappointing
disgusting
disorganized
distasteful
disturbing
drab
dragged
dreadful
dreary
dull
egregious
enraged
evil
fake
feeble
forgettable
garbage
good
gross
hate
hated
headache
horrendous
horrible
horrid
however
insulting
lackluster
lame
lazy
lifeless
long
lousy
mediocre
negative
no
nonsensical
not
nothing
offensive
old
passe
pathetic
plain
plot
poor
poorly
predictable
regretful
rough
sad
sadly
sh
shallow
shameful
shit
short
simple
skip
sleepy
slow
spooky
stupid
suck
tasteless
terrible
time
timepass
trash
unattractive
unbearable
underwritten
unfortunately
unfunny
uninspired
uninteresting
unoriginal
unpalatable
unrealistic
unstructured
untolerable
unwatchable
vapid
waste
weak
worst
worthy
""".split())
In [57]:
len(docs)
Out[57]:
1400
In [58]:
len(true_labels)
Out[58]:
1400
In [59]:
sum([True,True,False])
Out[59]:
2
In [60]:
def kw_classify(doc):
    num_pos = sum([ (w in POS_KEYWORDS) for w in doc ])
    num_neg = sum([ (w in NEG_KEYWORDS) for w in doc ])
#    print(num_pos, num_neg)
    return "pos" if num_pos > num_neg else "neg"
In [61]:
kw_classify(docs[100])
Out[61]:
'pos'
In [62]:
# Make predictions (classifications) into parallel list
preds = []
for doc in docs:
    preds.append( kw_classify(doc) )
In [63]:
from collections import Counter
In [64]:
Counter(preds)
Out[64]:
Counter({'pos': 594, 'neg': 806})
In [65]:
num_correct = sum([  (preds[i] == true_labels[i])
                    for i in range(len(preds))  ])
In [66]:
num_correct
Out[66]:
954
In [68]:
## ACCURACY RATE -- # Evaluate predictions vs. ground truth
num_correct / len(preds)
Out[68]:
0.6814285714285714
In [ ]: