from lecture, UMass CS 485, Spring 2024
import glob
(glob.glob("large_movie_review_dataset/test/pos/*.txt"))[:10]
['large_movie_review_dataset/test/pos/4715_9.txt', 'large_movie_review_dataset/test/pos/1930_9.txt', 'large_movie_review_dataset/test/pos/3205_9.txt', 'large_movie_review_dataset/test/pos/10186_10.txt', 'large_movie_review_dataset/test/pos/147_10.txt', 'large_movie_review_dataset/test/pos/7511_7.txt', 'large_movie_review_dataset/test/pos/616_10.txt', 'large_movie_review_dataset/test/pos/10460_10.txt', 'large_movie_review_dataset/test/pos/3240_9.txt', 'large_movie_review_dataset/test/pos/1975_9.txt']
import glob
pos_texts = [open(f).read() for f in glob.glob("large_movie_review_dataset/test/pos/*.txt")]
neg_texts = [open(f).read() for f in glob.glob("large_movie_review_dataset/test/neg/*.txt")]
labels_and_texts = []
for text in pos_texts:
labels_and_texts.append(("pos", text))
for text in neg_texts:
labels_and_texts.append(("neg", text))
len(labels_and_texts)
25000
labels_and_texts[13000]
('neg', 'Another pretentious film from Vicente Aranda. If "Juana la loca" shinned of the same, at least its quality was superior (mainly thanks to the great performance of Pilar López de Ayala), but "Carmen" is boring and full of topics (ardent brunette with a dagger in the stocking, poor man dragged to madness due to passion, Sierra Nevada gangs, "toreros",...)<br /><br />Obviously Paz Vega is a pretty woman, but about its talent there\'re more doubts, and Sbaraglia role is so stupid that results almost incredible. The script is weak and and Aranda\'s presumptuous character influences the entire film. With these ingredients the result could not be good.<br /><br />Not the worst film I\'ve seen, but a complete failure, in my opinion.')
def tokenize(text):
"""returns list of tokens (note diff than hw1)"""
tokens = re.split(r'\W+', text.lower())
tokens = [w for w in tokens if w] # remove empty tokens
return tokens
for (label,text) in labels_and_texts[:3]:
print("-------------------------------")
print("LABEL:", label)
print("TEXT:", text)
toks = tokenize(text)
print(toks)
------------------------------- LABEL: pos TEXT: Based on an actual story, John Boorman shows the struggle of an American doctor, whose husband and son were murdered and she was continually plagued with her loss. A holiday to Burma with her sister seemed like a good idea to get away from it all, but when her passport was stolen in Rangoon, she could not leave the country with her sister, and was forced to stay back until she could get I.D. papers from the American embassy. To fill in a day before she could fly out, she took a trip into the countryside with a tour guide. "I tried finding something in those stone statues, but nothing stirred in me. I was stone myself." <br /><br />Suddenly all hell broke loose and she was caught in a political revolt. Just when it looked like she had escaped and safely boarded a train, she saw her tour guide get beaten and shot. In a split second she decided to jump from the moving train and try to rescue him, with no thought of herself. Continually her life was in danger. <br /><br />Here is a woman who demonstrated spontaneous, selfless charity, risking her life to save another. Patricia Arquette is beautiful, and not just to look at; she has a beautiful heart. This is an unforgettable story. <br /><br />"We are taught that suffering is the one promise that life always keeps." ['based', 'on', 'an', 'actual', 'story', 'john', 'boorman', 'shows', 'the', 'struggle', 'of', 'an', 'american', 'doctor', 'whose', 'husband', 'and', 'son', 'were', 'murdered', 'and', 'she', 'was', 'continually', 'plagued', 'with', 'her', 'loss', 'a', 'holiday', 'to', 'burma', 'with', 'her', 'sister', 'seemed', 'like', 'a', 'good', 'idea', 'to', 'get', 'away', 'from', 'it', 'all', 'but', 'when', 'her', 'passport', 'was', 'stolen', 'in', 'rangoon', 'she', 'could', 'not', 'leave', 'the', 'country', 'with', 'her', 'sister', 'and', 'was', 'forced', 'to', 'stay', 'back', 'until', 'she', 'could', 'get', 'i', 'd', 'papers', 'from', 'the', 'american', 'embassy', 'to', 'fill', 'in', 'a', 'day', 'before', 'she', 'could', 'fly', 'out', 'she', 'took', 'a', 'trip', 'into', 'the', 'countryside', 'with', 'a', 'tour', 'guide', 'i', 'tried', 'finding', 'something', 'in', 'those', 'stone', 'statues', 'but', 'nothing', 'stirred', 'in', 'me', 'i', 'was', 'stone', 'myself', 'br', 'br', 'suddenly', 'all', 'hell', 'broke', 'loose', 'and', 'she', 'was', 'caught', 'in', 'a', 'political', 'revolt', 'just', 'when', 'it', 'looked', 'like', 'she', 'had', 'escaped', 'and', 'safely', 'boarded', 'a', 'train', 'she', 'saw', 'her', 'tour', 'guide', 'get', 'beaten', 'and', 'shot', 'in', 'a', 'split', 'second', 'she', 'decided', 'to', 'jump', 'from', 'the', 'moving', 'train', 'and', 'try', 'to', 'rescue', 'him', 'with', 'no', 'thought', 'of', 'herself', 'continually', 'her', 'life', 'was', 'in', 'danger', 'br', 'br', 'here', 'is', 'a', 'woman', 'who', 'demonstrated', 'spontaneous', 'selfless', 'charity', 'risking', 'her', 'life', 'to', 'save', 'another', 'patricia', 'arquette', 'is', 'beautiful', 'and', 'not', 'just', 'to', 'look', 'at', 'she', 'has', 'a', 'beautiful', 'heart', 'this', 'is', 'an', 'unforgettable', 'story', 'br', 'br', 'we', 'are', 'taught', 'that', 'suffering', 'is', 'the', 'one', 'promise', 'that', 'life', 'always', 'keeps'] ------------------------------- LABEL: pos TEXT: This is a gem. As a Film Four production - the anticipated quality was indeed delivered. Shot with great style that reminded me some Errol Morris films, well arranged and simply gripping. It's long yet horrifying to the point it's excruciating. We know something bad happened (one can guess by the lack of participation of a person in the interviews) but we are compelled to see it, a bit like a car accident in slow motion. The story spans most conceivable aspects and unlike some documentaries did not try and refrain from showing the grimmer sides of the stories, as also dealing with the guilt of the people Don left behind him, wondering why they didn't stop him in time. It took me a few hours to get out of the melancholy that gripped me after seeing this very-well made documentary. ['this', 'is', 'a', 'gem', 'as', 'a', 'film', 'four', 'production', 'the', 'anticipated', 'quality', 'was', 'indeed', 'delivered', 'shot', 'with', 'great', 'style', 'that', 'reminded', 'me', 'some', 'errol', 'morris', 'films', 'well', 'arranged', 'and', 'simply', 'gripping', 'it', 's', 'long', 'yet', 'horrifying', 'to', 'the', 'point', 'it', 's', 'excruciating', 'we', 'know', 'something', 'bad', 'happened', 'one', 'can', 'guess', 'by', 'the', 'lack', 'of', 'participation', 'of', 'a', 'person', 'in', 'the', 'interviews', 'but', 'we', 'are', 'compelled', 'to', 'see', 'it', 'a', 'bit', 'like', 'a', 'car', 'accident', 'in', 'slow', 'motion', 'the', 'story', 'spans', 'most', 'conceivable', 'aspects', 'and', 'unlike', 'some', 'documentaries', 'did', 'not', 'try', 'and', 'refrain', 'from', 'showing', 'the', 'grimmer', 'sides', 'of', 'the', 'stories', 'as', 'also', 'dealing', 'with', 'the', 'guilt', 'of', 'the', 'people', 'don', 'left', 'behind', 'him', 'wondering', 'why', 'they', 'didn', 't', 'stop', 'him', 'in', 'time', 'it', 'took', 'me', 'a', 'few', 'hours', 'to', 'get', 'out', 'of', 'the', 'melancholy', 'that', 'gripped', 'me', 'after', 'seeing', 'this', 'very', 'well', 'made', 'documentary'] ------------------------------- LABEL: pos TEXT: I really like this show. It has drama, romance, and comedy all rolled into one. I am 28 and I am a married mother, so I can identify both with Lorelei's and Rory's experiences in the show. I have been watching mostly the repeats on the Family Channel lately, so I am not up-to-date on what is going on now. I think females would like this show more than males, but I know some men out there would enjoy it! I really like that is an hour long and not a half hour, as th hour seems to fly by when I am watching it! Give it a chance if you have never seen the show! I think Lorelei and Luke are my favorite characters on the show though, mainly because of the way they are with one another. How could you not see something was there (or take that long to see it I guess I should say)? <br /><br />Happy viewing! ['i', 'really', 'like', 'this', 'show', 'it', 'has', 'drama', 'romance', 'and', 'comedy', 'all', 'rolled', 'into', 'one', 'i', 'am', '28', 'and', 'i', 'am', 'a', 'married', 'mother', 'so', 'i', 'can', 'identify', 'both', 'with', 'lorelei', 's', 'and', 'rory', 's', 'experiences', 'in', 'the', 'show', 'i', 'have', 'been', 'watching', 'mostly', 'the', 'repeats', 'on', 'the', 'family', 'channel', 'lately', 'so', 'i', 'am', 'not', 'up', 'to', 'date', 'on', 'what', 'is', 'going', 'on', 'now', 'i', 'think', 'females', 'would', 'like', 'this', 'show', 'more', 'than', 'males', 'but', 'i', 'know', 'some', 'men', 'out', 'there', 'would', 'enjoy', 'it', 'i', 'really', 'like', 'that', 'is', 'an', 'hour', 'long', 'and', 'not', 'a', 'half', 'hour', 'as', 'th', 'hour', 'seems', 'to', 'fly', 'by', 'when', 'i', 'am', 'watching', 'it', 'give', 'it', 'a', 'chance', 'if', 'you', 'have', 'never', 'seen', 'the', 'show', 'i', 'think', 'lorelei', 'and', 'luke', 'are', 'my', 'favorite', 'characters', 'on', 'the', 'show', 'though', 'mainly', 'because', 'of', 'the', 'way', 'they', 'are', 'with', 'one', 'another', 'how', 'could', 'you', 'not', 'see', 'something', 'was', 'there', 'or', 'take', 'that', 'long', 'to', 'see', 'it', 'i', 'guess', 'i', 'should', 'say', 'br', 'br', 'happy', 'viewing']
pos_keywords = """
Hardwork, determination, commitment
nice
"awesome
great
good
nice"
"good
amazing
great
awesome"
"good
great
brilliant
funny"
"good
awesome"
awesome, fun, happy, best
"great
fantastic
well
alright"
"great
good
awesome
brilliant
entertaining"
"amazing
stupendous
positive
incredible"
"awesome
cool
clean
pleasant"
happy, good, yay, puppy
"thrilling
funny
great
interesting"
daijobu fun baddy manic
"exciting
epic
influential
amazing"
"great
excellent
amazing
nice"
"happy
amazing
loving
caring"
"great
wonderful
cool
nice"
"exhilarating
thrilling
heart-racing
captivating"
"great
awesome
perfect
amazing"
"compelling
amazing
worth
awesome
novel
inviting"
"amazing
wonderful
interesting
perfect"
"amazing
beautiful
great
superb"
"super
nice
great
"
"great
awesome
fantastic
enjoy
like
appreciate
good
amazing
professional
creative"
"Great
Awesome
Good
Cool
Incredible
Wonderful
Superb
Fabulous
Whimsical
Lovely
Brilliant
Excellent
Extraordinary"
"good
cool
based
better
amazing"
"incredible
wow
amazing
good
gag
gagged
omg
cinematic
artistic
paced"
"excellent
thought provoking
thrilling
uplifting"
Positive, Congrats, Nice,
class, homework, accepted, internship
"great
excellent
amazing
superb
good
nice
fair
perfect"
"Lit
Good
Awesome
Excellent"
"adventurous
hilarious
memorable
fantastic"
Great
"""
pos_keywords = set(tokenize(pos_keywords))
pos_keywords
{'accepted', 'adventurous', 'alright', 'amazing', 'appreciate', 'artistic', 'awesome', 'baddy', 'based', 'beautiful', 'best', 'better', 'brilliant', 'captivating', 'caring', 'cinematic', 'class', 'clean', 'commitment', 'compelling', 'congrats', 'cool', 'creative', 'daijobu', 'determination', 'enjoy', 'entertaining', 'epic', 'excellent', 'exciting', 'exhilarating', 'extraordinary', 'fabulous', 'fair', 'fantastic', 'fun', 'funny', 'gag', 'gagged', 'good', 'great', 'happy', 'hardwork', 'heart', 'hilarious', 'homework', 'incredible', 'influential', 'interesting', 'internship', 'inviting', 'like', 'lit', 'lovely', 'loving', 'manic', 'memorable', 'nice', 'novel', 'omg', 'paced', 'perfect', 'pleasant', 'positive', 'professional', 'provoking', 'puppy', 'racing', 'stupendous', 'super', 'superb', 'thought', 'thrilling', 'uplifting', 'well', 'whimsical', 'wonderful', 'worth', 'wow', 'yay'}
neg_keywords = """
Too hard, quitting
stinky
"bad
aweful
terrible
unpleasant"
"sucks
worst
bad
horrible"
"bad
horrrible
terrible
exhausting"
bad
bad, horrible, awful, sad
"abysmal
bad
horrid
depressing"
"bad
terrible
annoying
boring"
"horrendous
terrible
horrible
dumb"
"poor
destructive
dirty
rude"
bad, poo, darn, terrible
"boring
waste
terrible
bad"
loss harm bad old skuffed FUBAR
"horrible
confusing
boring
mid"
"terrible
boring
horrible
awful"
"dreadful
genocide
terrified
complacent"
"awful
terrible
egregious
horrendous"
"trash
god-awful
terrible
snooze-worthy"
"horrible
bad
sad
awful"
"waste
terrible
disappointing
bad
boring"
"boring
sucks
disappointing
disappointed
horrible"
"disgusting
bad
sucks
awful"
"mean
hate
unfortunate
"
"disappointing
dislike
hate
trash
waste
underwhelming"
"Terrible
Horrible
Bad
Disgusting
Worst
Detestable
Loathsome
Gross
Awful
Untenable
Ugly
Ridiculous
Stupid
Dumb
Poor
Decrepit"
"sucks
bad
dogshit
failure"
"gross
bad
ugly
derogatory
poor"
"depressing
messy
unprepared
sad"
Negative, Rude, Death, Regret, Depression
offer, agency, transfer, competetive, rejected
"poor
bad
horrible
not great
not good
abysmal
trash
garbage
horrid
"
"Trash
Awful
Bad
"
"confusing
plot-lacking
garbage
disgusting"
Horrible
"""
neg_keywords = set(tokenize(neg_keywords))
neg_keywords.remove("too")
neg_keywords.remove("not")
neg_keywords
{'abysmal', 'agency', 'annoying', 'aweful', 'awful', 'bad', 'boring', 'competetive', 'complacent', 'confusing', 'darn', 'death', 'decrepit', 'depressing', 'depression', 'derogatory', 'destructive', 'detestable', 'dirty', 'disappointed', 'disappointing', 'disgusting', 'dislike', 'dogshit', 'dreadful', 'dumb', 'egregious', 'exhausting', 'failure', 'fubar', 'garbage', 'genocide', 'god', 'good', 'great', 'gross', 'hard', 'harm', 'hate', 'horrendous', 'horrible', 'horrid', 'horrrible', 'lacking', 'loathsome', 'loss', 'mean', 'messy', 'mid', 'negative', 'offer', 'old', 'plot', 'poo', 'poor', 'quitting', 'regret', 'rejected', 'ridiculous', 'rude', 'sad', 'skuffed', 'snooze', 'stinky', 'stupid', 'sucks', 'terrible', 'terrified', 'transfer', 'trash', 'ugly', 'underwhelming', 'unfortunate', 'unpleasant', 'unprepared', 'untenable', 'waste', 'worst', 'worthy'}
num_correct = 0
for (label,text) in labels_and_texts:
# print("-------------------------------")
# print("LABEL:", label)
# print("TEXT:", text)
toks = tokenize(text)
num_pos = len([w for w in toks if w in pos_keywords ])
# print([w for w in toks if w in pos_keywords ])
num_neg = len([w for w in toks if w in neg_keywords ])
pred = 'pos' if num_pos > num_neg else 'neg'
is_correct = (pred == label)
# print(is_correct)
num_correct += int(is_correct)
print(num_correct)
17012
Accuracy: proportion of predictions that are correct
num_correct / len(labels_and_texts)
0.68048