# Sentiment Analysis with Logistic Regression


A quick recap of Logistic Regression: https://ml-cheatsheet.readthedocs.io/en/latest/logistic_regression.html#binary-logistic-regression


If you don't have Pytorch installed in your local machine, please follow these instructions:

https://pytorch.org/get-started/locally/


You can use the command "nvidia-smi", and "nvcc --version" to check the CUDA version that is installed on your machine.
Colab already ships with the pre-compiled dependencies, so it is very easy to use GPUS

In [None]:
# Load and install required libraries
!pip install tqdm # We will need tqdm to generate pretty progress bars
!pip install nltk
import os
import string
import nltk
nltk.download('stopwords')
import csv

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import torch
import torch.nn as nn
import torch.optim as optim
from torchtext import data
from tqdm import tqdm, trange
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Download data
# Download files from https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format/download
# Upload to your Colab if needed, and unzip it with the following command
!mkdir sentiment-analysis-data
!unzip movie_reviews.zip -d sentiment-analysis-data

# Uncomment the following lines to list files in data dir
# input_dir_path = '/content/sentiment-analysis-data'
#for dirname, _, filenames in os.walk(input_dir_path):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))N


mkdir: cannot create directory ‘sentiment-analysis-data’: File exists
Archive:  movie_reviews.zip
   creating: sentiment-analysis-data/movie_reviews/
   creating: sentiment-analysis-data/movie_reviews/neg/
  inflating: sentiment-analysis-data/movie_reviews/neg/cv000_29416.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv001_19502.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv002_17424.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv003_12683.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv004_12641.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv005_29357.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv006_17022.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv007_4992.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv008_29326.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv009_29417.txt  
  inflating: sentiment-analysis-data/movie_reviews/neg/cv010_2906

In [None]:
!head -10 /content/sentiment-analysis-data/movie_reviews/neg/cv001_19502.txt

the happy bastard's quick movie review 
damn that y2k bug . 
it's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . 
little do they know the power within . . . 
going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . 
we don't know why the crew was really out in the middle of nowhere , we don't know the origin of what took over the ship ( just that a big pink flashy thing hit the mir ) , and , of course , we don't know why donald sutherland is stumbling around drunkenly throughout . 
here , it's just " hey , let's chase these people around with some robots " . 
the acting is below average , even from the likes of curtis . 
you're more likely to get a kick out of her work in hallow

In [None]:
def load_examples_from_dir(directory_path, label):
  l = list()
  for filename in os.listdir(directory_path):
    with open(os.path.join(directory_path, filename), 'r') as f:
      doc = ' '.join(f.readlines())
      csv_tuple = (doc, label)
      l.append(csv_tuple)
  return l

def write_list_of_docs_to_csv(file_path, doc_list):
  with open(file_path, 'w') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerows(doc_list)

# Convert data to CSV format ()
def convert_to_csv_files(data_dir):
  positive_dir_path = os.path.join(data_dir, 'pos')
  negative_dir_path = os.path.join(data_dir, 'neg')
  # Because the dataset is small we can load it to memory
  pos_docs = load_examples_from_dir(positive_dir_path, 1)
  neg_docs = load_examples_from_dir(negative_dir_path, 0)
  all_docs = pos_docs + neg_docs

  # Shuffle the docs to distribute them to Train, Val and Test splits
  np.random.shuffle(all_docs)

  # We will split the data according to a 0.8 / 0.1 / 0.1 for Train/Val/Test
  number_of_docs = len(all_docs)
  print("The dataset has {} docs.".format(number_of_docs))
  train_idx_limit = int(0.8 * number_of_docs)
  print("The train data has {} docs.".format(train_idx_limit))
  train_docs = all_docs[:train_idx_limit]

  # Let's split the dev and eval from the remaining ones
  remaining_docs = all_docs[train_idx_limit:]
  val_idx_limit = int(0.5 * len(remaining_docs)) # Note: 0.5 * 0.2 = 0.1
  val_docs = remaining_docs[:val_idx_limit]
  eval_docs = remaining_docs[val_idx_limit:]
  print("The dev/test data has {} docs, each.".format(val_idx_limit))

  # Write files
  write_list_of_docs_to_csv(os.path.join(data_dir, 'Train.csv'), train_docs)
  write_list_of_docs_to_csv(os.path.join(data_dir, 'Valid.csv'), val_docs)
  write_list_of_docs_to_csv(os.path.join(data_dir, 'Test.csv'), eval_docs)

convert_to_csv_files('/content/sentiment-analysis-data/movie_reviews')

The dataset has 2000 docs.
The train data has 1600 docs.
The dev/test data has 200 docs, each.


# Feature Extraction

In [None]:
# Feature extractor function - It will count the number of times a word occurs in a review ggiven the vocab
def extract_text_features(batch, vocab):
  # Generate fixed tensor based on vocab words
  # Indice to String mapping - This is a list
  itos = text_data_field.vocab.itos
  # This is a sample itos list for a vocabulary of size 100. Notice the first two no-op tokens.
  # ['<unk>', '<pad>', 'br', 'movie', 'film', 'one', 'like', 'good', 'even', 'would', 'time', 'really', 'see', 'story', 'much', 'well', 'get', 'also', 'great', 'bad', 'people', 'first', 'dont', 'movies', 'make', 'films', 'made', 'could', 'way', 'characters', 'think', 'watch', 'many', 'seen', 'two', 'character', 'never', 'love', 'acting', 'little', 'plot', 'best', 'know', 'show', 'life', 'ever', 'better', 'still', 'say', 'scene', 'end', 'man', 'scenes', 'something', 'go', 'im', 'back', 'real', 'watching', 'doesnt', 'thing', 'didnt', 'actors', 'years', 'actually', 'though', 'another', 'makes', 'look', 'funny', 'nothing', 'find', 'going', 'work', 'lot', 'new', 'every', 'old', 'part', 'us', 'director', 'cant', 'thats', 'quite', 'things', 'pretty', 'want', 'cast', 'seems', 'around', 'young', 'got', 'take', 'fact', 'however', 'enough', 'world', 'horror', 'give', 'big', 'thought', 'ive']

  # This is a sample stoi (text_data_field.vocab.stoi) dictionary that is also provided by the vocab object.
  # Notice the unique mapping from a word to a idx. Also notice that after idx 101 all other words are considered unk
  # {'<unk>': 0, '<pad>': 1, 'br': 2, 'movie': 3, 'film': 4, 'one': 5, 'like': 6, 'good': 7, 'even': 8, 'would': 9, 'time': 10, 'really': 11, 'see': 12, 'story': 13, 'much': 14, 'well': 15, 'get': 16, 'also': 17, 'great': 18, 'bad': 19, 'people': 20, 'first': 21, 'dont': 22, 'movies': 23, 'make': 24, 'films': 25, 'made': 26, 'could': 27, 'way': 28, 'characters': 29, 'think': 30, 'watch': 31, 'many': 32, 'seen': 33, 'two': 34, 'character': 35, 'never': 36, 'love': 37, 'acting': 38, 'little': 39, 'plot': 40, 'best': 41, 'know': 42, 'show': 43, 'life': 44, 'ever': 45, 'better': 46, 'still': 47, 'say': 48, 'scene': 49, 'end': 50, 'man': 51, 'scenes': 52, 'something': 53, 'go': 54, 'im': 55, 'back': 56, 'real': 57, 'watching': 58, 'doesnt': 59, 'thing': 60, 'didnt': 61, 'actors': 62, 'years': 63, 'actually': 64, 'though': 65, 'another': 66, 'makes': 67, 'look': 68, 'funny': 69, 'nothing': 70, 'find': 71, 'going': 72, 'work': 73, 'lot': 74, 'new': 75, 'every': 76, 'old': 77, 'part': 78, 'us': 79, 'director': 80, 'cant': 81, 'thats': 82, 'quite': 83, 'things': 84, 'pretty': 85, 'want': 86, 'cast': 87, 'seems': 88, 'around': 89, 'young': 90, 'got': 91, 'take': 92, 'fact': 93, 'however': 94, 'enough': 95, 'world': 96, 'horror': 97, 'give': 98, 'big': 99, 'thought': 100, 'ive': 101, 'monika': 0, 'mitchells': 0, 'showbiz': 0, 'satire': 0, 'laughs': 0, 'premeditated': 0, 'violence': 0, 'wouldnt': 0, 'bloodsoaked': 0, 'insult': 0, 'injury': 0, 'max': 0, 'matteojohn': 0, 'cassiniis': 0, 'actor': 0, 'quirky': 0, 'adaptable': 0, 'presence': 0, 'screen': 0, 'terrible': 0, 'track': 0, 'record': 0, 'chosen': 0, 'parts': 0, 'goes': 0, 'always': 0, 'producers': 0, 'nephew': 0, 'seemingly': 0, 'trivial': 0, 'reason': 0, 'awarded': 0, 'role': 0, 'seeks': 0, 'rid': 0, 'competitionmax': 0, 'becomes': 0, 'obsessed': 0, 'thoughts': 0, 'rewarding': 0, 'career': 0, 'swing': 0, 'push': 0, 'shot': 0, 'away': 0, 'members': 0, 'rene': 0, 'rivera': 0, 'molly': 0, 'parker': 0, 'jennifer': 0, 'beals': 0, 'frank': 0, 'cassini': 0, 'cameos': 0, 'eric': 0, 'roberts': 0, 'sandra': 0, 'oh': 0, 'businessor': 0, 'anne': 0, 'wentworth': 0, 'looks': 0, 'younger': 0, 'louisa': 0, 'cheerful': 0, 'sister': 0, 'mary': 0, 'supposed': 0, 'average': 0, 'actress': 0, 'complains': 0, 'complainer': 0, 'lady': 0, 'russell': 0, 'crazy': 0, 'read': 0, 'novel': 0, 'annes': 0, 'older': 0, 'mature': 0, 'friend': 0, 'maybe': 0, 'mother': 0, '1820': 0, '50': 0, '70': 0, 'fit': 0, 'come': 0, 'darn': 0, 'happy': 0, 'beginning': 0, 'smiles': 0, 'says': 0, 'worst': 0, 'passed': 0, 'yeah': 0, 'right': 0, 'ok': 0, 'anyone': 0, '1995': 0, 'roger': 0, 'michell': 0, 'version': 0, 'compare': 0, 'youll': 0, 'mean': 0, 'alright': 0, 'gotcha': 0, 'evidence': 0, 'yet': 0, 'storylines': 0, 'couple': 0, 'handfulls': 0, 'stephanie': 0, 'zimbalist': 0, 'professional': 0, 'tv': 0, 'outstanding': 0, 'definitely': 0, 'former': 0, 'fed': 0, 'loan': 0, 'profiler': 0, 'along': 0, 'motley': 0, 'bunch': 0, 'special': 0, 'investigation': 0, 'unit': 0, 'cops': 0, 'assigned': 0, 'wasnt': 0, 'goofy': 0, 'youd': 0, 'roll': 0, 'eyes': 0, 'despise': 0, 'takes': 0, 'awhile': 0, 'murderer': 0, 'found': 0, 'surprised': 0, 'los': 0, 'angeles': 0, 'locations': 0, 'possibly': 0, 'today': 0, 'theyd': 0, 'use': 0, 'toronto': 0, 'vancouver': 0, 'downtown': 0, 'la': 0, 'semidespise': 0, 'liked': 0, 'id': 0, 'grade': 0, 'b': 0, 'watched': 0, 'rupert': 0, 'grint': 0, 'knew': 0, 'ron': 0, 'harry': 0, 'potter': 0, 'appreciated': 0, 'loved': 0, 'entire': 0, 'wonderful': 0, 'job': 0, 'hilarious': 0, 'fine': 0, 'without': 0, 'sex': 0, 'worked': 0, 'somehow': 0, 'wait': 0, 'ruperts': 0, 'future': 0, 'julie': 0, 'walters': 0, 'amazing': 0, 'small': 0, 'expect': 0, 'either': 0, 'dame': 0, 'evie': 0, 'swallows': 0, 'key': 0, 'absolutely': 0, 'overall': 0, 'bought': 0, 'bloodsuckers': 0, 'ebay': 0, 'ago': 0, 'deemed': 0, 'dumb': 0, 'review': 0, 'excessive': 0, 'amount': 0, 'watery': 0, 'blood': 0, 'plain': 0, 'obsolete': 0, 'mention': 0, 'whiparound': 0, 'wind': 0, 'sounds': 0, 'friends': 0, 'super': 0, 'low': 0, 'budget': 0, 'effects': 0, 'exceeded': 0, 'crap': 0, 'festbr': 0, 'mistakes': 0, 'count': 0, 'believe': 0, 'theatre': 0, 'teacher': 0, 'ha': 0, 'final': 0, 'verdict': 0, 'bother': 0, 'flick': 0, '3': 0, 'stars': 0, 'possible': 0, '73': 0, 'start': 0, 'dark': 0, 'comedy': 0, 'fannntastic': 0, 'unfortunately': 0, 'else': 0, 'free': 0, 'buck': 0, 'spare': 0, 'mind': 0, 'price': 0, 'paid': 0, 'walmart': 0, 'meant': 0, 'thriller': 0, 'thrill': 0, 'kirklands': 0, 'lousy': 0, 'rendition': 0, 'wilkes': 0, 'misery': 0, 'sans': 0, 'snowy': 0, 'woodland': 0, 'area': 0, 'laugh': 0, 'rainy': 0, 'friday': 0, 'night': 0, 'highly': 0, 'recommend': 0, 'least': 0, 'half': 0, 'decent': 0, 'botherbr': 0, 'enjoy': 0, 'crappy': 0, 'worse': 0, 'cases': 0, 'wow': 0, 'person': 0, 'stink': 0, 'boy': 0, 'played': 0, 'vincent': 0, 'berry': 0, 'gave': 0, 'sweet': 0, 'entertaining': 0, 'tale': 0, '17': 0, '12': 0, 'year': 0, 'controlled': 0, 'overbearing': 0, 'religious': 0, 'withdrawn': 0, 'father': 0, 'finds': 0, 'retired': 0, 'eccentric': 0, 'tragic': 0, 'acted': 0, 'especially': 0, 'plays': 0, 'teenage': 0, 'showing': 0, 'talent': 0, 'last': 0, 'longer': 0, 'series': 0, 'laura': 0, 'linney': 0, 'ruthlessly': 0, 'strict': 0, 'hint': 0, 'redemption': 0, 'theres': 0, 'room': 0, 'british': 0, 'style': 0, 'likes': 0, 'keeping': 0, 'mum': 0, 'calendar': 0, 'girls': 0, 'early': 0, 'boys': 0, 'may': 0, 'satisfied': 0, 'others': 0, 'different': 0, 'stooge': 0, 'slapstick': 0, 'jokes': 0, 'rather': 0, 'poke': 0, 'eye': 0, 'slap': 0, 'larry': 0, 'grabs': 0, 'stethoscope': 0, 'moe': 0, 'sings': 0, 'gives': 0, 'smack': 0, 'crack': 0, 'ten': 0, 'minutes': 0, 'hit': 0, 'stooges': 0, 'simply': 0, 'trying': 0, 'glorify': 0, 'filmbr': 0, 'straight': 0, 'trash': 0, 'might': 0, 'visually': 0, 'stunning': 0, 'piece': 0, 'cinematography': 0, 'shortly': 0, 'thereafter': 0, 'large': 0, 'sack': 0, 'burlap': 0, 'fail': 0, 'fighting': 0, 'barely': 0, 'martial': 0, 'teetering': 0, 'edge': 0, 'par': 0, 'music': 0, 'worth': 0, 'describing': 0, 'created': 0, 'excuse': 0, 'decisions': 0, 'deal': 0, 'situations': 0, 'weak': 0, 'frustrate': 0, 'came': 0, 'act': 0, 'bit': 0, 'fan': 0, 'service': 0, 'using': 0, 'yumiko': 0, 'shaku': 0, 'infuriated': 0, 'simple': 0, 'shepherd': 0, 'gay': 0, 'men': 0, 'murdered': 0, 'clearly': 0, 'wicked': 0, 'happened': 0, 'poor': 0, 'truly': 0, 'horrible': 0, 'tragedy': 0, 'hollywood': 0, 'four': 0, 'white': 0, 'kids': 0, 'executed': 0, 'forced': 0, 'perform': 0, 'host': 0, 'acts': 0, 'killers': 0, 'evil': 0, 'black': 0, 'wichita': 0, 'celebrities': 0, 'mug': 0, 'camera': 0, 'serves': 0, 'political': 0, 'purpose': 0, 'laramie': 0, 'portrayed': 0, 'light': 0, 'pseudodocumentary': 0, 'course': 0, 'hardly': 0, 'surprising': 0, 'backward': 0, 'hicks': 0, 'must': 0, 'educated': 0, 'omniscient': 0, 'enlightened': 0, 'californians': 0, 'treat': 0, 'hal': 0, 'roach': 0, 'short': 0, 'tough': 0, 'winter': 0, 'ninetyninth': 0, 'ganglittle': 0, 'rascals': 0, 'eleventh': 0, 'talkie': 0, 'bascally': 0, 'showcase': 0, 'comic': 0, 'stepin': 0, 'fetchit': 0, 'gets': 0, 'billing': 0, 'shack': 0, 'gang': 0, 'hangs': 0, 'farina': 0, 'retrieves': 0, 'letter': 0, 'mail': 0, 'told': 0, 'since': 0, 'day': 0, 'school': 0, 'happens': 0, 'sweetheart': 0, 'tennesse': 0, 'ears': 0, 'stuffed': 0, 'cotton': 0, 'hot': 0, 'hear': 0, 'weezer': 0, 'relays': 0, 'instructions': 0, 'ann': 0, 'making': 0, 'taffy': 0, 'radio': 0, 'keeps': 0, 'running': 0, 'forth': 0, 'kitchen': 0, 'misses': 0, 'announcers': 0, 'segue': 0, 'rice': 0, 'pudding': 0, 'spanish': 0, 'tamale': 0, 'confusing': 0, 'additions': 0, 'tabasco': 0, 'lux': 0, 'concoction': 0, 'completed': 0, 'jackie': 0, 'rest': 0, 'help': 0, 'awful': 0, 'tasting': 0, 'sticky': 0, 'substance': 0, 'everyone': 0, 'stuck': 0, 'walls': 0, 'result': 0, 'try': 0, 'clean': 0, 'mess': 0, 'works': 0, 'basement': 0, 'various': 0, 'pipes': 0, 'electrical': 0, 'outlets': 0, 'mixes': 0, 'variable': 0, 'appliances': 0, 'functions': 0, 'telephone': 0, 'vacuums': 0, 'vacuum': 0, 'rings': 0, 'refrigerator': 0, 'described': 0, 'portends': 0, 'meandering': 0, 'nature': 0, 'served': 0, 'pilot': 0, 'potential': 0, 'took': 0, 'place': 0, 'fetchits': 0, 'characterization': 0, 'lazy': 0, 'negro': 0, 'amusing': 0, 'doses': 0, 'considered': 0, 'offensive': 0, 'sequence': 0, 'results': 0, 'blah': 0, 'summary': 0, 'curio': 0, 'seeing': 0, 'stepins': 0, 'name': 0, 'lincoln': 0, 'theodore': 0, 'perry': 0, 'gee': 0, 'cannot': 0, 'understand': 0, 'scary': 0, 'grudge': 0, 'trick': 0, 'admit': 0, 'brought': 0, 'stylized': 0, 'repeats': 0, 'consequence': 0, 'startled': 0, 'times': 0, 'quarter': 0, 'drill': 0, 'practically': 0, 'fell': 0, 'asleep': 0, 'grew': 0, 'predictable': 0, 'minute': 0, 'conclude': 0, 'genre': 0, 'begin': 0, 'socalled': 0, 'predecessor': 0, 'ring': 0, 'scarier': 0, 'buying': 0, 'ticket': 0, 'waste': 0, 'money': 0, 'sherman': 0, 'hemsley': 0, 'jeffersons': 0, 'family': 0, 'amen': 0, 'earth': 0, 'script': 0, 'luis': 0, 'avalos': 0, 'bankruptcy': 0, 'pointless': 0, 'ghost': 0, 'stick': 0, 'ghostbusters': 0, 'episode': 0, 'confused': 0, 'beginningbr': 0, 'clark': 0, 'blow': 0, 'head': 0, 'wakes': 0, 'floor': 0, 'fairview': 0, 'mental': 0, 'institution': 0, 'fun': 0, 'believing': 0, 'hes': 0, 'superhero': 0, 'delusional': 0, 'unusual': 0, 'marthas': 0, 'married': 0, 'lionel': 0, 'lex': 0, 'bound': 0, 'wheelchair': 0, 'limbs': 0, 'cut': 0, 'accident': 0, 'bridge': 0, 'lana': 0, 'devoted': 0, 'familiarity': 0, 'someone': 0, 'whos': 0, 'chloe': 0, 'patient': 0, 'known': 0, 'smallville': 0, 'doctor': 0, 'escapee': 0, 'phantom': 0, 'zonebr': 0, 'reminds': 0, 'buffy': 0, 'called': 0, 'normal': 0, 'begins': 0, 'vividdaydreams': 0, 'asylum': 0, 'tried': 0, 'convince': 0, 'figment': 0, 'imagination': 0, 'parents': 0, 'lived': 0, 'exist': 0, 'angel': 0, 'boyfriend': 0, 'dawn': 0, 'demons': 0, 'vampires': 0, 'episodes': 0, 'sad': 0, 'doctors': 0, 'arent': 0, 'telling': 0, 'tells': 0, 'fiction': 0, 'brings': 0, 'reality': 0, 'nice': 0, 'sense': 0, 'shows': 0, 'escape': 0, 'overcome': 0, 'challenges': 0, 'teta': 0, 'luna': 0, 'symbolic': 0, 'spain': 0, 'everything': 0, 'occurs': 0, 'meaning': 0, 'totally': 0, 'usual': 0, 'accessbr': 0, 'advice': 0, 'sample': 0, 'please': 0, 'saw': 0, 'chorus': 0, 'line': 0, 'onstage': 0, 'exposure': 0, 'musicals': 0, 'broadway': 0, '4': 0, 'auditioned': 0, 'touring': 0, 'company': 0, 'memorized': 0, 'original': 0, 'production': 0, '1985': 0, 'dreadful': 0, 'levels': 0, 'theatrical': 0, 'let': 0, 'assure': 0, 'irl': 0, 'audition': 0, 'play': 0, 'producer': 0, 'choreographer': 0, 'ask': 0, 'personal': 0, 'questions': 0, 'wanted': 0, 'become': 0, 'performer': 0, 'whether': 0, 'musical': 0, 'rarely': 0, 'five': 0, 'youre': 0, 'auditioning': 0, 'dancer': 0, 'shown': 0, '64bar': 0, 'dance': 0, 'combination': 0, 'decide': 0, 'immediately': 0, 'michael': 0, 'bennetts': 0, 'concept': 0, 'flesh': 0, 'lives': 0, 'dancers': 0, 'introduce': 0, 'uninitiated': 0, 'passion': 0, 'performing': 0, 'sacrifice': 0, 'richard': 0, 'attenborough': 0, 'focus': 0, 'beefing': 0, 'cassiezach': 0, 'relationship': 0, 'casting': 0, 'douglas': 0, 'zach': 0, 'zachhe': 0, 'voice': 0, 'theater': 0, 'cassie': 0, 'touched': 0, 'upon': 0, 'cab': 0, 'traffic': 0, 'upstairs': 0, 'talking': 0, 'playwas': 0, 'added': 0, 'major': 0, 'numbers': 0, 'rethought': 0, 'opening': 0, 'number': 0, 'hope': 0, 'jazz': 0, 'ballet': 0, 'eliminated': 0, 'jam': 0, 'three': 0, 'hundred': 0, 'together': 0, 'closeup': 0, 'disguise': 0, 'audrey': 0, 'landers': 0, 'goodbye': 0, '13': 0, 'hello': 0, 'brilliant': 0, 'vocal': 0, 'exploration': 0, 'childhoods': 0, 'jaundiced': 0, 'memories': 0, 'reworked': 0, 'surprise': 0, 'mainly': 0, 'vehicle': 0, 'late': 0, 'gregg': 0, 'burge': 0, 'richie': 0, 'famous': 0, 'song': 0, 'touching': 0, 'allegory': 0, 'sung': 0, 'standard': 0, 'performed': 0, 'tiredly': 0, 'miscast': 0, 'allyson': 0, 'reed': 0, 'jeffrey': 0, 'hornadays': 0, 'choreography': 0, 'dull': 0, 'unimaginative': 0, 'hold': 0, 'candle': 0, 'bennett': 0, 'staging': 0, 'previously': 0, 'mentioned': 0, 'michelle': 0, 'johnston': 0, 'bebe': 0, 'janet': 0, 'jones': 0, 'judy': 0, 'given': 0, 'opportunity': 0, 'walk': 0, 'chew': 0, 'gum': 0, '10': 0, 'finale': 0, 'dazzling': 0, 'almost': 0, 'hours': 0, 'devotee': 0, 'musicalbe': 0, 'afraidbe': 0, 'afraid': 0, 'lonely': 0, 'among': 0, 'season': 0, 'storyline': 0, 'although': 0, 'somewhat': 0, 'creates': 0, 'suspense': 0, 'supported': 0, 'creepy': 0, 'synthesizerdriven': 0, 'soundtrack': 0, 'typically': 0, 'alien': 0, 'body': 0, 'invasion': 0, 'scenario': 0, 'finally': 0, 'turning': 0, 'death': 0, 'assistant': 0, 'chief': 0, 'engineer': 0, 'singh': 0, 'delegate': 0, 'species': 0, 'deliver': 0, 'frame': 0, 'makeup': 0, 'far': 0, 'adding': 0, 'humor': 0, 'patrick': 0, 'stewart': 0, 'obviously': 0, 'enjoys': 0, 'stepping': 0, 'picard': 0, 'exploring': 0, 'terrain': 0, 'data': 0, 'posing': 0, 'sherlock': 0, 'holmes': 0, 'classic': 0, 'convincing': 0, 'cliff': 0, 'bole': 0, 'compensate': 0, 'trois': 0, 'lack': 0, 'ability': 0, 'improving': 0, 'beautiful': 0, 'neck': 0, 'dress': 0, 'improves': 0, 'appearance': 0, 'picards': 0, 'lightningscene': 0, 'slight': 0, 'air': 0, 'emperor': 0, 'star': 0, 'wars': 0, 'return': 0, 'jedi': 0, 'impression': 0, 'smilebr': 0, 'playing': 0, 'lighting': 0, 'corridors': 0, 'simulating': 0, 'aboard': 0, 'moving': 0, 'pulling': 0, 'entering': 0, 'transporter': 0, 'beam': 0, 'cloud': 0, 'clever': 0, 'cutting': 0, 'creating': 0, 'continuing': 0, 'dialog': 0, 'hypnosis': 0, 'report': 0, 'rounds': 0, 'crafted': 0, 'tng': 0, 'wesley': 0, 'crusher': 0, 'seemed': 0, 'tolerable': 0, 'ending': 0, 'p': 0, 'fetched': 0, 'easy': 0, 'left': 0, 'aside': 0, 'regarding': 0, 'strong': 0, 'moments': 0, 'offer': 0, 'warning': 0, 'reveal': 0, 'scoop': 0, 'ends': 0, 'reviewbr': 0, 'annie': 0, 'hall': 0, 'flukebr': 0, 'hugh': 0, 'jackmans': 0, 'naked': 0, 'chest': 0, 'itbr': 0, 'woody': 0, 'allens': 0, 'misogyny': 0, 'fixation': 0, 'women': 0, 'point': 0, 'granddaughter': 0, 'crippled': 0, 'pointbr': 0, 'promising': 0, 'ian': 0, 'mcshane': 0, 'directs': 0, 'fluffy': 0, 'headed': 0, 'student': 0, 'scarlett': 0, 'johansson': 0, 'investigate': 0, 'english': 0, 'lord': 0, 'jackman': 0, 'notorious': 0, 'tarot': 0, 'card': 0, 'killer': 0, 'prostitutes': 0, 'magician': 0, 'allen': 0, 'helps': 0, 'girlbr': 0, 'notwithstanding': 0, 'completely': 0, 'lacks': 0, 'charm': 0, 'atmosphere': 0, 'amazingly': 0, 'leaden': 0, 'amateurish': 0, 'effort': 0, 'previous': 0, 'dozens': 0, 'perhaps': 0, 'stroke': 0, 'gone': 0, 'unreported': 0, 'pressbr': 0, 'unlike': 0, 'septuagenarian': 0, 'allowed': 0, 'male': 0, 'lead': 0, 'constructed': 0, 'girl': 0, 'onebr': 0, 'central': 0, 'allows': 0, 'gotten': 0, 'drunk': 0, 'seduced': 0, 'powerful': 0, 'euphemism': 0, 'slam': 0, 'bam': 0, 'gotta': 0, 'kind': 0, 'moment': 0, 'bears': 0, 'relation': 0, 'whatsoever': 0, 'cheapens': 0, 'viewers': 0, 'add': 0, 'unnecessary': 0, 'female': 0, 'cake': 0, 'eat': 0, 'toobr': 0, 'command': 0, 'except': 0, 'wearing': 0, 'tight': 0, 'top': 0, 'imitates': 0, 'weird': 0, 'sadbr': 0, 'scripted': 0, 'doll': 0, 'function': 0, 'elderly': 0, 'less': 0, 'aweinspiring': 0, 'turnbr': 0, 'approximately': 0, 'age': 0, 'comes': 0, 'across': 0, 'vapid': 0, 'togetherbr': 0, 'audience': 0, 'breasts': 0, 'deserve': 0, 'heroines': 0, 'deserves': 0, 'heroine': 0, 'intelligence': 0, 'agency': 0, 'convey': 0, 'qualitiesbr': 0, 'similarly': 0, 'cheated': 0, 'apparently': 0, 'stand': 0, 'stunningly': 0, 'looking': 0, 'used': 0, 'merely': 0, 'shame': 0, 'productions': 0, 'oklahoma': 0, 'x': 0, 'actbr': 0, 'heres': 0, 'twist': 0, 'suave': 0, 'charming': 0, 'letting': 0, 'shes': 0, 'prostitute': 0, 'punish': 0, 'beyond': 0, 'graspbr': 0, 'passive': 0, 'aggressive': 0, 'touch': 0, 'deprives': 0, 'killing': 0, 'leaving': 0, 'alone': 0, 'note': 0, 'screening': 0, 'single': 0, 'member': 0, 'laughed': 0, 'sign': 0, 'advertised': 0, 'based': 0, 'title': 0, 'cute': 0, 'obvious': 0, 'direction': 0, 'within': 0, 'main': 0, 'contention': 0, 'remains': 0, 'tolerably': 0, 'consistent': 0, 'explain': 0, 'behind': 0, 'alcoholic': 0, 'overworked': 0, 'stressedout': 0, 'occasional': 0, 'resolution': 0, 'bottles': 0, 'randombr': 0, 'writing': 0, 'secondary': 0, 'concern': 0, 'werent': 0, 'desired': 0, 'niche': 0, 'pushed': 0, 'effect': 0, 'lame': 0, 'scripting': 0, 'glaringly': 0, 'filming': 0, 'connecticut': 0, 'southern': 0, 'california': 0, 'gods': 0, 'sake': 0, 'palm': 0, 'trees': 0, 'everywhere': 0, 'guy': 0, 'welcome': 0, 'throws': 0, 'newspaper': 0, 'greenwich': 0, 'herald': 0, 'stamford': 0, 'advocate': 0, 'refering': 0, 'makers': 0, 'done': 0, 'research': 0, 'god': 0, 'remotely': 0, 'care': 0, 'reviews': 0, 'said': 0, 'anything': 0, 'entertained': 0, 'intellectual': 0, 'popcorn': 0, 'shatner': 0, 'standout': 0, 'supporting': 0, 'cop': 0, 'russos': 0, 'coach': 0, 'funniest': 0, 'sit': 0, '2000s': 0, 'close': 0, 'king': 0, 'kongs': 0, 'adopted': 0, 'daughter': 0, 'went': 0, 'ahead': 0, 'tearful': 0, 'announcement': 0, 'coming': 0, 'endbr': 0, 'miss': 0, 'winfrey': 0, 'tearing': 0, 'laughing': 0, 'screaming': 0, 'wild': 0, 'indian': 0, 'westbr': 0, 'oprah': 0, 'puts': 0, 'whove': 0, 'suffered': 0, 'lost': 0, 'virginity': 0, 'theyve': 0, 'melted': 0, 'faces': 0, 'true': 0, 'missing': 0, 'paragraph': 0, 'spousal': 0, 'abuse': 0, 'tell': 0, 'havent': 0, 'heard': 0, 'tons': 0, 'bethany': 0, 'hamilton': 0, 'losing': 0, 'arm': 0, 'shark': 0, 'october': 0, '31st': 0, '2003': 0, 'hard': 0, 'feelings': 0, 'biggest': 0, 'probably': 0, 'jacksons': 0, 'interview': 0, '1993': 0, 'accused': 0, 'child': 0, 'molester': 0, 'sadly': 0, 'mr': 0, 'jackson': 0, 'particular': 0, 'michaels': 0, 'timebr': 0, 'oprahs': 0, 'influence': 0, 'middle': 0, 'aged': 0, 'soccer': 0, 'moms': 0, 'seem': 0, 'jesus': 0, 'sometimes': 0, 'ghetto': 0, 'rich': 0, 'isbr': 0, 'glad': 0, 'soon': 0, 'need': 0, 'television': 0, 'programs': 0, 'underrated': 0, 'imaginative': 0, 'creative': 0, 'opinion': 0, 'tops': 0, 'forgotten': 0, 'became': 0, 'overlooked': 0, 'bill': 0, 'teds': 0, 'bogus': 0, 'journey': 0, 'bombed': 0, 'box': 0, 'office': 0, 'whereas': 0, 'popular': 0, 'problem': 0, 'released': 0, '1991': 0, 'ted': 0, 'already': 0, '80s': 0, 'landscape': 0, 'changed': 0, 'radically': 0, 'gangsta': 0, 'rap': 0, 'hip': 0, 'hop': 0, 'pearl': 0, 'nirvana': 0, 'grunge': 0, 'seattle': 0, 'sound': 0, 'ozzy': 0, 'osbourne': 0, 'van': 0, 'halen': 0, 'guns': 0, 'n': 0, 'roses': 0, 'outdated': 0, '91': 0, 'nobody': 0, 'surfers': 0, 'saying': 0, 'stuff': 0, 'excellent': 0, 'gremlins': 0, '2': 0, '90s': 0, 'similar': 0, 'fate': 0, 'associated': 0, 'transition': 0, 'faster': 0, 'change': 0, '00s': 0, '1988': 0, '1989': 0, '2002': 0, '2001': 0, 'lookslooked': 0, '1996br': 0, 'adventure': 0, 'instead': 0, 'quickly': 0, 'wildly': 0, 'received': 0, 'viva': 0, 'jackass': 0, 'spin': 0, 'focuses': 0, 'adventures': 0, 'margera': 0, 'pals': 0, 'johnny': 0, 'knoxville': 0, 'brandon': 0, 'dicamillo': 0, 'etc': 0, 'fair': 0, 'share': 0, 'grossout': 0, 'stunts': 0, 'bams': 0, 'torturing': 0, 'parentsbr': 0, 'sorry': 0, 'cool': 0, 'ego': 0, 'tripped': 0, 'painfully': 0, 'unfunny': 0, 'yes': 0, 'narcissistic': 0, 'belief': 0, 'overly': 0, 'intro': 0, 'coolly': 0, 'explaining': 0, 'whatever': 0, 'f': 0, 'wants': 0, 'camerawork': 0, 'idiot': 0, 'camcorder': 0, 'garage': 0, 'moved': 0, 'steady': 0, 'pace': 0, 'felt': 0, 'boring': 0, 'dangerous': 0, 'disgusting': 0, 'performedbr': 0, 'follow': 0, 'hero': 0, 'pranks': 0, 'tortures': 0, 'relatives': 0, 'feel': 0, 'mildly': 0, 'presented': 0, 'tedious': 0, 'fashion': 0, 'spinoff': 0, 'feed': 0, 'margeras': 0, 'wretched': 0, 'talk': 0, 'botched': 0, 'poseidon': 0, 'respect': 0, 'salvagers': 0, 'caine': 0, 'karl': 0, 'malden': 0, 'tow': 0, 'wreck': 0, 'eponymous': 0, 'ocean': 0, 'liner': 0, 'creaky': 0, 'tug': 0, 'boat': 0, 'theyre': 0, 'challenged': 0, 'ruthless': 0, 'telly': 0, 'savalas': 0, 'machinegun': 0, 'toting': 0, 'goons': 0, 'sequel': 0, 'remake': 0, 'group': 0, 'survivors': 0, 'trek': 0, 'sinking': 0, 'ship': 0, 'shirley': 0, 'slim': 0, 'pickens': 0, 'peter': 0, 'boyle': 0, 'knight': 0, 'jack': 0, 'warden': 0, 'blind': 0, 'surely': 0, 'wish': 0, 'sally': 0, 'field': 0, 'particularly': 0, 'annoying': 0, 'stowaway': 0, 'board': 0, 'caines': 0, 'tugbr': 0, 'disaster': 0, 'master': 0, 'irwin': 0, 'produced': 0, 'decided': 0, 'direct': 0, 'thank': 0, 'fastforward': 0, 'rises': 0, 'falls': 0, 'stupid': 0, 'cliché': 0, 'difference': 0, 'javier': 0, 'bardem': 0, 'constructs': 0, 'buildings': 0, 'matter': 0, 'handsome': 0, 'cares': 0, 'car': 0, 'wrecks': 0, 'heroes': 0, 'struggle': 0, 'smells': 0, 'melodrama': 0, 'marries': 0, 'maria': 0, 'de': 0, 'madeiros': 0, 'insteadshe': 0, 'magnificently': 0, 'poetically': 0, 'heartshaped': 0, 'face': 0, 'oralinterface': 0, 'maribel': 0, 'verdu': 0, 'washes': 0, 'vulva': 0, 'beforehand': 0, 'handwashed': 0, 'sexy': 0, 'interest': 0, 'threesome': 0, 'minor': 0, 'theme': 0, 'spanishlanguage': 0, 'rise': 0, 'fall': 0, 'succeeds': 0, 'odds': 0, 'clear': 0, 'highclass': 0, 'soap': 0, 'opera': 0, 'premise': 0, 'execution': 0, 'entirely': 0, 'bminus': 0, 'gags': 0, 'appear': 0, 'spliced': 0, 'trailers': 0, '22orso': 0, 'waning': 0, 'anticipation': 0, 'morsel': 0, 'keep': 0, 'fidgeting': 0, 'remote': 0, 'counting': 0, 'carpet': 0, 'fibers': 0, 'exceptions': 0, 'comical': 0, 'overemoting': 0, 'gesticulating': 0, 'suited': 0, 'latenight': 0, 'infomercial': 0, 'primetime': 0, 'sitcom': 0, 'canadian': 0, 'admittedly': 0, 'cultural': 0, 'angle': 0, 'misfired': 0, 'cbc': 0, 'replicate': 0, 'success': 0, 'corner': 0, 'gas': 0, 'tone': 0, 'wrong': 0, 'prairies': 0, 'couldnt': 0, 'afford': 0, 'location': 0, 'actual': 0, 'town': 0, 'saskatchewan': 0, 'fooled': 0, 'regina': 0, 'exteriors': 0, 'proud': 0, 'primed': 0, 'cbcs': 0, 'publicists': 0, 'forgets': 0, 'colossal': 0, 'embarrassment': 0, 'wasted': 0, '90min': 0, 'turn': 0, 'gangster': 0, 'dude': 0, 'cover': 0, 'false': 0, 'advertising': 0, 'dumbstupid': 0, 'membersbr': 0, 'solid': 0, 'didbr': 0, 'write': 0, 'wasting': 0, 'rubbish': 0, 'vote': 0, 'belongs': 0, 'bottom': 0, '100': 0, 'youve': 0, 'sucked': 0, 'bitter': 0, 'dare': 0, 'torture': 0, 'war': 0, 'criminals': 0, 'terriosts': 0, 'theyll': 0, 'spilling': 0, 'beans': 0, 'begging': 0, 'mercy': 0, 'despite': 0, 'apparent': 0, 'structural': 0, 'similarity': 0, 'simpsons': 0, 'loud': 0, 'fat': 0, 'dad': 0, 'housewifey': 0, 'children': 0, 'pet': 0, 'typical': 0, 'suburban': 0, 'home': 0, 'functionally': 0, 'stylistically': 0, 'opposite': 0, 'avid': 0, 'cutaway': 0, 'stay': 0, 'nail': 0, 'rhea': 0, 'perlman': 0, 'danny': 0, 'devito': 0, '5': 0, 'spinesnappingly': 0, 'successful': 0, 'contrived': 0, 'advance': 0, 'bits': 0, 'insultingbr': 0, 'chemistry': 0, 'stewie': 0, 'brian': 0, 'griffin': 0, 'lends': 0, 'pure': 0, 'gold': 0, 'chris': 0, 'meg': 0, 'manage': 0, 'fulfill': 0, 'obligatory': 0, 'teenagers': 0, 'dysfunctional': 0, 'familybr': 0, 'feature': 0, 'untold': 0, 'mindnumbingly': 0, 'goodness': 0, 'dvdbr': 0, 'contrary': 0, 'tiresome': 0, 'comparisons': 0, 'perennial': 0, 'seth': 0, 'macfarlanes': 0, 'approach': 0, 'politically': 0, 'incorrect': 0, 'brazenbr': 0, 'creator': 0, 'ren': 0, 'stimpy': 0, 'john': 0, 'kricfalusi': 0, 'famously': 0, 'criticized': 0, 'extremely': 0, 'graphic': 0, 'standards': 0, 'cartoonists': 0, 'standpoint': 0, 'detailing': 0, 'accuracy': 0, 'spoofs': 0, 'successfulbr': 0, 'bowl': 0, 'chips': 0, 'smothered': 0, 'ranch': 0, 'sauce': 0, 'ass': 0, 'biographical': 0, 'charles': 0, 'lindbergh': 0, 'fly': 0, 'solo': 0, 'nonstop': 0, 'atlantic': 0, '1927': 0, 'plane': 0, 'spirit': 0, 'st': 0, 'louisbr': 0, 'amongst': 0, 'billy': 0, 'wilders': 0, 'boast': 0, 'impressive': 0, 'values': 0, 'performances': 0, 'outstandingbr': 0, 'definite': 0, 'limiting': 0, 'factor': 0, 'storytelling': 0, 'flew': 0, 'speak': 0, 'necessitated': 0, 'techniques': 0, 'internal': 0, 'monologues': 0, 'speaking': 0, 'housefly': 0, 'bouts': 0, 'exhaustion': 0, 'sets': 0, 'order': 0, 'avoid': 0, 'extended': 0, 'flight': 0, 'interspersed': 0, 'flashbacks': 0, 'methodical': 0, 'preparations': 0, 'flightbr': 0, 'huge': 0, 'era': 0, 'controversial': 0, 'beliefs': 0, 'taint': 0, 'legacy': 0, 'continue': 0, 'contribute': 0, 'aviation': 0, 'assisted': 0, 'civilian': 0, 'aircraft': 0, 'consultant': 0, 'wwiibr': 0, 'jimmy': 0, 'certainly': 0, 'flying': 0, 'background': 0, 'portrayal': 0, 'rose': 0, 'rank': 0, 'colonel': 0, 'force': 0, 'wwii': 0, 'reserves': 0, 'following': 0, 'reach': 0, 'brigadier': 0, 'general': 0, 'study': 0, 'wealth': 0, 'masterfully': 0, 'depicted': 0, 'written': 0, 'windbr': 0, 'kyle': 0, 'hadley': 0, 'kyles': 0, 'mitch': 0, 'difficulty': 0, 'finished': 0, 'college': 0, 'thrown': 0, 'wealthy': 0, 'oil': 0, 'controls': 0, 'townbr': 0, 'ny': 0, 'meets': 0, 'dreams': 0, 'nicely': 0, 'lauren': 0, 'bacall': 0, 'whirlwind': 0, 'romance': 0, 'fatherinlaw': 0, 'warns': 0, 'difficult': 0, 'sleeps': 0, 'gun': 0, 'pillow': 0, 'lee': 0, 'tramp': 0, 'fullest': 0, 'dorothy': 0, 'malone': 0, 'voted': 0, 'actressbr': 0, 'rock': 0, 'hudson': 0, 'faithful': 0, 'friendbr': 0, 'wedded': 0, 'bliss': 0, 'bride': 0, 'wife': 0, 'reveals': 0, 'indeed': 0, 'pregnant': 0, 'thinking': 0, 'mitchs': 0, 'drunken': 0, 'frenzy': 0, 'accidentally': 0, 'dead': 0, 'memorable': 0, 'scenebr': 0, 'tries': 0, 'unsuccessful': 0, 'blaming': 0, 'courtroom': 0, 'pulled': 0, 'stops': 0, 'admitting': 0, 'unfortunate': 0, 'oscar': 0, 'deservedbr': 0, 'surprisingly': 0, 'robert': 0, 'stack': 0, 'nominated': 0, 'upset': 0, 'victory': 0, 'anthony': 0, 'quinn': 0, 'paul': 0, 'gauguin': 0, 'lust': 0, 'lifebr': 0, 'sirk': 0, '1950s': 0, 'exception': 0, 'consider': 0, 'reviewer': 0, 'extolling': 0, 'virtues': 0, 'include': 0, 'gore': 0, 'uh': 0, 'setup': 0, 'codys': 0, 'comrade': 0, 'live': 0, 'blown': 0, 'cody': 0, 'holding': 0, 'lifeless': 0, 'bloody': 0, 'daily': 0, 'basis': 0, 'viewing': 0, 'glasses': 0, 'defines': 0, 'persona': 0, 'erased': 0, 'viewer': 0, 'memorybr': 0, 'rambo': 0, 'roams': 0, 'country': 0, 'bike': 0, 'long': 0, 'hometown': 0, 'usa': 0, 'guise': 0, 'nevada': 0, 'city': 0, 'realization': 0, 'damaged': 0, 'goods': 0, 'co': 0, 'declares': 0, 'destruction': 0, 'explains': 0, 'none': 0, 'notice': 0, 'flat': 0, 'post': 0, 'traumatic': 0, 'stress': 0, 'disorder': 0, 'guessing': 0, 'remember': 0, 'ordered': 0, 'battle': 0, 'fieldbr': 0, 'accidental': 0, 'kiss': 0, 'noted': 0, 'exactly': 0, 'respecting': 0, 'faith': 0, 'hitting': 0, 'knowing': 0, 'full': 0, 'spoken': 0, 'nonfamily': 0, 'value': 0, 'announced': 0, 'immediate': 0, 'universe': 0, 'posted': 0, 'youtube': 0, 'faiths': 0, 'lapse': 0, 'fidelity': 0, 'woman': 0, 'plans': 0, 'marry': 0, 'xmas': 0, 'cheering': 0, 'cheaten': 0, 'hearts': 0, 'lipsbr': 0, 'fiancé': 0, 'professes': 0, 'nano': 0, 'second': 0, 'accept': 0, 'proposal': 0, 'waited': 0, 'generous': 0, 'loves': 0, 'believes': 0, 'marriage': 0, 'compromises': 0, 'discussed': 0, 'doers': 0, 'herebr': 0, 'asner': 0, 'hill': 0, 'dialogue': 0, 'son': 0, 'literally': 0, 'days': 0, 'met': 0, 'stranger': 0, 'named': 0, 'band': 0, 'brothers': 0, 'speech': 0, 'phrase': 0, 'intended': 0, 'apply': 0, 'virtual': 0, 'strangers': 0, 'candy': 0, 'fluff': 0, 'betrays': 0, 'ways': 0, 'grossly': 0, 'applauds': 0, 'disrespect': 0, 'physically': 0, 'redefining': 0, 'wit': 0, 'accepting': 0, 'fledged': 0, 'loving': 0

  new_batch = []
  for example in batch:
    # Discard the first two tokens (unk and pad)
    feature_vec = [0] * len(itos)
    for idx in example:
      # Each example is a list of indices.
      # These indices are the conversion from the text in the vocab to a unique idx
      # Let's count how many words appear in a review (TF) and map to the fixed Bag-of-Words Vocab (100 positions, increment the ones that appear in the review)
      feature_vec[idx]+=1
      # TODO you can modify the previous line to implement a TF-IDF weighting
      # TODO also a simpler alternative is to use CTF as a feature, this is replace every position in the 100 dim vector by the collection term frequency (You need to manipulate the Counter returned by the vocab.freqs.most_common(100))
    new_batch.append(feature_vec)
  # Each tensor will contain a position with the number 
  return new_batch

# Reading the files

### Data processing

The following cell is parsing the data contained in the csv files that you just downloaded. 
Despite the extensive code block there are a number of familiar steps happening here:


1.   Clean and tokenize the text to generate TF features;
2.   Pair the text with its relevance judgements (labels);
3.   Create Datasets and Iterators to work with Pytorch

In [None]:
# Set the number of "input dimensions", this is the maximum vocab size that
# we will consider for sentiment analysis

# TODO you will need to tweak this number (Increase for more features / better performance)
max_vocab_size = 1000
# Number of samples used to compute the gradients in each iteration
batch_size = 32

# Load and tokenize CSV files to memory
def load_file(filepath, device, max_vocab_size):
    # Step 1 - Data preprocessing
    # Removing punctuation & split sentences.
    tokenizer = lambda x: str(x).translate(str.maketrans('', '', string.punctuation)).strip().split()
    
    # Use torchtext to create data fields for the text and labels
    # This is create our vocab, to generate our features 
    text_data_field = data.Field(sequential=True, lower=True, tokenize=tokenizer,
                                 stop_words=stopwords.words('english'), use_vocab=True,
                                 postprocessing=lambda b, vocab:extract_text_features(b, vocab))
    label_data_field = data.Field(sequential=False, use_vocab=False)
    
    print("Loading from csv...")
    tv_datafields = [("text", text_data_field), ("label", label_data_field)]
    
    # Step 2 - Build Pytorch Dataset.
    # This Dataset is an abstraction to pack our data into a data structure that
    # can be easily be manipulated by Pytorch
    train_dataset, valid_dataset, test_dataset = data.TabularDataset.splits(path=filepath,
                                                    train="Train.csv", validation="Valid.csv",
                                                    test="Test.csv", format="csv",
                                                    skip_header=False, fields=tv_datafields)
    print(train_dataset[0].__dict__.keys())
    
    
    # Step 3 - We will build our vocabulary based on the given text field that we created on step 1
    # We use a maximum threshold to save memory
    if max_vocab_size == 0:
      text_data_field.build_vocab(train_dataset)
    else:
      text_data_field.build_vocab(train_dataset, max_size = max_vocab_size)
    print("Text vocabulary built.")
    
    # Step 4 - Build our dataset iterators to go through the data. 
    train_iter = data.Iterator(train_dataset, device=device, train=True, batch_size=batch_size, shuffle=True)
    valid_iter = data.Iterator(valid_dataset, device=device, batch_size=batch_size, shuffle=True)
    test_iter = data.Iterator(test_dataset, device=device, batch_size=batch_size, shuffle=True)
    
    print("Iterator built.")
    return text_data_field, label_data_field, train_dataset, valid_dataset, test_dataset, train_iter, valid_iter, test_iter

# This will obtain the devices available for training/inference
# In your local machines if you have CUDA installed then you will be able to use them
# If not, Pytorch will use the CPU to train the model. CUDA comes pre-packed in Colab
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("I'm using this device: {}".format(device))
print("Don't forget to change the runtime type to GPU if you're working on Colab.")


text_data_field, label_data_field, train_dataset, valid_dataset, test_dataset, train_iter, valid_iter, test_iter = \
load_file('/content/sentiment-analysis-data/movie_reviews', device, max_vocab_size=max_vocab_size)

I'm using this device: cpu
Don't forget to change the runtime type to GPU if you're working on Colab.
Loading from csv...
dict_keys(['text', 'label'])
Text vocabulary built.
Iterator built.


### Exploring our data

In [None]:
# Let's explore the words that will be used as features for our model
# 1 - Using the Vocab structure that was created with our text_data_field obtain the following:

# Hint: https://torchtext.readthedocs.io/en/latest/vocab.html
# These 3 variables may help you
freqs = text_data_field.vocab.freqs
stoi = text_data_field.vocab.stoi
itos = text_data_field.vocab.itos

# Question 1.1 - What's the 15 words that occur more frequently in our dataset?
# Another Hint: freqs is a collections.Counter object https://docs.python.org/3.7/library/collections.html#collections.Counter
print(text_data_field.vocab.freqs.most_common(100))

# Question 1.2 - How many times does the 51th most frequent word occurs:
# print(YOUR CODE HERE)

[('film', 7163), ('one', 4433), ('movie', 4354), ('like', 2858), ('even', 2024), ('good', 1833), ('time', 1794), ('story', 1704), ('films', 1688), ('would', 1630), ('much', 1625), ('also', 1588), ('character', 1576), ('characters', 1558), ('get', 1525), ('two', 1458), ('first', 1436), ('see', 1386), ('well', 1382), ('way', 1348), ('make', 1290), ('really', 1249), ('little', 1209), ('life', 1197), ('plot', 1172), ('movies', 1145), ('bad', 1122), ('scene', 1113), ('could', 1102), ('never', 1088), ('people', 1078), ('new', 1046), ('best', 1040), ('doesnt', 1035), ('man', 1028), ('scenes', 1025), ('many', 1017), ('know', 982), ('dont', 960), ('hes', 936), ('great', 910), ('another', 904), ('go', 883), ('love', 870), ('us', 869), ('director', 855), ('end', 854), ('action', 850), ('something', 834), ('seems', 831), ('back', 828), ('still', 818), ('however', 801), ('made', 800), ('makes', 792), ('work', 791), ('world', 771), ('big', 771), ('though', 769), ('theres', 765), ('years', 748), ('ev

In [None]:
"""Use this cell if you want to debug the data
  Watch out, each time you run next() you will "remove" a batch from the iterator
  Don't forget to run the previous cell again before training the model to refresh the iterator
"""
e = next(iter(train_iter))
print(e.text.T.shape) # This will display the shape of the batch matrix (Nr_samples x Nr_Feature_Dimensions)
print(e.text.T[15]) # Position 15,0 gives the count br, 15,1 the count of movie, and so on... 
print(text_data_field.vocab.freqs.most_common(100))
print(text_data_field.vocab.itos)
print(text_data_field.vocab.stoi)

torch.Size([32, 102])
tensor([334, 767,  10,   1,   7,   1,   2,   4,   1,   0,   0,   1,   2,   1,
          0,   1,   5,   0,   1,   2,   1,   2,   2,   2,   0,   0,   0,   1,
          3,   0,   0,   0,   0,   0,   1,   0,   0,   0,   1,   0,   5,   0,
          0,   1,   0,   0,   3,   0,   0,   0,   0,   2,   0,   0,   0,   0,
          1,   0,   3,   1,   0,   0,   0,   2,   0,   0,   3,   1,   3,   1,
          0,   0,   0,   0,   0,   3,   0,   0,   0,   0,   2,   6,   3,   1,
          0,   0,   0,   0,   0,   0,   1,   2,   3,   3,   0,   0,   0,   1,
          0,   0,   1,   0])
[('film', 7163), ('one', 4433), ('movie', 4354), ('like', 2858), ('even', 2024), ('good', 1833), ('time', 1794), ('story', 1704), ('films', 1688), ('would', 1630), ('much', 1625), ('also', 1588), ('character', 1576), ('characters', 1558), ('get', 1525), ('two', 1458), ('first', 1436), ('see', 1386), ('well', 1382), ('way', 1348), ('make', 1290), ('really', 1249), ('little', 1209), ('life', 1197), ('p

### Building a Logistic Regression model with Pytorch


In [None]:
# We will use a logistic regression, which is a binary classifier to determine if the text reviews are positive or negative
class LogisticRegression(nn.Module):
  def __init__(self, input_dim, nr_classes):
    super(LogisticRegression, self).__init__()
    self.linear = torch.nn.Linear(input_dim, nr_classes)
    self.activation_functions = [nn.Sigmoid(), nn.Softmax(),
                                 nn.LogSigmoid(), nn.LogSoftmax(),
                                 nn.ReLU(), nn.Tanh()]
    # TODO change the activation function here
    self.act_func = self.activation_functions[0]

  def forward(self, x):
    # Return the application of the linear layer over the input vector x
    logits = self.linear(x.T)
    probs = self.act_func(logits)
    # Squeeze values to supress deprecation warnings due to different vector shapes
    return logits.T.squeeze(0), probs.T.squeeze(0)

  # Evaluating our Model with accuracy
  def binary_accuracy(self, preds, y):
      # Return accuracy per batch
      correct = (preds.int() == y.int()).float()
      acc = correct.sum() / len(correct)
      return acc

### Preparing our model to train and evaluate

Also part of our Machine Learning recipe we need a loss function (To evaluate by how much an example fails), and an optimizer to update our network weights, and reduce our losses.

In [None]:
# Initialize required variables
epochs = 10
# Question 3 - We want to use logistic regression as a binary classifier.
# What should the number of output dimensions be? Hint: It is probably one less int than what you are thinking. Why? 
nr_classes = 1
learning_rate = 0.001

# This input dim was determined upon data processing time
# Add 2 to the vocab size due to the special tokens added to the vocab 'unk' and 'pad'
model = LogisticRegression(max_vocab_size + 2, nr_classes)

# Use binary cross entropy loss for binary classification
loss_func = torch.nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Let's send our model and loss function to the device. Hopefully it is a GPU
model = model.to(device)
loss_func = loss_func.to(device)

### The training and evaluation loop

In [None]:
# Let's create our training and eval functions
def train(model, iterator, loss_func, optimizer):
  epoch_loss = 0.0
  preds = []
  rel_judgements = []
  # Signal Pytorch that we are training our model
  model.train()

  total_steps = len(iterator)
  # for i, batch in enumerate(tqdm(iterator, total=total_steps, desc='Train - Iteration')):
  for i, batch in enumerate(iterator):
    optimizer.zero_grad()

    # This will invoke our logistic regression model
    logits, predictions = model(batch.text.float())

    loss = loss_func(predictions, batch.label.float())
    
    loss.backward()
    optimizer.step()

    epoch_loss += loss.item()
    if i % 100 == 0:
      # Print loss update every 100 batches/steps
      print(f"[{i}/{total_steps}] : Loss: {loss.item():.2f}")
    preds.extend(predictions.tolist())
    rel_judgements.extend(batch.label.float())

  preds = np.where(np.asarray(preds) > 0.5, 1.0, 0.0)
  return epoch_loss / total_steps, preds, rel_judgements

from sklearn.metrics import classification_report
def evaluate(model, iterator, loss_func):
  epoch_loss = 0.0
  epoch_acc = 0.0
  preds = []
  rel_judgements = []
  
  # Signal pytorch that we are evaluating our model
  model.eval()
  
  total_steps = len(iterator)
  # Use torch no grad to signal that we do not want to update our gradients
  with torch.no_grad():
    #for i, batch in enumerate(tqdm(iterator, total=total_steps, desc='Eval - Iteration')):
    for i, batch in enumerate(iterator):
      # prediction [batch_size]
      logits, predictions = model(batch.text.float())

      loss = loss_func(predictions, batch.label.float())
      # Let's use the function that we defined earlier
      acc = model.binary_accuracy(predictions, batch.label)
        
      epoch_loss += loss.item()
      epoch_acc += acc.item()

      preds.extend(predictions.tolist())
      rel_judgements.extend(batch.label.float().tolist())
      
  preds = np.where(np.asarray(preds) > 0.5, 1.0, 0.0)
  return epoch_loss / total_steps, preds, rel_judgements

### Putting everything together

In [None]:
# Let's invoke our training loop to see the data the number of times we have set for epochs
best_valid_loss = float('inf') # Smaller loss is an indicator of a better model
target_names = ['Neg', 'Pos']
for epoch in trange(epochs, desc="Epoch"):
  train_loss, train_preds, train_qrels = train(model, train_iter, loss_func, optimizer)
  valid_loss, val_preds, val_qrels = evaluate(model, valid_iter, loss_func)

  # If our validation loss at the end of each epoch is better than our best, we want to save that model
  if valid_loss < best_valid_loss:  
      best_valid_loss = valid_loss
      torch.save(model.state_dict(), 'my-sentiment-model.pt')

  print(f'\nEpoch: {epoch} | Train  Loss: {train_loss: .3f}')
  print(classification_report(train_qrels, train_preds, target_names=target_names))
  print(f'\nEpoch: {epoch} | Valid  Loss: {valid_loss: .3f}')
  print(classification_report(val_qrels, val_preds, target_names=target_names))

Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

[0/50] : Loss: 0.69


Epoch:  10%|█         | 1/10 [00:00<00:04,  1.85it/s]


Epoch: 0 | Train  Loss:  0.690
              precision    recall  f1-score   support

         Neg       0.50      0.99      0.67       793
         Pos       0.75      0.04      0.07       807

    accuracy                           0.51      1600
   macro avg       0.63      0.51      0.37      1600
weighted avg       0.63      0.51      0.37      1600


Epoch: 0 | Valid  Loss:  0.681
              precision    recall  f1-score   support

         Neg       0.56      1.00      0.72       109
         Pos       1.00      0.07      0.12        91

    accuracy                           0.57       200
   macro avg       0.78      0.53      0.42       200
weighted avg       0.76      0.57      0.45       200

[0/50] : Loss: 0.66


Epoch:  20%|██        | 2/10 [00:01<00:04,  1.88it/s]


Epoch: 1 | Train  Loss:  0.660
              precision    recall  f1-score   support

         Neg       0.57      0.93      0.71       793
         Pos       0.82      0.32      0.46       807

    accuracy                           0.62      1600
   macro avg       0.70      0.62      0.58      1600
weighted avg       0.70      0.62      0.58      1600


Epoch: 1 | Valid  Loss:  0.685
              precision    recall  f1-score   support

         Neg       0.65      0.92      0.76       109
         Pos       0.81      0.42      0.55        91

    accuracy                           0.69       200
   macro avg       0.73      0.67      0.66       200
weighted avg       0.72      0.69      0.67       200

[0/50] : Loss: 0.64


Epoch:  30%|███       | 3/10 [00:01<00:03,  1.86it/s]


Epoch: 2 | Train  Loss:  0.637
              precision    recall  f1-score   support

         Neg       0.64      0.95      0.77       793
         Pos       0.91      0.47      0.62       807

    accuracy                           0.71      1600
   macro avg       0.78      0.71      0.70      1600
weighted avg       0.78      0.71      0.69      1600


Epoch: 2 | Valid  Loss:  0.649
              precision    recall  f1-score   support

         Neg       0.65      0.98      0.78       109
         Pos       0.94      0.36      0.52        91

    accuracy                           0.70       200
   macro avg       0.80      0.67      0.65       200
weighted avg       0.78      0.70      0.66       200

[0/50] : Loss: 0.61


Epoch:  40%|████      | 4/10 [00:02<00:03,  1.87it/s]


Epoch: 3 | Train  Loss:  0.624
              precision    recall  f1-score   support

         Neg       0.69      0.95      0.80       793
         Pos       0.92      0.58      0.71       807

    accuracy                           0.76      1600
   macro avg       0.80      0.77      0.76      1600
weighted avg       0.81      0.76      0.76      1600


Epoch: 3 | Valid  Loss:  0.652
              precision    recall  f1-score   support

         Neg       0.69      0.95      0.80       109
         Pos       0.90      0.49      0.64        91

    accuracy                           0.74       200
   macro avg       0.80      0.72      0.72       200
weighted avg       0.79      0.74      0.73       200

[0/50] : Loss: 0.63


Epoch:  50%|█████     | 5/10 [00:02<00:02,  1.90it/s]


Epoch: 4 | Train  Loss:  0.613
              precision    recall  f1-score   support

         Neg       0.71      0.97      0.82       793
         Pos       0.95      0.60      0.74       807

    accuracy                           0.78      1600
   macro avg       0.83      0.79      0.78      1600
weighted avg       0.83      0.78      0.78      1600


Epoch: 4 | Valid  Loss:  0.660
              precision    recall  f1-score   support

         Neg       0.84      0.80      0.82       109
         Pos       0.77      0.81      0.79        91

    accuracy                           0.81       200
   macro avg       0.80      0.81      0.80       200
weighted avg       0.81      0.81      0.81       200

[0/50] : Loss: 0.65


Epoch:  60%|██████    | 6/10 [00:03<00:02,  1.87it/s]


Epoch: 5 | Train  Loss:  0.603
              precision    recall  f1-score   support

         Neg       0.74      0.95      0.84       793
         Pos       0.94      0.68      0.79       807

    accuracy                           0.81      1600
   macro avg       0.84      0.82      0.81      1600
weighted avg       0.84      0.81      0.81      1600


Epoch: 5 | Valid  Loss:  0.654
              precision    recall  f1-score   support

         Neg       0.81      0.84      0.83       109
         Pos       0.80      0.76      0.78        91

    accuracy                           0.81       200
   macro avg       0.80      0.80      0.80       200
weighted avg       0.80      0.81      0.80       200

[0/50] : Loss: 0.58


Epoch:  70%|███████   | 7/10 [00:03<00:01,  1.87it/s]


Epoch: 6 | Train  Loss:  0.598
              precision    recall  f1-score   support

         Neg       0.78      0.95      0.85       793
         Pos       0.93      0.73      0.82       807

    accuracy                           0.84      1600
   macro avg       0.86      0.84      0.84      1600
weighted avg       0.86      0.84      0.84      1600


Epoch: 6 | Valid  Loss:  0.629
              precision    recall  f1-score   support

         Neg       0.74      0.97      0.84       109
         Pos       0.95      0.58      0.72        91

    accuracy                           0.80       200
   macro avg       0.84      0.78      0.78       200
weighted avg       0.83      0.80      0.78       200

[0/50] : Loss: 0.57


Epoch:  80%|████████  | 8/10 [00:04<00:01,  1.88it/s]


Epoch: 7 | Train  Loss:  0.591
              precision    recall  f1-score   support

         Neg       0.78      0.97      0.87       793
         Pos       0.96      0.73      0.83       807

    accuracy                           0.85      1600
   macro avg       0.87      0.85      0.85      1600
weighted avg       0.87      0.85      0.85      1600


Epoch: 7 | Valid  Loss:  0.624
              precision    recall  f1-score   support

         Neg       0.81      0.84      0.83       109
         Pos       0.80      0.76      0.78        91

    accuracy                           0.81       200
   macro avg       0.80      0.80      0.80       200
weighted avg       0.80      0.81      0.80       200

[0/50] : Loss: 0.59


Epoch:  90%|█████████ | 9/10 [00:04<00:00,  1.87it/s]


Epoch: 8 | Train  Loss:  0.588
              precision    recall  f1-score   support

         Neg       0.79      0.96      0.87       793
         Pos       0.95      0.75      0.84       807

    accuracy                           0.86      1600
   macro avg       0.87      0.86      0.86      1600
weighted avg       0.87      0.86      0.86      1600


Epoch: 8 | Valid  Loss:  0.626
              precision    recall  f1-score   support

         Neg       0.77      0.94      0.84       109
         Pos       0.90      0.66      0.76        91

    accuracy                           0.81       200
   macro avg       0.83      0.80      0.80       200
weighted avg       0.83      0.81      0.80       200

[0/50] : Loss: 0.61


Epoch: 100%|██████████| 10/10 [00:05<00:00,  1.88it/s]


Epoch: 9 | Train  Loss:  0.583
              precision    recall  f1-score   support

         Neg       0.80      0.96      0.87       793
         Pos       0.95      0.76      0.84       807

    accuracy                           0.86      1600
   macro avg       0.87      0.86      0.86      1600
weighted avg       0.88      0.86      0.86      1600


Epoch: 9 | Valid  Loss:  0.639
              precision    recall  f1-score   support

         Neg       0.80      0.93      0.86       109
         Pos       0.89      0.73      0.80        91

    accuracy                           0.83       200
   macro avg       0.85      0.83      0.83       200
weighted avg       0.84      0.83      0.83       200






In [None]:
# Finally let's load our best model and evaluate on our test data
model.load_state_dict(torch.load('my-sentiment-model.pt'))
test_loss, test_preds, test_qrels = evaluate(model, test_iter, loss_func)
print(f'Test Loss: {test_loss:.3f}')
print(classification_report(test_qrels, test_preds))

Test Loss: 0.602
              precision    recall  f1-score   support

         0.0       0.72      0.91      0.81        98
         1.0       0.88      0.67      0.76       102

    accuracy                           0.79       200
   macro avg       0.80      0.79      0.78       200
weighted avg       0.80      0.79      0.78       200



How can we improve our results?

Go back in this code and try to tweak the following parameters:
1. The maximum vocab size, which will be the dimension of our feature vector;
2. Selecting a different activation function from the list of activation functions in the model definition. (Attention, not all of them are useful for our problem)
3. Changing the number of epochs, learning rate, and batch_size this is the number of times our model sees the data, and how confidently it will update the model weight with respect to the losses, and how many examples sees during each iteration. 