Skip to content

Training a RNN to generate dialogs for The Office

I try out a Recurrent Neural Network in this notebook to try and generate TV scripts for the Office. While the output was nowhere near being realistic, it was still interesting looking at some of the lines generated using this method.

Here’s some sample output. The complete outputs are here and here (lines only from Phyllis). Update: I generated some more lines using the saved model from this notebook to generate somewhat better output here.

jim: i know, i’m not saying it. pam: how do you know a joke? phyllis: i don’t think we should get in here. dwight: thank you. pam: i want to be working on the phone? jim: oh, hey, darryl. what do you mean? michael: oh, thanks. jim: i want to go? andy: yeah. kevin: i’m sorry

The first thing to do was to fetch the lines from the Office so I wrote some web scraping code for this. I first learned about Recurrent Neural Networks from my Deep Learning Nanodegree and did a similar project there using Tensorflow but I wanted to try this out with Keras.

I found this notebook that contains the Keras implementation but when I tried to run it as is as the first step with the Office dataset, it exceeded the time limit on Kaggle Kernels with a free GPU despite their very generous 9 hour limit.

So I started modifying the code and decided to use words instead of characters thinking it will produce better results since the output from the original code had some invalid words. This way the model would just have learn to form sentences and not learn to form words too. However this resulted in the kernel running out of memory due to the huge increase in number of building blocks - there were around 70 unique characters before but around 10,000 unique words.

To reduce the amount of unique words, I first removed lines of everyone other than Michael but since he had a lot of lines, the data was still too much. I ended up using just the lines for Phylis and ended up with some good results. Since I wanted to capture the styles of all actors, I decided another approach to reduce the data - getting the top 2000 common words and considering just the lines made of these words.

All of these steps can be seen in this initial notebook.

# https://github.com/fchollet/deep-learning-with-python-notebooks/blob/master/8.1-text-generation-with-lstm.ipynb
import keras
import numpy as np
Using TensorFlow backend.
import ssl
# To avoid ssl: certificate_verify_failed error
ssl._create_default_https_context = ssl._create_unverified_context
# Get the script
def get_text():
office_script_file_url = "https://raw.githubusercontent.com/Pradhyo/the-office-us-tv-show/master/the-office-all-episodes.txt"
path = keras.utils.get_file('script.txt', origin=office_script_file_url)
text = open(path).read().lower()
return text
text = get_text()
print('Corpus length:', len(text))
Corpus length: 4108664
from collections import Counter
from pprint import pprint
char_counts = Counter()
for c in text:
char_counts[c] += 1
pprint(char_counts.most_common())
pprint(len(char_counts))
[(' ', 683865),
('e', 325287),
('t', 257302),
('a', 255726),
('o', 249519),
('i', 231860),
('n', 192916),
('h', 182121),
('s', 168200),
('r', 145558),
('l', 140074),
('d', 111329),
('y', 104883),
('m', 102473),
('.', 100099),
('u', 97290),
('g', 81523),
('c', 79151),
('w', 78946),
('\n', 69462),
(':', 60428),
('p', 53410),
(',', 47960),
('k', 47018),
('b', 43158),
('f', 42647),
("'", 33029),
('v', 27859),
('?', 18735),
('j', 18556),
('[', 12031),
(']', 12021),
('!', 9991),
('-', 6245),
('x', 3625),
('’', 3209),
('"', 2435),
('z', 2005),
('q', 1331),
('0', 1038),
('1', 615),
('2', 424),
('…', 417),
('5', 360),
('”', 255),
('3', 251),
('4', 248),
('“', 246),
('9', 156),
('—', 149),
(';', 145),
('$', 138),
('8', 137),
('7', 115),
('6', 109),
('/', 88),
('&', 85),
('‘', 69),
('#', 59),
('*', 59),
('%', 58),
('–', 55),
(')', 45),
('(', 32),
('_', 6),
('é', 6),
('+', 4),
('ü', 4),
('@', 3),
('�', 3),
('ñ', 3),
('{', 2),
('}', 2),
('=', 1)]
74
# Get some sample strings for each character to explore the data
def sample_strings(char, string_length=20, num_samples=5):
sample = 0
samples = []
for i, c in enumerate(text):
if i < string_length:
continue
if char == c:
samples.append(text[int(i-string_length/2):int(i+string_length/2)])
sample += 1
if sample == num_samples:
break
return samples
for c in char_counts:
print(f"{c}: {sample_strings(c)}")
m: ['l right jim. your qu', 'ibrary?\njim: oh, i t', 'it. so...\nmichael: s', " you've come to the ", 'me to the master for']
i: ['ll right jim. your q', 'r quarterlies look v', 'how are things at th', 's at the library?\nji', 'library?\njim: oh, i ']
c: ["ld you. i couldn't c", " couldn't close it. ", '. so...\nmichael: so ', "so you've come to th", 'for guidance? is thi']
h: ['ery good. how are th', ' how are things at t', 'hings at the library', 'ry?\njim: oh, i told ', ' so...\nmichael: so y']
a: ['m. your quarterlies ', 'good. how are things', 're things at the lib', 't the library?\njim: ', 'so...\nmichael: so yo']
e: ['your quarterlies loo', ' quarterlies look ve', 'ies look very good. ', 'od. how are things a', 'ings at the library?']
l: ['ur quarterlies look ', 'arterlies look very ', 'gs at the library?\nj', ': oh, i told you. i ', "you. i couldn't clos"]
:: ['brary?\njim: oh, i to', "..\nmichael: so you'v", 'opper?\njim: actually', 'h.\nmichael: all righ', '.\n\nmichael: [on the ']
: ['right jim. your quar', ' jim. your quarterli', 'uarterlies look very', 'rlies look very good', ' look very good. how']
r: ['t jim. your quarterl', '. your quarterlies l', 'our quarterlies look', 'es look very good. h', 'ood. how are things ']
g: ['look very good. how ', 'w are things at the ', 'aster for guidance? ', "u're saying, grassho", 'e saying, grasshoppe']
t: [' your quarterlies lo', '. how are things at ', 'e things at the libr', 'things at the librar', 'im: oh, i told you. ']
j: [' library?\njim: oh, i', 'sshopper?\njim: actua', 'products. just wante', 'uh, yeah. just a fax', 'rumming]\n\njim: my jo']
.: [' right jim. your qua', ' very good. how are ', 'i told you. i couldn', 't close it. so...\nmi', 'ose it. so...\nmichae']
y: ['ight jim. your quart', 's look very good. ho', 'the library?\njim: oh', 'h, i told you. i cou', "chael: so you've com"]
o: ['ght jim. your quarte', 'rterlies look very g', 'terlies look very go', 'ook very good. how a', 'ok very good. how ar']
u: ['ht jim. your quarter', 'im. your quarterlies', ' i told you. i could', " you. i couldn't clo", "ael: so you've come "]
q: ['jim. your quarterlie', '-manger. [quick cut ', 'ut... uh, quantities', '\nmichael: question. ', ', you big queen.\nmic']
s: ['quarterlies look ver', ' are things at the l', "uldn't close it. so.", 'close it. so...\nmich', "\nmichael: so you've "]
k: ['erlies look very goo', "es, i'd like to spea", 'ke to speak to your ', 'ted to talk to you m', 'ger. [quick cut scen']
v: ['lies look very good.', "l: so you've come to", 'thank you very much,', 'she had a very low v', ' very low voice. pro']
d: ['k very good. how are', ' oh, i told you. i c', "ou. i couldn't close", 'er for guidance? is ', ' you called me in he']
w: ['y good. how are thin', "? is this what you'r", 'll right. well, let ', 'let me show you how ', "how you how it's don"]
n: ['ow are things at the', "u. i couldn't close ", ' for guidance? is th', "ou're saying, grassh", 'alled me in here, bu']
b: [' at the library?\njim', ' in here, but yeah.\n', 'voice. probably a sm', 'ice. probably a smok', " uh, i've been at du"]
?: ['he library?\njim: oh,', 'r guidance? is this ', 'rasshopper?\njim: act', 'right, pam?\npam: wel', '\npam: what?\nmichael:']
: ['e library?\njim: oh, ', ' it. so...\nmichael: ', 'asshopper?\njim: actu', ' but yeah.\nmichael: ', "it's done.\n\nmichael:"]
,: ['y?\njim: oh, i told y', "'re saying, grasshop", ': actually, you call', 'me in here, but yeah', 'ight. well, let me s']
': [". i couldn't close i", "el: so you've come t", "s what you're saying", "you how it's done.\n\n", "ne] yes, i'd like to"]
f: ['he master for guidan', ' to your office mana', 'to your office manag', ' manager of dunder m', ' dunder mifflin pape']
p: ['g, grasshopper?\njim:', ', grasshopper?\njim: ', ': [on the phone] yes', ' like to speak to yo', ' manager, please. ye']
[: ['\nmichael: [on the ph', 'a-manger. [quick cut', ' mistake. [hangs up]', 'er, so... [clears th', 'ears ago. [growls]\np']
]: [" the phone] yes, i'd", ' cut scene] all righ', ' [hangs up] that was', "ars throat] so that'", 'o. [growls]\npam: wha']
-: ['ou manager-a-manger.', ' manager-a-manger. [', '. pam! pam-pam! pam ', '... ringie-dingie-di', 'gie-dingie-ding!\njan']
1: ['fflin for 12 years, ', 't it for $1,200. fix', 'rofits by 17% or whe', 'it says, "100% post-', ' us erase 100 years ']
2: ['flin for 12 years, t', "got a '78 280z. boug", 'it for $1,200. fixed', ' up being 25% of my ', 'more like 200 years.']
!: [', pam. pam! pam-pam!', 'm! pam-pam! pam bees', 'ichael: oh! pam, thi', 'per basket! look at ', 'ok at that! look at ']
x: [' just a fax.\nmichael', 'well, i faxed one ov', 'e get a fax this mor', 'at with faxes.\njan: ', 'is fine. excellent.\n']
": [' they go, "god we\'ve', 'hilarious." "and you', 'larious." "and you g', 'out of us." [shows t', 'and said, "mr. scott']
z: [' be downsizing.\nmich', 'use downsizing is a ', 'out downsizing himse', 'not downsizing himse', 'out downsizing?\n\nmic']
7: [" i got a '78 280z. b", 'ofits by 17% or when', 'ill lose $7,000 if y', ' for like 7 seconds.', '..now, $9.78, signs ']
8: ["i got a '78 280z. bo", "ot a '78 280z. bough", " amazing '80s party ", 'e walked 18 miles.\nm', 'ftware, 128-bit encr']
0: ["t a '78 280z. bought", 't for $1,200. fixed ', ' for $1,200. fixed i', 'e worth, 50 cents?\nm', 'michael: 50 cents, y']
$: ['ht it for $1,200. fi', "there's a $1,200 ded", 've you... $25.\noscar', 'alkathon. $25.\noscar', 'will lose $7,000 if ']
5: ['se worth, 50 cents?\n', '\nmichael: 50 cents, ', ' notes at 50 cents a', 'up being 25% of my c', ' you... $25.\noscar: ']
%: ['fits by 17% or when ', 'says, "100% post-con', 'p being 25% of my co', 've that 99% of the p', 'wight: 100%.\npam: di']
9: ['ieve that 99% of the', 'eve that 99% of the ', 'ogi, only 99 cents p', 'gi, only 99 cents pl', "?\ntoby: '89.\nkaty: o"]
;: ["on't laugh; please d", "urt mozart; you're g", "s terrible; no one's", 'e his mind; he thoug', 'wait, wait; one thin']
3: ["at it's a 300ft drop", ': it goes 300 feet i', "at it's a 300ft drop", ', it goes 300 feet i', '0 pounds, 3 inches. ']
4: ['e turning 46, but, c', " um, he's 41 years o", 'can have 14. marjory', '\nmichael: 42897. ok.', '? this is 400 bucks.']
6: [' turning 46, but, co', ' pam. pam 6.0.\n[pam ', 'left me a 60 acre wo', "en you're 65. hey, i", 'lose on 126 over the']
): ['es phyllis)] um... y', 'iq (it guy): that ju', "iq (it guy): what's ", "iq (it guy): oh, it'", 'iq (it guy): by keyw']
&: ['ercrombie & fitch?\nm', "the most m&m's in th", " bowl of m&m's into ", 'y.\n\nkevin & oscar: o', " pack of m&m's, his "]
#: ['is "mambo #5." so...', "michael's #2 guy for", 'on worker #1: hey, y', 'on worker #2: ass, a', 'on worker #2: ...ass']
/: [' an amused/appreciat', ' too much / in this ', 'mmmpt.\npam/jim: [in ', 'ifty eyes / ryan - d', 'ed pupils / kelly - ']
(: ['...\nsadiq (it guy): ', '...\nsadiq (it guy): ', 'er]\nsadiq (it guy): ', 'ch?\nsadiq (it guy): ', '". [sadiq (it guy) t']
*: ['queer as f***, so...', 'ueer as f***, so...\n', 'eer as f***, so...\nj', 'is is bull****!\n\nmic', 's is bull****!\n\nmich']
+: ["' or an 'a+' but i c", "re's an 'a++'.\n\nkare", "e's an 'a++'.\n\nkaren", ' a solid b+. althoug']
{: [' kidding. {kevin tak', 'porpoise. {erin argu']
@: ['en] packer@dundermif', ' packaging@dundermif', ' mac.com, @ their we']
�: ['ang tao, j�rg ro�kop', 'o, j�rg ro�kopf, and', ': de class�.\nmichael']
_: ['ing out "s_an_ey is ', ' out "s_an_ey is che', 'ey is chea_in_ _n _e', 'is chea_in_ _n _eri]', ' chea_in_ _n _eri] t']
}: ['bumps toby}\ndwight: ', 'one in cab} after dw']
=: ['o smart. e=mc... squ']
’: ['h, for god’s sake. [', 'aces. that’s it. som', 'or you don’t. and i', ' and i don’t. but i', ' and i don’t really ']
…: ['california… for the ', 'd he chose…\n\nandy: [', 'h hands] i… it’s unb', '.\ndwight: … 2, 3! [p', 'f the list… attack!\n']
“: ['told him, “i need a ', 'en i say, “and shove', 'from jim. “this is g', 'meredith: “suck it l', 'was like, “who’s tha']
”: [' enforcer.” smart, r', 'your butt.” it’s stu', 'ain later.”\npam: [ev', 'it losers.”\n\nryan: o', ' like her.” now i’m ']
‘: ['k, it’s a ‘little pr', 'elly] and ‘big pregs', 'ff table] ‘specially', 'who in an ‘alive’ si', 't’s-\npam: ‘scuse me?']
ñ: ['ame is. señor loaden', 'ughing] señor loaden', ' called señor loaden']
–: ['reminders – no burpi', 'you asked – connecti', 'chapter 2 – announci', 'chapter 4 – one of t', 'chapter 9 – the tabl']
ü: ['rine and güiro]\ndarr', '[removes güiro and b', '. [plays güiro and s', ' playing güiro] fish']
é: ['elve clichés every t', ' her fiancé ravi was', 's ex-fiancé’s weddin', 's ex-fiancé.\npam: [e', 'y ex-fiancé.\npam: [s']
—: ['0 children—\npam: kay', 'rk and, um—\npete: pe', 'k: is this—is this l', 't a glance—\ndwight: ', 'ait, sales—what sale']
# See longer strings for non alphanumeric characters
for c in char_counts:
if not c.isalnum():
print(f"{c}: {sample_strings(c, 40)}")
:: [' at the library?\njim: oh, i told you. i ', "se it. so...\nmichael: so you've come to ", 'ng, grasshopper?\njim: actually, you call', 'e, but yeah.\nmichael: all right. well, l', " it's done.\n\nmichael: [on the phone] yes"]
: ['im. your quarterlies look very good. how', 'our quarterlies look very good. how are ', 'uarterlies look very good. how are thing', 'lies look very good. how are things at t', ' look very good. how are things at the l']
.: ['rlies look very good. how are things at ', "\njim: oh, i told you. i couldn't close i", " i couldn't close it. so...\nmichael: so ", "ouldn't close it. so...\nmichael: so you'", "uldn't close it. so...\nmichael: so you'v"]
?: ['hings at the library?\njim: oh, i told yo', " master for guidance? is this what you'r", ' saying, grasshopper?\njim: actually, you', " forever. right, pam?\npam: well. i don't", '. [growls]\npam: what?\nmichael: any messa']
: ['ings at the library?\njim: oh, i told you', "dn't close it. so...\nmichael: so you've ", 'saying, grasshopper?\njim: actually, you ', 'e in here, but yeah.\nmichael: all right.', "w you how it's done.\n\nmichael: [on the p"]
,: ['the library?\njim: oh, i told you. i coul', "s what you're saying, grasshopper?\njim: ", 'opper?\njim: actually, you called me in h', 'ou called me in here, but yeah.\nmichael:', 'ael: all right. well, let me show you ho']
': ["i told you. i couldn't close it. so...\nm", "o...\nmichael: so you've come to the mast", "ce? is this what you're saying, grasshop", "t me show you how it's done.\n\nmichael: [", "on the phone] yes, i'd like to speak to "]
[: ["t's done.\n\nmichael: [on the phone] yes, ", 'u manager-a-manger. [quick cut scene] al', ' sorry. my mistake. [hangs up] that was ', 'bly a smoker, so... [clears throat] so t', 'ouple of years ago. [growls]\npam: what?\n']
]: ["chael: [on the phone] yes, i'd like to s", 'er. [quick cut scene] all right. done de', 'y mistake. [hangs up] that was a woman i', "so... [clears throat] so that's the way ", 'f years ago. [growls]\npam: what?\nmichael']
-: [' talk to you manager-a-manger. [quick cu', 'alk to you manager-a-manger. [quick cut ', 'onist, pam. pam! pam-pam! pam beesly. pa', 'd of going... ringie-dingie-ding!\njan: i', "ing... ringie-dingie-ding!\njan: i've spo"]
!: ['ceptionist, pam. pam! pam-pam! pam beesl', 't, pam. pam! pam-pam! pam beesly. pam ha', 't a fax.\nmichael: oh! pam, this is from ', 'he wastepaper basket! look at that! look', 'basket! look at that! look at that face.']
": ['best boss. they go, "god we\'ve never wor', 'e. you\'re hilarious." "and you get the b', ' you\'re hilarious." "and you get the bes', ' the best out of us." [shows the camera ', 'me to me, and said, "mr. scott, would yo']
$: ['280z. bought it for $1,200. fixed it up.', "o vision, there's a $1,200 deductible.\n\n", "oing to give you... $25.\noscar: that's..", "hew's... walkathon. $25.\noscar: per mile", 'arol: you will lose $7,000 if you walk a']
%: ['reased profits by 17% or when i cut expe', 'e back it says, "100% post-consumer cont', 'all ends up being 25% of my commission f', '. we believe that 99% of the problems in', 'dential?\ndwight: 100%.\npam: did you just']
;: [", sorry. don't laugh; please don't laugh", " try and hurt mozart; you're going to ge", "r. devon is terrible; no one's gonna mis", 'ould change his mind; he thought that i ', 'pid.\njim: wait, wait; one thing. uh, by ']
): ['el: [ignores phyllis)] um... yeah. who e', 'ted...\nsadiq (it guy): that just means y', " oh...\nsadiq (it guy): what's your passw", "puter]\nsadiq (it guy): oh, it's 1-2-3.\nm", 'earch?\nsadiq (it guy): by keyword phrase']
&: ['g.\npam: abercrombie & fitch?\nmichael: uh', "o can put the most m&m's in their mouth?", ": [empties bowl of m&m's into his mouth]", 'led anybody.\n\nkevin & oscar: one, two, t', ".. a party pack of m&m's, his favorite c"]
#: ['hone ring is "mambo #5." so...\npam: [lau', "have been michael's #2 guy for about 5 y", 'efrigeration worker #1: hey, you wanna s', 'efrigeration worker #2: ass, ass, ass...', 'efrigeration worker #2: ...ass, ass, ass']
/: [' gives her an amused/appreciative grin]\n', "you've had too much / in this life.\njim:", 'stanley: hmmmpt.\npam/jim: [in unison] i ', '"creed -shifty eyes / ryan - dilated pup', 'an - dilated pupils / kelly - hyperactiv']
(: ['-protected...\nsadiq (it guy): that just ', "ichael: oh...\nsadiq (it guy): what's you", " on computer]\nsadiq (it guy): oh, it's 1", 'o you search?\nsadiq (it guy): by keyword', 'and "funny". [sadiq (it guy) types; resu']
*: [' i watch, queer as f***, so...\njan: that', "i watch, queer as f***, so...\njan: that'", " watch, queer as f***, so...\njan: that's", 'ichael: this is bull****!\n\nmichael: me w', 'chael: this is bull****!\n\nmichael: me wa']
+: ["r be an 'a' or an 'a+' but i completely ", "t that there's an 'a++'.\n\nkaren: [record", " that there's an 'a++'.\n\nkaren: [recordi", 'at lecture a solid b+. although, for the']
{: [' got to be kidding. {kevin takes bite of', 'ooth as a porpoise. {erin argues]\npete: ']
@: ['puter screen] packer@dundermifflin.com. ', 'fflin.com. packaging@dundermifflin.com. ', 's are just mac.com, @ their website, wha']
�: ['waldner, wang tao, j�rg ro�kopf, and of ', 'r, wang tao, j�rg ro�kopf, and of course', 't all.\njim: de class�.\nmichael: french. ']
_: ['ame, spelling out "s_an_ey is chea_in_ _', ', spelling out "s_an_ey is chea_in_ _n _', 'out "s_an_ey is chea_in_ _n _eri] that a', ' "s_an_ey is chea_in_ _n _eri] that and ', 's_an_ey is chea_in_ _n _eri] that and th']
}: ['er. [fist bumps toby}\ndwight: so long da', '\ntoby: [alone in cab} after dwight fired']
=: ["ic? 'i'm so smart. e=mc... squared. i dr"]
’: ['\n\noscar: oh, for god’s sake. [notices er', 'n weird places. that’s it. sometimes you', 'er get it or you don’t. and i don’t. b', 'ou don’t. and i don’t. but i am so exc', 'r own job. and i don’t really know how s']
…: ['on robert california… for the manager po', 'fill. and he chose…\n\nandy: [drumroll w', 'umroll with hands] i… it’s unbelievable.', '. yeah. oh.\ndwight: … 2, 3! [pulls phyll', 'eft side of the list… attack!\njim: wait,']
“: ['lk. and i told him, “i need a really str', 'ion and then i say, “and shove it up you', ' oh, text from jim. “this is getting ver', 'om kevin.\nmeredith: “suck it losers.”\n\nr', 'ough here was like, “who’s that receptio']
”: ['u to be my enforcer.” smart, right?\nkell', 'ove it up your butt.” it’s stupid, but i', ' will explain later.”\npam: [everyone’s p', 'th: “suck it losers.”\n\nryan: okay, not t', 'tionist? i like her.” now i’m just a fat']
‘: ['ngela: look, it’s a ‘little pregs’ [poin', 's to her belly] and ‘big pregs’ [points ', 'cks toby off table] ‘specially with me a', 'would eat who in an ‘alive’ situation. n', '’t think it’s-\npam: ‘scuse me?\ndwight: s']
–: ['er simple reminders – no burping, no slu', " i'm glad you asked – connecticut casual", 'rd!\n\njim: chapter 2 – announcing guests ', "lf.\n\njim: chapter 4 – one of the host's ", 'er.\n\njim: chapter 9 – the tableau vivant']
—: ['rade of 500 children—\npam: kay, well, yo', '. it’s clark and, um—\npete: pete!\nandy: ', ' man?\nclark: is this—is this like code f', 'ust give it a glance—\ndwight: ok\nclark: ', '\ndwight: wait, sales—what sales meeting?']

Looking at the above text, some of the characters like \nappear in between words but some of them like ' appear as part of the word. I am going to leave the ones within words as is but consider the others as separate words so the model doesn’t consider jim in\njim different from justjim. I am also going to consider all numbers the same.

# consider these as words
consider_words = ''.join(c for c in char_counts if not c.isalnum())
print(consider_words)
: .?
,'[]-!"$%;)&#/(*+{@�_}=’…“”‘–—

Looking at the symbols more closely, it doesn’t look like there are a lot of symbols that appear within the words so I am just going to consider all of them separate words.

numbers = '0123456789'
def replace_numbers(text):
for n in numbers:
text = text.replace(n, "0")
return text
text = replace_numbers(text)
consider_words += '0' # consider 0 also a word
print(consider_words)
: .?
,'[]-!"$%;)&#/(*+{@�_}=’…“”‘–—0
def split_into_words(text, consider_words):
# Split text into words - characters above are also considered words
text = text.replace(' ', ' | ') # pick a char not in the above list
text = text.replace('\n', ' | ') # pick a char not in the above list
for char in consider_words:
text = text.replace(char, f" {char} ") # to split on spaces to get char
words_with_pipe = text.split()
words = [word if word != '|' else ' ' for word in words_with_pipe]
return words
words = split_into_words(text, consider_words)
print(words[:50])
['michael', ':', ' ', 'all', ' ', 'right', ' ', 'jim', '.', ' ', 'your', ' ', 'quarterlies', ' ', 'look', ' ', 'very', ' ', 'good', '.', ' ', 'how', ' ', 'are', ' ', 'things', ' ', 'at', ' ', 'the', ' ', 'library', '?', ' ', 'jim', ':', ' ', 'oh', ',', ' ', 'i', ' ', 'told', ' ', 'you', '.', ' ', 'i', ' ', 'couldn']
# Length of extracted word sequences
maxlen = 20
# We sample a new sequence every `step` words
step = 3
def setup_inputs(words, maxlen, step):
try:
# This holds our extracted sequences
sentences = []
# This holds the targets (the follow-up characters)
next_words = []
for i in range(0, len(words) - maxlen, step):
sentences.append(words[i: i + maxlen])
next_words.append(words[i + maxlen])
print('Number of sequences:', len(sentences))
# List of unique characters in the corpus
unique_words = sorted(list(set(words)))
print('Unique words:', len(unique_words))
# Dictionary mapping unique characters to their index in `unique_words`
word_indices = dict((word, unique_words.index(word)) for word in unique_words)
# Next, one-hot encode the characters into binary arrays.
print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(unique_words)), dtype=np.bool)
y = np.zeros((len(sentences), len(unique_words)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, word in enumerate(sentence):
x[i, t, word_indices[word]] = 1
y[i, word_indices[next_words[i]]] = 1
return x, y, unique_words, word_indices
except MemoryError as e:
print(e)
pass
# Commenting out to avoid MemoryError
# Tried catching it but didn't seem to work
# x, y, unique_words, word_indices = setup_inputs(words, maxlen, step)

Reducing data

Since the above was throwing a MemoryError, I tried reducing the data by considering lines by just one actor. Using Michael’s lines caused the same issue again so I tried using lines for Phyllis.

text = get_text()
selected_actor = "phyllis"
def get_selected_lines(text, selected_actor):
lines = text.split("\n")
return "\n".join(line for line in lines if line.startswith(f"{selected_actor}:"))
text = get_selected_lines(text, selected_actor)
print(text[:200])
phyllis: so what does downsizing actually mean?
phyllis: what?
phyllis: well, uh, for decorations, maybe we could... it's stupid, forget it.
phyllis: i was just going to say, maybe we could have strea
text = replace_numbers(text)
words = split_into_words(text, consider_words)
x, y, unique_words, word_indices = setup_inputs(words, maxlen, step)
Number of sequences: 8328
Unique words: 1813
Vectorization...
from keras import layers
def build_model(maxlen, num_unique_words):
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(maxlen, num_unique_words)))
model.add(layers.Dense(num_unique_words, activation='softmax'))
optimizer = keras.optimizers.RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
return model
model = build_model(maxlen, len(unique_words))
def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)
import random
import sys
def train_model(text, words, unique_words, word_indices, max_epoch, script_file, model_file=""):
with open(script_file, "wt") as f:
f.write("") # Just to create/overwrite the file
for epoch in range(1, max_epoch):
with open(script_file, "at") as f:
f.write(f'\n\nepoch {epoch}\n\n')
# Fit the model for 1 epoch on the available training data
model.fit(x, y,
batch_size=128,
epochs=1)
# Select a text seed at random
start_index = random.randint(0, len(words) - maxlen - 1)
generated_text = words[start_index: start_index + maxlen]
with open(script_file, "at") as f:
f.write('--- Generating with seed: "' + ''.join(generated_text) + '"\n')
with open(script_file, "at") as f:
for temperature in [0.2, 0.5, 1.0, 1.2]:
f.write('\n--- temperature: ' + str(temperature) + "\n")
f.write(''.join(generated_text))
for i in range(200):
sampled = np.zeros((1, maxlen, len(unique_words)))
for t, word in enumerate(generated_text):
sampled[0, t, word_indices[word]] = 1.
preds = model.predict(sampled, verbose=0)[0]
next_index = sample(preds, temperature)
next_word = unique_words[next_index]
generated_text.append(next_word)
generated_text = generated_text[1:]
f.write(next_word)
if model_file:
model.save(model_file)
train_model(text, words, unique_words, word_indices, 100, "phyllis_script.txt")

Using only the most common words

Since I wanted to learn lines from all actors, I reduced the data by taking the 2000 most common words and considering the sentences only made solely of these words.

text = get_text()
text = replace_numbers(text)
words = split_into_words(text, consider_words)
words_counter = Counter(words)
print(len(words_counter))
# Display just 200 on the blog post
# 2000th most common word occurred 25 times
print(words_counter.most_common(200))
20795
[(' ', 753327), ('.', 100099), (':', 60428), (',', 47960), ("'", 33029), ('i', 29843), ('you', 24675), ('?', 18735), ('the', 17982), ('to', 16773), ('a', 15401), ('michael', 15184), ('s', 14738), ('it', 13938), ('[', 12031), (']', 12021), ('and', 11393), ('that', 11344), ('!', 9991), ('dwight', 9905), ('jim', 8971), ('is', 8502), ('t', 8013), ('of', 7616), ('pam', 7208), ('in', 6985), ('what', 6608), ('-', 6245), ('no', 6047), ('we', 6032), ('this', 5906), ('on', 5283), ('andy', 5087), ('my', 5041), ('me', 5035), ('m', 4934), ('have', 4886), ('just', 4786), ('know', 4453), ('do', 4432), ('so', 4427), ('for', 4387), ('oh', 4340), ('not', 4332), ('don', 4071), ('are', 3965), ('re', 3696), ('be', 3612), ('was', 3608), ('he', 3554), ('your', 3490), ('can', 3484), ('0', 3453), ('with', 3433), ('like', 3381), ('all', 3309), ('yeah', 3237), ('’', 3209), ('okay', 2981), ('up', 2911), ('but', 2847), ('here', 2749), ('out', 2722), ('right', 2710), ('at', 2659), ('get', 2623), ('about', 2544), ('there', 2527), ('well', 2522), ('"', 2435), ('hey', 2431), ('go', 2418), ('kevin', 2367), ('angela', 2302), ('think', 2283), ('good', 2248), ('one', 2194), ('they', 2167), ('if', 2064), ('really', 2039), ('ryan', 2035), ('oscar', 2030), ('how', 2028), ('going', 1999), ('erin', 1969), ('she', 1903), ('want', 1862), ('yes', 1733), ('would', 1722), ('her', 1682), ('did', 1674), ('darryl', 1653), ('uh', 1627), ('ll', 1620), ('his', 1602), ('let', 1581), ('now', 1572), ('phyllis', 1569), ('gonna', 1561), ('who', 1534), ('ok', 1529), ('from', 1517), ('an', 1502), ('come', 1477), ('got', 1453), ('back', 1449), ('as', 1448), ('him', 1436), ('will', 1420), ('jan', 1403), ('am', 1399), ('toby', 1384), ('why', 1355), ('kelly', 1355), ('see', 1323), ('or', 1323), ('time', 1317), ('stanley', 1253), ('because', 1237), ('some', 1237), ('say', 1209), ('when', 1168), ('could', 1158), ('look', 1144), ('ve', 1142), ('need', 1137), ('then', 1133), ('great', 1099), ('people', 1076), ('um', 1066), ('thank', 1062), ('man', 1048), ('very', 1046), ('office', 1038), ('take', 1034), ('guys', 1029), ('should', 1006), ('little', 998), ('meredith', 988), ('been', 986), ('make', 974), ('down', 964), ('over', 953), ('had', 938), ('phone', 937), ('mean', 932), ('sorry', 926), ('our', 922), ('has', 920), ('tell', 917), ('way', 904), ('something', 902), ('god', 897), ('them', 895), ('us', 889), ('into', 873), ('were', 862), ('where', 860), ('didn', 853), ('day', 853), ('more', 852), ('thing', 850), ('holly', 844), ('work', 839), ('off', 837), ('love', 826), ('by', 821), ('guy', 796), ('d', 795), ('two', 783), ('everyone', 766), ('please', 759), ('doing', 757), ('too', 749), ('david', 734), ('said', 687), ('much', 686), ('call', 685), ('maybe', 675), ('new', 667), ('never', 665), ('creed', 663), ('lot', 655), ('nellie', 653), ('gabe', 647), ('give', 646), ('sure', 640), ('robert', 632), ('talk', 626), ('wait', 625), ('stop', 620), ('alright', 617), ('even', 617), ('any', 616), ('nice', 611), ('actually', 610), ('these', 582), ('put', 580), ('thought', 576), ('today', 564)]
top_words = []
for word, count in words_counter.most_common(2000):
top_words.append(word)
print("Total number of top words: ", len(top_words))
def get_lines_with_words(top_words):
selected_lines = []
text = get_text()
lines = text.split("\n")
for line in lines:
line = replace_numbers(line)
words_in_line = split_into_words(line, consider_words)
excluded_words = 0
for word_in_line in words_in_line:
if word_in_line not in top_words:
excluded_words += 1
break
if not excluded_words:
selected_lines.append(line)
return selected_lines
selected_lines = get_lines_with_words(top_words)
print("Total number of selected lines: ", len(selected_lines))
print(selected_lines[:100])
Total number of top words: 2000
Total number of selected lines: 40366
["jim: oh, i told you. i couldn't close it. so...", 'jim: actually, you called me in here, but yeah.', "michael: all right. well, let me show you how it's done.", '', '', "pam: well. i don't know.", 'pam: what?', 'michael: any messages?', 'pam: uh, yeah. just a fax.', "pam: you haven't told me.", '', '', '', '', 'jim: nothing.', 'michael: ok. all right. see you later.', 'jim: all right. take care.', 'michael: back to work.', '', 'jan: [on her cell phone] just before lunch. that would be great.', '', '', "jan: what? i'm sorry?", "michael: really? i didn't... [looks at pam] did we get a fax this morning?", 'pam: uh, yeah, the one...', 'jan: do you want to look at mine?', 'michael: yeah, yeah. lovely. thank you.', 'michael: ok...', 'michael: no, no, no, no, this is good. this is good. this is fine. excellent.', 'michael: ok. no problem.', '', 'jan: go ahead.', "michael: oh, that's not appropriate.", "michael: uh, i don't know what you mean.", '', 'phyllis: so what does downsizing actually mean?', 'stanley: well...', '', '', "angela: i bet it's gonna be me. probably gonna be me.", "kevin: yeah, it'll be you.", '', 'pam: i have an important question for you.', 'jim: yes?', 'jim: yeah, stop. that is ridiculous.', '', '', 'michael: hey.', 'ryan: hey.', 'pam: this is mr. scott.', 'ryan: yup.', '', 'pam: dunder mifflin. this is pam.', '', 'dwight: what?', 'jim: what are you doing?', "jim: it's not on your desk.", "dwight: you can't do that.", 'jim: why not?', 'dwight: downsizing?', '', '', 'pam: you just still have these messages from yesterday.', 'pam: what?', "pam: don't we all?", "michael: i'm sorry?", 'pam: nothing.', '', '', '', "dwight: i'm assistant regional manager. i should know first.", 'michael: assistant to the regional manager.', "michael: i'm about to tell everybody. i'm just about to tell everybody.", "oscar: can't you just tell us.", "dwight: please, ok? do you want me to tell 'em?", "michael: you don't know what it is. [laughs]", 'dwight: go ahead.', '', '', 'michael: not gonna happen.', 'stanley: it could be out of your hands michael.', "michael: it won't be out of my hands stanley, ok. i promise you that.", 'stanley: can you promise that?', 'michael: no.', 'phyllis: what?', "stanley: it's just that we need to know.", 'michael: i know. hold on a second. i think pam wanted to say something. pam, you had a look that you wanted to ask a question just then.', 'man: are you sure about that?', 'dwight: pam, information is power.', "stanley: you can't say for sure whether it'll be us or them, can you?", '', '', '', 'michael: watch out for this guy. dwight schrute in the building. this is ryan, the new temp.', "ryan: what's up? nice to meet you.", 'dwight: dwight schrute, assistant regional manager.', '', 'dwight: damn it! jim!', 'pam: [laughing]', "dwight: that's real professional thanks. that's the third time and it wasn't funny the first two times either jim."]
selected_text = "\n".join(selected_lines)
selected_text = replace_numbers(selected_text)
selected_words = split_into_words(selected_text, consider_words)
x, y, unique_words, word_indices = setup_inputs(selected_words, maxlen, step)
Number of sequences: 189933
Unique words: 1992
Vectorization...
model = build_model(maxlen, len(unique_words))
train_model(selected_text, selected_words, unique_words, word_indices, 100, "generated_script.txt", "top_lines.h5")

Reflections

  • The output for lines from Phyllis is just Phyllis talking to herself all the time.

  • This output from the most common words seems more realistic but I think it suffers from lots of sentences removed from the data that interrupted the flow in the dialogs.

  • As the temperature increased, so did the randomness in the dialogs

  • Getting a lot more sentences without interruption in the flow with just the most commonly used words should produce better results

Resources

  1. Text generation with LSTM (notebook)