FQA: Frequently Questioned Answers

BDP documents the process of exploring poetry generation using deep learning and feature extraction from large bodies of poetry.




Corpora 2017:

639,813 lines of primarily contemporary poetry

Corpora 2012-2016:

10,573 poems from PoetryFoundation.org

57,434 rap songs from Ohhla.com

4,702 pop lyrics


Machine learning (neural nets, auto-encoders, unsupervised learning) poetry generation.

Libraries: Tensorflow, Keras, Pytorch

Language: Python

Code on Github

PROCESS 2011-2014
How does the poetry generation work (in simple terms)?

Parsing phase:

All poems in html pages are downloaded using SiteSucker.
Html is parsed using Beautiful Soup and poems extracted from html into text files.

Analysis phase:

All words in all poems are analyzed using NLTK (Natural Language Toolkit) for POS (part-of-speech).

All poems are sent to an online deep-learning natural language processing API called Alchemy which identifies entities. “Named entities specify things such as persons, places and organizations. AlchemyAPI’s named entity extraction is capable of identifying people, companies, organizations, cities, geographic features and other typed entities”. These entities then form an archive.

All words that are not matched to a synonym in WordNet are put into a ‘reservoir’

Generation phase:

Every entity is replaced with an entity from another poem.

Words that are not entities and not prepositions are replaced with a synset (synonym,homonym,meronym…) using WordNet. If no replacements exist in synset, these words are replaced with a random word from the ‘reservoir’.

Rudimentary correction of verb tenses is done using pattern.en.

How about the graphs?

About the only true data-science in the project is the T-SNE analysis.

Visuals are either generated in Python using matplotlib
or using D3.js

Where is the code?

On github

So why?

Big-data is big meme. Fluffy flarfy retroactive regurgitation of probabilistic entrails: sophic oracles updated with statistical analysis. Extroverted introversion. All answers concerning its validity for qualitative practice are questionable.

It might also seem that numbers obliterate the ambiguous juicy core of poetry. Yet, the techniques of big-data offer a chance to generate poetry from models of language that emerge at a scale previously unimaginable.

In the eras of the printing press, poets (writers and intellectuals) aspired to be well-read. Some aspired to breadth, others to depth, yet all recognized the cognitive benefits of reading: through the osmosis of many words, patterns and process and modes of communication became clear. Now the amount written exceeds the capacity to read, a brain beyond the brain is needed to analyze and interpret the results. Big data can digest the literary torrent.