Wavenet for Poem Generation: preliminary results

For the past week, I’ve been running a port of the Wavenet algorithm to generate poems. A reasonable training result emerges in about 24 hours, — a trained model that can generate immense amounts of text relatively quickly. On a laptop. (Code: github). By reasonable I mean the poems do not have any real sense, no sentient self, no coherent narrative, nor epic structure. But they do have cadence, they do not repeat, new words are plausible, and they have adopted a scattered open line style characteristic of the late twentieth century corpus on which they were trained. Much more lucid than Schwitters’ Ursonate, output is reminiscent of Beckett’s Not I : ranting incandescent perpetual voice.


Remember, these are evolutionary amoebas, toddlers just learning to babble. The amazing thing is that without being given any syntax rules, they are speaking, generating a kind of prototypical glossolalia poem, character by character. Note: models are like wines, idiosyncratic reservoirs, the output of each has a distinct taste, — some have mastered open lines, others mutter densely, many mangle words to make neologisms  — each has obsessions. The Wavenet algorithm is analogous to a winery: its processes ensure that all of the models are similar. Tensorflow is the local region; recursive neural nets form the ecosystem. The corpus is the grapes.

Intriguing vintages-models :

Dense intricate Model 33380 — trained with 1024 skip channels and dilation to 1024 (read a txt sample)

the mouth’s fruit
tiny from carrying
a generative cup

Loose uncalibrated Model 13483 with loss = 0.456, (1.436 sec/step) trained on 2016-10-15T20-46-39 with 2048 skip channels and dilation to 256 (read a txt sample)

 at night, say, that direction.

      now. so you hear we are shaking
          from the woods

Full results (raw output, unedited txt files from the week of Oct 10-16th 2016) here.

it’s there we brail,
  beautiful full
left to wish our autumn was floor

Edited micro poems

…extracted from the debris are here.

through lust,
and uptight winking cold
blood tree hairs
 in loss


Python source code + a few trained models, corpus and some sample txt: on github which will be updated with new samples and code as it emerges.


The Model number refers to how many steps it trained for. Skip channels weave material from different contexts. On this corpus, larger skip channels produce more coherent output. Dilations refer to the size of the tensors of the encoder-decoder: eg. [ 1, 2, 4, 8, 16, 32, 64, 128, 256, etc… ] Higher values up to 1024 seem to be of benefit, but take longer to train. Loss is the mathematical calculation of the distance between the goal and the model; it is a measure of how tightly the model fits the topological shape of the corpus; as models are trained, they are supposed to learn to minimize loss; low loss is supposed to be good. For artistic purposes this is questionable (I describe why in see next section). For best results, in general, on this corpus: 10k to 50k steps, 1024 dilations, a skip channel of 512 or more, and (most crucial) loss less than 0.6.


Loss is not everything. An early iteration model with low loss will generate cruft with immense spelling errors. Thousands of runs later, a model with the same loss value will usually produce more sophisticated variations, less errors. So there is more going on inside the system than is captured by the simple metric of loss optimization. Moreover if the system is about to  undergo a catastrophic blowout of loss values, during which the loss ceases to descend toward the gradient and exponentially oscillates (this occasionally occurs after approx 60k steps). Generated text from poems just before that (with good loss values below 1.0 or even excellent loss values below 0.6) will produce some ok stuff interspersed with long periods of nonsense or —— repeated **** symbols. These repetitive stretches are symptoms of the imminent collapse. So loss is not everything. Nonsense can be a muse. Mutating small elements, editing, flowing, falling across the suggestive force of words in raw tumult provides a viable medium for finding voice.


Contemporary neural nets do not in any way approach the complexity of biological processes. As many observors note: they are not anywhere near sentience or clever articulate reasoning.  Critics of the neural net hype are venerable and incisive:

Berwick, Robert C.; Chomsky, Noam (2016-01-15). Why Only Us (MIT Press) (pp. 50-51). The MIT Press. Kindle Edition. 
Yet we do not really know how this most foundational element of computation is implemented in the brain (Gallistel and King 2009). For example, one of the common proposals for implementing hierarchical structure processing in language is as a kind of recurrent neural network with an exponential decay to emulate a “pushdown stack” (Pulvermüller 2002). Unfortunately, simple bioenergetic calculations show that this is unlikely to be correct. As Gallistel observes, each action potential or “spike” requires the hydrolosis of 7 × 108 ATP molecules (the basic molecular “battery” storage for living cells). Assuming one operation per spike, Gallistel estimates that it would take on the order of 1014 spikes per second to achieve the required data processing power. Now, we do spend lots of time thinking and reading books like this to make our blood boil, but probably not that much. Similar issues plague any method based on neural spike trains, including dynamical state approaches, difficulties that seem to have been often ignored; (see Gallistel and King 2009 for details). Following the fashion of pinning names to key problems in the cognitive science of language, such as “Plato’s problem,” and “Darwin’s problem,” we call this “Gallistel’s problem.””

John Cayley added the above emphasis when he sent this quotation in an email; so I recognize neural net evangelism bugs some folks, it bugs them so much that they grow mildly incoherent, — at first I thought maybe John had mistyped the phrase above, but that is actually how it is published. And yes, I agree, excessive unvalidated hype is annoying.

Yet, human exceptionalism irritates me; it is vain and untenable given our scale within the multiverse. Certainly these neural nets are only tiny subsidiary rudimentary crude prototypes of what might become blastocysts of silicon sentience, but they are potent, much much more potent emulational processes than any computational tools seen before in history. I recognize the necessity for limits to utopian thinking. Yet I do feel that as this base architecture evolves in granularity, modularity, distribution, scale, multiplicity, and sophistication, it’s capacity to emulate poetic styles will emerge swiftly. They are astoundingly good mimics: mimics capable of replication without repetition. Thus, superb symbiotic catalysts to creativity. And I’ve always disagreed with Chomsky: It is Not Only Us.

More Backstory

On Sept 10th 2016, I sent an email to John Cayley, Chris Funkhouser, and Daniel Howe: screen-shot-2016-10-15-at-21-40-10 I was pretty psyched. Cayley passed on the message to his list. Then 3 days later Jörg Piringer soberly observed: screen-shot-2016-10-15-at-21-44-29 6 days after that Chris Novello reported: screen-shot-2016-10-15-at-21-46-03


Training a 256 dilation with 2048 skip channel model that after almost 24k steps has produced a model with a good loss value of 0.594
TRAINING a 256 dilation with 2048 skip channel model that after almost 24k steps has produced a model with a good loss value of 0.594. Since the model loss is under 1.0, it is saved. Later generation shows that the output contains stretches of repeats, which suggests the model is heading toward a catastrophic blowout into incoherence.
GENERATING: the code highlighted in the centre shows how to run a generation of 6000 characters with the json parameters file (in this case using 2048 skip channels and 256 dilations -- i tend to name the files with key parameter values -- This generation runs in about 20 secs on macbook air.
GENERATING: the code highlighted in the centre shows how to run a generation of 6000 characters with the json parameters file — in this case using 2048 skip channels and 256 dilations ( I name the files with key parameter values). This generation runs in about 20 secs on macbook air. The results are feasible, yet because there are only 15k steps, the coherence is low. Because the skip channel values are set relatively high and loss in this case was under 0.4, there are some good conjunctions that glisten like ore before processing.


On a 2013 Macbook Air, training on a corpus of 11k poems (all small-case letters with most extraneous characters and punctuation removed; this is small data) — with default parameters, a training step occurs in under half a second; and with dilations of the encoder up to 1024 and skip channels set to 1024, each step takes about 1.5 seconds.


This work was completed as a component of the final phase of coursework for Creative Applications Of Deep Learning With Tensorflow on Kadenze. I thank Parag Mital for his teaching and much generous helpful feedback on the forums.