Category: info-viz

THAT ETERNAL HEAVY COLOR OF SOFT

The title of this post is the title of the first poem generated by a re-installed Wavenet for Poetry Generation (still using 0.1 Tensorflow, but now  on Python 3.5), and working on an expanded corpus (using approx 600,000 lines of poetry) the first model was Model: 7365  |  Loss: 0.641  |  Trained on: 2017-05-27T14-19-11 (full txt here).

Wavenet is a rowdy poet, jolting neologisms, careening rhythms, petulant insolence, even the occasional glaring politically-incorrect genetic smut. Tangents codified into contingent unstable incoherence. 

Compared to Pytorch, which aspires to a refined smooth reservoir of cadenced candy, Wavenet is a drunken street brawler: rude, slurring blursh meru crosm nanotagonisms across rumpled insovite starpets.

Pytorch is Wallace Stevens; Wavenet is Bukowski (if he’d been born a mathematician neologist).

Here’s a random poem:

____________________________________________
Model: 118740  |  Loss: 0.641  |  2017-05-28T00-35-33

Eyes calm, nor or something cases.

from a wall coat hardware as it times, a fuckermarket
in my meat by the heart, earth signs: a pupil, breaths &

stops children, pretended. But they were.

Case study: Folder 2017-05-28T12-16-50 contains 171 models (each saved because their loss was under the 0.7 threshold). But what does loss really mean? In principle it is a measurement of the gap between the generated text and the validation text (how close is it?); yet however many different schemas proliferate, loss (like pain) cannot be measured by instrumentality.

Here’s another random poem:

____________________________________________
Model: 93286  |  Loss: 0.355  |   2017-05-28T12-16-50

would destroying through the horizon. The poor
Sequel creation rose sky.

So we do not how you bastard, grew,
there is no populously, despite bet.
Trees me that he went back
on tune parts.

I will set
a girl of sunsets in the glass,

and no one even on the floral came

Training

I’m slowly learning the hard way to wrap each install using VirtualEnvironments. Without that as the upgrades happen, code splinters and breaks, leaking a fine luminous goo of errors.

The current virtual environment was created using

$ sudo apt-get install python3-pip python3-dev python-virtualenv
$ virtualenv -p python3.5 ~/tf0p1-py3.5-wvnt

After that, followed the instructions,

$  TF_BINARY_URL=https//storage.googleapis.com/tensorflow/mac/gpu/tensorflow-0.10.0-py3-none-any.whl
$ pip3 install --upgrade $TF_BINARY_URL

 

then got snarled into a long terrible struggle with error messages messing up the output, resolved by inserting,

os.environ['TF_CPP_MIN_LOG_LEVEL']='2' 
# into generate_Poems_2017-wsuppressed.py

And to generate on Ubuntu, using the lovely Nvidia Titan X GPU so generously donated by Nvidia under their academic grant program:

$ cd Documents/Github/Wavenet-for-Poem-Generation/
$ source ~/tf0p1-py3.5-wvnt/bin/activate
(tf0p1-py3.5-wvnt)$ python train_2017_py3p5.py --data_dir=data/2017 --wavenet_params=wavenet_params_ORIG_dilations1024_skipChannels4096_qc1024_dc32.json

Text Files

tf0p1-py3.5-wvnt_jhave-Ubuntu_Screencast 2017-05-28 11:31:40_TrainedOn_2017-05-28T00-35-33 tf0p1-py3.5-wvnt_jhave-Ubuntu_2017-05-28 09:18:00_TRAINED_2017-05-28T00-35-33 tf0p1-py3.5-wvnt_jhave-Ubuntu_Screencast-2017-05-27 23:50:14_basedon_2017-05-27T14-19-11 tf0p1-py3.5-wvnt_jhave-Ubuntu_Screencast 2017-05-28 22:36:35_TrainedOn_2017-05-28T12-16-50.txt

T-SNE Animator

Same Data, Same Code (Different Parameters)

Jhave (2016)
Python code, T-SNE algorithm, 6689 poems
Cloud support by Karteek Addanki

Information visualizations normally change as the data changes. In this demo, the data (6689 poems) stays the same, but visualizations change as the parameters change.

HD video generated from Python script.
Project Code on Github: https://github.com/jhave/TSNE-animator
Exhibited as part of pop-up exhibit for Digital Asia Hub forum on AI in Asia (21/11/2016) at Maritime Museum, Hong Kong

 

t-SNE: Classification of 10,557 poems

Once again: there is much magic in the math. The era of numeration discloses a field of stippled language. Songlines, meridians, tectonics, the soft shelled crab, a manta ray, a flock of starlings.

In the image below, each dot is a poem. It’s position is calculated based on an algorithm called t-SNE (Distributed Stochastic Neighbour Embedding)

Screen Shot 2014-08-23 at 9.16.28 pm

The image above is beautiful, but it’s impossible to know what is actually going on. So i built a interactive version (it’s a bit slow, but, functions…) where rollover of a dot reveal all the poems by that author.

Screengrabs (below) of the patterns suggest that poets do have characteristic forms discernible by algorithms. Position is far from random; note, the algorithm did not know the author of any of the poems; the algorithm was fed the poems; this is the equivalent of blind-taste-testing.

Still these images don’t tell us much about the poems themselves, except that they exist in communities. That the core of poetry is a spine. That some poets migrate across styles, while others define themselve by a style. The real insights will emerge as algorithms like t-SNE are applied to larger corpus, and allow nuanced investigation of the features extracted: on what criteria exactly did the probabilities grow? What are the 2 or 3 core dimensions?

What is t-SNE

My very basic non-math-poet comprehension of how it works: t-SNE performs dimensionality reduction: it reduces the numbers of parameters considered. Dimensionality reduction is useful when visualizing data; think about graphing 20 different parameters (dimensions). Another technique that does this is PCA: principal component analysis. Dimensionality reduction is in a sense a distillation process, it simplifies. In this case, it converts ‘pairwise similarities’ between poems into probability distributions. Then it decreases the ‘entropy’ using a process of gradient descent to minimize the (mysterious) Kullback-Leibler divergence.

To know more about the Python version of t-SNE bundled into sklearn, read Alexander Fabisch

One of the few parameters I bothered tweaking over numerous runs is appropriately named) perplexity. In the FAQ, LJP van der Maaten (who created t-SNE) wrote:

 What is perplexity anyway?

Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors-k that is employed in many manifold learners.

A few rudimentary visuals of Poetry Foundation corpus (preliminary buggy results)

Word counting is the ‘Hello World’ of big data. And my data is relatively tiny.

Below are 25 images in 5 increasing small time scales for 5 different variables (line length, verse length, avg word length, # of words per poem, # of verses per poem) derived from an analysis of a corpus of 10.5k poems scraped from poetryfoundation.org.

plot_# of LINES_0_2015

Continue reading

Written by Comments Off on A few rudimentary visuals of Poetry Foundation corpus (preliminary buggy results) Posted in data science, info-viz