Category: info-viz

T-SNE Animator

Same Data, Same Code (Different Parameters)

Jhave (2016)
Python code, T-SNE algorithm, 6689 poems
Cloud support by Karteek Addanki

Information visualizations normally change as the data changes. In this demo, the data (6689 poems) stays the same, but visualizations change as the parameters change.

HD video generated from Python script.
Project Code on Github: https://github.com/jhave/TSNE-animator
Exhibited as part of pop-up exhibit for Digital Asia Hub forum on AI in Asia (21/11/2016) at Maritime Museum, Hong Kong

 

t-SNE: Classification of 10,557 poems

Once again: there is much magic in the math. The era of numeration discloses a field of stippled language. Songlines, meridians, tectonics, the soft shelled crab, a manta ray, a flock of starlings.

In the image below, each dot is a poem. It’s position is calculated based on an algorithm called t-SNE (Distributed Stochastic Neighbour Embedding)

Screen Shot 2014-08-23 at 9.16.28 pm

The image above is beautiful, but it’s impossible to know what is actually going on. So i built a interactive version (it’s a bit slow, but, functions…) where rollover of a dot reveal all the poems by that author.

Screengrabs (below) of the patterns suggest that poets do have characteristic forms discernible by algorithms. Position is far from random; note, the algorithm did not know the author of any of the poems; the algorithm was fed the poems; this is the equivalent of blind-taste-testing.

Still these images don’t tell us much about the poems themselves, except that they exist in communities. That the core of poetry is a spine. That some poets migrate across styles, while others define themselve by a style. The real insights will emerge as algorithms like t-SNE are applied to larger corpus, and allow nuanced investigation of the features extracted: on what criteria exactly did the probabilities grow? What are the 2 or 3 core dimensions?

What is t-SNE

My very basic non-math-poet comprehension of how it works: t-SNE performs dimensionality reduction: it reduces the numbers of parameters considered. Dimensionality reduction is useful when visualizing data; think about graphing 20 different parameters (dimensions). Another technique that does this is PCA: principal component analysis. Dimensionality reduction is in a sense a distillation process, it simplifies. In this case, it converts ‘pairwise similarities’ between poems into probability distributions. Then it decreases the ‘entropy’ using a process of gradient descent to minimize the (mysterious) Kullback-Leibler divergence.

To know more about the Python version of t-SNE bundled into sklearn, read Alexander Fabisch

One of the few parameters I bothered tweaking over numerous runs is appropriately named) perplexity. In the FAQ, LJP van der Maaten (who created t-SNE) wrote:

 What is perplexity anyway?

Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors-k that is employed in many manifold learners.

A few rudimentary visuals of Poetry Foundation corpus (preliminary buggy results)

Word counting is the ‘Hello World’ of big data. And my data is relatively tiny.

Below are 25 images in 5 increasing small time scales for 5 different variables (line length, verse length, avg word length, # of words per poem, # of verses per poem) derived from an analysis of a corpus of 10.5k poems scraped from poetryfoundation.org.

plot_# of LINES_0_2015

Continue reading

Written by Comments Off on A few rudimentary visuals of Poetry Foundation corpus (preliminary buggy results) Posted in data science, info-viz