Category: data science

Another 10k day

I’m beginning to understand the exultation of spam-lords, the rapturous power narcotic that arises from watching thousands of words of perhaps-dubious quality arise & spew in a rapid unreadable scrawl across a screen.

Beyond semantics, words like sperm procreate incessantly in abundant sementics. Quality in this inverted world is a quantity.

On the technical side: today, I fixed the repetition hatching; used pattern.en to correct articles (like ‘an’ or ‘a’) and conjugate correct verb participles (as in ‘I’m walking home…’); and created FAKE_authors (because  who wants to read a poem written by a bot…unless it’s good, which these poems are not yet).

It all took much longer than anticipated.

The poems are now output in hourly batches:

Here’s a weird sample:

Body The New Road: Clark
by Anthony Lazarus

Wait.

                                                 look.

                                     hold.

                        hold back. expect.
                look forward
                                                          kick one’s heels.
                                  kick one’s heels an i kick one’s heels.

                        kick one’s heels.

hold off.

                                                               look.
                                                          look to.
                                                         stand by.
kick one’s heels.



                                    NOW.

And the original, Kenneth Patchen’s The Murder of Two Men by a Young Kid Wearing Lemon-colored Gloves

Wait.
                                                 Wait.
                                        Wait.
                        Wait. Wait.
                Wait.
                                                          Wait.
                                  W a i t.
                        Wait.
                                              Wait.
                                                               Wait.
                                    Wait.
                                                          Wait.
Wait.



                                    NOW.

Code on Github
Made by Glia.ca  


[ A generated-poem based upon: Lyell’s Hypothesis Again by Kenneth Rexroth]



Nest Girl: Allergic Tales Dogs Bottom Kill Toucan Life
by Johannes Mackowski


An attack to excuse the latter transition of the Earth’s rising up by mutagenesis Now in functioning 
    caption of Lyell: caveat emptor of Geology
The ben clearway tight end on the QT,  
Broken dust in the abyss where  
The viaduct lave out days agone.   Continue reading

Hatching (trance bug poem set)

Another day, another 10k.

I received an email today from a friend who is a poet named Ian Hatcher; his email included an MP3 of himself reading a poem I had generated using code that was a bit broken .

Ian Hatcher reads “woodland_pattern” 

I sent this particular poem to Ian because it included a lot of repetition; Ian’s style involves repetition, productive repetitions, calibrated repetitions, sung repetitions, drone repetitions, blind repetitions, profound repetitions.

To my mind the repetition was a bug; from the perspective of efficient communication, a repeated word is an inefficient redundant symbol. In my mind, an ancient mantra chanted: effective text is succinct and to the point. But Hatcher’s work (among others) reminds me of the parallel/opposite tradition of trance ritual (incantation incantation incantation …) appropriated by post-modern poetics. Time does not matter: matter cycles.


Poetry belongs to both traditions (efficient condensation and redundant trance-inducing repetition). It is both a sustainment of affective efficient communication where redundancy is reduced and it is also a mode of being that involves states of consciousness invoked by rhythm and repetition.


Repetition is exactly what my code started churning out unexpectedly today after I made an error in how I dealt with 2 letter prepositions followed by punctuation. So I generated 10k+ poems in that style.


A computer-generated stanza

predictably, aluminum business and USDA diverge diverge
diverge            and the deflagration catapult 
sow in the sagas van chicken farm carnival  carnival
carnival                 with shallot desktop and radicchio 

based on a template derived from D. A. Powellrepublic (2008)

soon, industry and agriculture converged
                        and the combustion engine
sowed the dirtclod truck farms green
                                  with onion tops and chicory

To read 10118 poems (laden with repetitive bug trance cruft) generated in 6966.78432703 seconds on 2014-07-30 at 22:42 click here


Code on Github
Made by Glia.ca  


p.s. If you think trance is a thing of the past, consider the repetitive contemporary pop dance track just released by airhead (also sent to me as a link in an email today). Incantation is secular.  

Writing 10,118 poems in 5 hours

In the same way that climate change signals something irrevocable in the atmosphere, machine-learning constitutes an erosion of the known habitat of culture.

Today I wrote 12,000 poems. Most of them are crap. But if even 1% are halfway decent, that’s 120 poems.


Numbers aren’t everything. We love what we love. Quantification does not measure value. The quality of things defies statistics.

Yet, few can deny statistics (of climate change), scale (of moore’s law), grandeur (of uni/multiverse), immensity (of blogosphere), complexity (of evolution), dexterity (of language), velocity (of singularity). Emergence.


Augmented humans have existed since the first horse was tamed, since fire was carried in coals slung in a goat’s bladder lined with moss and pebbles; since the first toilet was born in a rock’s gullet.

Run a car down with a a bicycle. Chase a sprinter with an airplane. Make a nuclear bomb with matches. I dare you.


Welcome the cyber poet. Touch it’s silicon tongue, algorithm-rich, drenched in data. Obligatory obliteration.

Keep in mind, the results emerge from recipes. I am not the best chef in the world. This is mere crawling. But it is instinct which suggests a path, a motion and a blur over the meaning of creativity. Symbiosis.

Continue reading

B/IOs

Biographies of poets. Generated with code.

2,513 bios of poets scraped from PoetryFoundation.org were batch-analyzed by Alchemy API (an online text-mining engine) for entities (employment roles, organizations, people, locations, etc…), concepts, keywords, and relations (subject,action,object).

This analysis then guided word replacement and generation of new bios using NLTK (Natural Language Toolkit) part-of-speech tagging

Approx. 2000+ BIOs generated in each run:

Code on Github
Made by Glia.ca

Text Analysis by
alchemyLogo60Height

Prosody: using the CMUdict in NLTK

OK. Parsing. Prosody. Metre. Rhythm. It seems prehistoric in the age of free-verse. But if poems are rhythm with/or/without rhyme then parsing into metrical feet seems one precondition on the path of accurately generating poems. Unfortunately, as far as I could tell, few folks have done it. A google search returned a few academic papers and no code. There was one stackoverflow question. So I wrote an email to Charles Hartman who had written Virtual Muse, who kindly replied : I’ve been away from programming for quite a while. But by the end of this year Wiley-Blackwell will be publishing my textbook Verse: An Introduction to Prosody…” So I did it myself.


INPUT WORDS  >>> OUTPUT NUMBERS:  An Example

If by real you mean as real as a shark tooth stuck

‘1  1  1  1  1  1  1  1  0  1  1  1’

in your heel, the wetness of a finished lollipop stick,

’0  1  1 *,* 0  1  0  1  0  1  0  1  0  2 1 *,*’

Aimee  Nezhukumatathil, Are All the Break-Ups in Your Poems Real?http://www.poetryfoundation.org/poem/245516 

## parseStressOfLine(line) 
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings 
#
# 'stress' in form '0101*,*110110'
#   --Note: 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'

Continue reading

FQA: Frequently Questioned Answers

BDP documents the process of exploring poetry generation using deep learning and feature extraction from large bodies of poetry.

 

LANGUAGE IS A PATTERN

POETRY IS ENCRYPTION


Corpora:

10,573 poems from PoetryFoundation.org

57,434 rap songs from Ohhla.com

4,702 pop lyrics


CURRENT PROCESS

Machine learning (neural nets, auto-encoders, unsupervised learning) poetry generation.

ARCHIVE 2011-2014
How does the poetry generation work (in simple terms)?

Parsing phase:

All poems in html pages are downloaded using SiteSucker.
Html is parsed using Beautiful Soup and poems extracted from html into text files.

Analysis phase:

All words in all poems are analyzed using NLTK (Natural Language Toolkit) for POS (part-of-speech).

All poems are sent to an online deep-learning natural language processing API called Alchemy which identifies entities. “Named entities specify things such as persons, places and organizations. AlchemyAPI’s named entity extraction is capable of identifying people, companies, organizations, cities, geographic features and other typed entities”. These entities then form an archive.

All words that are not matched to a synonym in WordNet are put into a ‘reservoir’

Generation phase:

Every entity is replaced with an entity from another poem.

Words that are not entities and not prepositions are replaced with a synset (synonym,homonym,meronym…) using WordNet. If no replacements exist in synset, these words are replaced with a random word from the ‘reservoir’.

Rudimentary correction of verb tenses is done using pattern.en.


How about the graphs?

About the only true data-science in the project is the T-SNE analysis.

Visuals are either generated in Python using matplotlib
or using D3.js


Where is the code?

On github


So why?

Big-data is big meme. Fluffy flarfy retroactive regurgitation of probabilistic entrails: sophic oracles updated with statistical analysis. Extroverted introversion. All answers concerning its validity for qualitative practice are questionable.

It might also seem that numbers obliterate the ambiguous juicy core of poetry. Yet, the techniques of big-data offer a chance to generate poetry from models of language that emerge at a scale previously unimaginable.

In the eras of the printing press, poets (writers and intellectuals) aspired to be well-read. Some aspired to breadth, others to depth, yet all recognized the cognitive benefits of reading: through the osmosis of many words, patterns and process and modes of communication became clear. Now the amount written exceeds the capacity to read, a brain beyond the brain is needed to analyze and interpret the results. Big data can digest the literary torrent.

Even the OpenLibrary is Locked

In my amateur-quest, to retrieve some archive that might have a semblance of approaching moderately large data for a poetry analysis project, I imagined OpenLibrary.org might offer an opportunity to download some poetry that is in the open domain. My first encounter was not encouraging. Thousands of the books listed there under poetry are distributed by the library of congress under what is know as a DAISY lock which requires a key to open and is only accessible to the blind. Imagine a library where a significant portion of the books are locked shut. Aaron Schwartz would not be amused.

Continue reading

A few rudimentary visuals of Poetry Foundation corpus (preliminary buggy results)

Word counting is the ‘Hello World’ of big data. And my data is relatively tiny.

Below are 25 images in 5 increasing small time scales for 5 different variables (line length, verse length, avg word length, # of words per poem, # of verses per poem) derived from an analysis of a corpus of 10.5k poems scraped from poetryfoundation.org.

plot_# of LINES_0_2015

Continue reading

Written by Comments Off on A few rudimentary visuals of Poetry Foundation corpus (preliminary buggy results) Posted in data science, info-viz

Review: Socher et al. Recursive Deep Models …

Review

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank, Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Chris Manning, Andrew Ng and Chris Potts. Conference on Empirical Methods in Natural Language Processing (EMNLP 2013, Oral). pdf;Website with Live Demo and Downloads

Objective/Abstract

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank by Socher et al. introduces Recursive Neural Tensor Network (RNTN), a model for extracting sentiment from longer phrases that outperforms previous models. The model is trained on a corpus of 215,154 phrases that were labelled using Amazon Turk. Using RNTN, single sentence accuracy increases from 80% up to 85.4%; and negation is accurately captured (a task no previous model succeeded at).

Continue reading

On Numeration (Khan meet Steigler)

I’ve been spending some hours this weekend reviewing math at the amazing Khan Academy.  The following reflection is meant as a contemplation of a trend and not in any way a critique of their valuable work.

Consider the screenshot below: where the value assigned to IAK of 66º does not accurately reflect it’s value. Both angles IAK and GCJ, if measured with a tool like a compass, are 45º angles. Yet IAK is labelled 66º and the correct answer to the angle GCJ is 24º. It does not seem like an important mislabelling, yet there is a fundamental conceptual issue at stake here. And it has a relation to poetry (perhaps tenuous) but it’s an issue of trust. As Tom Waits said: “The large print giveth, the small print taketh away”. Exactly as here: where the faint inscription at the bottom states: “Note: Angles not necessarily drawn to scale.”

MWSnap017 2014-04-008

 

Continue reading