A few rudimentary visuals of Poetry Foundation corpus (preliminary buggy results)

Word counting is the ‘Hello World’ of big data. And my data is relatively tiny.

Below are 25 images in 5 increasing small time scales for 5 different variables (line length, verse length, avg word length, # of words per poem, # of verses per poem) derived from an analysis of a corpus of 10.5k poems scraped from poetryfoundation.org.

5 images.

Dirt in data and perhaps a bad algorithm: Mark Tardi’s result is due to long horizontal lines inserted into his poem “Eventual History”. Other bug: Wang Ping’s only poem in data “The Last Son of China” contains “ ……………. ” which the algorithm mistakenly attributes as words. Yet another bug, Alan Shapiro’s work does not seem to reflect his status.

Notes concerning anomalies:

I removed a radio play (too long). And a christmas carol (cause i hated it and it skewed results with its repetitive refrains).

The outlier Frank Bidart in # of verses reflects his preoccupation with narrative poems broken into verses of 2 to 3 lines. Caroline Bergval’s prose poem “Drift” skews the line length data because its lines are paragraphs and throws her into outlier class. She is also poet with lowest average word length because “Drift” includes entire paragraphs with just the letter ‘t’. The results reveal that counting from such a limited sample is not to be trusted; and at another level, at least it reveals the algorithms are functioning correctly. A poet’s creativity with underscores and dots, resulted in dirt in data that required an amended algorithm: Mark Tardi’s long horizontal lines inserted into his poem “Eventual History” had him as poet with longest average word lengths. Other bug: Wang Ping’s only poem in data “The Last Son of China” contains “ ……………. ” which the algorithm mistakenly attributes as words. Yet another bug, Alan Shapiro’s work did not seem to reflect his status as top of category in several previous iterations.