What, where and how?

This small piece of work came about as part of a hack day for the Being Human Festival 2014.

It consists of data found in The Keep's digital archives using an API

For this exercise, a small set of documents titled "WEEKLY REPORT FOR HOME INTELLIGENCE" was chosen. These texts had been scanned and digitised using Optical Character Recognition.

In order to allow a machine to understand the texts, they needed a considerable amount of manual cleaning. It was necessary to extract the raw slabs of text from the XML files, stripping away formatting and structures such as the table of contents, page numbers and section numbering. It was possible to use regular expressions to search for and replace certain characters and character combinations, followed by spell-checking and proof-reading.

The cleaned texts were then cached (saved) and sent for Natural Language Processing using the Alchemy API. Results of the analysis were cached and used to display various pieces of information about the texts in HTML format. Each of the extracted entities, keywords and taxonomies are rated for relevance and sentiment. The sentiment score has been used to apply a colour from a scale of reds for positive sentiment, blues for negative sentiment and white for neutral sentiment. See Sentimental Hex for an example.

The SRU Search Client is a copy of the original real search page with links to raw documents and analysis inserted. Two of the original search results were eliminated as the text breached the maximum allowed size for submission to the Alchemy API.

Next steps

Enabling human analysis of the text for comparison.

Linking to images.

Use of NLP machines other than Alchemy would also prove interesting, although there are not many that are so easily accessible. Creation of a custom NLP machine would be ideal but would require resources beyond that of a weekend side-project!

External links

Alchemy: Sentiment Analysis API

Scholarly articles on the subject of Sentiment Analysis

Creating a Sentiment Analysis Model

Amazon AWS Tutorial: Sentiment Analysis

Semantria: What is Sentiment Analysis?