What, where and how?
For this exercise, a small set of documents titled "WEEKLY REPORT FOR HOME INTELLIGENCE" was chosen. These texts had been scanned and digitised using Optical Character Recognition.
In order to allow a machine to understand the texts, they needed a considerable amount of manual cleaning. It was necessary to extract the raw slabs of text from the XML files, stripping away formatting and structures such as the table of contents, page numbers and section numbering. It was possible to use regular expressions to search for and replace certain characters and character combinations, followed by spell-checking and proof-reading.
The cleaned texts were then cached (saved) and sent for Natural Language Processing using the Alchemy API. Results of the analysis were cached and used to display various pieces of information about the texts in HTML format. Each of the extracted entities, keywords and taxonomies are rated for relevance and sentiment. The sentiment score has been used to apply a colour from a scale of reds for positive sentiment, blues for negative sentiment and white for neutral sentiment. See Sentimental Hex for an example.
The SRU Search Client is a copy of the original real search page with links to raw documents and analysis inserted. Two of the original search results were eliminated as the text breached the maximum allowed size for submission to the Alchemy API.