16 Sep 2017 22:23 GMT
My latest project is Westminster Bubble, a Twitter bot that shows what Westminster journalists are talking about on Twitter, in one word cloud a day.
The @wmbubble account follows all journalists who are registered with Parliament and who primarily cover Westminster politics in their work and on Twitter.
This is a reasonably objective set of criteria for Westminster journalism but it's not perfect. There is some fuzziness around who primarily covers Westminster and whether that is their focus on Twitter, but the aim is to exclude journalists who are covering only one or two specific areas of policy and to include journalists who are talking about Westminster politics or policy generally.
The bot does not analyse or store individual tweets, or keep any record of who is saying what. It takes the text of each tweet in the stream of tweets from people it follows, cleans the text to remove mentions, hashtags and links, and adds what's left to an anonymous corpus. That corpus is used to make the word cloud at the end of each day.
If you are a journalist the account follows and you are uncomfortable with your tweets being included in this analysis, please let me know and I will remove your account from the group. Alternatively, you can just block the @wmbubble account. But to reiterate, the bot is not storing any data about individual accounts, just a single block of anonymised text which does not include any ids, account names, hashtags or URLs.
Word clouds are posted daily at around 9:00pm, just before the newspaper front pages come out. The word clouds show the most common words each day and the size of a word represents a measure of its relative frequency within that day's corpus. This measure is not based on frequency alone, but an equally weighted combination of frequency and rank on frequency.
This is because on some days a small number of words are so dominant in the corpus they would make most other words too small to render at the minimum font size, if size was scaled only to frequency. Giving equal weight to rank on frequency preserves the diversity of words that are visible in the cloud, and ensures you get a better sense of the content of the day's big stories. But this does mean that the relative size of the words is not a direct measure of their relative frequency alone.
Frequently collocated words are treated as a single term, so names like Theresa May and Jeremy Corbyn are preserved. The colours are not meaningful and are just randomly chosen from a range of colour maps for variety.
Feel free to address any questions to me via @olihawkins on Twitter. Don't try addressing questions to @wmbubble: it's programmed to ignore those tweets.