Clean data and Twitter blue ticks

I’m not a very good cook, although I can make a good burger, and this opinion is obviously in no way biased! One of the few things I do know, is that you need to make sure that before preparing the vegetables from your burger, like lettuce and tomato, you need to clean them first with water. This helps gets rid of soil particles and other contaminants. I doubt that’s a controversial opinion either. Once everything is cleaned you can start preparing the ingredients in earnest.


With data, just like lettuce, you have to clean it first! If your dataset is riddled with random invalid values, missing values and outliers etc it can severely impact any analysis further down the line. It is probably one of the most time consuming parts of any data analysis problem to clean data, but it is also one of the most necessary. Indeed, at Turnleaf Analytics, which Alexander Denev and I cofounded, to forecast inflation, one major part of our forecast pipeline involves cleaning data.


Data appears everywhere in our society these days. Twitter is effectively a massive dataset, with folks interacting with it both via a graphical interface, and if you have a sufficient budget, you can ingest vast quantities of the data via an API. Clearly, the cleaner the Twitter dataset is, as with any dataset we might have, the more valuable it is to its users and the more you can do with it.


So how has Twitter tried to historically ensure that their dataset is clean? One way has been to verify well known accounts, and give them a “blue tick”. It is a time consuming process, but we can at the very least ensure that whoever has historically tweeted from such a dataset is who they say they are. As has been well publicised, in recent weeks the concept of a blue tick has been totally upended. Now anyone who wants to pay 8 dollars a month, and can provide a telephone number, can buy a blue tick. There aren’t any proper checks on who exactly you are.


I can understand why Elon Musk has done this. It provides an additional income stream for Twitter. An additional income stream for additional services for his users sounds like a good idea (and I’d be willing to pay for a nicer version of Tweetdeck, as an example). However, the way it has been implemented, is that he has simultaneously devalued the vast amount of data amassed in his firm, by mixing up the concept of a Twitter Blue subscription and the old notional of being verified. Over time, the concept of a blue tick, will simply come to mean something willing to pay to get their content boosted on searches. It won’t mean who they say they are.


Twitter can be made better. It needs to provide an income stream to whoever runs it (I recognise it isn’t cheap to run). However, making the dataset less accurate and clean, reduces its value to everyone. Why would you want to spend millions advertising or using the API, if you know the dataset is being continually degraded over time? I suspect in the end, some sort of verification will end up being reinstated, as a way to clean up the dataset once more.