Blog

Viral tweets

I won’t lie, I am desperate for likes on Twitter. Here are some of my tweets that went viral:

bioRxiv first said

When endlessly scrolling through my Twitter feed during lock-down I stumbled upon the bot New New York Times (@NYT_first_said). The idea of the bot is simple, effective and at times hilarious. Every time a word appears in the New York Times that has never been published there before, the bot tweets that word. The result is a jumble of funny made-up words, slang, onomatopoeias, and jokes. Here’s an example:

I instantly loved this idea for it’s simplicity and effectiveness. My first instinct was to steal it, and that is what I did. Scientists also often use words they made up themselves or are extremely niche so I thought it would be funny to apply this idea to bioRxiv. For those of you who don’t know: bioRxiv is the most popular pre-print server in the field of biological sciences which receives over 3.000 monthly submissions of research articles. Going through all these publications would be intractable, but the abstracts are easily accessible using a web crawler (I used biorxiv-retriever). I set up a Linux server and had it crawl all abstracts ever published on bioRxiv, extract the words, and build a library of unique words used in these abstracts. When selecting words I had to do some clean-up like removing interpunction, also some selection criteria had to be imposed. Biological scientists use a lot of acronyms of genes, proteins, brain regions, etc. These acronyms are not particularly funny so I excluded all words that have at least one capital letter, since acronyms are almost always written in capitals this got rid of them. I also excluded words with more than one hyphen to exclude words like “to-be-determined”. With this word library in hand, I set up a cron job on my server that runs a Python script every day which crawls the new abstracts of that day, cleans up the words, and checks them against the library. Because there are usually a lot of new words each day, the script randomly picks five to tweet with a variable delay of up to four hours between the tweets. And that’s it, it works! Here is the very first tweet it produced:

If you like this Twitter bot you can give it a follow on @bioRxiv_first and you can find all the code on my Github repository.

Trump’s tweets blur the boundary between reality and fiction.

Can a machine learning algorithm tell the difference?

We live in an age of misinformation in which it is becoming increasingly difficult to discern truth from falsehood. Part of the blame rests with technological advances, conceived in the advent of internet bubbles; neural networks which can fabricate deep-fake videos and bots which flood the internet with fake news. In other cases the blame lies with individual people who intentionally spread misleading or outright deceitful messages, because they personally benefit from confusion about what is real and what is not. One of these people is the current President of the United States. Mr. Trump routinely discredits news sources which are widely regarded as trustworthy. Some examples of news outlets he labelled as “fake news media” are The New York Times, CNN, and The Washington Post. It’s hardly a coincidence that there is a large overlap between sources which are deemed fake and those which disagree with Mr. Trump’s political views.

To amplify his -often misleading- statements, Mr. Trump regularly takes to Twitter as a platform. For example, he has falsely claimed that mail-in ballots lead to a fraudulent election (they do not), and that he fired Marine Corps general James Mattis (he resigned). In Twitter’s defense, the company has labelled several tweets of Mr. Trump as containing false information. Besides the accuracy of Mr. Trump’s tweets, the style of writing is often exceptionally un-presidential with myriad exclamation marks and random capitalization. Misleading content written in an outlandish style has become his distinctive signature, and has incentivized many to start parody Twitter accounts which jokingly mimic Mr. Trump. This has created yet another blurred boundary between truth and fiction on top of an already opaque situation. At times, I have been hard pressed to determine whether a tweet was real or a parody. That’s when I decided to see if a machine learning algorithm would be able to make this distinction for me.

I scraped 1000 tweets from Mr. Trump himself and 500 tweets from two parody accounts (@realDonaldTrFan and @RealDonalDrumpf) to train a machine learning algorithm (see the Github repository for details) to tell the difference between the real and the parody tweets. The trained classifier performed remarkably well and was able to correctly classify 88% of new tweets as either real or fake. Unfortunately it doesn’t quite reach human-level performance yet. My girlfriend was kind and bored enough to sit behind a laptop and classify 100 tweets for me, and had 96% correct. Only twice did I hear “He actually **expletives** said that?!”.

I took the trained classifier and put it into a Twitter bot; @real_fakeTrump was born. The bot shows you a tweet from either Mr. Trump himself or one of the parody accounts, you then have to guess if it’s real or fake. In the thread the bot then tells you its own prediction, the correct answer, and a link to the original tweet. Let’s dive in and look at a couple of examples, first a tweet which the classifier correctly identified then two where it was mistaken. The latter case is the most interesting because they are either parody tweets that could be real or tweets from Mr. Trump that are so absurd they cannot be distinguished from parody.

Let’s give it a go shall we? What do you think of this one:

If you thought this tweet was fake, you were sadly mistaken. The classifier correctly identified this tweet as real (but only with 55% certainty), it is indeed from Mr. Trump himself. Let’s try another one:

This is an interesting example of where the classifier made a mistake; it identified this tweet as fake but it was actually from Mr. Trump. This means that this tweet is very close to parody regarding choice of words and sentence construction. However, because the algorithm builds a high-dimensional model to make its predictions it is unfortunately not possible to pin down exactly how it reached its conclusion. OK, last one:

You might have guessed that this one is fake, it was indeed from the parody account @RealDonalDrumpf. The classifier, however, thought this one was so close to Mr. Trump’s style of tweeting that it predicted it was real.

At the moment, deep-fake videos and fake news articles written by bots are relatively easy to spot, although in some cases it is already impossible. Take for example the college student who created a productivity blog with content that was generated by the language model GPT-3. Nobody realized the blogs weren’t written by a human and it reached the top of Hacker News. As technology advances, it will become increasingly difficult to discern truth from fiction. Ironically, we might need to depend on AI to make this distinction for us.

Collaborative science is the way forward

Two years ago I decided to join a new collaboration in neuroscience called the International Brain Laboratory (IBL). It was a gamble. Systems neuroscience doesn’t really do collaborations of this magnitude. Sure, there are top-down initiatives aimed at fostering collaboration between already existing labs. The idea of the IBL is that it is a lab, albeit physically distributed over several locations. The aim of this virtual laboratory is to standardize and reproduce a behavioral paradigm in all its experimental locations. Thereby tackling several problems that haunt systems neuroscience: reproducibility, statistical power, and a lack of data sharing. Everything the IBL produces is open-source: hardware (as far as possible), software, data, publications, protocols, etc. The big collaboration-wide publications that the IBL publishes are in the style of particle physics: a large number of authors who are all listed in alphabetical order. Contrary to particle physics, neuroscience is not used to these kind of publications. Traditionally in neuroscience, the researchers who has done the largest part of the work is listed as first author and the head of the lab is listed last. That brings me to why joining the IBL was a gamble. Unless there is a shift in how publications are valued on the job market, I will have a hard time convincing any hiring committee that I provided a substantial contribution to the paper where I’m listed as 22th author. I hope that CERN-style papers like this one will become more accepted and valued within the neuroscience community.

I believe that the way we traditionally do neuroscience is not viable. Both the theoretical and the experimental aspects of neuroscience have become too complex for individual labs to tackle. Researchers are expected to be polymaths. They should be skilled in physiology, programming, hardware, behavior, theoretical models, etc. All of these different topics are becoming increasingly complicated and specialized. In a traditional neuroscience publication the first author usually build the recording setup, trained subjects to perform a certain task, gathered the data, processed and analyzed the data, and wrote the paper. This was OK twenty years ago when most recordings were done with single channel electrodes and the behaviors under scrutiny relatively simple. Now, after an explosion of technological advancement everything from gathering to analyzing the data has become exponentially complex. I’m not saying that the way the IBL is set up is the only, or even the best, way forward. What is clear, however, is that the current system is unsustainable. The times, they are a-changin’.