published August 13, 2018
For a current side project I am using tweets as the input to a machine learning model. In this article I will share a short and practical guide to converting tweets into something suitable for machine learning use. I will assume that you have already obtained one or more tweets from the twitter API to analyse. I will refer directly to the structure of the JSON provided by the twitter API as we go on. This structure is described in more detail here.
Assume that in our programming language of choice we have a variable called tweet and that element access is implied by square brackets.
If you are familiar with the specifics of machine learning then you can probably skip this section. If you find yourself wondering why tweets need any conversion at all for machine learning use, then read on.
We'll start with the main issue, tweets are essentially strings with some metadata attached. No machine learning algorithm I have ever encountered can take a string as input directly, and I would bet that no such algorithm exists. This does not however mean that strings cannot be used as input for machine learning. They just have to be converted into a numerical representation first, through a process called feature generation.
A feature is simply an input dimension for a machine learning model, e.g. if we are learning from data about football teams one feature might be the number of goals they scored and another might be the number of yellow cards they received. So feature generation is the process of turning raw input into features suitable for our machine learning model.
Before we get stuck into the main body of the tweet text, we will extract some simple features from the associated metadata.
For our application we care about the "trustworthiness" and social impact of the twitter account making the tweets. One simple metric for this is to look at the age of the twitter account. We can extract the time and date of account creation from
tweet["user"]["created_at"], with the example format "Sat Feb 08 02:43:01 +0000 2015". Then it is a simple matter of extracting the time difference between now and that creation date.
We may also be interested in how active the twitter user is, which can be represented by the number of tweets they have made. This can be accessed directly from
We can extract whether the twitter user is verified or not by extracting tweet["user"]["verified"] and converting from boolean to integer. Although verification explicitly "does not imply an endorsement by Twitter" it does highlight that this account may be of special interest. Of course it is up to the machine learning algorithm to decide whether verification is a useful piece of information.
Probably the clearest indicator of the social impact of a twitter account is its number of followers, which can be extracted from
Potentially less interesting than the number of followers, is the number of friends from
tweet["user"]["friends_count"]. However the ratio of followers to friends may well impart some useful information about the way in which the twitter account is being used.
Twitter has the option to create curated lists of twitter accounts, and we can find out how many lists the account is on from
tweet["user"]["listed_count"]. Being listed a large number of times might well highlight a particularly influential account.
It's possible that whether a tweet is a reply to an existing tweet or not may be important for your model. For example, perhaps a large number of single tweets suggests a different situation compared to one large chain of replies between a few users. Whether the tweet is a reply or not is simply found from checking whether the inreplytostatusid property is present on the root object.
Now for the main body of the feature generation process, processing the tweet text. The main process I am using is the classic NLP technique of modelling n-gram frequency.
An n-gram is simply a phrase of n consecutive words. For example an unigram (1-gram) could be "the", a bigram (2-gram) "the quick" and a trigram (3-gram) "the quick brown". Before we get on to n-gram modelling in detail however, we must pass the tweet through a number of other processing steps. The order of these steps is important.
Tweets can contain a number of special entities, such as URLs or hashtags. For some of these entities, particularly photos, we do not particularly care what their exact content is, but simply that they are present. We can represent this by replacing those entities with a placeholder word, for example we might map URLs to "URL".
Conveniently the twitter API identifies entities for us at
tweet["entities"]. This is an associative map from entity type (e.g. "urls") to a list of objects containing useful properties. These properties include a start and stop index (
["indices"]) to locate the entity in the text and other entity-type-specific data such as converting a shortened URL to its full form.
For our purposes will simply use the indices provided to replace those sections of the tweet text
tweet["text"] with our placeholder words. For our particular use case we chose not to replace hashtags and cashtags (entity type "symbols") as their particular content is useful to know. That is, we care not only that the tweet contained 4 hashtags but also which particular hashtags they were.
To complicate things there is a further source of entities. If content such as photos or gifs are embedded directly into the link then there will also be a list of
tweet["extended_entities"] to be handled in the same way as the other entities.
Once we have encoded the appropriate entities (i.e. replaced with placeholders) we pass the text onto the next processing steps.
As we will not be analysing the structure of the text, we also remove the common punctuation marks such as "." or ";".
To make the text easier to handle, we now replace any kind of whitespace with a single space. At this point we should now have a space-separated list of words.
For our n-gram frequency calculation we do not care what case the words are written in, therefore we lowercase every word in the text.
Stop words are very common words which provide little additional overall meaning to text such as "a" or "I". These are very commonly removed before any kind of NLP processing. For processing tweets in English we remove any word found on the stop word list provided by the snowball stemmer.
It is very common to use emojis in tweets, and often without putting whitespace between them and the rest of the text. We do not want "happiness" and "happiness😀" to appear as different unigrams in our later analysis, but rather the bigram "happiness 😀". Therefore we enforce that there must be a single space between consecutive emoji and non-emoji characters in the text.
We considered putting space between consecutive emoji characters so that they could be treated as separate words, but some emojis consist of multiple consecutive emoji characters (e.g. 👨👨👧👧 ). We might revisit this later to split up emoji characters that do not change meaning when combined, but this was deemed too complex to start with.
Stemming is the process of converting a particular form of a word to a particular common root form. For example stemming both "running" and "ran" would produce "run". We stem each word in the text as when calculating n-gram frequency we only care about the meaning of the individual words, rather than taking into account grammar. We use the snowball stemmer.
After passing the tweet through the previous steps, we now have text appropriate for calculating n-gram frequency. This is now a simple case of running a sliding window of size n over the words, and calculating how many times each unique combination appears. For our purposes we decided to initially calculate frequencies for n = 1, 2, 3 but we may revise this after further model testing. In Python this calculation might look like,
If we calculate the frequency of all n-grams in our tweet corpus we are going to collect a huge amount of data. Furthermore very rare n-grams are unlikely to have much predictive power in our final model unless we have a very large dataset to learn from. Therefore we cull the n-gram frequencies we store by keeping only the N most frequent n-grams for each value of n. The choice of N depends on your exact application but starting at 1000 might make sense, and then decreasing or increasing as you develop your model.
We want to use the most common n-grams for all possible future data. However we do not have all possible future data. One could at this point reach out to an external data source for the most frequent n-grams, but we decided to bootstrap these values by instead calculating n-gram frequencies from samples of our entire tweet training corpus and treating those as population frequencies.
In practice we choose which n-gram frequencies to calculate and store before the previous step of our feature generation pipeline. It goes without saying that the set of n-gram frequencies calculated must be the same for each tweet.
The final step is to convert all the features we have generated from metadata and n-gram frequencies into a matrix. Each row represents a tweet and each column the value of one of our features.
A note for those using Python, dictionary traversal orders are not guaranteed to remain constant on data insertion or between different executions of the same program. Therefore some slight care is required when mapping a dictionary of n-gram frequencies to a matrix, as column order must be maintained.
One slight difficulty with handling tweets is the prevalence of retweets and quotes (retweets with extra text added). You can determine if a tweet is a retweet by checking for the presence of
tweet["retweeted_status"] and a quote with
tweet["quoted_status"]. These two cases must be handled separately. We can track whether a given tweet is a retweet or quote by adding a further two features which are 1 if the tweet is a retweet / quote and 0 otherwise.
If a tweet is a retweet, then we can access the text and entities (for entity extraction) of the original tweet from
tweet["retweeted_status"]["entities"]. We can safely ignore the content of the retweet as nothing extra has been added. If user info for the original tweet is desired it can be found at
If a tweet is a quote then we both need to extract the content of the original tweet through
tweet["quoted_status"] and of the quote tweet through the root object as above. Depending on the exact analysis you are planning you might want to concatenate the content of the quote and original tweet together, consider them separately or only look at one or the other.
It is also likely that you will want to strip out the word "RT" (retweet) before calculating n-grams, as the presence of this word will be highly correlated with our features identifying a tweet as a retweet or quote. Highly correlated features are often not a good idea.
In November 2017 extended tweets were released. The character limit was increased from 140 characters to 280. In order to maintain backwards compatibility the text of tweets returned from the twitter is truncated if larger than the original length limit. To check if a tweet is truncated you can check the
tweet["truncated"] property. If it is truncated the text and entities are obtained from
tweet["extended_tweet"] rather than the root object.
The steps above represent just one way of preparing tweets for machine learning, and you may well have to refine the process to work better for your particular model and use-case. Of course, you should aim to change your feature generation process in an empirical fashion by using model performance metrics to identify whether any particular change to the process was a good idea or not.
Something we have not touched on in this article is how to handle the time series aspect of tweet data. That is, tweets are produced at a specific point in time, and that timing provides extra information. Time series analysis is a field all of its own, but we will revisit this in a later blog post.
Hopefully this information will prove useful for other machine learning practitioners working with the twitter API for the first time. Twitter is a great source of data with a rich API, so get out there and build something awesome with it. Tweet @jadelderfield or @verifaio and add some more points to the dataset.
Our technology approach is vendor agnostic, solution based, and always focussed on the needs of our customers.
This article describes the problem of automatically deploying code to a private GKE cluster and my workaround solution to it.
How to run JCasC without using an orchestrator.
By submitting your email you agree to our terms and conditions.
We will never share your email with anybody else.