Comparing geotagged tweet volumes available with the Twitter Premium Search API, the Twitter Search API, and TWINT

Setup and search for tweets using the Twitter Premium Search API (counts endpoint)

How do the hourly and daily endpoints compare?

When grouped by day, the hourly data is a close match to the daily data. However, especially for days with low tweet volume, the aggregated hourly data appears to be a slight over-estimate. This is explained by the quantization of low value aggregate data, seen below.

Quantization of data

The counts endpoint appears to quantize/obfuscate any counts between 1 and 5 to a value of 5. Interestingly, this only appears to be the case for georeferenced queries. (Shown further below in this notebook.)

Setup and search for tweets using TWINT

How does this compare to the Premium Search API counts?

It appears that there are many more tweets returned by TWINT in the previous 7 days than are counted by the Premium Search API. However, beyond this period, they are very similar. (Except for a few cases of more tweets returned by the Premium Search API counts endpoint, which could be a consequence of deleted tweets.)

On the lower left of the graph, the quantization of low value counts is very visible.

What is different about the tweets that explains this volume discrepancy?

Get tweets using the Premium Search API

Use the search endpoint instead of the counts endpoint to get actual tweets.

Most tweets do not have explicit geo information.

However, all seem to contain "place" information.

Tweets that were found by TWINT and by the Premium Search include a Twitter "place," as in this example: https://twitter.com/SumatiThusoo/status/1389522172671037442

Tweets that were found by TWINT but not by the Premium Search API do not. It appears that information on location is being sourced from the user's profile location in this case.

Unfortunately, very few TWINT tweets contain populated "place" information, even tweets which include a place in the Twitter web interface. It is unknown why this is the case.

Unfortunately, the TWINT "Location" option is currently non-functional. This means that it is not possible to distinguish the tweets that only match based on the user's profile location.

From these observations we can conclude that the difference in tweet volume is caused by the Twitter Advanced Search (and therefore TWINT as well) including tweets based on the user's profile location information, but only for the most recent seven days. This means that tweet volume measured using TWINT from the most recent seven days cannot be compared with volume beyond that time range.

Getting tweets with the normal Twitter Search API

The standard Twitter Search API appears to match the behavior of TWINT, however it returns slightly fewer tweets. It is not immediately clear why certain tweets are not included.

Since the data returned by the standard Twitter Search API contains populated place information, it would be possible to merge this with TWINT data to form a consistent dataset of tweets containing place data within a specified radius, while only using free APIs. However, the Twitter Search API does have use rate limitations, and this would include only a limited selection of tweets.

Another limitation of both the Premium Search API and the standard Twitter Search API is the limitation of the radius to 40km maximum. There is no (observed) maximum radius when searching with Twitter Advanced Search.

It appears that they match quite closely, with the Premium Search API counting slightly more tweets in certain circumstances. This is expected from the official Twitter documentation.

Notebook state saving/restoration