Twitter Scrapers Are All Broken. What Should We Do?
Edit April 9, 2021:
Hey! Update here. At this point, I don’t see a way forward with libraries like Taspinar’s Twitterscraper repo which scrape up a whole lot of tweets quickly due to Twitter’s changes in 2020.
This article walks readers through Selenium and how to implement it with Twitter. That being said, if you’re trying to get some data asap and don’t need to learn about Selenium atm, @Altimis has a repo that wraps up Selenium nicely for Twitter. Here’s the link to that repo. Link here
Whenever Twitter updates its front end, scrapers break, data scientists groan, and temporary fixes are needed. In this article, let’s quickly go through the reasons why an individual would scrape data instead of use an official API. We’ll also cover temporary remedies via headless browsers and why this (inefficient) remedy should be temporary at best.
Why Scrape?
Last week, I needed historical Twitter data for a single query, yet it seemed that every Twitter-scraper repo had broken! I needed a fix before the talented few on the internet pushed changes to these repos and decided this was a good chance to talk a bit about scraping and headless browsers and to write a quick Selenium implementation for others in the same predicament.
As I’ll note throughout this post, Selenium is possibly the least optimal method of collecting Twitter data apart from by hand — nevertheless, it’s a wonderful gateway into a whole host of other topics, from network analysis and natural language processing to, of course, web scraping.
Approaching scraping for newcomers
When data needs to be gathered in an organized format, an individual often has the option between using the official API of the site in question or scraping the data themselves. Unfortunately, the reasons to choose one option over another are oftentimes not discussed. I’ve met individuals who’ve written scrapers but have never gone near an API — as well as others (more commonly) who’ve worked with APIs but never built a scraper before.
Therefore, when approaching scrapers, the first things to talk about are the ethics, responsibilities, and best practices —topics which aren’t often discussed in classroom settings.
Unlike the sanitized, orderly data that students usually work with in classroom settings, scraped data requires know-how regarding how to scrape data responsibly as well as how to effectively clean and transform these data.
Before diving into the following code — I encourage you to read James Densmore’s piece “Ethics in Web Scraping.” It’s extremely short and serves as a no-nonsense orientation for newcomers to web scraping. While all of the guidelines in his piece are key, one that I often emphasize is:
“Scrape for the purpose of creating new value from the data, not to duplicate it.”
Twitter Scrapers Down
Twitter’s API limits users to queries of tweets posted within the last seven days. For a longitudinal analysis, this is extremely limiting, and accessing data over a week old raises a whole new host of issues regarding the pricing and scale of the data being collected.
Twitter scrapers allow individuals to gather data from any point in time and have the functionality to accept start and end dates when querying for tweets.
Unfortunately, these scrapers rely on Twitter’s front end, meaning that if there are changes to the front end, the scrapers stop working. The issues thread above shows a discussion regarding a particular scraper breaking recently.
A Temporary Selenium Solution
Generally, an individual interested in gathering data from a site should try (in the following order):
- Looking through the official site API and documentation.
- Taking a look into network traffic and requests. (I’d recommend trying Postman for testing.)
- Working with libraries like Scrapy or bs4 + requests. (There are other solutions similar to this — these just happen to both be Python libraries.)
- Working with a headless browser like Selenium.
In this particular case, I needed a quick solution to gather tweets from a single advanced search query over the past three months. As a note: Selenium is the least ideal and least practical option, particularly with Twitter, which has so many libraries to gather data from the site, but with none working at a moment’s notice … Selenium.
I’d also recommend checking out Selenium’s documentation directly — it’s very straightforward in directing the reader to the most important aspects of scraping using Selenium.
Scraping using Selenium
The scraping function itself is very straightforward. First, initialize the webdriver
(which automates usage as a user would) and create an empty DataFrame
to populate.
Heads up — great library here to setup browsers without having to deal with corresponding Chrome versions: link here.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pdbrowser_path = 'C:/vnineteen/chromedriver.exe'# setting the chromedriver path and initializing driver
driver = webdriver.Chrome(executable_path=browser_path)# create master df to append to
master_df = pd.DataFrame()
Also, create a function that waits a certain number of seconds before continuing the script. This is a very important part of scraping with Selenium so your scraper appears as a real user would and doesn’t overburden the site with you scraper’s actions.
from time import sleep
import randomdef sleep_for(opt1, opt2):
time_for = random.uniform(opt1, opt2)
time_for_int = int(round(time_for))
sleep(abs(time_for_int - time_for))
for i in range(time_for_int, 0, -1):
sleep(1)
Next, loop through the Twitter advanced search URLs you’ve created (copy and paste the URLs from your searches here into the URLs list
variable), scroll down the page a specified number of times, create a list containing each tweet, and append this list-turned-DataFrame to the master DataFrame. As a note, there are more efficient ways to append these data before converting them into a pandas DataFrame, but for our current need for speed, it’ll suffice.
There’s also a variable to define called post_element_xpath
. This is the element for each tweet; we’ll be using this element to retrieve each tweet after scrolling down the webpage. *Note that the following code appends to a pandas dataframe — this is highly inefficient, but conceptually easy to understand, especially if you’re coming more from a humanities + Excel background than a computer science background.
from progressbar import ProgressBar
pbar = ProgressBar()urls = ['https://twitter.com/search?q=simpsons%20(predicted%20OR%20covid)&src=typed_query']# how many times should the browser scroll down
scroll_down_num = 5 # the element we are obtaining from the webpage
post_element_xpath = '//div/div/article/div/div'# loop through your list of urls
for url in pbar(urls):
driver.get(url)
sleep_for(10, 15) # sleep a while # scroll x number of times
for i in range(0, scroll_down_num):
# scroll down
driver.find_element_by_xpath('//body').send_keys(Keys.END)
sleep_for(4, 7) # get a list of tweets
post_list = driver.find_elements_by_xpath(post_element_xpath) # get the text only from each element
post_text = [x.text for x in post_list] # create temporary dataset of each tweet
temp_df = pd.DataFrame(post_text, columns={'all_text'}) # append the temporary dataset to the dataset we will save
master_df = master_df.append(temp_df)
Regardless of whether an individual is signed in, Twitter limits the number of scroll downs possible, meaning that if you’re interested in accessing a greater number of tweets, I’d suggest generating a list of start and end dates to query via editing the URL string that you’re passing through driver.get()
.
The tweets we’ve gathered will have the user handle, tweet, number of likes, replies, etc. as separate lines within each string. Therefore, to parse the resulting data, use .splitlines()
to create a list of elements, one per line returned from the scrape.
Here’s an example of the tweet text we’ve gathered.
Jeremy Gothem
@jeremygothem
·
Jun 4
Hotel Pivots - Hotels are repurposing their rooms during COVID-19
1
Since the elements vary depending on the kind of tweet and profile (one tweet might contain six elements, another might contain eight), add some flexibility into your parsing, allowing the script to search for the proper element within a limited range of the list index.
Run the parsing function, applying it to the text column of your DataFrame. Then export the data set.
def parse_text(text): # split by new line
text_list = str.splitlines(text) # get the username (always the first list element)
username = text_list[0] # within the first few elements, find the element
# with the @ symbol, this will be the user handle
handle = ''.join(x for x in text_list[1:3] if '@' in x) # get the date, using the single dot to identify its
# index location
dot_position = text_list[1:4].index('·')
date = text_list[dot_position + 2] # date comes after dot # check if its a reply to someone else
if text_list[4] == "Replying to ":
reply_to = True
reply_to_handle = text_list[5]
text = text_list[6]
else:
reply_to = False
reply_to_handle = ''
# find the longest string within list index 4:6
# this will be the tweet text
text = max(text_list[4:6], key=len) # return the variables we have parse from the text
return pd.Series([username, handle,
date, reply_to, reply_to_handle, text])# run the parse function via pandas apply
df[['username', 'handle', 'date', 'reply_to', 'reply_to_handle', 'atext']
] = df['text'].apply(parse_text)# export csv
df.to_csv('output.csv')
Going Forward
Twitter is a great place to get started if you’re learning about data mining, natural language processing, or network analysis. While the Twitter API is perfectly good to get started with (and you can use libraries like Tweepy to ease into it), longitudinal analysis is sadly a no-go using the API, and I hope this code can serve as a makeshift remedy if Twitter scrapers aren’t working.
Again, I’d recommend adding additional automation regarding start and end dates to query additional data using this scraper— I was able to gather ~3,000 tweets from July through October for a pretty niche topic.
Scweet repo script (from April edit up top)
- make a folder & cd into it
- git clone https://github.com/Altimis/Scweet.git
- run script below
import pandas as pdimport os
from os import chdir, getcwd# sets the path to the current folder you're cd'ed into
folder_loc = os.path.dirname(os.path.realpath("__file__"))
os.chdir(folder_loc)# make sure you git cloned the repo into your working directory
# folder
from Scweet.Scweet.scweet import scrap
# set parameters
list_handles = ['theo_goe','handle2','etc']
output_file_name = 'output_file.csv'start_date = "2020-11-01"
max_date = "2020-12-15"
# interval of how many days to search in between
interval_in = 30# make an empty dataframe to fill
tweets_df = pd.DataFrame()for handle in list_handles:
# https://github.com/Altimis/Scweet
data = scrap(start_date=start_date, max_date=max_date, from_account = handle, interval=interval_in,
headless=True, display_type="Top", save_images=False,
resume=False, filter_replies=True, proximity=True)
tweets_df = tweets_df.append(data)tweets_df.to_csv(output_file_name)