Collecting Tweets Using R and the Twitter Search API

Would you pay $40 for an end-to-end video course on using the Twitter APIs in R to store live streams & search streams on a server with R?

View Results

Loading ... Loading ...

Sentiment Analysis and Natural Language Processing (NLP) have always fascinated me yet I never really understood the inner-workings of this type of analysis and never made the time to dig into the science. Until recently, I didn’t even know that you could collect tweets for free using Twitter’s Search and Streaming APIs. A few days and several blogs later, I’ve now set up R to work with both the Search and Streaming APIs. Since much of the information was located on disparate websites, I thought I’d give a general recap here. This first post deals with using the Twitter Search API and R to collect tweets. Before I dig into the code, there are some notes I want to touch on (which I later learned from Twitter’s documentation).

 

What is the Twitter Search API?

The Twitter search API, one of three such APIs (search, streaming, “firehose“), allows access to a subset of popular or recent tweets (in the last 4-6 days). That is, it allows querying past tweets (though a significantly small fraction of all tweets). To me, this is a great way to get one’s hands wet on collecting and cleaning tweet datasets, however, it doesn’t really provide any utility for research as the fraction of tweets received may not really be representative of the entire tweet stream.

 

Who can access the Twitter Search API?

Anyone! That’s right…if you have an account, you can create an authorization token and get started with Big Brotheresque collection of people’s thoughts and locations (that’s right….locations).

 

How do I get started with Twitter Search API and R?

To be able to query the twitter search api and import the data into R, we’ll need to accomplish the following tasks:

  1. Sign up for Twitter & create an application.
  2. Install R and required R packages
  3. Understand the Twitter Search API query structure
  4. Run our first query, and save to database

 

Sign up for Twitter & Create an Application

If you don’t yet have a Twitter account, head on over to twitter.com and grab yourself an account. Process is pretty self explanatory. If you already have an account with Twitter, we’ll need to set up an application (this will allow us to connect R to the twitter stream).

First step is to head on over to dev.twitter.com. After logging in (yes, you might be prompted for another login), click on your twitter thumbnail (upper right hand corner of the screen) and click on “My Applications.” In the following screen, click on “Create New App.” You’ll need a name, description, and website. I’ve used my blog address as a website, though I’d imagine that anything works.

Once created, click on “modify app permissions” and allow the application to read, write and access direct messages (this might come in handy later on). Lastly, click on the API Keys tab and scroll to the bottom of the page. Under token actions, click on “Create my access token.” We’ll need these access tokens when we fire up R. That’s it! We’re done with the Twitter part of this setup.

 

Install R and Required R Packages

If you have not yet installed R, head on over to r-project.org and install the version appropriate for your platform. I also highly suggest installing R-Studio, an integrated development environment and gui for R. I am running R-studio throughout the tutorial.

To use the Twitter Search API, we need the following packages installed:

Some of these packages have dependencies on other packages, so make sure you install all required packages before moving on. To install all of these packages in one run, just copy the following code and run it in R (or R-studio):

Once these packages are installed, it’s time to set up our connection to the Twitter Search API. To do so, we’ll need to copy & paste our API credentials into either a text file, or, preferably, an R script. Just copy the following code into your script, noting the requirements in quotation marks:

Keep the api keys in quotation marks, but remember to replace the text with your actual api and token keys. The setup_twitter_oauth function will create a connection to Twitter’s Search API. If you are successful, the following message should show up in your console:

 

Understanding the Twitter Search API Structure

Our R instance is now ready to receive tweets from Twitter. However, before we can receive any information, we’ll need to understand the format of a Twitter Search query. Per Twitter, the best way to build a query and test if it’s valid and will return matched tweets is to first try it at twitter.com/search. This in essence uses the same API that we are calling. Once your results are accurate, we can load that search string into R.

The query has multiple operators and will behave in the following way:

Operator Behavior
Obamacare ACA will find tweets containing both "Obamacare" and "ACA"; not case sensitive
Obamacare ACA will find tweets containing the exact phrase "Obamacare ACA"; not case sensitive
Obamacare OR ACA will find tweets containing either "Obamacare" or "ACA" or both; not case sensitive; the OR operator IS case sensitive.
Obamacare -ACA will find tweets containing "Obamacare" but not "ACA
#Obamacare will find tweets containing the hashtag "Obamacare"
from:BarackObama will find tweets sent from Barack Obama
to:BarackObama will find tweets sent to Barack Obama
@BarackObama will find tweets referencing Barack Obama's account
Obamacare since:2014-08-25 will find tweets containing "Obamacare" and sent since 2010-08-25 (year-month-day)
ACA until:2014-08-22 will find tweets containing "ACA" and sent before 2010-08-25

There are a few other query operators that you can review on the Twitter Search API documentation page, though the ones in the table above will suffice for this tutorial.

So now that we know the basic structure of a Twitter Search API query, let’s build one in R and run it. Remember, if you are building more sophisticated queries, it’s worth running it through twitter.com/search. Let’s say I’d like to run a query on tweets that have mentioned “Obamacare,” “ACA,” “Affordable Care Act,” or “#ACA.” I’d also like to run it with tweets since 2014-08-20, and I’d like the query to run until it returns 100 tweets. The following code should do the trick:

Running the code results in a list called “tweets” containing 100 rows. To make this list easier to read, let’s transform it into a data frame by running the following:

We now have a data frame (tweets.df) with 100 tweets. You’ll notice that the data frame contains 16 columns. While you can figure out on your own what the different columns mean, the following are likely the most important ones:

  • Text: the text of the actual tweet
  • Created: the date and timestamp of creation
  • ID: the ID of the tweet (useful when needing to remove duplicates)
  • Longitude and Latitude: if available, the longitude and latitude of the tweet.

It’s important to note that the searchTwitter function allows users to enter a latitude/longitude pair. A search string that includes lat/lng is of the following format:

When this query item is used, tweets returned will be of two types:

  1. Tweets that have a designated latitude/longitude specified
  2. Tweets whose users specified a location in their profile within the specified radius.

The search string above will provide up to 100 tweets that specified our search terms within 50 miles of Los Angeles, since 8/20, in English.

That’s it! We’ve now saved our first set of tweets to a data frame within R. Much still needs to be done before we can actually do some interesting analysis on the text (like removing retweets, readying the text for sentiment analysis, processing the text and assigning sentiment etc). My next blog post will focus on using the Twitter Streaming API to capture tweets (different than the Search API in that we can now capture tweets in real time, vs. doing a historical search).

For your reference, see the full code below:

 

Leave a Reply

46 Comments

  1. raffaele

    Hi bogdan, Thanks for the tutorial. i have to get tweets with R for a hashtag; but i have a big problem. i’m not able to get tweets older than ten days. can you help me?? thanks

    • bogdanrau

      Hi Raffaele. Unfortunately, the search API does not go back further than about 2 weeks. You can try the streaming API going forward for that. The only other solution is to actually buy that data from Twitter, which could get pricey.

  2. Gaurav

    How to detect spam tweets ?

    • bogdanrau

      I don’t believe there is a way to do that through the API. You’d need to build your own mechanism for detecting spam. The definition of spam is also different depending on your use case.

  3. SHUBHAM UPADHYAYA

    Hello Bogdan,

    This write up has been very useful for me to understand search API of twitter. I want to extract only tweet ID from twitter rather than whole tweet text & all the other stuff that comes with it, since when i work with a large no. of tweets this api crashes very frequently due to amount of data involved. So is it possible to only extract the tweet ID directly from twitter.

    • bogdanrau

      I don’t believe it’s easy to just extract the tweet ID, however, you could just run your filterStream in a timed for loop (or a cron job if running on linux), then drop all columns with the exception of tweet id. Running filterStream on a timed loop or cron job would ensure that the overall dataset never gets big enough to crash your computer.

  4. Alishba

    Hey Sir. Thanks for the tutorial. Can you tell me please how to get tweets of a specific user. I need to fetch tweets from other person’s account and i have no idea how to fetch that account’s tweets. Please help

  5. Sami

    Hey Bogdan,
    Thanks for this post its very helpful
    I’m having problems with running searchTwitter on tweets with Arabic hashtag or Arabic keyword . It doesn’t return tweets with the searched word. I can search for Arabic tweets but only by searching for English keyword or a hashtag in English.

    would be great if you can help.

  6. Adrian F

    Hi Bogdan ! Firstly thanks to your tutorial, it’s been really helpful and I almost got the task but I’m stocked at the end because my API doesn’t return to me any tweet. I got an error (if I query 3 tweets for example): “3 tweets were requested but the API can only return 0”

    Would u know how to help me ?

    Thanks in advance 🙂

    • bogdanrau

      Can you try requesting more than 3? What happens if you try to request 100? And what is the topic you are searching for?

  7. suhas

    Thank you for this post . its really help full .

    I could able to got through the code .

    q1) I am requesting n=50000 but its providing 20000 only .
    q2) I couldn’t get the lat and log location its showing NA for all the tweets .

    Thank You

    • bogdanrau

      Twitter only has about 10% of tweets that come actually geotagged (per some article posted a while ago…might have changed recently). This means that most of your tweets won’t come with a latitude/longitude. As far as requesting 50,000 and only getting 20,000, I’m not sure how to answer that. Seeing your code might help. It could be due to the fact that you are searching for a very specific term and there are only 20,000 reports total. I believe the search API only goes back about 2 weeks.

  8. Heather Evans

    Thanks for posting this information and walking us through how to do this. I’m wondering — I’ve seen a couple different posts online about getting someone’s entire tweet history. Have you had any luck with that?
    I guess I could have it running all the time…..

    • bogdanrau

      The best I can come up with is the userStream function in the streamR package. You’d still need to have it running all the time, but you’d only be collecting from this one specific user vs. collecting all.

  9. Ronak Shah

    Hi,

    Can you help me in following a person with the twitter API in R? I have the id of the user and would like to follow them using API. I went through this (https://dev.twitter.com/rest/reference/post/friendships/create) document but was unsuccessful. I was using the POST method from httr package. However, I feel the URL which I was generating was wrong. I have also asked the same question on SO (http://stackoverflow.com/questions/37560164/follow-someone-using-twitter-rest-api-in-r) It would be great if you can help.

  10. Rosebud

    Hi
    Thanks for this brilliant post.

    I keep on running into errors intermittently like: Error in curl::curl_fetch_memory(url, handle = handle) :
    Couldn’t connect to server and Error in curl::curl_fetch_memory(url, handle = handle) :
    Timeout was reached
    How do I fix this?

    • bogdanrau

      I’ve never encountered those errors before, but it sounds like it might have a problem connecting to the server. Can you share your actual code? Hard to figure out what’s going on without seeing that.

  11. sharanya

    Hi
    This information was very much helpful.
    But, I need the tweets with the location mentioned, instead of the latitude and longitude. so that I can do some health care data analysis.
    Please help me in this regard

    • bogdanrau

      The resulting data frame often has location specified in the location column, though often times it is unreliable. Your best bet is still to filter by lat/lon and then use a service like Google’s geocoder or others to get the actual location of the user. Sorry I can’t be more specific but this answer is not paragraphs long, but rather pages long.

  12. Hello Bogdan,

    This write up has been very useful for me to understand search API of twitter. i am facing an issue while running this script. it accepts all the code lines however, after i punch the last line tweets.df <- twListToDF(tweets), nothing really happens. do i need to punch another set of codes to view the data frame

    • bogdanrau

      That should work. Do you get any error messages?

  13. Divye

    Hi
    First of all this is really a helpful blog. It works perfect like a rocket however,
    I have two questions.
    1. I am not able to search less popular words like tagbin.(it shows 0 results whereas there are posts containing tagbin.)
    2.Why are we getting only 16 variables. When i run the API from ubuntu terminal and get a json it contains 147 variables embedded in 25 elements. some of the useful variables are information about the user who tweeted the post, his details which contain 41 variables like is, name, location etc. How do i access that information?

    Have you written a blog on parsing twitter json files with R.
    Thanks,

    • bogdanrau

      Divye,

      Thanks for the note & the kind words! With regards to your questions:
      1. Neither the search nor the streaming API provide the entirety of tweets that Twitter collects. It is only a fraction (I believe streaming API is rumored to be ~2%). Less popular words like tagbin have a higher likelihood of not showing up. Have you tried the streaming API? I have a blog post about that as well.
      2. I’ve not played around with parsing the twitter json myself, though I’m sure it can be done using RJSON, or RJSONIO, or many of the other JSON-related packages out there. I’ve seen name and location show up in my streaming data. I’ve not used search for quite a while so Twitter may have changed the way they send that information in. Have you had a look at the API docs?

  14. Mark

    Thanks for the great post.
    Is it possible to only get tweets that has lat lng data?

    • bogdanrau

      Based on my experience, it’s not 100% possible as many times, you’ll need additional data management steps to weed out reports. Look into the streaming API as they have some information and suggestions on your question.

  15. Shovon

    Hi,
    Thanks for the walk-through. This approach worked for me while it was on my local machine. However, it just does not work when I publish it on shinyapps.io server.
    Diagnosis 1) On local server it waits for input to save session. Which could be solved by
    options(httr_oauth_cache=T) # Adding this line
    setup_twitter_oauth(your_cons … … …. )

    But, still it does not work on the shinyapp.io server.

    Probably I need to authenticate differently while the R-prog runs on shiny… any idea how to resolve this? Thank you.

    • bogdanrau

      Hmmm….are you able to see any of the console output? When you host it on an actual shiny-server, it outputs all of the console output to a log file. Does shinyapps output a log file? That would be very helpful in figuring this out.

  16. I am following the instructions above but am getting the error:
    “Error in check_twitter_oauth() : OAuth authentication error:
    This most likely means that you have incorrectly called setup_twitter_oauth()”

    I do get the message [1] “Using direct authentication”
    Use a local file to cache OAuth access credentials between R session?
    1: Yes
    2: No

    I have search around on google and it looks like it’s a problem with the version of the httr package but I can’t seem to find a solution that will work. Any ideas?

    Thanks!

    • Wendy

      Hi

      This is a great blog and very useful.

      I am getting the same error as Brittany, any ideas?

      Cheers
      Wendy

      • Wendy

        Hi

        I found the solution was to use the library base64enc. Good explanation on github.

        Thanks
        Wendy

  17. Amanda

    This is helping me tremendously! However, before I even apply the geo parameters R tells me that “1000 tweets were requested but the API can only return 245.” I want all the tweets since my date, but just put in 1,000 to start the process. I’m using a very popular hashtag- so I don’t see why only 245 would be returned. I worry that the sample will shrink even more once I add geo parameters. Any help is most appreciated! Thanks.

    • bogdanrau

      I have a feeling this might be due to the fact that Twitter only gives you a sample of tweets, not ALL, so its basically saying that out of 1000 available, it was only able to get 245. Basically, there were 1000 total, but twitter is only giving you access to 245. This is total speculation, however.

  18. Luis

    Hi!

    Thanks for the guide, I find it very usefull =)

    I’ve been doing some test for a Master Degree’s project, I’m trying to cross data between tweets and flights, but I think that the geocode function doesn’t work properly… for example:

    # Tweets near Dublin (see geocode)
    tweets_geolocated <- searchTwitter("onboard", n=10, lang="en", geocode="57.3906,-46.8855, 35mi", since="2015-05-06")
    tweets_geolocated.df <- twListToDF(tweets_geolocated)
    tweets_geolocated.df$text

    # Tweets near America (see geocode)
    tweets_geolocated <- searchTwitter("onboard", n=10, lang="en", geocode="53.5822,-11.6383, 35mi", since="2015-05-06")
    tweets_geolocated.df <- twListToDF(tweets_geolocated)
    tweets_geolocated.df$text

    — in both cases I get the same tweets and I don't understand that because the location from example 1 is very distant from example 2…

    Thanks

    • bogdanrau

      Hi Luis,

      Can you try naming the dfs uniquely? maybe tweets_geolocated_US and tweets_geolocated_Dublin? I’m thinking R might just overwrite dfs since you’re technically writing to the same file. Let me know how that goes!

  19. Brilliant stuff, cant wait for the streaming api post, waiting….

  20. Rachit

    can we search the tweets based on city name ?

    • Twitter only gives lat/lon pairs, so I don’t believe you can search by city though I believe there was an option to search by the city in the user’s profile, not the actual location from where the user sent the tweet. What you CAN do however is take those lat/lon pairs and geocode them back to cities or any other administrative region for that matter.

  21. Roberto

    Good work, I’m waiting for the next post!

    Best

      • Dust

        Hey guys,

        What happens when you want to look for tweets in Arabic?
        R understands that…?

      • bogdanrau

        Interesting question. A quick search reveals that the search API has had some issues with Arabic and other languages. Has that been fixed? If so, you should be able to pass the characters in your search string the same way as you’d pass the characters in twitter’s API request. Let us know how that goes!

Next ArticleHow healthy is your community?