Collecting Tweets Using R and the Twitter Streaming API

Would you pay $40 for an end-to-end video course on using the Twitter APIs in R to store live streams & search streams on a server with R?

View Results

Loading ... Loading ...

Although this post has been a very long time coming, I’ve gotten many good questions and requests over comment/email to create a streaming API tutorial, so here it is! This second post is a follow up to my initial Collecting Tweets Using R and the Twitter Search API post. As always, before I dig into the code, below are some notes I want to touch on.

 

What is the Twitter Streaming API?

The Twitter Streaming API, one of 3 such APIs (search, streaming, “firehose“), gives developers (and data scientists!) access to multiple types of streams (public, user, site), with the difference that the streaming API collects data in real-time (as opposed to the search API, which retrieves past tweets).

 

What is the difference between the different type of streams?

Twitter explains it best, though I’ll give a short recap here. Public streams are streams of public data that flow through twitter. You can use this for following specific users or topics, and for data mining (chaching!). User streams contain data corresponding to a single user’s stream. Lastly, site streams is a multi-user version of the user stream.

 

Who can access the Twitter Streaming API?

Like with the search API, anyone that creates a dev account can access live streaming data. See my search API blog post on how to sign up for a dev account.

 

What R packages will we need?

To use the Twitter Streaming API, we will need to install the following packages:

Some of these packages have dependencies on other packages, so make sure you install all required packages before moving on. To install all of these packages in one run, just copy the following code and run it in R (or RStudio):

The next step varies slightly from using the TwitteR package, in that we need to set up an OAuth handshake. This is accomplished using the ROAuth package. See the code below. NOTE: you will only need to do this one time, so long as you save to an .Rdata file (also included in the code). Also, please make sure that you run parts 1 and 2 separately (don’t run the entire code at once). After you run part 1, a browser window will open requesting that you authorize this application. Once you click on authorize, copy the pin provided by Twitter into the R console and hit enter. Then run part 2.

Now that we’re done with setting up the handshake, we can start with a clean (empty) R script, and start collecting data. This is done by using the filterStream function within the streamR package. The filterStream function takes the following parameters:

  • file.name = the name of the file where tweets will be written to.
  • track = a string containing keywords to track.
  • follow = string or vector containing twitter user IDs if we want to only track tweets of specific users.
  • locations = a vector of latitude and longitude pairs (southwest corner coming first) specifying a set of bounding boxes to filter incoming tweets.
  • language = a list of BCP 47 language identifiers.
  • timeout = in seconds, the max length of time to connect to the stream. Setting a timeout = 10 will end the connection after 10 seconds. Default is 0, which means that the connection is always on.
  • tweets = number of tweets to collect. For example, if you only want to collect 100 tweets, you could leave timeout = 0, and specify tweets = 100. The connection would end after the 100th tweet is collected.
  • oauth = object where we specify our oauth setup (my_oauth).
  • verbose = can be TRUE or FALSE, generates output to the R console with information about the capturing process.
You’ll notice that I also used parseTweets in the code above. This will go through the json file and convert all of the information into a data frame. the simplify argument, when set to true, includes geolocation information (latitude/longitude) in the data frame.

Although not feasible for a small timeout value, we could also specify where exactly we want to pull tweets from. The code below requests tweets from an area that roughly estimates Los Angeles County. Note that we likely won’t get any tweets using small timeouts.

Some things to keep in mind:

  • the tweets.json file will NOT be overwritten. Instead, information is appended to it. This is important as the file could get quite large over time. For new searches, I suggest using a separate file, or deleting the old one before running the filterStream function.
  • Not all tweets will have lat/lon information in them. It’s up to us to clear through the non-geocoded tweets.
  • There is a LOT more information available in the streamR cran documentation.
  • There is a LOT more information available on the Twitter streaming API documentation.

And there you have it! Using the streamR package, we’ve saved almost real-time tweets to a json file and converted that file to a data frame. Happy analyzing!

 

Leave a Reply

41 Comments

  1. Khandis

    So brillant, thanks for the blog.

  2. Vidya

    Hi! Thanks for this blog. You mentioned in the above comment that we can try and store the tweets as they come in into MySQL tables. I am doing something similar but I want to write the tweets into elasticsearch as and when they come in. My problem is, if I give timeout =0 and file.name=””, I am not able to see/print/access/process any of the tweets as they come in. My code just says “Capturing tweets …”.
    Here is the line of code I am using.
    filterStream(file.name=””,track=c(“Obama”),timeout=0,oauth=my_oauth, verbose=TRUE)

    Is there anyway for me to store the tweets into elasticsearch/mysql table as soon as each tweet comes in?

    • bogdanrau

      There are a few options I think. One would be to actually save the tweets to a file (so use the file.name option). You could then parse what’s in the file on a regular basis (every x minutes?) and import them into elasticsearch/mysql. The other option would be to not use timeout. So turn on streaming with a timeout of 10 minutes, import, turn on for another 10 minutes, import, and so forth. This would require either a cronjob or a task.

  3. utsav

    Hello

    Thank you for this post. I am a newb here and this really helped me. I have a couple of questions though.
    1) Is there a solution where I can just continuously download, append and parse tweets real-time?
    2) How can I download tweets from the brexit day? I doing a rookie analysis on it. Or do you know any place I can just download the REST data of that day.

    This would really save my life. Looking forward for your reply.

    • bogdanrau

      1. I think you’d need to save them to a db somewhere. You can use the streaming API to continuously stream tweets to a file, which then can be parsed and stored by a separate process.
      2. You could use the search API instead of streaming. This post discusses the search API, though it might be outdated. I also know that there are a variety of new packages out there that might help in that process. Keep in mind that, to my knowledge, the search API only goes about 2 weeks in the past. Anything more than that and you’d need to go to one of Twitter’s data services (and pay $$$).

  4. Ajay

    One of the best tutorials to capture Twitter Stream data. We had two hours to set this up to download tweets for the 3rd debate and the code worked seamlessly.

    Great work!

    • bogdanrau

      Glad I could help! Are you posting any of your findings online? Feel free to share here if you can. I’d love to see some analysis on that debate.

  5. CarlosFra

    Hi
    I am having an issue.
    After run the code , I have
    “Capturing tweets…
    Connection to Twitter stream was closed after 1 seconds with up to 1 tweets downloaded.”
    The APi was set up correctly and no modification of you code.
    Any idea?
    Thanks for your help

    • bogdanrau

      Have you tried changing the search term? It might be because you are searching for an empty string.

  6. Hello, first, thanks for the example. I already have used twitteR other time. Recently im digging into the stream API. So my question is, when we say “search api goes back in time” and “stream api is for real time data”. How this historical data works? I mean, if i run my script with search api now, im able to get tweets posted like 20 minutes ago right? So how can a tweet be in stream api and not in search api? Why we cannot just use search api instead?

    • bogdanrau

      The difference between streaming and search is that streaming is a real-time, always-on API service. This means that you can start listening to tweets at between 10:00 and 11:00, and you will get ~1% of twitter’s stream during that 1-hour interval, and nothing else. Nothing before 10:00 or after 11:00. Search API searches twitter’s historical tweets within the last ~2 weeks, meaning that if you run the search API at 10:00, you’ll get search results from ~2 weeks ago up until 10:00. I’m not sure if the search API returns ~1% or not and I don’t think the documentation mentions this either. I hope this clarified the difference.

  7. Katrin

    Hi Bogdan,
    Thanks so much for your explanations. They were very helpful and the code runs perfectly.
    I have a question regarding data collection and storage. I would like to start collecting tweets and am unsure about the right way to do it. As I understand it the streamR package collects data in realtime, so if I set a timeperiod –say an hour- I would have to restart the code every hour to get all tweets? Is that correct or is there another (automated) way to do it? I read that you suggest the streamR package and filterStream() function to collect data. What would be the advantage over the twitter package and the searchTwitter() function if you want to collect longer (and complete) timeseries.?
    Thank you!

    • bogdanrau

      To do it purely in R, you’d need to run it on a server and run the R data collection script on a cron job. There are other tools out there made for ingesting data from streaming APIs (Apache Hive), but I am not familiar with that implementation. The difference between twitteR and streamR is that the former uses Twitter’s SEARCH API, which only goes back about 2 weeks or so. StreamR uses Twitter’s STREAMING API, which only collects data in real time (no historical data). You could use both at the same time, though you’d need to de-duplicate.

      Hope this helps!

      Bogdan

  8. jeremy

    Hi Bogdan,

    Do you know how to run this on Shiny servers without having to verify the PIN each time?

    Thanks,
    Jeremy

    • bogdanrau

      Hi Jeremy,

      What would be the use case here? Collect and display, or allow users to specify what to collect? You can store the credentials after the handshake locally and load it each time you make a call to the API.

      Bogdan

  9. Francesco Piccinelli

    Hey Bogdan! It’s all great but what if I wanted tweets within two different dates?

    • bogdanrau

      Hi Francesco,

      You can use the search API to get tweets from about 2-weeks back. I don’t believe they have an API where you can historically mine the data. So your options are to start collecting now (with streaming) until you have enough sample, or use streaming and add on top the search API data. There are a few data vendors that are authorized by Twitter to sell historic data. Depending on your budget, you might want to reach out to them. Have a look at gnip: https://gnip.com/historical/

  10. Amar

    I am not able to proceed after this line.
    my_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”,package = “RCurl”))
    After running this line, browser opens for login and then it gets disconnected.

    Any thoughts on what is happening and how to proceed?

    • bogdanrau

      Could you try logging into twitter first before running the code? Hard to say what’s going on without seeing the code in action.

  11. brigitte

    Thank you for this example.
    I think there is something missleading in your example.
    filterStream(file.name = “tweets.json”, # Save tweets in a json file
    track = c(“Affordable Care Act”, “ACA”, “Obamacare”),
    language = “en”,
    location = c(-119, 33, -117, 35),
    timeout = 60,
    oauth = my_oauth)

    According to the Twitter streaming API webpage, this won’t work as you want, because you can only chose EITHER location OR words (track) but with streaming API, you can’t search for both at the same time. I.e. this should give you all tweets in SF rather then those with the terms “Affordable Care” etc.
    Or did I misunderstand the streamingAPI documentation?

    If you were to use TwitteR instead to look into the last 2 weeks, how would you set up the authentication? The searchTwitter() function doesn’t take the oauth=my=oauth Argument. I have been trying for a while. Any help would be appreciated.

    • bogdanrau

      Brigitte,

      Thank you for pointing that out, and you are correct. Looks like the streaming API matches on EITHER, but not both a location bounding box and a search term. This might have been a change in their API. They do mention that additional filtering steps need to be taken so you could just do the filtering in R (ie. collect all tweets in a location, then filter those again using grep or something else). With regards to twitteR, see this: http://bogdanrau.com/blog/collecting-tweets-using-r-and-the-twitter-search-api/. It was written a while ago so it might not be fully relevant, though it should hopefully help.

  12. Sruteesh

    Hi,
    Very Helpful Article.
    I have a small query. I tried running the above code for 120 secs and was able to stream only 5 tweets which I feel is very slow where as SearchTwitter from TwitteR package is much faster. Can u explain how streaming tweets using the above mentioned method is useful?

    Thanks

    • bogdanrau

      The twitteR package I believe uses the SEARCH API, which can only look into the past about 2 weeks or so. The streamR package allows you to take in “real-time” tweets as opposed to searching for them in the past. Depending on what your search terms are, you may get few records if any.

  13. Chris

    You have a type in part 1, line 15. You are missing a closing ).

  14. Siya

    Hi,

    Everything went well following instructions from your first blog, as am able to connect and retrieve archived tweets, but have been unable to connect to Twitter for the live feed option

    after declaring the API credentials, i run this final code: y_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”)), and after a while, it returns error:

    Error in function (type, msg, asError = TRUE) :
    Failed to connect to api.twitter.com port 443: Timed out

    and this is the traceback
    7: fun(structure(list(message = msg, call = sys.call()), class = c(typeName,
    “GenericCurlError”, “error”, “condition”)))
    6: function (type, msg, asError = TRUE)
    {
    if (!is.character(type)) {
    i = match(type, CURLcodeValues)
    typeName = if (is.na(i))
    character()
    else names(CURLcodeValues)[i]
    }
    typeName = gsub(“^CURLE_”, “”, typeName)
    fun = (if (asError)
    stop
    else warning)
    fun(structure(list(message = msg, call = sys.call()), class = c(typeName,
    “GenericCurlError”, “error”, “condition”)))
    }(7L, “Failed to connect to api.twitter.com port 443: Timed out”,
    TRUE)
    5: .Call(“R_post_form”, curl, .opts, .params, TRUE, matchPostStyle(style),
    PACKAGE = “RCurl”)
    4: .postForm(curl, .opts, .params, style)
    3: postForm(url, .params = params, curl = curl, .opts = opts, style = “POST”)
    2: oauthPOST(.self$requestURL, .self$consumerKey, .self$consumerSecret,
    NULL, NULL, signMethod = .self$signMethod, curl = curl, handshakeComplete = .self$handshakeComplete,
    …)
    1: my_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”,
    package = “RCurl”))

    Can you please help as i also did set setup_config(use_proxy(url = ‘xxxx’, port = 8080, username = ‘xxxx’,password = ‘xxxxx’)).
    —i did also change port to 443, but still same problem persist

    • bogdanrau

      Some search results for that error reveal that some IP addresses might have been blacklisted. Given that you were able to search the archive, I’m not certain that might be the case. Can you try switching from https to http in your url?

  15. Aidan

    This is really fantastic! Thanks for making it so easy. Never used R before but got a harvester up and running in about 15 minutes.

    I wonder, instead of a handshake over http, could that be done headless by getting one’s accessToken and accessTokenSecret from dev.twitter.com and passing them as variables?

    • bogdanrau

      Have you tried it? Would be interested in this as well! I mainly do R so connecting headless to APIs is not something I’m very versed in.

  16. julka

    Dear Bogdan,

    Thank You very much for this, it’s really great! However, when I get to this line I get an error message (please see below)
    my_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”))
    Error in function (type, msg, asError = TRUE) :
    SSL certificate problem: unable to get local issuer certificate

    I have been looking on Stackoverflow, but can’t find a reply (granted I’m quite a noob at R, so I might be not looking for the right thing).

    Thank You for your help in advance

    Best,

    Julia

      • Siya

        i did try to download CERT again,

        download.file(url=”http://curl.haxx.se/ca/cacert.pem”, destfile=”cacert.pem”)

        but same problem persist: “Error in function (type, msg, asError = TRUE) :
        Failed to connect to api.twitter.com port 443: Timed out”

      • Dorris

        Hi Bogdan,

        Thanks for the tutorial on using R. I am wanting to collect twitter data via the streaming API using a server but I don’t know where to start. Do you know of any good tutorials on this?

        Thanks!

      • bogdanrau

        Hi Dorris,

        I’ve not found any tutorials online regarding your question, however, I’ve dabbled a bit in doing just that. You’ll want to write a script that collects the data and writes it to a database (or file), and run that either on a cron job, or continuously (which might cause problems since the API sometimes goes down).

        Bogdan

  17. Laila

    Hello! iam Following same exact steps but when i reach part one from code ” which is handshaking with Twitter API” the Rstudio is stuck and no Browser appear for authentication! please advise!

    • bogdanrau

      I’m not really sure why that would happen. I have a feeling that might be an issue with your browser. Try searching for it on stackoverflow and see if anyone else experienced the same issue.

    • Alina

      Hey!
      Your problem is that after running the code line “my_oauth$handshake(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”))” it opens the website, you click the allow application button and then got blocked with the message “No page available”? I had this problem and the solution is to remove the callback URL from the twitter application. Set the callback URL just when you want to download historical data.
      Best,
      Alina

      • bogdanrau

        Thanks for jumping in Alina!

  18. Hey Bodgan,

    I´m streaming tweets by location setting the locations argument and it works fine. The structure of my code is like yours: I first stream, writing a file to memory and then parse and write the file to memory again. I would like to run the code on a windows server and leave the connection open all time by setting the timeout argument equal to 0.

    Any idea how to write the code so that it´d write a file every hour while leaving the connection open? I´ve been trying around with a second R code loading the json file and parsing it, but I didn´t manage to get it work. Any hint is highly appreciated!

    Cheers,
    Jessica

    • bogdanrau

      Hi Jessica,

      That’s an excellent question, and one that I’ve thought about myself, though haven’t had time to investigate. This is the way I would try and develop a solution (there’s never a single solution so this is probably one of many):

      1. The code to start collecting should stay the same as long as timeout = 0. I’m not sure if there are any rules on Twitter’s end that you’d need to be aware of, so definitely check their API docs for that.

      2. You’ll need to create a function that uses the Sys.time() function in R. You basically set your start time at hh:mm:ss, then that function would then copy your “collection” file into a separate file (could use a for loop to name it accordingly – maybe include some date/time info in there), empty the “collection” file after copying, so that R can continue filling it.

      I’m not sure if this makes sense, or if there’s a better way to do it. The other way that comes to mind is to use RMySQL to write to SQL tables that you have set up (you’d need a server for that). So basically, each time a new tweet comes in, write it directly to the database. This would basically continue updating your tables for infinity, so long as nothing breaks.

      Hope this helps! Please do let me know if you find a solution. Very intriguing question!

      Cheers,
      Bogdan

  19. thank you. it works! it saves my time and i can happily analyze. thank you. 🙂

Next ArticleRunning R and Data Science in the Cloud