Skip to content

End to End Example Twitter Study

numeroteca edited this page Sep 24, 2021 · 3 revisions

[DRAFT]

As a motivating example, we will gather tweets about a breaking news trend. We will use both the Search API for older data and Streaming API for new Tweets, and process the results to extract extra metadata about tweets, perform some analysis on the data, prepare the dataset for publication, and publish the results.

Exploratory Search

https://github.com/igorbrigadir/twitter-advanced-search

Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls

https://arxiv.org/abs/1403.7400

Identifying good search keywords and or Accounts

https://firstdraftnews.org/latest/sources-and-keywords-the-fundamentals-of-online-newsgathering/

Writing Search and Streaming Queries

https://developer.twitter.com/en/docs/tutorials/building-high-quality-filters

https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query

https://developer.twitter.com/en/docs/twitter-api/tweets/filtered-stream/integrate/build-a-rule

Working with Twitter Data

JSON and CSV Formats

Storage and Querying

Tweet Specific Data Issues

Sharing Datasets and Compliance

Extracting IDs

Checking for Deletes

Sharing Data

A simple step by step Twitter analysis guide with twarc2

A short list of command lines to help others start with twarc2 in Linux. Complete documentation at https://twarc-project.readthedocs.io/en/latest/

0. Set up twarc2

In my case I use pyenv to manage the python version (3.8.1).

Install or upgrade to the last version:

pip install --upgrade twarc

1. Get the tweets

Simple search for "blacklivesmatter":

twarc2 search blacklivesmatter > search.jsonl

A complex search in the full archive (requires Academic Research track). This search looks for tweets with one of these two URL and a word:

twarc2 search 'url:"https://www.elconfidencial.com/espana/madrid/2021-09-07/universidad-periodismo-complutense-profesores_3218500" OR url:"https://www.infolibre.es/noticias/opinion/columnas/2021/09/08/la_verdad_sobre_caso_quiros_una_cronica_primera_persona_124235_1023.html" OR miguelenlared' --start-time 2021-09-07T00:00:01 --archive > 210907-21_2url_and_miguelenlared_con-y-sin-arroba.json

2. Count the tweets

If you just want to count the number of tweets by day

twarc2 counts --csv --granularity day > blacklivesmatter_count.csv

3. Flatten the json file

This means to put each tweet in a line, instead of a whole json structure:

twarc2 flatten search.jsonl search_flatten.jsonl

4. Convert to CSV

twarc2 csv search_flatten.jsonl search_flatten.csv

5. Convert to allow network analysis (gexf) in Gephi

A list of tweets can be displayed as a network of nodes (users) that are linked (tweets) by their interactions (RT, quote, mention, reply):

twarc2 network search_flatten.jsonl --format gexf search_flatten.gexf

Then you need to open the .gexf in Gephi.