Skip to content

Latest commit

 

History

History
71 lines (38 loc) · 8.07 KB

1907.1275.md

File metadata and controls

71 lines (38 loc) · 8.07 KB

Exploring Author Context for Detecting Intended vs Perceived Sarcasm, Oprea and Magdy, 2019

Paper, Tags: #sarcasm-detection

We investigate the impact of using author context on textual sarcasm detection. Author context = embedded representation of their historical posts on Twitter. We experiment with 2 tweet datasets, one labelled manually and the other via tag-based distant supervision. We achieve state-of-the-art performance on the second dataset but not on the first, indicating a difference between intended and perceived sarcasm.

Sarcasm is a form of irony that occurs when there's a discrepancy between the literal meaning of an utterance and its intended meaning. This is used to express a form of dissociative attitude towards a previous proposition, often in the form of contempt or derogation.

Given a tweet t posted by user ut with user embedding et, we address two questions:

  1. Is et predictive of the sarcastic nature of t?
  2. Is the predictive power of et on the sarcastic nature of t the same if t is labelled via manual labelling vs. distant supervision?

To our knowledge, previous work considering author context only experiment on distant supervision datasets, we use 2 datasets: Riloff, labelled manually, and Ptacek, via distant supervision. We use NNs to build user embeddings and achieve state-of-the-art results in Ptacek, but not on Riloff.

Our findings indicate a difference between the sarcasm intended by the author, captured by distant supervision (Ptacek), and sarcasm perceived by the audience, captured by manual labelling (Riloff). Our work suggests a future research direction in sarcasm detection where the 2 types of sarcasm are treated as separate phenomena and socio-cultural differences are taken into account.

1. Background

We identify 2 classes of models in the literature:

  • Local models: only consider information available within the text being classified. Most work considers linguistic incongruity, a positive verb used in a negative sentiment context, etc.
  • Contextual models: use both local and contextual information. Some work includes information about the forum type where the post was posted. For twitter, previous work represent user context by a set of manually-curated features extracted from their historical tweets. Others use ParagraphVector model to build a representation of the document. Hazarika et al., 2018 extract personality features from the historical document. These models are only tested on datasets labelled via distant supervision

2. Datasets

2.1. Riloff dataset

Consists of 3.2k tweet IDs, manually labelled by third party annotators. The labels capture the subjective perception of the annotators (perceived sarcasm). We collected tweets using the Twitter API as well as the historical timeline tweets for each user. For a user with tweet t in Riloff, we collected those historical tweets posted before t. Only 701 original tweets along with the corresponding timelines could be retrieved.

2.2. Ptacek dataset

Consists of 50k tweet IDs, labelled via distant supervision. Tags used as markers of sarcasm are #sarcasm, #sarcastic, #satire and #irony. This dataset reflects intended sarcasm. In a similar setting as with Riloff we could only collect 27.1k tweets and corresponding timelines.

3. Contextual sarcasm detection models

As a baseline we implement the SIARN model proposed by Tay et al. 2018, since it achieves the best published results on both our datasets. SIARN only looks at the tweet being classified.

Further, we introduce two classes of models: exclusive and inclusive models. In exclusive, the decision whether a tweet is sarcastic or not is independent of t. The content of the tweet being classified isn't considered, only the user historical tweets are used for prediction. We feed the user embedding et to a layer with softmax activations to output a probability distribution over Y (sarcastic or non-sarcastic). We name these models EX-[emb], where emb is the name of the user embedding model.

inclusive models account for both t and et. We get the feature vector ft extracted by SIARN from t, then concatenate ft with et and use output layer with softmax activations -> IN-[emb]. Regarding emb, we consider several embs:

  • CASCADE embeddings: it's been the most informative, but only tested on a dataset of Reddit posts labelled via distant supervision. We test it on our datasets.
  • W-CASCADE embeddings: CASCADE treats all historical tweets in the same manner, but long-term working memory plays an important role in verbal reasoning and textual comprehension. We expect recent historical tweets to have a greater influence on the current behaviour of a user. How we achieve these embeddings in paper, page 3.
  • ED embeddings: Encoder-decoder models have the ability to handle inputs and outputs of variable length. The encoder, a RNN, transforms an input sentence into a fixed-dimension internal representation. The decoder, another RNN, obtains the output sequence using this representation. We use bi-directional LSTM cells. Goal is to reconstruct the input.
  • SUMMARY embeddings: we use an encoder-decoder model like the last option, but with the goal of summarizing the input rather than reconstructing it.

4. Effect of context on sarcasm detection

We filter out all tweets shorter than three words and replace all words that only appear once in the entire corpus with an UNK token. Then, we encode each tweet as a sequence of word vectors initialized using GloVe embeddings. We set the word embedding dimension to 100, tune the dimension of all CASCADE embeddings to 100 on the validatoin set. We set W-CASCADE embeddings to the same dimension as well.

Results

User embeddings show remarkable predictive power on the Ptacek dataset. In particular, using the EX-W-CASCADE model, we get better results than the baseline without even looking at the tweet being predicted. On the Riloff dataset, however, user embeddings seem to be far less informative.

The state-of-the-art performance of exclusive models on Ptacek indicate that users seem to have a prior disposition to being either sarcastic or nonsarcastic, which can be deduced from historical behaviour. However, this behaviour can change over time, as we achieve better performance when accounting for the temporal arrangement of historical tweets, as we do in W-CASCADE.

Performance analysis

In the Riloff dataset, where it's manually annotated, we could expect user embeddings to have poor predictive power, perhaps annotator embeddings would shed more light.

We noticed that many of the tweets in Riloff contain 1+ of the tags used to mark sarcasm in Ptacek. For all tweets in Riloff, we checked the agreement between containing such a tag, and being manually annotated as sarcastic. We notice a large disagreement. 217 of the 509 tweets that were annotated manually as non-sarcastic contained such a tag. The lack of coherence between the presence of sarcasm tags and manual annotations in the Riloff dataset suggests that the 2 labelling methods capture distinct phenomena, considering the subjective nature of sarcasm.

To investigate further, we re-labelled the Riloff dataset via distant supervision considering these tags as markers of sarcasm, to create the #Riloff dataset. We applied the exclusive models on #Riloff and noticed a considerably higher predictive power than on Riloff. Author history seems therefore predictive of authorial sarcastic intention, but not of external perception.

This could indicate that future work should differentiate between the two types of sarcasm: intended and perceived.

Conclusion

We studied the predictive power of user embeddings in textual sarcasm detection across datasets labelled via both manual labelling and distant supervision. We suggested several NNs to build user embeddings, achieving state-of-the-art results for distant supervision, but not for manual labelling. There are 2 types of sarcasm captured by the two labelling methods. We suggest a future research direction in sarcasm detection where the two types of sarcasm are treated as separate phenomena and socio-cultural differences are taken into account.