Sarcasm Detection on Czech and English Twitter

Authors: Tomáš Ptáček, Ivan Habernal and Jun Hong ( tigi | habernal @ | j.hong @

Our paper presents a machine learning approach to sarcasm detection on Twitter in two languages - English and Czech. We also created a large human-annotated Czech Twitter corpus.
More details can be found in our paper.

Article Resources


Unfortunately we can provide only tweet IDs. The actual tweets will be sent on request via email (tigi at

Twitter CZ Corpus contains 7,000 tweets (325 sarcastic and 6,675 normal tweets). Corpus: (~270 kB)

The archive contains annotated tweets in an Excel file (Annotations.xlsx) and two text files, one with normal tweets and one with sarcastic tweets.

Twitter EN Balanced Corpus consists of 100,000 tweets (50,000 sarcastic, 50,000 normal tweets).
Corpus: (~850 kB)

Twitter EN Imbalanced Corpus consists of 100,000 tweets (25,000 sarcastic, 75,000 normal tweets).
Corpus: (~850 kB)


Extract archive (~160 MB) and run experiments in class SarcasmEvaluationApp with command line options. Simple batch files to run all experiments are included.
Please direct questions regarding software to tigi (at)


Corpus is licenced under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License

Software is licenced under The GNU General Public License


Please, cite our article if you use any of the available resources.

  author    = {Pt\'{a}\v{c}ek, Tom\'{a}\v{s}  and  Habernal, Ivan  and  Hong, Jun},
  title     = {Sarcasm Detection on Czech and English Twitter},
  booktitle = {Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers},
  month     = {August},
  year      = {2014},
  address   = {Dublin, Ireland},
  publisher = {Dublin City University and Association for Computational Linguistics},
  pages     = {213--223},
  url       = {}

Last change: 2014-08-12