Authors: Tomáš Ptáček, Ivan Habernal and Jun Hong ( tigi | habernal @ kiv.zcu.cz | j.hong @ qub.ac.uk)
Our paper presents a machine learning approach to sarcasm detection on Twitter in two languages - English and Czech.
We also created a large human-annotated Czech Twitter corpus.
More details can be found in our paper.
Twitter CZ Corpus contains 7,000 tweets (325 sarcastic and 6,675 normal tweets). Corpus: cs.zip (~270 kB)
The archive contains annotated tweets in an Excel file (Annotations.xlsx) and two text files, one with normal tweets and one with sarcastic tweets.Twitter EN Balanced Corpus consists of 100,000 tweets (50,000 sarcastic, 50,000 normal tweets).
Corpus: en-balanced.zip (~850 kB)
Twitter EN Imbalanced Corpus consists of 100,000 tweets (25,000 sarcastic, 75,000 normal tweets).
Corpus: en-imbalanced.zip (~850 kB)
Corpus is licenced under Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License
Software is licenced under The GNU General Public License
Please, cite our article if you use any of the available resources.
@InProceedings{ptacek-habernal-hong:2014:Coling, author = {Pt\'{a}\v{c}ek, Tom\'{a}\v{s} and Habernal, Ivan and Hong, Jun}, title = {Sarcasm Detection on Czech and English Twitter}, booktitle = {Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers}, month = {August}, year = {2014}, address = {Dublin, Ireland}, publisher = {Dublin City University and Association for Computational Linguistics}, pages = {213--223}, url = {http://www.aclweb.org/anthology/C14-1022} }
Last change: 2014-08-12