MC2 2018 Lab

Multilingual Cultural Mining and Retrieval

Home > Data > Content Analysis Results: Language identification 2017

Content Analysis Results: Language identification 2017

Thursday 15 March 2018, by Malek Hajjem


  • Topics are a random selection of original microblogs posted in June 2016 without external links and with more then 80 characters.
  • Submissions and scores for the two best teams can be found here Syllabs and Lia.
  • The task paper can be found here
 author    = {Liana Ermakova and
              Josiane Mothe and
              Eric SanJuan},
 title     = {{CLEF} 2017 Microblog Cultural Contextualization Content Analysis
              task Overview},
 booktitle = {Working Notes of {CLEF} 2017 - Conference and Labs of the Evaluation
              Forum, Dublin, Ireland, September 11-14, 2017.},
 year      = {2017},
 crossref  = {DBLP:conf/clef/2017w},
 url       = {},
 timestamp = {Thu, 16 Nov 2017 14:36:59 +0100},
 biburl    = {},
 bibsource = {dblp computer science bibliography,}

Evaluation process

The Evaluation process detects the reliability of the language on Twitter.
In fact, Tweet objects have a long list of ‘root-level’ attributes, including fundamental attributes such as "lang". When present, this attribute indicates a BCP 47 language identifier corresponding to the machine-detected language from where the microblog was edited. Obviously the machine-detected language may be different from the microblog langage.
Scores in this evaluation are assigned by a human expert. Only the tweets where the results of participants’ language detector systems differ from tweet’s "lang" attribute were examined. Tweets in several languages have a graduated score describing how much a language is present on it.