Home > Data > Content Analysis Results: Language identification 2017

Content Analysis Results: Language identification 2017

Thursday 15 March 2018, by Malek Hajjem

Results

Topics are a random selection of original microblogs posted in June 2016 without external links and with more then 80 characters.
Submissions and scores for the two best teams can be found here Syllabs and Lia.
The task paper can be found here

@inproceedings{DBLP:conf/clef/ErmakovaMS17,
  author    = {Liana Ermakova and
               Josiane Mothe and
               Eric SanJuan},
  title     = {{CLEF} 2017 Microblog Cultural Contextualization Content Analysis
               task Overview},
  booktitle = {Working Notes of {CLEF} 2017 - Conference and Labs of the Evaluation
               Forum, Dublin, Ireland, September 11-14, 2017.},
  year      = {2017},
  crossref  = {DBLP:conf/clef/2017w},
  url       = {http://ceur-ws.org/Vol-1866/invited_paper_14.pdf},
  timestamp = {Thu, 16 Nov 2017 14:36:59 +0100},
  biburl    = {https://dblp.org/rec/bib/conf/clef/ErmakovaMS17},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Evaluation process

The Evaluation process detects the reliability of the language on Twitter.
In fact, Tweet objects have a long list of ‘root-level’ attributes, including fundamental attributes such as "lang". When present, this attribute indicates a BCP 47 language identifier corresponding to the machine-detected language from where the microblog was edited. Obviously the machine-detected language may be different from the microblog langage.
Scores in this evaluation are assigned by a human expert. Only the tweets where the results of participants’ language detector systems differ from tweet’s "lang" attribute were examined. Tweets in several languages have a graduated score describing how much a language is present on it.