Home > Data > Content Analysis Results: Language identification 2017
Content Analysis Results: Language identification 2017
Thursday 15 March 2018, by
Results
- Topics are a random selection of original microblogs posted in June 2016 without external links and with more then 80 characters.
- Submissions and scores for the two best teams can be found here Syllabs and Lia.
- The task paper can be found here
@inproceedings{DBLP:conf/clef/ErmakovaMS17,
author = {Liana Ermakova and
Josiane Mothe and
Eric SanJuan},
title = {{CLEF} 2017 Microblog Cultural Contextualization Content Analysis
task Overview},
booktitle = {Working Notes of {CLEF} 2017 - Conference and Labs of the Evaluation
Forum, Dublin, Ireland, September 11-14, 2017.},
year = {2017},
crossref = {DBLP:conf/clef/2017w},
url = {http://ceur-ws.org/Vol-1866/invited_paper_14.pdf},
timestamp = {Thu, 16 Nov 2017 14:36:59 +0100},
biburl = {https://dblp.org/rec/bib/conf/clef/ErmakovaMS17},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Evaluation process
The Evaluation process detects the reliability of the language on Twitter.
In fact, Tweet objects have a long list of ‘root-level’ attributes, including fundamental attributes such as "lang". When present, this attribute indicates a BCP 47 language identifier corresponding to the machine-detected language from where the microblog was edited. Obviously the machine-detected language may be different from the microblog langage.
Scores in this evaluation are assigned by a human expert. Only the tweets where the results of participants’ language detector systems differ from tweet’s "lang" attribute were examined. Tweets in several languages have a graduated score describing how much a language is present on it.