Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities.
We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, es, fr and pt.
These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links. Other (…)
Home > Tasks 2017 > 1 - Content Analysis
1 - Content Analysis
Organisers: IRIT, Université de Montréal, LISIS
Synopsis
Given a stream of microblogs the content analysis tasks consists in:
- filtering microblogs dealing with festivals;
- language(s) identification;
- event localization;
- author categorization (official account, participant, follower or scam);
- WikiPedia entity recognition and translation in four target languages: English, Spanish, Portuguese and French.
- automatic summarization of linked WikiPedia pages in the four target languages.
Each item will be evaluated independently, however, language identification could impact WikiPedia linking and the resulting summaries.
A login is required to acces the data, once registered on CLEF each registered team can obtain up to 4 extra individual logins by writing to admin@talne.eu.
Data
- The complete stream of microblogs is available here for registered participants.
- Topics are a random selection of original microblogs posted in June 2016 without external links and with more the 80 characters.
- Founded entities must refer to page titles in 2016 CLEF CMC workshop wikipedia versions.
- Summaries must also be extracted from 2016 CLEF CMC workshop. Online Indri index are available in English, Spanish, French and Potuguses
Submission
Each individual participant can only submit one run per sub-task, so up to 5 runs per team. Submissions will be uploaded on a MySQL server through web interface.
Expected formats for each subtask are tables in which the primary key is the micro-blog id and have some extra fields.
- filtering: one extra field with a normalized score between 0 and 1, 1 being the maximal score for a micro-blog surely related to a specific festival event.
- language(s) identification; three extra fields containing two letter ISO 639-1 code for languages, first field for the main language, last field for a subsidiary or less probable language.
- event localization; five extra fields for a ranked list of cities (IATA codes) related to the micro-blog.
- author categorization: one extra field with one of the categories ’official’ when the microblog has been posted by the organizers of the festival or a media broadcasting the event or an invited artist; ’participant’ when it has been posted by a non official individual in the public; ’follower’ for individuals following the festival but not taking port in it ; and scam or troll ;
- entity recognition: one table per target language, each one with 10 extra fields corresponding to a ranked list of WikiPedia entries (page titles) related to the micro-blog. List is ranked by decreasing relevance. Participants can submit less than four languages.
- automatic summarization of linked WikiPedia pages in every language: one table per language, one extra field with a short summary of 120 words (sequences of characters separated by spaces).