Home > Tasks 2017 > 1 - Content Analysis > Cultural Microblog Contextualization based on Wikipedia

2016 CMC WorkShop experiment

Cultural Microblog Contextualization based on Wikipedia

Thursday 31 March 2016, by Jian-Yun Nie, josiane, Liana Ermakova

Organizers:

Liana Ermakova, Josiane Mothe, Jian-Yun Nie (cmct1@irit.fr)

Task 1 participation deadline extended to 23 May, 2016

Objective

The aim of this task is to generate a short summary providing background information for a tweet to help a user understand it. For instance, if a microblog announced a cultural event, participants would have to provide a short summary extracted from Wikipedia that provides -extensive -background about this event. The summary must contain information about the context of the event in order to help answering questions like "what is this tweet about?" using a recent cleaned dump of Wikipedia. The context should be in the form of a readable summary, not exceeding 500 words, composed of passages from the provided Wikipedia corpus.

Any open access resources can be used in addition to the data we provide to participants’ subject for describing it and providing a valid URL.

Data

  • Tweets to contextualize: We select a set of 1001 tweets to be contextualized by the participants using the English version of Wikipedia. These tweets in English are collected from a set of public micro-blogs on Twitter and are related to the keyword “festival”. The microblogs are in UTF8 csv format with various fields. In this task, the tweets do not contain URL. The other tasks will use additional information.
  • Wikipedia Crawl: Unlike tweets, Wikipedia is under Creative Commons license, and it’s content can be used to contextualize tweets or to build complex queries referring to Wikipedia entities. We have extracted from Wikipedia an average of 10 million XML documents per year since 2012 in the four main twitter languages:- en, es, fr and pt. -These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links.
    Other contents such as images, footnotes and external links are stripped out in order to obtain a corpus easy to process by standard NLP tools. By comparing contents over the years, it is possible to detect long term trends.
  • Format of the results

Results should be provided in CSV format:

<tid> Q0 <file> <rank> <rsv> <run_id> <text of passage 1>  
<tid> Q0 <file> <rank> <rsv> <run_id> <text of passage 2>  
<tid> Q0 <file> <rank> <rsv> <run_id> <text of passage 3>  
...

where:

  • The first column is the tweet id (id field of the JSON format).
  • The second column is currently unused and should always be Q0.
  • The third column is the file name (without .xml) from which a result is retrieved, it is identical to the one in the Wikipedia document. Alternatively, the wikipedia page title can also be used.
  • The fourth column is the position number of the passage in the summary, independent of its informativeness.
  • The fifth column shows the score (integer or floating point) that should reflect the estimated informativeness of the passage. This score is used in the pooling process to build informativeness q-rels.
  • The sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used.
  • The seventh column is the raw text of the Wikipedia passage. Text is given without XML tags and without formatting characters (avoid "\n","\r","\l"). The resulting word sequence has to appear in the file indicated in the third field.
  • The columns are separated by tabs.

Example:

  • Topic 610507526174601216:
    Classes "The scenic writings to the manipulated object." Francis and Peter were very promising in the art of manipulation. Some pictures of the live performances at Usine Tournefeuille.
  • Possible abstract:
    Marionnettissimo is a puppet festival, created by the association Et Qui Libre / Marionnettissimo (or EQL / Marionnettissimo), whose objective is the development of "puppet culture", considering the public, artists, and cultural actors. The Marionnettissimo festival is part of a series of cultural actions, programming, training, conducted by the association since 1990. It takes place in the Toulouse area and the Midi-Pyrenees region, annually since 2006. 
    The “scenic writings to the manipulated object” training was presented by Francis Monty from the La Pire Espèce group (Quebec) and Pier Porcheron from the Elvis Alatac troupe (Poitou-Charentes) at Marionnettissimo festival from the 8th to the 19th of february 2016.
  • Formated result:
    610507526174601216 Q0 1693938 0 14.0	Marionnettissimo is a puppet festival, created by the association Et Qui Libre / Marionnettissimo (or EQL / Marionnettissimo), whose objective is the development of "puppet culture", considering the public, artists, and cultural actors.
    610507526174601216 Q0 1693938 1 12.0	The Marionnettissimo festival is part of a series of cultural actions, programming, training, conducted by the association since 1990.
    610507526174601216 Q0 1693938 2 11.0	The “scenic writings to the manipulated object” training was presented by Francis Monty from the La Pire Espèce group (Quebec) and Pier Porcheron from the Elvis Alatac troupe(Poitou-Charentes) at Marionnettissimo festival from the 8th to the 19th of february 2016.

Evaluation

The summaries will be evaluated according to:

  • Informativeness-: the way they overlap with relevant passages (number of them, vocabulary and bi-grams included or missing). For each tweet, all passages from all participants will be merged and displayed to the assessor in alphabetical order. Therefore, each passage’s informativeness will be evaluated independently from others, even in the same summary. Assessors will have to provide a binary judgment on whether the passage is should appear in a summary on the topic, or not.
  • Readability assessed by evaluators and participants. Each participant will have to evaluate readability for a pool of summaries on an online web interface. Each summary consists of a set of passages and for each passage, assessors will have to tick four kinds of check boxes:
  1. Syntax (S): tick the box if the passage contains a syntactic problem (bad segmentation for example),
  2. Anaphora (A): tick the box if the passage contains an unsolved anaphora,
  3. Redundancy (R): tick the box if the passage contains redundant information, i.e. information that has already been given in a previous passage,
  4. Trash (T): tick the box if the passage does not make any sense in its context (i.e. after reading the previous passages). These passages must then be considered as trashed, and the readability of following passages must be assessed as if these passages were not present.

Download the data:

Tweets to contextualize (download)
Wikipedia collection to use to contextualize the tweets (download)

Submission

Participants should be registered at http://clef2016-labs-registration.dei.unipd.it/registrationForm.php. The personal access to the submission form is sent after the registration.

2016 Schedule

  • Topics and task guidelines released: 1 April
  • Run submission deadline : 23 May (extended)
  • Informativeness Evaluation results sent out: 5 June
  • Readability Evaluation results sent out: 5 June
  • Participant papers (CLEF proceedings) due: 7 June.
  • Overview paper due: 30 June