MC2 2018 Lab

Multilingual Cultural Mining and Retrieval

Home > Tasks 2017 > 3 - Time Line Illustration

3 - Time Line Illustration

1. Goal

The goal is to retrieve all relevant tweets dedicated to each event of a festival, according to the program provided. We are really looking here at a kind of "total recall" retrieval, based on initial shows names, artists names, the date and time of shows.

We focus in this task on 4 festivals. Two french Music festivals, one french theater festival and one great-britain theater festival:

2. Topics

Topics are given in the file clef_mc2_task3_topics.xml

Each topic is related to one cultural event.
In our terminology, one event is one occurrence of a show (theater, music, ...).
Several occurrences of the same show correspond then to several events (e.g. plays can be presented several times during theater festivals).
More precisely, one topic is described by: one id, one festival name, one title, one artist (or band) name, one timeslot (date/time begin and end), and one location venue.

An excerpt from the topic list is:


The id is an integer ranging from 1 to 664.
We see from the excerpt above that, for a live music show without any specific title, the title field is empty.
The artist name is a single artist, a list of artist names,
an artistic company name or orchestra name, as they appear in the official programs of the festivals.
The festival labels are:

  • charrues for Vielles Charrues 2015,
  • transmusicales for Transmusicales 2015,
  • avignon for Avignon 2016,
  • edinburgh for Edinburgh 2016.

For the fields and , the format is : DD/MM/YY-HH:MM .
If the start or end time is unknown, they’re replaced with : DD/MM/YY-xx:xx .
If the day is unknown, the date format is the following: -HH:MM (day is omitted).
The venue is a string corresponding to the name of the location, given by the official programs.

3. Dataset

A login is required to access the data, once registered on CLEF each registered team can obtain up to 4 extra individual logins by writing to

Participants are required to use the full dataset to conduct their experiments:

4. Runs

The runs are expected to respect the classical trec top files format. Only the top 1000 results for each query run must be given. Each retrieved document is identified using its tweet id.
The evaluation will be achieved on a subset of the full set of topics, according to the richness of the results obtained.
The official evaluation measures planned are recall values at 5, 10, 25, 50 and 100 documents.
Each registered participant should submit no more than 6 runs. The protocol to submit the runs will be described later.
The evaluation protocol is likely to change depending on the submission received.

5. Evaluation

As much retweets will be excluded from the pools.
Tweet relevance will be based on a 3-level scale:

  • Not relevant: the tweet is not related to the topic
  • Partially relevant: the tweet is somehow related to the topic (e.g. the tweet is related to the artist, song, play but not to the event, or is related to a similar event with no possible way to check if they are the same)
  • Relevant: the tweet is related to the event