Accueil > Mots-clés > Catégorie > data

data

Articles

Wikipedia XML corpus for summary generation
18 October 2016, by sanjuan

Wikipedia is under Creative Commons license, and its contents can be used to contextualize tweets or to build complex queries referring to Wikipedia entities.
We have extracted an average of 10 million XML documents from Wikipedia per year since 2012 in the four main twitter languages:- en, es, fr and pt.
These documents reproduce in an easy-to-use XML structure the contents of the main Wikipedia pages: title, abstract, section and subsections as well as Wikipedia internal links. Other (…)
The festival galleries dataset
18 October 2016, by sanjuan

This data set allows to experiment microblog search and stream summarization.
Microblog collection
The document collection is provided to registered participants by ANR GAFES project. It consists in a pool of more than 50M unique micro-blogs from different sources with their meta-information as well as ground truth for the evaluation.
The microblog collection contains a very large pool of public posts on Twitter using the keyword festival since June 2015. These micro-blogs are (…)
Microlog Data Set
2 November 2015, by sanjuan

The document collection provided by GAFES project consists a pool of more than 70M unique microblogs from different sources with their meta-information and expanded URLs on a MySQL server. Due to legal terms the access to this database is restricted to registered participants under privacy agreement.
Along with the microblog corpus, a clean simplified xml dump of wikipedia easy to index and to process with state of the art NLP tools is made available to participants. Ground truth (…)