Data
The data for the shared task is scheduled to be released in three stages: sample data, training data, and evaluation data. The sample data will be released first to give participants an idea of the data format and the task. The training data will be released next to allow participants to train their systems. The evaluation data -- without the target references -- will be released last for participants to evaluate their systems. The target references for the evaluation data will be released after the evaluation period ends.
- Jump to the "Download" section to download the data.
Important Dates
- Sample data ready: 15 July 2024
- Training data ready: November 2024
- Evaluation data ready: 10 January 2025
- Evaluation data with target references ready: After the evaluation period ends
All deadlines are 23:59 UTC-12 ("anywhere on Earth").
Overview
The data for the shared task will be provided in JSONL format. The data will contain English source text and translations in multiple target languages. The data will also contain named entities mentioned in the translations. The data will be based on Wikidata entities.
Languages
The data will be provided in the following languages:
Source languages
- English (
en
)
Target Languages
- Italian (
it
): sample, training, and evaluation data - Spanish (
es
): sample, training, and evaluation data - French (
fr
): sample, training, and evaluation data - German (
de
): sample, training, and evaluation data - Arabic (
ar
): sample, training, and evaluation data - Japanese (
ja
): sample, training, and evaluation data - Chinese (
zh
): sample and evaluation data - Korean (
ko
): sample and evaluation data - Thai (
th
): sample and evaluation data - Turkish (
tr
): sample and evaluation data
Format
Here is a brief overview of the data format for the shared task. The data will be provided in JSONL format, where each line in the file will contain a JSON object. Note that the format of the training data is different from the format of the sample and evaluation data.
Sample/evaluation data format
Each row in the dataset contains the following fields:
{
"id": "Q2461698_0",
"wikidata_id": "Q2461698",
"entity_types": [
"Fictional entity"
],
"source": "Who are the main antagonistic forces in the World of Ice and Fire?",
"targets": [
{
"translation": "Chi sono le principali forze antagoniste nel mondo delle Cronache del ghiaccio e del fuoco?",
"mention": "mondo delle Cronache del ghiaccio e del fuoco"
}
],
"source_locale": "en",
"target_locale": "it"
}
Where:
id
is a unique identifier for the row, usually in the format<entity_id>_<q_id>
, where<entity_id>
is the entity ID in Wikidata and<q_id>
is the question ID (from 0 to 4).wikidata_id
is the QID of the entity in Wikidata.entity_types
is a list of types of the entity; not all entities have types.source
is the source text in English.targets
is a list of translations in the target language, where each translation contains:translation
is the translation of the source text in the target language.mention
is the mention of the entity in the translation.
source_locale
is the source language.target_locale
is the target language.
In the example above, the entity is the "World of Ice and Fire" and the translation is "mondo delle Cronache del ghiaccio e del fuoco", which are not 1-to-1 translations as the Italian version also includes "delle Cronache" ("of the Chronicles"). You can check out more examples below.
Examples
-
Ring a Ring o' Roses is translated as Girotondo in Italian:
{
"id": "Q746666_0",
"wikidata_id": "Q746666",
"entity_types": [
"Musical work"
],
"source": "Can you sing the chorus of the folk song Ring a Ring o' Roses?",
"targets": [
{
"translation": "Puoi cantare il ritornello della canzone popolare Girotondo?",
"mention": "Girotondo"
},
{
"translation": "Sai cantare il ritornello del girotondo, la canzone popolare?",
"mention": "girotondo"
}
],
"source_locale": "en",
"target_locale": "it"
} -
Mary of Burgundy is translated as Maria di Borgogna and Maximilian I is translated as Massimiliano I in Italian:
{
"id": "Q157073_0",
"wikidata_id": "Q157073",
"entity_types": [
"Person"
],
"source": "How long was Mary of Burgundy married to Emperor Maximilian I?",
"targets": [
{
"translation": "Per quanto tempo Maria di Borgogna è stata sposata con l'imperatore Massimiliano I?",
"mention": "Maria di Borgogna"
},
{
"translation": "Per quanto tempo Maria di Borgogna è stata sposata con l'imperatore Massimiliano I",
"mention": "Maria di Borgogna"
}
],
"source_locale": "en",
"target_locale": "it"
} -
Little Women is translated as Mujercitas in Spanish:
{
"id": "Q850522_0",
"wikidata_id": "Q850522",
"entity_types": [
"Movie"
],
"source": "Who are the main characters in the movie Little Women?",
"targets": [
{
"translation": "¿Quiénes son los personajes principales de la película Mujercitas?",
"mention": "Mujercitas"
}
],
"source_locale": "en",
"target_locale": "es"
} -
A Room of One's Own is translated as Una habitación propia in Spanish:
{
"id": "Q1204366_1",
"wikidata_id": "Q1204366",
"entity_types": [
"Book"
],
"source": "Who is the author of the book A Room of One's Own?",
"targets": [
{
"translation": "¿Quién es el autor del libro Una habitación propia?",
"mention": "Una habitación propia"
},
{
"translation": "¿Quién es el autor del libro Una habitacion propia?",
"mention": "Una habitacion propia"
}
],
"source_locale": "en",
"target_locale": "es"
}
Training data format
The training data will be provided in a different format, where each row in the dataset contains the following fields:
{
"source": "Did Gone With The Wind come out before 1940?",
"target": "Via col vento è uscito prima del 1940?",
"entities": [
"Q2875"
],
"source_locale": "en",
"target_locale": "it",
"instance_id": "826528e6",
"from": "mintaka",
}
Where:
source
is the source text in English.target
is the translation of the source text in the target language.entities
is a list of Wikidata IDs of the entities mentioned in the source text.source_locale
is the source language.target_locale
is the target language.instance_id
is a unique identifier for the row.from
is the source of the data.
Source of the data
The training data will be provided by different sources. The source of the data will be provided in the from
field (currently, the only source is mintaka
but we plan to add more sources in the future).
Source | Description | Notes | License | Link |
---|---|---|---|---|
mintaka | Mintaka is a multilingual question answering dataset based on Wikidata. | Entity subset. | CC-BY | Mintaka |
Downloads
Here are the download links for the data for the shared task:
Data Type | Description | Download |
---|---|---|
Sample data | Sample data to show format and task requirements | link (.zip file) |
Validation data | Dataset for model development/validation | link (.zip file) |
Training data | Additional data for model training | link (.zip file) |
Test data (no targets) | Dataset for evaluation where the targets are hidden | link (.zip file) |
Prediction data | Predictions by GPT4o and GPT4o-mini | link (.zip file) |
Sample Data
The sample data contains a small subset of the data to give participants an idea of the data format and the task. When you are comfortable with the data format, you can use the sample data as a starting point to develop your models, including training and evaluation.
Validation Data
The validation data is a larger dataset that you can use to develop and validate your models. The validation data is similar to the sample data but contains more examples. The main uses of the validation data are model selection and hyperparameter tuning, but you can also use (part of) the validation data for fine-tuning your models.
Training Data
The training data is a larger dataset that you can use to train your models. Note that the training data is different from the sample and validation data in terms of distribution and size. The main use of the training data is to train your models.
Test Data
The test data is the dataset for evaluation. The test data does not contain the target references, so you can use the sample and validation data to evaluate your systems. The test data is similar to the sample and validation data but contains more examples. The main use of the test data is to evaluate your systems. We will release the target references after the evaluation period ends.
Prediction Data
The prediction data contains the predictions by GPT4o and GPT4o-mini. You can use this data to compare your predictions with the predictions by GPT4o and GPT4o-mini. You can also use this data to analyze the performance of your models or build on top of the predictions by GPT4o and GPT4o-mini.