An Automated Approach for Large-Scale Lexical Substitution

24 May 2021

This website contains new large-scale datasets for the task of lexical substitution. Each instance in the datasets contains a target word in its context, associated with a silver list of possible substitutes. The automated procedure to obtain the datasets is described in the ALaSca paper.

Abstract

The lexical substitution task aims at finding suitable replacements for words in context. The paucity of annotated data has forced researchers to apply mainly unsupervised approaches, limiting the applicability of large pre-trained models and thus hampering the potential benefits of supervised approaches to the task. We mitigate this issue by proposing ALaSca, a novel approach to automatically creating large-scale datasets for English lexical substitution. ALaSca allows examples to be produced for potentially any word in a language vocabulary and to cover most of the meanings it lists. Thanks to this, we can unleash the full potential of neural architectures and finetune them on the lexical substitution task. Indeed, when using our data, a transformer-based model performs substantially better than when using manually-annotated data only.

Datasets

The datasets used in the paper can be downloaded here. The .zip folder contains the following files:

alasca_train.tsv            # the dataset (alasca_t) built with the procedure described in our paper.
alasca_coinco_train.tsv     # alasca_t concatenated with the training split of CoInCo.
alasca_twsi_train.tsv       # alasca_t concatenated with the training split of TWSI.
alasca_with_gold_train.tsv  # alasca_t concatenated with the training splits of CoInCo and TWSI.
coinco_train.tsv            # the training split of CoInCo.
twsi_train.tsv              # the training split of TWSI.
coinco_twsi_train.tsv       # the training splits of CoInCo and TWSI.
coinco_twsi_dev.tsv         # dev splits of CoInCo and TWSI.
lst_test.tsv                # test set (LST).

Each dataset has the same format. Consider, as example, a row from alasca_train.tsv, i.e.

arrest.VERB 1498 5 All of the protesters were arrested . apprehend capture catch jail convict detain charge confront imprison detain::10

It is formatted as:

lexeme \t instance_id \t target_index \t sentence \t mask \t gold

where:

lexeme represents the target word and is formatted as {lemma}.{pos},

instance_id is an integer that identifies the instance, i.e. the triple (lexeme, target index, sentence)

target_index is an integer that represents the index of the target word in the sentence (starting from 0).

sentence is the context where the target appears, space-separated.

mask is the set of substitutes across instances for the given target lexeme. Space are replaced with underscores.

gold is a space-separated list of word::score elements, where each element is a possible substitute for the target in context with its relative score. Also in this case, spaces inside the words have been replaced with underscores. The scores indicate how suitable a substitute is for the given context: the higher the score, the better the substitute.

Supplementary Material

In the supplementary material we provide the results for each tuned value of similarity threshold.

Reference

Please cite our work as

@inproceedings{lacerraetal:2021,
  title={ {AL}a{S}ca: an {A}utomated approach for {L}arge-{S}cale {L}exical {S}ubstitution},
  author={Lacerra, Caterina and Pasini, Tommaso and Tripodi, Rocco and Navigli, Roberto},
  booktitle={Proceedings of the 30th International Joint Conference on Artificial Intelligence},
  publisher={International Joint Conferences on Artificial Intelligence},
  year={2021},
  doi={10.24963/ijcai.2021/528},
}

Authors

Caterina Lacerra (contact author)
PhD Student @ Sapienza University of Rome
lacerra [at] uniroma1.it

Tommaso Pasini
Post-Doctoral Research Fellow @ University of Copenhagen
tommaso.pasini [at] di.ku.dk

Rocco Tripodi
Assistant Professor @ University of Bologna
rocco.tripodi [at] unibo.it

Roberto Navigli
Full Professor @ Sapienza
navigli [at] uniroma1.it

Acknowledgements

The authors gratefully acknowledge the support of the ERC Consolidator Grant
MOUSSE No. 726487 under the European Union’s Horizon 2020 research and
innovation programme.

This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022” of the Department of Computer Science of the Sapienza University of Rome.

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.