Resources
- Resources and Tools for Computational Historical Linguistics
- Neural Text Simplification Models
- Europarl Corpus of Native, Non-native and Translated Texts - ENNTT
- A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus
- Romanian Determiners Lexicon - RoDetLexicon 1.1
- Comparing Speech and Text Classification of Native and Non-native English
- Degrees of Similarity Between Romanian and Related Languages
- Cross-lingual Named Entity Recognition
- A Computational Exploration of Pejorative Language in Social Media
Resources and Tools for Computational Historical Linguistics
- The Java code for automatically identifying and producing related words for historical linguistics: link.
- The translations used for dictionary-based identification of cognates: link.
- The input and output files for experiments on identification and production of relate words: link.
Neural Text Simplification Models
- We present the first attempt at using sequence to sequence neural networks to model text simplification (TS). Unlike the previously proposed automated methods, our neural text simplification (NTS) systems are able to simultaneously perform lexical simplification and content reduction. An extensive human evaluation of the output has shown that NTS systems achieve good grammaticality and meaning preservation of output sentences and higher level of simplification than the state-of-the-art automated TS systems.
-
Follow the steps, in order to generate simplified text:
- Checkout our repository including the submodules:
git clone --recursive https://github.com/senisioi/NeuralTextSimplification.git
- Download the pre-trained released models NTS and NTS-w2v (this may take a while):
python src/download_models.py ./models
- Run translate.sh from the scripts dir:
cd src/scripts && ./translate.sh
- Checkout our repository including the submodules:
Europarl Corpus of Native, Non-native and Translated Texts - ENNTT
- A complete description of this resource is available here: A Corpus of Native, Non-native and Translated Texts, LREC, 2016, PDF
- For the raw corpus, please check the dataset available here
- For the experiments presented in the ACL 2016 paper, please check the dataset available here
- For the experiments presented in the LREC 2016 paper, please check the dataset available here
Short description:
- This is a monolingual English corpus of native, non-native and (human) translated texts extracted from the European Parliament. The translated texts from different source languages represent a subset of the Haifa Corpus of Translationese. We preserved the same annotation style and included an ID and the EU state that each member of the European Parliament represents.
- We hope this dataset will facilitate a unified comparative study of translations and language produced by highly fluent non-native speakers, two closely-related phenomena that have only been studied in isolation so far.
A Visual Representation of Wittgenstein’s Tractatus Logico-Philosophicus
- This work is the result of our collaboration with Anca Bucur, Ph.D. candidate, from the Center of Excellence in Image Study.
- We compile a multilingual parallel corpus from different versions of Wittgenstein’s Tractatus Logico-Philosophicus, including the original in German and translations into English, Spanish, French, and Russian. Using this corpus, we compute a similarity measure between propositions and render a visual network of relations for different languages.
Comparing Speech and Text Classification of Native and Non-native English
- We provide a comparison of speech and text classification of native and non-native English using a subset of the International Corpus Network of Asian Learners of English (ICNALE)
- The analysis is reported in the paper Nisioi, S., Comparing Speech and Text Classification on ICNALE, LREC 2016
Romanian Determiners Lexicon - RoDetLexicon 1.1
- The first version of Romanian Determiners Lexicon (RoDetLexicon 1.1) specifies the relevant features for determiners studied so far during the research project “The structure and interpretation of Romanian Determiner Phrase in Discourse Representation Theory: the determiners”. The importance of determiners comes from both syntax and semantics. From the point of view of syntactic theory, specifying the determiner’s relevant features naturally leads to the determination of the parameters of syntactic variation in the Determiner Phrase domain. From the discursive perspective, determinants have a fundamental role, being the most important constituents when it comes to establishing the logical structure of the sentence or of the discourse.
- The feature matrix of each determiner contains morpho-syntactic and semantic features, as they emerged from the studies developed during the project, such as: syntactic category, selectional features, phi-features (person, number, gender), definiteness, quantificational features, cardinality, focus, topic, deixis, proximity, contrastive, location, anaphoric, cataphoric or classifier.
- More details are available this paper.
Degrees of Similarity Between Romanian and Related Languages
- More details are available in Ciobanu, A.M. and Dinu, L.P., An Etymological Approach to Cross-Language Orthographic Similarity. Application on Romanian, EMNLP 2014 PDF
Cross-lingual Named Entity Recognition
- Experiments on named entity translation using word embeddings are described in Şulea, O. M., Nisioi, S., and Dinu, L. P.,:, Using Word Embeddings to Translate Named Entities, LREC2016
- This resource is an annotated parallel corpus of named entities, currently work in progress
A Computational Exploration of Pejorative Language in Social Media
- More details about this resource can be found in Dinu, L. P., Iordache, I. B., Uban, A. S., Zampieri, M.: A Computational Exploration of Pejorative Language in Social Media, Findings of the Association for Computational Linguistics: EMNLP 2021.
- the dataset can be downloaded via this link.