Distributional semantic models learn how linguistic items — words, terms, phrases, constructions — are related to each other by modelling how they occur and cooccur in naturally occurring text or speech, learning meaningful relations between items from how they are used.
Distributional models are attractive, both through the straightforward path from theory to implementation, and through their intuitive transparency and plausibility. They conform well with how we as human language learners understand how we most often learn new words or turns of phrase: we internalise a number of observed usages without explicit definition or instruction.
But there are important differences between human language learning and implemented distributional models.
- (1) humans have a much richer representation of usage than does a machine;
- (2) humans are constrained by the human experience to communicate about what is humanly communicable and comprehensible.
On the first point: the words in discourse are used in a situation with some specific speakers, specific constraints and observanda, at some specific time of day, year, and week. These factors are not taken into account for building a linguistic model: they are either discarded as being extralinguistic or relegated to the pragmatic wastebasket of non-formalisable facets of human-human communication.
On the second point: all conceivable combinations of expression with respect to known referents and relations in discourse are not utilized by people. There are numerous obligatory constraints, somewhat more flexible conventional practices, and defaults of more or less imperative character that constrain what can be said or understood about what. These factors are seldom taken into account for building a linguistic model and disregarded in work on distributional semantics.
This does not need to be so.
Gavagai will during 2017-2018 work on an 18 month research project to include extralinguistic and pragmatic facets of language use in a distributional model. The project is co-funded by Gavagai, Vinnova – the Swedish Governmental Agency for Innovation Systems, and the EU.
The project will build models to accommodate aspects of the meaning of language which are strongly bound to the here and now of the situation it is used in. Specifically, this project will study how those aspects of meaning can be studied using distributional semantics, with an eye on future application to data from the Internet of Things, involving vast numbers of interconnected sensors, actuators, and interaction devices – and human language use.
The project will involve implementation of an experimental extension of Gavagai’s current model for distributional semantics, theoretical work on linguistic representation, and large scale experimentation.
The implementational aspects of the model will be developed at Gavagai during the first half of the project, in the Spring of 2017. The first main challenge is how continuous data such as weather readings, temperature curves, or visual images of the sky can be incorporated in a distributional processing model which today is built for handling discrete symbolic input, such as words.
The second half, the academic year 2017-18, will be on a research visitor to the Department of Linguistics at Stanford, extending the representation of distributional models to handle non-referential semantics of utterances such as aspect, tense, manner, process, and implicature. This enhanced distributional model developed during the research visit will then be tested using text data paired with streams from sensors and devices. Such a model will provide first steps towards providing a practically applicable theory of distributional pragmatics.
The models designed in the project will be built to accommodate textual data collected by us on the attitudes of the general population both from editorial and social media on a number of issues related to quality of life, climate change, and water quality. These texts are being collected for the purposes of the Ground Truth project on citizen observatories, funded by the European Commission, and will be a highly heterogeneous data set, in many languages, from many cultural areas, and in many genres. We expect many texts to be explicitly anchored in the situation of the writer, which will provide rich testing and validation materials.