What are Word Embeddings? Why are they important?
Engineering a computer to understand the meaning of words is a notoriously difficult problem in Data Science, Natural Language Processing and Artificial Intelligence.
Any effective text analysis software must successfully be able to take semantics into account. Knowing that ‘toilet’ and ‘loo’ have essentially the same meaning is a basic prerequisite to analysing reviews of customer facilities. A reliance on string similarity (words that appear similar because of the letters they contain: ‘tram’,’tramp’, ‘car’, ‘carpet’) is not sufficient. Matching letters alone does not tell us anything about semantic similarity.
One possibility is to make use of resources such as Princeton WordNet, essentially a sophisticated online dictionary in a programmable format. There are problems with this approach, however. A static, discrete ontology does not provide any meaningful way to quantify the similarity of words. Language is dynamic – constantly evolving, updating and developing at a thunderous pace.
Here at Gavagai, we use an AI-based approach. Our lexicon leverages word space technology, also known as word embeddings, semantic word vectors or distributional semantic models. For each language (be it English, Finnish, Arabic etc.) the meaning of a word is represented as a point in a multi-dimensional vector space. From digesting vast amounts of text, the meaning of a word is learned.
What is different about Gavagai’s Word Embeddings?
Word embedding technology has been around the block in a range of different formats (cf. a brief history of word embeddings). Our own word spaces are much more powerful than standard implementations.
To begin with, our system is trained from online data. The internet is the largest corpus of real-world language usage that exists. Thanks to this, we are able to exploit a rich amount of text from many different domains. This enhances our Natural Language Understanding of many different types of data. In addition, our system continuously learns and incrementally updates, so we able to keep abreast of the latest coinages and nuances in online language in a completely unsupervised way.
This incrementality also means we do not need to retrain our models from scratch. (In this case, the word ‘model’ refers to the set of word meanings we have learned for a language). Almost all Machine Learning and Deep Learning models use training sets of fixed size. Once the model is trained, it is immutable and fixed in place. As soon as new data becomes available, it is not possible to update the existing representations without starting completely from scratch. This is gobsmackingly inefficient: work already carried out is discarded and in some cases, the train and redeploy cycle can take several days. When dealing with online data, the problem is exacerbated: new documents and content are being created every minute of everyday. Our technology is able to work on-the-fly, updating incrementally as each new document comes in. No retraining is needed.
Due to the domain diversity of our training data, we are also able to handle understanding of variations, misspellings and corruptions of data. This is vital for accurately processing the extreme variability of natural language.
Our word space technology is the technology that powers the core of what we do. It allows us to take previous developments in topic modelling and sentiment analysis to commercially deployable levels.