- We use Ethersource to monitor usage of racist terminology in the Swedish blogosphere.
- We find that one of the largest demographic groups to use such terminology is young female bloggers.
- We demonstrate how we are able to cluster and profile users of racist terminology.
One of the many benefits of Ethersource is that it is not limited to the standard positive/neutral/negative sentiment palette, but that it can be used to analyze and monitor any type of textually manifested phenomena. Previous examples in this blog include artist popularity, flu trend, aversive language, and positivity vs headache.
In this post, we report on some observations on using Ethersource to monitor racist expressions in the Swedish blogosphere.
The following image shows the frequency of occurrence of racist terminology in the Swedish blogosphere from late March to the end of May 2012. Obviously, racist terminology is a frequent everyday occurrence on Swedish blogs.
However, merely counting the frequency of occurrence of racist terminology is of limited usefulness for understanding what people say and mean, since there are many ways to use terminology. Some uses may signal ideological or political standpoints, but other uses may not (e.g. discussions about the terminology itself, such as the origin and appropriateness of various terms). Thus, only counting the frequency of occurrence of racist terminology in the blogosphere may lead to premature or misleading conclusions. We therefore also monitor negative or degrading usage of racist terminology, as well as aggressive or hateful usage. And there is a difference between counting frequencies and counting opinionated usage, as we can see in the image below, which shows frequency (in blue), degrading usage (in green), and aggressive usage (in red).
It is obvious that the total frequency of occurrence of racist terminology is much larger than that of the frequencies of degrading use and aggressive use. As a rough estimate, approximately 10% of the total number of posts containing racist terminology are negative or degrading, while approximately 5% are aggressive or hateful.
The general trends in these graphs are not of lasting value, since the time span is relatively short. What is interesting – and surprising – is the demographic profile of bloggers found in the two bottom lines. Since Ethersource enables an analyst to retrieve individual blog posts which contain a given target (in this case, racist terminology), it is possible to further analyze the material. Looking at the blog posts that use racist terms in degrading ways, we find that roughly 25% are written by young female bloggers who write about their own lives. Perhaps even more surprising, around 10% of blog posts using racist terms in aggressive ways are written by these young females. This is a surprising discovery, considering that the topical content of these blogs revolve around everyday events, lifestyle, and fashion.
Demographic clustering and stylometric profiling
The noteworthy observation above suggests that it may be interesting to look more closely also at the non-opinionated usage of racist terminology (i.e. the occurrences that are neither aggressive nor degrading). We do so by automatically clustering all the blog posts containing racist terminology during 2012. Always keeping the obvious risk of overgeneralizing in mind, we infer from manual inspection of the material that the four main clusters represent following groups of bloggers:
Imagine that we for some reason could not inspect the material manually and therefore did not know the demographics of the clusters we found. In such cases, we can use stylometric profiling to characterize the stylistic differences between clusters, and based on these differences we can infer demographic information. As an example, consider the following comparison between the stylometric profile for the cluster containing the young female bloggers, and the stylometric profile for the cluster containing mainly political bloggers.
The comparison between these two stylometric profiles shows that the main stylistic differences between these two groups of bloggers (let’s call them group F for the young female bloggers and group P for the political bloggers) can be found in the following variables:
- Self
- Group F is more self-oriented, which indicates that this group talks mainly about things that happen to the author, stuff the author thinks or worries about, or things that the author does.
- Address
- Group F refers directly to the reader more often than does group P.
- Abstract vocabulary
- Group P tends to use more abstract and complex vocabulary than group F.
- Anchoring
- Blog posts from group P contain more explicit temporal and spatial references than do posts from group F.
These differences suggest that authors in group F (the young female bloggers) write mainly from a subjective point of view, while authors in group P (the political bloggers) adopt a more factual perspective. Based on such differences, we may formulate hypotheses about the demographics of these two groups. This difference would allow us to propose that since the one group writes from a more personal and immediate perspective, they can be assumed to be younger and more personally engaged in their narration than the other group. This characterisation of author style is actually more salient than the objective notion of author age and gender since writing style and authoring background are more interesting for understanding blog posts than the age and gender or other demographich variables.
The analysis and discussion above serves as an illustrative example of how stylometric profiling correlates well with human intuition about demographic clustering, and that such profiles may serve as explanatory constructs for a demographic clustering solution. We conclude this blog post with the observation that the combination of attitude analysis, clustering, and profiling provides a very powerful framework for analysis of online content.