ences. Our proposed approach contains much more facets of people, by directly mining the interests of the crowd.
• Our research may improve the process of geocoding of geo-textual data.
Given a user who specified "London" as his location, the probabilis-tic distribution might be shifted towards the city of London, Ontario, Canada, if the vocabulary, topics and keywords of his tweets are more similar to regions within that area. This can be done by describing the user, who is to be geocoded, by the set of his own tweets, obtain a proper feature representation and compare this representation to can-didate geo-locations.
• For targeted marketing, it may be much more interesting for a company to direct their advertisements to an area of people having a similar mindset. Even if this region covers multiple political regions. For example, an upper-class car manufacturer may be looking to direct an advertizement at a wealthy city district. However, parts of the administrative city districts may not actually wealthy, or the actual wealthy population may reach outside of the city district. With our approach, the car manufacturer can target it’s advertizement at the mental cluster that is rooted in the wealthy city district.
Obviously, the performance of the hierarchical clustering step strongly depends on the quality of the text representation vectors that are used to fit the clustering model. Although we limit the proof of concept to fairly simple representations, summarized in Section 3.4, we want to emphasize that more sophisticated text mining approaches as for instance those surveyed in [18], or even deep learning approaches that learn vector representations [200] might boost the performance of the clustering step.
The remainder of this chapter is as follows. In Section 7.2 we formalize our search for a socio textual map, and identify the research challenges that need to be solved towards this vision. In Section 7.3, we implement a first solution, by solving each of the research challenges in an initial way. We show that our vision is feasible: if the necessary research steps are all solved thoroughly, then a large scale solution to map the minds of people is a vision that may become reality.
Distance
measure
1 0 2 1 2 2 1 1 2 3 4 2 0 1 0 1 3 1 2 2 3 1 0 0 1
Hierarchical
metric clustering Hello, World!
Good night
I like ham.
Features:
101100111...
Features:
100101001...
Feature selection
and transformation
regions x regions
Figure 7.1: Searching in collections of multi-represented users.
socio textual map is feasible and discuss problems, open research questions and challenges. The first step requires to obtain a feature representation of a potentially large and dynamic set of textual documents.
Definition 20 (Feature Selection). Let S ∈ String∗ be a set of text docu-ments. A function f : S 7→Rd is called a d-dimensional feature representa-tion of S.
The choice of function f is one of the main challenges. This function should chosen such that two of text documents S1 and S2 are similar in terms of the topics, interests and experience of these texts, if and only if f(S1)is similar toS2. Thus, a proper feature selection method should discard terms without informative content, i.e. words that appear very frequently.
A common approach is referred to as document frequency-based selection, introduced by [175]. The idea is to weight those terms that appear more frequently with a higher value. To avoid prioritizing stop words likethe, [238]
extended this approach by using inverse document frequencies. However, these approaches give no information about the importance of a keyword in terms of describing the mental topic of the user generating the text. This is the challenge of feature extraction for socio textual mapping. In [76] a
concept is presented that uses an entropy measure to select those features that carry the most information. Taking this concept to feature selection for geo-textual data, the idea is to select terms which are highly frequent in only a few regions, and extremely rare in others. Such local trends may be extremely useful to distinguish regions at a local scale, but may become useless for other scales and other areas. However, it seems intuitive that a proper approach should also include global trends, that most of the world has (to different degrees) on their mind.
Next, we apply function f to geo-spatial regions.
Definition 21 (Feature Transformation). Let W denote a hierarchical par-titioning of the geo-spatial region representing the surface of the Earth. Each level of W corresponds to a geographic scale, i.e., continental level, coun-try level, state level and city level. For each region w ∈ W, the function text(w) : W 7→ String∗ returns a set of text documents that are associated with w.
The first step of our workflow in Figure 7.1 illustrates this step. In the top-left of Figure 7.1, we consider a spatial regionwcorresponding to Bavaria, Germany. We take the set of text messagestext(Bavaria)and apply a (in this example binary) feature transformation. The same feature transformation is performed to all other German states. In the next step, we need to assess the pair-wise dissimilarity between these regions, using standard vector distance functions. Using the resulting dissimilarity matrix, exemplarily depicted in the lower-left of Figure 7.1, we can apply a metric clustering approach to find groups of similar regions. The choice and the parameterization of this clustering approach are another challenging step. For instance, the clustering approach needs to account for different geographic scale. That is, regions on country level should allow much more freedom to be considered similar than regions on a city level.
Theoretic Foundation
There are two main theoretical reasons why finding a socio textual map is viable and feasible. The first is the law of large numbers, and the second is Tobler’s first law of geography.
The law of large numbersstates that, for a random variable, the empirical probability approaches the actual probability as more trials are performed.
Applied to our problem, we can treat the topic of a tweet as a random vari-able. For a sufficiently large number of tweets drawn from a region, the law of large number states that the fraction of tweets having a specific topic
con-verges to the true fraction of people having this topic on their mind.2 The flood of daily tweets and other geo-textual data sets allows us to exploit the law of large number to obtain a representative sample of the minds of people in a region.
Tobler’s first law of geography states that “Everything is related to ev-erything else, but near things are more related than distant things” [248] and is one of the key reasons why “spatial is special” [162]. It is the reason why we expect that a clustering of the minds of regions results in a clustering that is also spatially correlated. And it is the reason why we envision that we can obtain a socio textual map that captures more than lingual vocabulary, but also captures topics and trends that people think about.