• Keine Ergebnisse gefunden

is used, there is no model learning phase. The run-time results for the clas-sification is shown in Figure 6.8(b).3 a linear run-time is observed, which is attributed to the extreme high dimensionality of the feature vectors, which cannot be beneficially supported by an index structure for the kNN search.

But even at the full 15,989 users, the time to classify each trace is less than 1ms.

Socio-Textual Mapping

The work presented in this chapter has been published as the article Socio Textual Mapping in the Proceedings of the 8th ACM SIGSPATIAL Interna-tional Workshop on Location-Based Social Networks, 2015 [262].

7.1 Introduction

Traditionally, a spatio-temporal database consists of triples (objectID, time, location), mapping objects (e.g., users) and time to a position in geo-space where the object was, is, or will be located. In recent application, this information is further enriched by textual information: For example, in geo-social networks user can check-in at their current location such as a restaurant and publish a textual description of their experience at this location. An-other example is Twitter where users can broadcast small messages of no more than 140 characters. Many of these Tweets contain a geographical tag corresponding to the geo-spatial position of the user. Loosely speaking, the textual content of a Tweet contains information about what’s on the mind of a user: For example, a Tweet may describe an experience that a user wants to share, a restaurant that a user wants to recommend, an achievement that the user wants to boast about, or simply anything the user wants to say. In this paper, we want to generalize this concept, by making the assumption that the collection of recent tweets of a region reflectswhat’s on the mind of a region.

As an example, consider the two topics“Justin Bieber”and“Greek Bankruptcy”

and consider two geo-spatial regions, such asOntario, CanadaandGermany.

It may turn out that in Ontario, one percent of all tweets contain the key-word “Justin”, and five percent of all tweets contain the key-word “Greece”. In contrast, the Twitter users in Germany may user the keyword “Justin” in only

0.1 percent of their tweets, but use the keyword Greece in ten percent of their tweets. Clearly, these two distributions of keywords are different. Thus, peo-ple in Ontario and peopeo-ple in Germany have different things that they tweet about - different things that are on their mind. We want to automatically extract a feature representation of what’s on the mind of people.

In the past, such a vision of describing a region by text messages published in that region was entirely infeasible. Even in the example above, if we only have a few hundreds of tweets per day in Germany, then making significant statement about the frequency of the topic“Justin Bieber” is hard. Trying to make conclusions about the frequency of rare topics such as “Databases” was hard. Drawing conclusions for smaller spatial regions, such as cities or parts of cities was completely impossible. But now, both the current trends in technology such as smart phones, general mobile devices, stationary sensors and satellites as well as a new user mentality of utilizing this technology to voluntarily share information produce a huge flood of geo-textual data.

Today, we have 500 million tweets per day1which, in addition to other sources of geo-textual data such as travel blogs and social networks, we are suddenly able to make significant conclusions about the frequency of rare terms even in small spatial regions. It’s time to use this data. In [69], Cheng et al.

proposed a framework to predict a twitter users city-level based solely on words in corresponding tweets. We generalize this idea to not only determine words that classify cities, but find the latent concepts and topics that describe a generic region. Thus we extend the relationship to areas such as districts, cities, states or even artificial regions not tied to political borders (e.g. a music festival event at an off-site location). Our vision is to describe geo-spatial regions by a representation of their thoughts. We therefore derive simple representations from the textual contents of microblog data. Using this representation, we want to hierarchically cluster the world in terms of what’s on the mind of their people. We call the resulting asocio textual map, envisioned to be useful in a large variety of application fields:

• Research in sociology has focused on the problem of Ghettos and social tensions in modern cities [130, 75, 131]. Data used in this kind of re-search uses Census data using "up to six race/ethnicity groups (white, black, Hispanic, Indian, Asian and other)"[131]. We claim that, es-pecially in the 21st century, the race/ethnicity distribution of regions is not the sole source of social tension. Social tension may be caused simply by having different opinions and beliefs. With our solution, we can find spatial regions, on a city scale, having people with significant different interests. This may or may not be a result of ethnic


ences. Our proposed approach contains much more facets of people, by directly mining the interests of the crowd.

• Our research may improve the process of geocoding of geo-textual data.

Given a user who specified "London" as his location, the probabilis-tic distribution might be shifted towards the city of London, Ontario, Canada, if the vocabulary, topics and keywords of his tweets are more similar to regions within that area. This can be done by describing the user, who is to be geocoded, by the set of his own tweets, obtain a proper feature representation and compare this representation to can-didate geo-locations.

• For targeted marketing, it may be much more interesting for a company to direct their advertisements to an area of people having a similar mindset. Even if this region covers multiple political regions. For example, an upper-class car manufacturer may be looking to direct an advertizement at a wealthy city district. However, parts of the administrative city districts may not actually wealthy, or the actual wealthy population may reach outside of the city district. With our approach, the car manufacturer can target it’s advertizement at the mental cluster that is rooted in the wealthy city district.

Obviously, the performance of the hierarchical clustering step strongly depends on the quality of the text representation vectors that are used to fit the clustering model. Although we limit the proof of concept to fairly simple representations, summarized in Section 3.4, we want to emphasize that more sophisticated text mining approaches as for instance those surveyed in [18], or even deep learning approaches that learn vector representations [200] might boost the performance of the clustering step.

The remainder of this chapter is as follows. In Section 7.2 we formalize our search for a socio textual map, and identify the research challenges that need to be solved towards this vision. In Section 7.3, we implement a first solution, by solving each of the research challenges in an initial way. We show that our vision is feasible: if the necessary research steps are all solved thoroughly, then a large scale solution to map the minds of people is a vision that may become reality.

Im Dokument Unsupervised learning on social data (Seite 115-118)