1. Introduction

Background

Mobile phones are now fundamental parts of our life; they are practically always with us, wherever we go, almost as if they were a part of our body. The continuous communication between a device and the mobile networks leaves traces of our whereabouts in the operator’s system. Via these devices, the mobile network can “sense” our movements, which is the basis of the “Smart City” concept [1].

In the last few decades, anonymized Call Detail Records (CDR) have become a common information source for analyzing the characteristics of human mobility. Numerous research has been published utilizing this source of data from all around the world, even from Hungary, e.g., [2, 3, 4]. The research focuses on Budapest and its agglomeration. CDRs contain the billed activities of the subscribers, providing information about the whereabouts of the population. Based on this massive information source, human mobility analysis is utilized in fields — among others — like social sensing, epidemiology, transportation engineering, urban planning, and sociology. Furthermore, the human sleep-wake cycle (SWC) is also studied by analyzing mobile phone network data.

The analysis of the human movement patterns based on the CDR data, which makes it possible to examine a large population cost-effectively, resulted in several discoveries about human dynamics. These works usually consider the population as a homogeneous group, and the classification is based on some mobility indicators [5]. The next step was adding external data sources to the mobile network data, extending the investigation to other population characteristics. Often, this external source is used to classify the data (e.g., by gender) and to analyze the classes using the previously introduced mobility indicators. Among others: social network data [6], transportation data [7], taxi trips [8], socioeconomic indicators (income, education rate, unemployment rate and deprivation index) [9], sale price of residential properties [10].

A part of this work also fits into the trends, using housing prices as a socioeconomic indicator. Housing price, however, is an indirect indicator for the SES, so a more direct indicator, the cellphone price, was also investigated. While Blumenstock et al. used the call history as a factor of socioeconomic status [11], Sultan et al. [12] applied mobile phone prices as a socioeconomic indicator and identified areas where more expensive phones appear more often. However, only manually collected phone prices were used, and the analysis was not performed on the subscriber level.

Research Goals

The goal of my research was to develop a methodology and implement a data processing framework that can evaluate mobile network data. This framework should also be able to calculate mobility indicators and associate socioeconomic indicators with the subscribers. As the mobility patterns of the subscribers can be extracted from the CDR, the people can be distinguished based on their mobility customs. One of the main goals was to find a correlation between mobility and socioeconomic status.

As the mobile network data does not contain information about the subscribers’ income, the CDRs were required to be enriched with other data sources. My research focused on an indirect and a more direct feature in this regard. The indirect is the housing prices because the level of a neighborhood is used to infer the SES. When the home location of a subscriber is known, the typical housing price of that area can be associated with the subscriber. Considering if someone lives in a more expensive area, their socioeconomic status should be higher, as well.

A more direct feature would be the price of the subscriber’s cellphone. As the actual purchase price of a mobile phone can depend on several factors, the age of the device may be used in line with the recommended retail price.

Another goal was to analyze the commuting tendencies of the capital and its agglomeration. Kiss et al. state that although commuting is an essential and common phenomenon, its measurement is occasional and inadequate [13]. Commuting is predominantly analyzed by the census, but that is performed only once in a decade; thus cannot follow sudden but permanent changes. They also stress that commuting should be examined frequently, and its methodology should be established [13].

As questioning the population is a slow, tedious and expensive task, it would be obvious to automate the process with the available info-communication technologies (ICT). In this research, the application of CDR processing was presented to examine commuting, mainly to Budapest, and the findings were validated by the results of studies that analyzed commuting using census.

Economic models distinguish city parts such as residential areas, industrial areas, business districts, and so on, but that is a relatively static, slowly evolving city layer. Mobile phone network data has the potential to describe the city structure via the inhabitants’ mobility patterns. The next part of this research focuses on the effect of the SWC on the city structure. In this regard, it continued the commuting analysis, but the city structure was analyzed by the circadian rhythm of the people who live and work in a given area of Budapest.

Is it possible to cluster city areas by the time when the activity of the inhabitants, the workers, or the passers-by starts their activity in the morning or halts in the evening? Do city parts have “chronotypes”? Is there a structural or socioeconomic connection between the areas with the same “chronotype”? Can neighborhoods or districts be described by the terms “morningness” or “eveningness”? Another goal of this research was to answer these questions.

Meso-data

Is this work dealing with Big Data? No, not really, as one of the V’s of Big Data [14], volume, does not apply here. Historical data has been processed for a limited area (the capital and its agglomeration) and a limited period (one month). Increasing the observation area and period would increase the volume of the data. However, a year of data for the whole country should still be effortlessly manageable with the current set of tools that are not extraordinary.

Could it be Big Data? Yes, it could, if, for example, the analytics should be updated in real-time or almost real-time, so the velocity would be an issue. Some current analyses could even take an hour or two, though there could be plenty of room for optimization in my queries and scripts.

What is it, if not Big Data? I like the term “meso-data” (and “meso-computing”) that Matt Williams proposed in his essay [15] to describe the processing challenges of the order of magnitude of this data.

Data Science is like Origami

Origami1 is the art of paper folding, where after a specific sequence of simple folding steps, a sheet of paper becomes a figurine of an animal or an object. Data science is like origami, as during the data processing, simple steps are applied one after another while it starts to shape and new information is born. As everyone has a sheet of paper, everyone has a pile of data nowadays.

Paper crane by Caro Asercion, CC BY 3.0
Figure 1.1.: Paper crane by Caro Asercion, CC BY 3.0

The question is, who can fold that data into art?


  1. In the Japanese word, origami, ori means “folding”, and kami means “paper” (“ka” changes to “ga” in the compound word). ↩︎

Top