3. Data Sources

In this chapter, the data sets used in this research are described. The “April 2017” (Section Vodafone April 2017) and the “June 2016” (Section Vodafone June 2016) CDR data sets are described, along with other sources like the estate price data (Section Estate Price Data). Unless otherwise indicated, the CDR data denotes the “April 2017” data set through this work.

Vodafone April 2017

The CDR data were collected from Budapest, the capital of Hungary, and the surrounding county. Vodafone Hungary is one of the three mobile phone operators providing services in Hungary. The market share of the three big operators in Hungary has not changed significantly in the last few years. Vodafone Hungary had 25.5% in 2017 Q2 nationwide[109].

The communication between a cellular device and the mobile phone network can be divided into two categories: i) An administrative communication maintaining the connection with the service, for example, registration of the cell-switching, can be called passive communication. ii) When the device actively uses the network for voice calls, messages, or data transfer, that can be called active communication. The available data contains only active communication, which is sparser, so it cannot be used to track continuous movements.

The raw CDR data contains a long alphanumeric hash to identify the Subscriber Identity Module (SIM), a timestamp that was truncated to 10 seconds, and an ID of the cell, thus, a subscriber can be mapped to a geographic location in a given time. These are extended customer type (business, consumer), the subscription type (prepaid, postpaid), the age and gender of the subscriber, and the Type Allocation Code (TAC) of the device. The TAC is the first eight digits of the International Mobile Equipment Identity (IMEI) number that refers to the manufacturer, and the model of the device wherein the SIM card is active. These values are also present in every record, so, for example, the device changes can be tracked as well.

As for the cells, a separate table was provided with a cell ID, the geographic location of the cell centroid, the area of the cell, and the distance between the centroid and the base station. These values are an estimation based on a momentary state, especially with the UMTS (3G) cells due to their breathing mechanism, which can change the geographic size of the serving area for load balancing. The heavily loaded cells shrink, and the neighboring ones grow to compensate[110].

The rationale for this “wide” format may be that the subscriber data and the device can be changed within the observation period. This occurs about 3000 times during the observation period. The owner of the subscription can change its details and, of course, change the device if they bought a new mobile phone, for example. Subscriber and customer type were provided for every SIM, but age and gender were missing in many cases, presumably due to the privacy options requested by the subscriber.

The records include neither the type of the activity (voice call, message, data transfer) nor the direction (incoming, outgoing), and there was no data provided by the operator to resolve the TACs to manufacturer and model.

The “April 2017” CDR data set includes mobile phone network activity of the Vodafone users from Budapest (and the surrounding areas) in April 2017. This contains 955035169 activity records, from 1629275 SIM cards. Figure 3.1, shows the activity distribution between the activity categories of the SIM cards. Only 17.67% of all the SIM cards that have more than 1000 activity records provide the majority (75.48%) of the mobile phone activity during the observation period. Figure 3.2, shows the distribution of the SIM cards by the number of active days. Only about one-third (33.23%) of the SIM cards have activity on at least 21 different days. Despite the relatively large number of SIM cards present in the data set, most of them are not active enough to provide enough information about their mobility habits. For the exact selection criteria, see Section Selecting Active SIMs.

Subscriber Identity Module cards in the 2017-04 data set categorized by the number of activity records.     The Subscriber Identity Module cards with more than 1000 activity records (17.67% of the Subscriber Identity Module cards) provide the majority (75.48%) of the activity.
Figure 3.1.: Subscriber Identity Module cards in the 2017-04 data set categorized by the number of activity records. The Subscriber Identity Module cards with more than 1000 activity records (17.67% of the Subscriber Identity Module cards) provide the majority (75.48%) of the activity.
Subscriber Identity Module card distribution in the 2017-04 data set by the number of active days, in contrast of the number of records generated per category.
Figure 3.2.: Subscriber Identity Module card distribution in the 2017-04 data set by the number of active days, in contrast of the number of records generated per category.

Figure 3.3a, displays the mobile phone activity as a time series for the “April 2017” dataset. It is quite regular, considering the 4-day weekend in the middle of the month due to Easter1. Figure 3.3b shows its Fourier decomposition to highlight the seasonality of the data. As expected, this 30-day dataset has a 24-hour periodicity.

Mobile phone activity through the “April 2017” dataset.
(a) Mobile phone activity through the “April 2017” dataset.
Fourier decomposition.
(b) Fourier decomposition.
Figure 3.3.: The mobile phone network activity (a) during the observation period of April 2017, and its Fourier decomposition (b).

Missing Properties

The subscriber type and subscription type were provided for all the SIM cards, whereas age and gender were frequently missing. For a group of SIM cards, some of these values have changed during the data set. These changes could be realistic as the subscription type, the subscriber (the ownership), or the subscription plan can be changed at will. Sometimes, the change affects the age or the sex as they become unknown or known when they were unknown before.

Let us take that someone changed their subscription plan, which is treated as a new contract between the subscriber and the operator, and this might have affected the personal data usability for marketing purposes. Either, someone just revoked that permission. Nevertheless, this affects about 2000 SIM cards out of about 1.6 million. In these cases, all the subscriber properties were set to unknown without trying to decide which value should be used.

Outliers

The age property contains some unrealistic values, like a more than 210 years old subscriber, and there were quite a few 115 years old subscribers. It is assumed that these values might be administrative errors. The number of at least 90-year-old people was 2926, and out of that, 554 were between 90 and 100, according to the data, which is unlikely. For comparison, 59470 over 90 years old people lived in Hungary in 2017 KSH[111], which is 0.6% of the population.

Vodafone June 2016

Although this data set predates the formerly introduced one, I consider it secondary, as almost all of my work was performed on the “Vodafone April 2017” data set (Section Vodafone April 2017). It is more recent and better in data quality. The collaboration between Vodafone Hungary and Óbuda University was an iterative process. Based on our remarks, they improved the provided data in a format and — if it was possible — in quality.

This section follows the structure of the previous one, describing the “Vodafone June 2016” data set.

The observation area of the CDR data was Budapest, the capital of Hungary, and the surrounding county. In 2016 Q2, the nationwide market share if Vodafone Hungary was 25.3% [109], the data format is similar as described in Section Vodafone April 2017. This data set contains 2291246932 records from 2063005 unique SIM cards. It is more than twice as many records for a practically same time interval and more than 26.6% increase regarding the unique SIM card number.

Although this data also covers one month, June, instead of April. It is a later period of the year that has some consequences: June, like an early summer month, is warmer with longer days, and there are usually more tourists in Budapest. Also, the school semesters end in June, which might be some impact on the mobility trends, and the 2016 UEFA European Football Championship took place during this observation period (for the Euro 2016 case study, see Section Euro 2016).

The main drawback of this data set is a large number of missing cell coordinates and possibly dirty data as regards the third Sunday of the month. Out of the 7239 cell IDs that appear in the data set, only 6268 has known geographic coordinates. 22.5% of the records (515583442) take place in unknown locations, which makes this data set highly unreliable when it comes to mobility analysis.

Number of cells stopping to operate per day of June 2016
Figure 3.4.: Number of cells stopping to operate per day of June 2016

Moreover, 419 cells ceased to operate (did not have any activity) before the end of the month. Figure 3.4 shows the distribution of these cells during the month2. According to my knowledge, operators regularly adjust the network, including installing new cells and shutting down others. The problem is that the cell map does not associate temporal information with the geographic locations, so these changes are not represented. The numerous cells without geographic location might be the replacement cells of the removed ones or just temporary cells.

For example, Óbuda Island is a recreational area with marginal mobile phone activity during most of the year. But, it gives place the Sziget Festival, that had approximately 441000 visitors in 2015, with a daily capacity of at most 90000 visitors [112]. For this event, the operators install temporary cells to serve the massively increased demand of the mobile phone network capacity.

Figure 3.5, displays the mobile phone activity as a time series for the “June 2016” dataset. Unlike Figure 3.3a, it shows some considerable irregularities. Anomaly 1 was caused by the missing activity records on June 1, 2016. The data set contains about 16.5 million records for that day, which is only 18.5% of an average weekday. Interestingly, the obtained records are distributed across the whole day from numerous cells. So, it is not like a portion of cells or a period of the day was omitted. The majority of the assumed records are missing without apparent reason. Anomaly 2 is quite the opposite. June 12 contains an inexplicable activity surplus and is detailed later in this section. Anomaly 3, 5, 6 and 8 in Figure 3.3a are covered in Section Euro 2016. However, the reason for the sudden negative peaks denoted by numbers 7 and 9 is unknown. Just as in the case of anomaly 4.

Mobile phone activity through the “June 2016” dataset, with several anomalies.
Figure 3.5.: Mobile phone activity through the “June 2016” dataset, with several anomalies.
Hourly aggregated activity on Sundays
Figure 3.6.: Hourly aggregated activity on Sundays

Figure 3.6 shows the hourly aggregated activity on Sundays of June 2016. All four days have the same tendency. Between the last two Sundays of the month, there was hardly any difference. One notable difference is around 20:00, which may be explained in Section Hungary vs. Belgium. On June 5, the activity levels are notably higher during the daytime, but this is within reason. However, on June 12, the activity was abnormally intense, almost twice as much as it would be expected, even during nighttime. I have no information about any events or circumstances that could have normally caused this activity surplus on June 12, 2016. It was not concentrated on a definite time interval or area of the city. At first, it could seem as if the records of that day had been duplicated, but there is no direct evidence of it in the data. As the surplus is unexplainable, I consider it a sign of dirty data.

Subscriber Identity Module cards in the 2016-06 data set categorized by the number of activity records.     The Subscriber Identity Module cards with more than 1000 activity records (26.98% of the Subscriber Identity Module cards) provide the majority (91.31%) of the activity.
Figure 3.7.: Subscriber Identity Module cards in the 2016-06 data set categorized by the number of activity records. The Subscriber Identity Module cards with more than 1000 activity records (26.98% of the Subscriber Identity Module cards) provide the majority (91.31%) of the activity.

Apart from the issues, the same descriptive analysis has been performed as on the “Vodafone April 2017” data set. Figure 3.7, shows the activity distribution between the activity categories of the SIM cards. The dominance of the last category, the SIM cards with more than 1000 activity records, is even more significant. This almost 27% of the SIM cards produce the more the 91% of the activity.

Figure 3.8, shows the SIM card distribution by the number of active days. Only the 34.59% of the SIM cards have activity on at least 21 different days. The ratio of the short-term (less than a week) present SIM cards is larger (more than 50%) than in the “Vodafone April 2017” data set. There are 241824 SIM cards (11.72%) that appeared at least two days, but the difference between the first and the last activity is not more the seven days. High levels of tourism are usual during this part of the year.

Subscriber Identity Module card distribution in the 2016-06 data set by the number of active days, in contrast of the number of records generated per category.
Figure 3.8.: Subscriber Identity Module card distribution in the 2016-06 data set by the number of active days, in contrast of the number of records generated per category.

The subscriber property changes affected this data set as well. About 3000 SIM cards were affected out of about 2 million. Some very old subscribers are also present in this data set.

Device Types

Both CDR dataset contains Type Allocation Codes, that the first eight digits of the International Mobile Equipment Identity (IMEI) number, allocated by the GSM Association and uniquely identifies the mobile phone model, and every GSM capable device.

The TACs are provided for every record because a subscriber can change their device at any time. Naturally, most of the subscribers (95.71% in June 2016, and 95.8% in April 2017) used only one device during the whole observation period, but there were some subscribers, maybe mobile phone repair shops, who used multiple devices (see Figure 3.9). As a part of the data cleaning, the wide-format has been normalized. The CDR table contains only the SIM ID, the timestamp, and the cell ID. A table is formed from the subscriber and the subscription details, and another table tracking the subscribers’ device changes.

June 2016
(a) June 2016
April 2017
(b) April 2017
Figure 3.9.: The number of different Type Allocation Codes used by the subscribers.

Resolving Type Allocation Codes

To the best of my knowledge, there is no publicly available TAC database to resolve the TACs to manufacturer and model, although some vendors (e.g., Apple, Nokia) publish the TACs of their products. The exact model of the phone is required to know how recent and expensive a mobile phone is. Although this is not even enough to determine how much the cell phone cost for the subscriber as they could have bought it on sale or discount via the operator in exchange for signing an x-year contract. Still, the consumer price should designate the order of magnitude of the phone price.

The dataset of TACs provided by 51Degrees has been used, representing the model information with three columns: “HardwareVendor”, “HardwareFamily” and “HardwareModel”. The company mostly deals with smartphones that can browse the web, so the data set usually does not cover feature phones and other GSM-capable devices. Release date and inflated price columns were also included, but these were usually not known, making the data unsuitable to use on its own.

Although it cannot be separated by type, the CDR data contains not only call and text message records but data transfer as well. Furthermore, some SIM cards do not operate in phones but in other – often immobile – devices like a 3G router or a modem. 51Degrees managed to annotate several TACs as a modem or other not phone devices, which was extended by manual search on the most frequent TACs. There were 324793 SIM cards that used only one device during the observation period and operated in a non-phone device.

Fusing Databases

For a more extensive mobile phone price database, a scarped GSMArena database [113] has been used. GSMArena3 has a large and respectable database, that is also used in other studies [114, 115]. The concatenation of the brand and model fields of the GSMArena database could serve as an identifier for the database fusion. 51Degrees stores the hardware vendor, family, and model, where the hardware family often contains a marketing name (e.g., [Apple, iPhone 7, A1778]). As these fields were not always properly distinguished, their concatenation may contain duplications (e.g., [Microsoft, Nokia Lumia 820, Lumia 820]). So, for the 51Degrees records, three identifiers were built using the concatenation of fields (i) vendor + family, (ii) vendor + model, and (iii) vendor + family + model, and all the three versions were matched against the GSMArena records.

Another step of the data cleaning is to correct the name changes. For example, BlackBerries were manufactured by RIM (e.g., [RIM, BlackBerry Bold 9700, RCM71UW]), but later, the company name was changed to BlackBerry, and the database records are not always consistent in this matter. The same situation occurs due to the Nokia acquisition by Microsoft.

The simple string equality cannot be used due to writing distinction to match these composite identifiers, so the Fuzzy String match is applied using the FuzzyWuzzy 0.18 Python package, which uses Levenshtein Distance to calculate the differences between strings. This method was applied for all the three identifiers from the 51Degrees data set, and the duplicated matches (e.g., when the family and the model are the same) were removed. Mapping the GSMArena database to the 51Degrees adds phone price and release date information to the TACs, which can be merged with the CDRs.

From the GSMArena data, two indicators have been extracted: (i) the price of the phone (in EUR) and (ii) the relative age of the phone (in months). The phone price was left intact without taking into consideration the depreciation, and the relative age of the phone was calculated as the difference between the date of the CDR data set and the release date of the phone.

Figure 3.10 shows the distribution of the phone prices and relative ages within the April 2017 data set. The relative age has a nice distribution, showing that most cellphones are 1 to 3 years old. There were some new and very old phones still in use. The cellphone price distribution follows the relative ages. However, the number of expensive phones seems to be unrealistically low4, so analysis has been performed on an expensive and well-known brand, the iPhone.

(a)
(b)
Figure 3.10.: Distribution of the mobile phone prices (a), and the mobile phone relative ages (b).

iPhones

Model distribution
(a) Model distribution
Price comparison
(b) Price comparison
Figure 3.11.: Based on the “April 2017” dataset, the different iPhone models in use are also displayed (a), and comparing Apple iPhone prices [117] with the GSMArena-based source [113] (b). Versions with the lowest amount of storage denoted by “budget”, and versions with the most expensive versions categorized as “high-end”.

As Apple iPhones are considered a status symbol [116], it makes them suitable to validate the phone price database [113]. Figure 3.11a shows the number of subscribers that exclusively use the certain iPhone models in the “April 2017” dataset. Using TAC values, it is not possible to distinguish the iPhone models based on specifications like storage. However, it is clear that the most expensive models (“Plus” versions) do not have a significant user base, in contrast with some older models like iPhone 4 and iPhone 5 series.

The launch prices of the iPhone models, released until April 2017, are obtained from [117]. Figure 3.11b, compares the two sources. As there are different versions of a certain model, a “budget” (with the lowest amount of storage) and a “high-end” (the most expensive) version are also displayed. Although GSMArena price property is supposed to be a launch price, Figure 3.11b clearly shows that they are much lower than the original prices. Moreover, the older the phone is, the lower the available prices are, except for the first iPhone. Note that GSMArena prices are in EUR, whereas the ground truth prices are in USD, which cannot cause the difference. The results of this analysis imply that the phone prices might have depreciated.

Estate Price Data

Property estate price data was provided by the ingatlan.com estate selling website. The data contains slightly more than 60 thousand estate locations, floor spaces, and selling prices from the advertisements. The prices may not be the actual value that the buyer paid, but even if there was some bargaining, the order of magnitude should be reasonably accurate.

Spatial distribution of the normalized real estate prices (million Hungarian forint).
Figure 3.12.: Spatial distribution of the normalized real estate prices (million Hungarian forint).

The data is from 2018, not from the same year as the CDR data. However, the price differences between the areas of Budapest have not changed significantly during those years, so it should be adequate to describe the average estate price of an area. The price of one square meter was calculated from the floor space and the selling price. In this way, the price level of two different estates in two very different parts of the city can be compared.

The data source contains slightly more than 85 thousand estate locations with floor spaces and selling prices. Figure 3.13, shows its distribution. Figure 3.12 shows the estate advertisements over Pest county, and the administrative border of Budapest is also displayed. The more expensive estate advertisements are represented both by color and larger markers.

Real estate price histogram from the ingatlan.com data source.
Figure 3.13.: Real estate price histogram from the ingatlan.com data source.

Although 70.78% of the data points are within Budapest, there are some areas without property price samples, even in Budapest. Indicators are often aggregated by cells (Section Cell-Map Mapping), so it is crucial to know the average property price for an area covered by the given cell. Besides the cell-level aggregation, the property prices were aggregated on the suburb and district or settlement level. Budapest has more than 200 suburbs of varying sizes, and the average property price of the suburbs are also determined. When a cell does not have a property price sample (or not enough), the average property price of the underlying suburb was used.

For every cell Voronoi polygons, it was determined how large part of the cell overlaps with the suburbs. The average property price of a cell is the weighted mean of the suburb property prices of the overlapping suburbs. This method significantly reduced the number of cells without estate price data and compensated for the extreme differences between the neighboring cells that might come from the advertisements. Figure 3.14 shows the result of the mean housing prices by the cell polygons.

Average real estate price per cell polygons from the ingatlan.com data source.
Figure 3.14.: Average real estate price per cell polygons from the ingatlan.com data source.

OpenStreetMap

OpenStreetMap (OSM) provides community-built map data about administrative boundaries (e.g., county, county, city, district), roads, railways, stations, and Points of Interest (e.g., museums, cafés) all over the world. I predominantly use OSM map data to visualize the mobile phone data in a spatial context.

Budapest is divided into 23 districts and more than 200 suburbs. Moreover, KSH groups the districts into three city parts (Figure 3.15b) and seven district groups (Figure 3.15a) as well. The agglomeration is divided into six sectors (Figure 3.15c). The administrative borders of these settlements, including the districts and suburbs of Budapest, are derived from OSM. Historically, District 21 (Csepel) is not part of either Buda or Pest, as it is located on the northern end of Csepel Island. Still, statistics used to classify as a Pest-side district (e.g. [118]).

Administratively, Margaret Island was part of the 13th district. Since July 2013, it has been directly under the control of the city, being a part of Budapest without belonging to any districts. However, some maps in this work still denote it as a part of District 13. Practically, the whole island is a recreational area covered with landscape parks.

District groups
(a) District groups
City parts
(b) City parts
Sectors of the Budapest agglomeration
(c) Sectors of the Budapest agglomeration
Figure 3.15.: District groups (a) and city parts (b) of Budapest, and the sectors of the Budapest agglomeration (c).

Other Data Sources

Statistical Data

As a validation for the results (e.g., estimated population), statistical data is obtained from the Hungarian Central Statistical Office (KSH), using its interface of the spreadsheet sets, called STADAT [119].

Astronomical Data

Also, as a validation, astronomical information (sunrise and sunset) has been obtained, for Budapest, from Visual Crossing, which collects global weather data [120]. This data was applied in Section The Length of the Day to compare with the calculated day lengths.

Twitter

In Section In Social Media, Twitter data is utilized that was obtained via its academic research access program [121], which provides access for non-commercial research purposes. To download historical tweets based on hashtags, the twarc software was used [122].


  1. From 2017, Good Friday is also a holiday in Hungary. ↩︎

  2. This phenomenon also affects the “April 2017” data set, but not to this extent, and almost every cell has a valid location, even if the adjustments are not documented in the received cell-map. ↩︎

  3. https://www.gsmarena.com/ ↩︎

  4. This was realized by an anonymous reviewer of my paper [123]. ↩︎

Top