4. Data Processing Framework
A computational framework has been developed to process the CDR data, including preprocessing, cleaning, home and work location estimation, or calculating the mobility indicators (like Radius of Gyration and Entropy). This simplified process can be seen in Figure 4.1 and will be discussed further in this chapter.
Software Environment
I only used open-source software to build the data processing and visualization environment. The framework that practically is a set of scripts is primarily written in Python, using the following packages: Pandas, GeoPandas (Pandas for geospatial data) for the data manipulation, Matplotlib and Seaborn for plotting, NumPy, SciPy, Skikit-learn for numeric and scientific methods, NetworkX for graph/network analysis, OSMnx for downloading streets (and other objects) from OpenStreetMap as a NetworkX graph, just to mention the essential packages I have been relying on during my work.
For tasks, when the processing speed of the interpreted Python became a bottleneck, I have written small, task-specific programs in Go (in the beginning) or Nim (later). Small stream-processing tools have been written, using compiled languages, that read a large CSV file line by line, perform some operation on the records and write the result directly to the disc or accumulate in the memory. It worked well for these tasks because the activity records are independent of each other.
Otherwise, the usual procedure was to group the records by the SIM card ID and process a given subscriber’s data on a single thread (for example, to calculate a mobility metric). Multiple subscribers’ data was processed parallelly to speed up the computation.
All the CDR data is stored in a PostgreSQL database with an enabled PostGIS extension to add support for geospatial operations within the database. Spatial filtering can be faster this way, and only the minimum amount of records needs to be loaded into a (Geo)Pandas (Geo)DataFrame.
Visualization
The visualization is a crucial step. Not just when the results are presented but during the analysis as well. One could argue that it is even more crucial to have a fast and reliable way to visualize the results during data analysis. Besides the descriptive statistics, the spatial distributions are also essential to validate a result.
Both Pandas and GeoPandas have APIs to plot (Geo)DataFrames using Matplotlib to have a quick view of the data, which can provide very fast feedback, but sometimes a more interactive tool is required. QGIS is an open-source Geographic Information System that can provide this. QGIS is an excellent software with an enormous number of features. I only used a small subset of its capabilities, usually just to visualize a GeoDataFrame dump or a GeoJSON with some linked data from a CSV or directly from the database.
Data Preparation
As the received data had a wide format (see Figure 4.2), they were normalized before importing into the database. The CDR table only contains the SIM ID, the timestamp, and the cell ID. a table has been introduced to store the SIM related properties (like subscription type, customer type, age, gender) and another to store the cell properties (cell centroid and base station coordinates).
Order numbers were used for SIM IDs in order of appearance instead of the long hash values to save disk space. The longitude and latitude values, provided in EPSG:4326 projection (also known as WGS84) were rounded to 6 decimals because further decimals have no practical meaning in CDR positioning1. The received data uses property labels like “MALE”, “FEMALE” to indicate gender, “CONSUMER”, “BUSINESS” for customer type, “PREPAID”, “POSTPAID” for subscription type and the “UNKNOWN” string is used to denote unknown values. Again, these strings were shortened, and the unknown values were represented with the proper “NULL” value to save disc space.
Normalization
The single table, received data has been normalized, and three new tables were formed (see Figure 4.3): i) cell, containing a cell ID and the coordinates of the cell centroid, ii) a device table, containing the device/SIM ID and information about the subscriber (age, sex), the subscription (consumer or business, prepaid or postpaid), and the Type Allocation Code (TAC) that can be used to identify the device where the SIM card is operating, iii) the CDR table, that serves as a link table to map a subscriber to a geographic location in a given time, (iv) TACs are loaded to another table after the dominant devices had been determined for the SIM cards, and (v), indexes were built for all the tables.
Cell-Map Mapping
In most cases, the operator provided only the base stations’ coordinates. Therefore, it is common practice [18, 49, 124, 125, 126] at CDR processing to use the base stations to map Call Detail Records to geographic locations and use Voronoi tessellation to estimate the covered area of the base station.
Vodafone Hungary provided the base station coordinates and the (estimated) cell centroids for the cells. With this, the position of the SIM cards is known with a finer granularity as if only the base station locations were known, as a base station can serve several cells.
The close cells within 100 m are merged using the DBSCAN algorithm of the Scikit-learn [127] Python package using the cell activity as weight, and the Voronoi tessellation is applied for merged cell centroids, similarly, as in [51].
It is possible to represent the merged cell by an imaginary cell centroid that could be the center of mass of the merged points, but I wanted to use an existing cell (centroid), the medoid, to represent the cluster. The number of the activity records of the cells was applied as a weight for the DBSCAN algorithm. This way, the most active or the most significant cell was chosen to represent multiple cells of an area.
After merging the cells, the 5412 cells were reduced to 3634. The merge affects mostly Budapest downtown since more mobile phone cells are applied in densely populated areas.
When mapping data with exact GPS coordinates to the cells, the Voronoi polygons were used to determine in which cell the given point belongs. As shown in Section Visualization, this method has some drawbacks, but without exact cell geometries, this is the only available solution. The estate price data (see Section Estate Price Data) were mapped to cells via these Voronoi polygons. When there were no property price data in a cell, the average price of the underlying administrative area was used (e.g., suburb, district, or settlement).
Gábor Bognár, a colleague of mine, tried to reconstruct the cell geometries from the received information, but his work resulted in unrealistic coverage (Figure 4.4). He suspected that the data was erroneous in respect of the geometries.
Rasterization of Locations
With an accurate geographic data source like GPS traces, it can be more natural to map locations to a spatial grid. With CDR data, the geographic locations are bound to the cell or a base station coordinates, and Voronoi tessellation is often applied, but what about the rasterization?
The problem is that it is not possible to know where a subscriber is within the cell. So, the maximal available accuracy is the cell. A finer resolution spatial raster can work well with point coordinates that can be classified to the rasters, but in the case of CDR, the classification can be only achieved with probabilities.
The intersection ratio with the cell polygons is calculated for every element of the grid (see Figure 4.5) that also serves as a probability of being in a given raster when the subscriber is in a given cell. Theoretically, this probability could be improved by removing rasters where people usually cannot be present. Using the example of Figure 4.7a, there are several cells that cover the river. When the subscriber is in one of those cells, they must be on the riverbank since being on a ship is usually not probable. OSM could provide landscape data for removing these rasters, but this functionality was not implemented in this framework.
When the probabilities are determined, I applied the Jefferson apportionment method (also known as the D’Hondt method or Hagenbach-Bischoff method) from the Python package, “voting”, to distribute the subscribers of the cell between the rasters, using the probabilities as weight. It is feasible when only cardinality matters. For example, when one wants to plot a heatmap, but as the distribution cannot pay regard to the properties of the subscribers, it cannot be stated that subscribers in a given raster have certain mobility habits. On a cell level, this kind of statement can be made.
In [128], Galiana et al. also map Voronoi polygons to the grid but use a different approach for distributing subscribers between rasters.
Selecting an Area
In order to select an area for further investigation, a polygon has to be defined. The geojson.io provides an easy-to-use tool to define polygons over the map that can be downloaded in multiple GIS formats, such as GeoJSON or Well-known text (WKT). Figure 4.6, shows the observation area from the Central European University (CEU) demonstration case study (Section Those Who Stood with CEU). It contains three different areas: (i) one from the Castle Garden to the Chain Bridge at the Buda side of the river, (ii) one in the middle from the bridge to the Kossuth Lajos square, covering the CEU buildings, and (iii) the parliament and the Kossuth Lajos square on the top of the map (Figure 5.8 highlights these buildings).
The affected cells can be easily selected, using a polygon (or polygons) to define the observation area. The cells are represented with Voronoi polygons (Section Cell-Map Mapping) and the relevant cell selection can be performed in two ways: (i) selecting the cells Voronoi polygon intersects the selector polygon (Figure 4.7a and 4.7b), or (ii) selecting the cells which centroids intersect with the polygon (Figure 4.7c and 4.7d). The former method returns more cells, which can be more favorable in some cases, for example, when the target area is relatively small compared to the neighboring Voronoi polygons.
Home and Work Locations
Most of the inhabitants in cities spend a significant time of a day at two locations: their homes and workplaces. In order to find the relationship between these most important locations and Social Economic Status (SES), first, the positions of these locations have to be determined. There are a few approaches used to find home locations via mobile phone data analysis [129, 130, 58].
The work location was determined as the most frequent cell where a device was present during working hours on workdays. Working hours were considered from 09:00 to 16:00. The home location was calculated as the most frequent cell where a device was present during the evening and the night on workdays (from 22:00 to 06:00) and all day on holidays. Although people do not always stay at home on the weekends, it is assumed that most of the activity is still generated from their home locations.
Most of the mobile phone activity occurs in the daytime (see Figure 4.8), which is associated with work activities. This may cause the home location determination to be inaccurate or even impossible for some devices.
This method assumes that everyone works during the daytime and rests in the evening. Although in 2017, 6.2% of the employed persons regularly worked at night in Hungary [131], the current version of the algorithm does not try to deal with the night-workers. Some of them might be identified as regular workers, but their work and home locations are mixed.
Selecting Active SIMs
As showed in Chapter Data Sources, most of the SIM cards have very little activity records (Figure 3.1), that is not enough to provide adequate information about the subscribers’ mobility customs. Thus, only those SIMs are considered active enough that had activity for at least 20 days, the average weekday activity is at least 40 records, and at least 20 on weekends. Additionally, the average activity of the SIM cannot be more than 1000 to filter out those SIM cards that possibly did not operate in a cell phone, but in a 3G modem, for example.
Note that activity can be either voice call, text message, or data transfer, also both incoming or outgoing. Hence, they cannot be distinguished in the dataset.
Calculating Indicators
During my research, the Scikit-mobility [132] Python package has been published that is capable of calculating well-known mobility indicators, such as Radius of Gyration or Entropy. However, by the time this was published, my own implementation was complete, tested, and well-optimized (within reasons), so I kept using my own software stack.
As mentioned before, the data was stored in a PostgreSQL database, but the query engine of PostgreSQL was also utilized during the computation. Python is excellent for data science, as an interactive shell (also called read–eval–print loop, or REPL) provides a convenient environment for data analysis. However, the execution speed is not among the strong points of the interpreted languages. Considering that most of CDR processing happens at the subscriber level, and the subscribers are independent of each other, it seemed the best solution to partition the processing by subscribers. Then, the activity records of a single subscriber can be processed in a single thread even with a relatively slow interpreted language.
The Pool object, from the multiprocessing package, provides a convenient tool to parallelize the execution. A SIM ID is associated with a worker that executes a query to filter its activity records. Naturally, the query can contain spatial or temporal constraints as well. The database engine can efficiently select the activity records of a SIM card, utilizing the available many-core environment2. Then, the indicators were calculated per SIM card, and the partial results were collected and saved. The analysis and the visualization did not require considerable computing power and thus were often performed on a laptop.
The sixth decimal represents about 0.111 meter at the Equator. As the CDRs roughly provides a house-block level accuracy in downtown, the sixth decimal is more than enough in this use-case. ↩︎
Lenovo x3650 M5 server, with 1024 GB of RAM, and two 18-core CPUs with HT, resulting in 72 logical cores. The nickname of this server was “naggykuttya” (“biggdogg”). ↩︎