geolocating cities using human activity traces

Gergő Pintér, PhD

ANETI Lab / CIAS
Department of Network Science / IDAIS
Corvinus University of Budapest

gergo.pinter@uni-corvinus.hu

 
37 data points from Weeplace checkins [1]
37 data points (0.5% of unique places)
37 data points (0.5% of unique places)
70 data points (2% of unique places)
697 data points (20% of unique places)
3485 data points (all unique places)

human activity is tightly connected to the urban areas

  • only ~700 data points and Budapest is almost recognizable
  • ~3500 data points reveals the silhouette of the city
    • although the distribution is uneven

photo by NASA | NASA’s Earth Observatory

YJMob100K

  • data is from the Yahoo! Japan smartphone application
  • published as open data in 2023 [2]
    • as a part of the HuMob Challenge 2023 [3]
  • follows 100,000 people
  • during consecutive 75 days
    • not known which days
  • in a 100 km by 100 km area
  • somewhere in Japan

data sample

uid d t x y
0 0 1 79 86
0 0 2 79 86
0 0 8 77 86
0 0 9 77 86
0 0 19 81 89

where d is day, t is time (0-47), x and y is the location in a 200 × 200 grid

visualizing mobility data

activity count (log)
unique user count (log)

reproductions of figure 8 [4]

finding out which city it is

  1. Japan is an island country
  2. the largest cities are on the shores
  3. 100 km by 100 km observation area
  4. some low activity part of the heatmap must be water

finding out which city it is – largest cities

Tokyo (1) and Yokohama (2)
Osaka (3)
Nagoya (4)

finding out which city it is – Nagoya

problem: the heatmap contains no geographic information

template matching

via Wikipedia by Laserlicht and Benjamin Watson | CC BY 4.0

applying template matching to the problem

  • a map is required where 1 pixel is 500

  • template: 200×200 image
  • 1 pixel is 500 m

preparing the template – thresholding

 
threshold = 75

geolocating with template matching

 
 
 
 

  • relative location: 127, 358
  • absolute

verificating the geolocation

more transformation happened?

  • the grid was rotated and mirrored
  • what if a raster is not exactly 500 m × 500 m?
  • because it was also shrunk or stretched

template considers one pixel as one raster of the grid

it is difficult to shrink or stretch a pixel

the map can be shrunk and stretched inversely

the value happened to be 10% shrink in width, 10% stretch in height

a raster of a grid is actually 450 m × 550 m

verification

  1. estimate user’s home location
  2. associate rasters to municipalities
  3. sum the estimated inhabitants
    per municipalities
  4. compare with census data [5]
correlation coefficient (Pearson’s R): 0.8879

robustness check – thresholding

50

500

1000

5000

anchor difference by threshold

25 50 100 200 300 400 500 600 700 800 900 1000 2000 3000 4000 5000
x 0 0 0 0 0 0 0 0 0 0 -228 -228 65 61 -387 -387
y 0 0 0 1 1 1 1 1 1 1 44 45 150 149 78 79

different cities, different data

Toronto

London

different cities, different data

Helsinki

Dallas–Fort Worth

different cities, different data – results

Toronto
Helsinki
London
London
Dallas–Fort Worth

what did we learn so far?

  1. human activity is closely linked to urban areas
  2. even if the activity is transformed to a virtual plane (blind map)
    the city is geolocatable

upscaling

  • spatial discretization of the mobility data is a 500 m × 500 m grid
  • what if a coarser grid was used?
    • could it prevent the geolocation?

merging width (m) resolution
500 200×200
2 × 2 1000 100×100
4 × 4 2000 50×50
8 × 8 2000 25×25

upscaling

  • spatial discretization of the mobility data is a 500 m × 500 m grid
  • what if a coarser grid was used?
    • could it prevent the geolocation?

upscaled geolocation

100×100 pixel

50 × 50 pixel

25 × 25 pixel

 
threshold = 75
threshold = 375

1 km × 1 km discretization

2 km × 2 km discretization

4 km × 4 km discretization

4 km × 4 km discretization

why does upscaling even matter?

privacy considerations

  • location data is likely to lead to privacy risks [6]
    • must be coarse in either the time domain or the space domain
  • even partial location data can be used to infer the location
    • like altitude information from fitness tracker applications [7]
  • the top four locations is enough to stand out from the crowd [8]
    • works almost like a fingerprint

distinguishable by top four locations

location 500 m × 500 m 1 km × 1 km 2 km × 2 km 4 km × 4 km
4 99998
3 2
2 0
1 0
location 500 m × 500 m 1 km × 1 km 2 km × 2 km 4 km × 4 km
4 99998 35469 12882 5090
3 2 48228 42323 28457
2 0 15582 38548 50987
1 0 721 6247 15466

there’s a trade-off between privacy preservation and researchers’ interest of working with granular mobility data

distinguishing does not mean identifying, but with additional information that could happen as well, espeally for famous people

how could the privacy be preserved?

  • exclude location information completely
  • add noise to (a part of) the locations
  • or add fake appearances

adding noise

  • Gaussian noise to every location
    • with standard deviation of 250, 500, 1000 (in the figure), 2000 meters

geolocating after noise addition

Euclidean distance between the upper-left corner of the observation area and the result of the template matching operation (in meters)

Toronto
London

what did we learn so far?

  1. human activity is closely linked to urban areas
  2. even if the activity is transformed to a virtual plane (blind map)
    the city is geolocatable
    • even strong discretization
    • or noise addition

observation period

a follow-up study of Abhishek Kumar Mishra, Mathieu Cunche, and Héber H. Arcolezi inferred the observation period [9]

  • daily activity shows the weekday–weekend differences [ ]
  • holidays [ ]
    • Respect for the Aged Day (16/09/2019), Autumn Equinox (23/09/2019), Health and Sports Day (14/10/2019), Enthronement Ceremony Day (22/10/2019), and Culture Day (04/11/2019)
  • typhoon Hagibis made landfall in Japan (12 October 2019) [ ]

observation period – verification

Port Messe Nagoya – event center

 
events collected by Mishra et al. [9]

day 0 is 15 September 2019 [9]

takeaway

  1. any data that describes human activity reflects the human behavior
    • contains a lot of implicit information
  2. human activity is closely linked to urban areas
  3. even if the activity is transformed to a virtual plane (blind map)
    the urban landscape is recognizable
    • obscuring the spatial dimension doesn’t increase privacy
  4. the temporal characteristic of human mobility data is also specific to the circadian rhythm
    • also the social / economic routine

thanks for the attention!

Gergő Pintér, gergo.pinter @ uni-corvinus.hu, @pintergreg

references

[1]
Z. Chen, Spatiotemporal checkins with social connections.” Zenodo, Mar-2022.
[2]
[3]
[4]
T. Yabe et al., “YJMob100K: City-scale and longitudinal dataset of anonymized human mobility trajectories,” Scientific Data, vol. 11, no. 1, p. 397, 2024.
[5]
Official Statistics of Japan, Population census 2020.” 2020.
[6]
H. Zang and J. Bolot, “Anonymization of location data does not work: A large-scale measurement study,” in Proceedings of the 17th annual international conference on mobile computing and networking, 2011, pp. 145–156.
[7]
U. Meteriz-Yildiran, N. F. Yildiran, J. Kim, and D. Mohaisen, “Learning location from shared elevation profiles in fitness apps: A privacy perspective,” IEEE Transactions on Mobile Computing, 2022.
[8]
Y.-A. De Montjoye, C. A. Hidalgo, M. Verleysen, and V. D. Blondel, “Unique in the crowd: The privacy bounds of human mobility,” Scientific reports, vol. 3, no. 1, pp. 1–5, 2013.
[9]
A. K. Mishra, M. Cunche, and H. H. Arcolezi, “Breaking anonymity at scale: Re-identifying the trajectories of 100K real users in japan,” arXiv preprint arXiv:2506.05611, 2025.

other cities upscaled

other cities upscaled