Trajectory Simulation
Problem
Trajectories are sequences of (ObjectID, Location, Time) triples. There is a plethora of research on trajectory mining solving problems such as:
Clustering Trajectories
Outlier Detection on Trajectories
Next Location Prediction
User Identification/Classification
Location Recommendation
and many more
GPS trajectory taken from a thirty minute drive through Seattle
Dataset provided by: P. Newson and J. Krumm. Hidden Markov Map Matching Through Noise and Sparseness. ACMGIS 2009.
A problem in this field is the lack of publicly available trajectory datasets. Researchers commonly use the GeoLife GPS Trajectory dataset which captures 182 users, or location-based social network data which only covers a handful of users per city per day. This lack of real-world severely data limits progress in this field. It is impossible to conclude if patterns mined from such datasets generalize to the general population, or overfit to the biased sampled of 182 users. As precise location data is considered Personally Identifiable Information (PII), it is unlikely that any large real-world trajectory data sets will ever be published for privacy reasons.
But instead of collecting data from the real world, we can create our own world. This is what we've done in DARPA's Ground Truth Program as Principal Investigator 2018-2020, where we simulated urban regions with tens of thousands of individuals (agents) who have a home, go to work, go to recreational places, meet and make friends, and follow other patterns of life. The video on the right shows a short period of such a simulation, including the spatial network (roads, buildings, locations of agents) on the left, and their emerging social network on the right. This one also shows the simulated spread of an infectious disease across space, time, and social space.
This simulation world was created in close collaboration with social scientists implementing many social science theories (details of the model are found in [1]). But it is not real-world data. Thus, any patterns found in the simulated data may not hold in the real world. Yet, our simulated world follows socially plausible patterns of life and allows us to mine and publish generated datasets without any privacy concerns, as none of the simulated individuals are real.
This sandbox world allows us to evaluate theories and algorithms, to understand the potential and limitations of research on trajectory data: What patterns would we be able to find if we had perfect trajectory data? And can existing algorithm scale beyond 184 users?
Most importantly, our simulated world provides us with a ground truth of the entire world. Not only do we have trajectories for 100% of our simulated population, we also know the underlying semantics of trajectory: Why did agents visit places? This sandbox allows us to create data mining challenges such as having specific agents behave abnormal, and challenging data mining solutions to find these outliers among a haystack of simulated normal trajectories.
For more details, we have published on vision paper on trajectory and social network simulation [2]. We've also published a first paper on massive location-based social network data generation that provides datasets that are many orders of magnitudes larger than real-world datasets [3].
A glimpse of future directions for this project can be found at IARPA's HAYSTAC Program website. Including a lightning talk of mine (shown as "George Mason University Lightning Talk" as I was still at GMU back then). Interested students/postdocs are encouraged to take a look at the HAYSTAC Broad Agency Announcement (BAA) found on that website.
For this project, I have funding for two PhD students (fully funded) and for one Postdoc. Together, we will work both on
Massive Trajectory Microsimulation to scale our simulation to millions of agents. This will require 1a) expertise in big data management, as our generated dataset will be larger than any dataset ever observed in the real world, as we have a sample of 100% of the population including 100% of their location updates, and 1b) expertise in efficient algorithms and indexing to increase scalability of the simulation.
Big Trajectory Data Mining to automatically find patterns that are imposed into the simulation, including clusters and outliers. This will require expertise in 2a) traditional data mining (scalable clustering and outlier detection algorithms) as well as 2b) deep learning for trajectories to find suitable representations of trajectories data to efficiently find clusters and outliers in the learned feature space.
Funding
Extramural funding (source undisclosed at this time) is available for two PhD students (full funding for three years each) and one postdoc. If you are interested in either, feel free to email me.
Collaborations
This work will be in collaboration with the Department of Computer Science at Tulane University, the Department of Geography at the University at Buffalo, and the Department of Geography and Geoinformation Science at George Mason University.
[1] Züfle, A., Wenk, C., Pfoser, D., Crooks, A., Kim, J.S., Kavak, H., Manzoor, U. and Jin, H., 2021. Urban life: a model of people and places. Computational and Mathematical Organization Theory, pp.1-32.
[2] Kavak, H., Kim, J.S., Crooks, A., Pfoser, D., Wenk, C. and Züfle, A., 2019, August. Location-based social simulation. In Proceedings of the 16th international symposium on spatial and temporal databases (pp. 218-221).
[3] Kim, J.S., Jin, H., Kavak, H., Rouly, O.C., Crooks, A., Pfoser, D., Wenk, C. and Züfle, A., 2020, June. Location-based social network data generation based on patterns of life. In 2020 21st IEEE International Conference on Mobile Data Management (MDM) (pp. 158-167). IEEE.