In recent years, ride-hailing and carpooling platforms have become increasingly popular and convenient way of moving around in most modern cities, matching riders with drivers, with Uber, Lyft and Didi being the biggest providers within the industry. In light of increased environmental awareness as well as concerns on minimizing carbon footprint, ridesharing and carpooling has become increasingly important.
Carpooling has numerous societal and individual benefits, including but not limited to reduction of Greenhouse-Gas emissions, cost savings in terms of shared travel costs for public agencies and employers .
In their paper,  present salient points in the understanding of the key aspects of the existing ridesharing system, going on to design a framework to identify challenges in the use of ridesharing thus fostering the development of mechanisms to overcome and promote widespread use.
Emerging studies  demonstrate psychological factors such as monetary and time benefits becoming more dominant factors in decisions to use ride-hailing and carpooling services. In relation to rider satisfaction,  found surge pricing not to bias Uber towards riders of higher income threshold, but rather, homophilous matching that is, matching riders to drivers of a similar age resulted in higher ratings and further went on to use these insights to predict driver and/or rider retention. Examining ridesharing platforms,  concluded moving forward, these platforms will do more good than harm, also, it was found that relatively little is known about their efficiency and equity but is likely to change with growing research interest. Using online reviews of drivers of popular ride-hailing companies, Uber and Lyft,  was able to demonstrate preference of Uber to Lyft. In addition, analysis show increased competition to attract more drivers, for which drivers counted job flexibility, and meeting new people as main advantages. In contrast, insufficient compensation, poor job security, poor rider behavior and poor customer service as impeding factors.
2. Problem Statement
The Braess Paradox   is a network phenomenon in which it is observed that the addition of extra capacity reduces overall network performance over time with lack of cooperation of users being the ultimate culprit for network breakdown
Congestion & Vehicle-Clustering: Over the years, the number of vehicles engaged in ride-hailing has increased astronomically, surpassing taxis in many urban cities, . A report from the (Union of Concerned Scientists, 2020) shows that ride hailing trips are responsible for 69 percent more emissions than the trips the service displaces with a significant amount of trips being Deadheading (Dead-mileage). This constitutes the period between drop-off and pick-up and is associated with increased costs . Surge pricing i.e. where prices are adjusted upwards to meet acute driver shortage is viewed a disincentive to many riders, leading to lost revenue.
Solving the problem
In order to combat the problems above, it is necessary to develop a sound driver deployment strategy. Collective Intelligence (COIN)  was first suggested as a way of solving Braess paradox. This involves all networks users acting centrally for the benefit of all. , Observed that strategic repositioning is key to maximizing driver earnings as against surge chasing which increases Deadmileage. First and foremost will be to be able to predict and deploy vehicles accordingly. , conclude that centralized fleet coordination offers substantial benefits towards sustainable growth and market share.
Research Purpose and Objective
The objective is to develop a city-wide prediction algorithm capable of predicting trip pick-up and drop-off points, as well as potential pick-up locations after each drop-off based on historical data using Data mining Techniques.
Case study: City of Chicago, Illinois.
3. Related Works
The growth of demand for ride-hailing services has disrupted urban transportation and is changing the way in which people travel. Modern ride-hailing services require the development of efficient recommendation systems in order to improve both riders and driver experience. In response, many researchers have conducted various experiments to help predict ride hailing demand in order to improve effective ride-hailing vehicle deployment.
In attempting to optimize the number of pick-ups whilst minimizing waiting time for taxi services,  developed a ride-hailing recommendation system. This is completed in 3 phases. The model starts by first effectively estimating future customer demand in different clusters within the area of interest. This is followed up with a taxi-to-region matching according to preset rules and conditions including driver preference and finally concluded with the design of an optimized geo-routing algorithm to help drivers minimize dead-mileage. The problem with this mainly lies with the instability of driver preference which changes frequently, making the approach difficult to deploy in real world situations.
Dead-mileage comprises a significant share of total travel covered by drivers within the ride-hailing industry in terms of miles travelled and number of trips overall. Accurate demand prediction within the ride-hailing industry can greatly improve vehicle utilization whilst reducing waiting time. Customers mainly desire minimization of waiting time whilst drivers on the other hand aim to minimize deadheading and idle time after trips. This subsection of the industry comprises another area of strong research interest.
, is one of the first to study this emerging field. He develops a model which predicts the gap between rider demand and driver supply within a given time period and specific geographic area using Point of Interest (POI), Traffic, Weather data as well as data from Car sharing orders. A data sampling techniques is used to determine patterns and generalizations which can be applied in real case scenarios forming the basis for future work. This concept of finding the supply and demand gap is important as it allows for the deployment of drivers to improve the level of service
Time based demand prediction is another research area fast gaining ground. This is based on the premise of predicting ride-hailing vehicle demand in the next hour.
3.1. Operational Research Mobility Optimization
The vast majority of human interaction takes place in one of two areas; home or work. In order to further understand mobility patterns of users of ridesharing services across home and work locations, as well as social ties between users,  developed an algorithm for matching users with similar mobility patterns under constraints and concluded, a decrease in social distance of as much as 31% when users shared rides with others. These findings indicate the importance of the study of mobility patterns and the benefits which can be derived from optimizing ride-hailing services at an operational level. Using a more flexible yet extendible mobility model representing ride-sharing users movement and habits,  deploy a Variable-Order Markov Model (VOMM) underplayed with a Partial Matching (PPM) algorithm for next location prediction, with a prediction accuracy ranging from 60% - 81%. A major limitation of the usage of the PPM algorithm hovers around the compression process which tends to limit performance over time. In comparing the use of privately owned vehicles and two Autonomous Mobility on-Demand (AMoD) simulated on a real transport network based on current situation, under different scenarios,  found the deployment of AMoD system resulted in a major decrease in both number of vehicles required in order to meet transport needs (that is, 43% in AMoD1 and 88% in AMoD2) and street parking space required (58% in AMoD1 and 83% in AMoD2). , also cite effective road utilization as another advantage of designing the matching algorithm. Comparing the use of privately owned vehicles and two autonomous mobility on-demand (AMoD) simulated on a real transport network based on current situation, under different scenarios. Autonomous Mobility on-Demand vehicles are viewed by many as the future of transport, however their effectiveness hinders largely on the ability to coordinate their movement and predict demand as accurately as possible using the vast quantity of data we have available at our disposal, for which this paper seeks to pursue further.
In an attempt to resolve the surge of homeward-bound persons during the holiday seasons,  proposed a large-scale ridesharing system called CountryRoads® using an online greedy matching algorithm to match drivers and passengers, recording a success rate of 23.2%. Online Greedy matching algorithms have a comparatively low performance threshold when applied in complex systems such as ride-hailing services as experienced by the authors this is largely due to the level of rigidity of process making it not ideal for location prediction. Based on the concept of space-time windows, , develop a unique approach based on Lagrangian relaxation, and conclude that the adoption of flexible pickup and delivery will evidently reduce system-wide cost whilst improving service quality. This hypothesis although found to be true, defeats the purpose of ride hailing services. Flexible pickup and delivery have not been widely accepted even within the carpooling sphere as centralized pick-up location is yet to gather wide acceptance.
3.2. Linear Programming & Statistical Methods
In implementing optimization solutions based on linear programming,  deploy a Tabubased meta-heuristic algorithm with the aim of solving the mixed integer linear program (MILP) under differing scenarios. The algorithm is observed to have a higher computational accuracy than control, the introduction of meet points to the ridesharing system reduces total travel time by 2.7% - 3.8% for scaled tests. With meet-points not having been widely accepted within the ride-hailing and carpooling industry, the benefits of reduced travel time, and reduced travel costs associated with it cannot be fully quantified. Especially given Covid-19 social distancing protocols. This demonstrates the need to improve location prediction as a lasting solution.
From the domain of probability and statistics,  having collected data of taxi trips in New York, Singapore, San Francisco and Vienna compute shareability curves for each city, then through natural rescaling collapse them into a universal curve which is used to predict the potential of ridesharing in any given city based on a few qualities and parameters. The statistical methods employed here demonstrate the general overview of the potential of the growth of ride-hailing services in any given city. This is to help with city planning purposes and fails to examine rider-driver interaction.
Examining the relationship between the frequency and probability of ridesharing usage, and frequency of public transit usage, , develop a Zero-inflated negative binomial regression model.
Results show a positive relationship between ridesharing and public transit use particularly for people living in areas of high population density and comparatively fewer vehicles. The significance of this is to allow the measurement of ride-hailing service utilization across population densities across any given city taking into consideration anticipated demand and in the selection of the research Case study.
4. Research Framework and Design
To reduce the number of vehicles, alleviate traffic jams and curb pollution in transporting people in office hubs in Poland,  collected a representative sample of the population and used spatial data mining techniques to develop a set of parameters for the multi-agent system. Using the distributed model-free, system DeepPool® based on deep Q-network (DQN) techniques,  develop an algorithm able to learn the optimal dispatch policy through interaction with the environment, incorporating travel demand statistics and a dataset of taxi trips in New York to dispatch vehicles and anticipate future demand. Deploying a convolutional neural network (CNN) based on deep learning for multi-step ride-hailing demand prediction using trip request data in Chengdu,  showcase faster training and prediction of CNN models compared to the use of Long Short Term Memory (LSTM) models.
In conducting this research, a large scale dataset of rideshare and taxi trips spanning 2018/2019 in Chicago is collected, as shown in Table 1, with each observation consisting of the following elements:
The data is processed and cleaned. As a first step, a comprehensive understanding of the individual features within the dataset is required, as well as knowledge of trip distribution across the city, from origin (O) to Destination (D). Numerous studies have demonstrated the importance of regional partitioning in location prediction. Research and experiments by  demonstrated that regional partitioning led to better forecast and demand prediction of geospatial data.
This is followed up with followed by scenario development. Figure 1 shows a color-coded layout of the City of Chicago, detailing its community areas as well as census tracts.
Figure 1. Map of City of Chicago including its community Areas and Census tracts. Visualization of potential pick-up and drop-off points across the city.
Table 1. Data points used for data mining and the development of the predictive algorithms.
4.2. Research Framework and Scope
Multidimensional Scenario Formulation
Scenario performance analysis allows for measuring performance under varied rider privacy limitations.
Location prediction with no information i.e. drop-off community area (destination) prediction with only pick-up (origin) data, and vice versa. This is in order to allow for riders with strict privacy concerns in information release, measuring ability to predict trip start and end points given rider privacy restrictions.
Location prediction with partial information. That is, drop-off community area (destination) prediction with pick-up data and Census Tract (destination zone) information, vice versa. It is based on the idea of being able to predict trip start and end points under rider uncertainty.
Steps and Methodological process
Figure 2. Step by step methodological process in designing and evaluating predictive models used.
Figure 2 shows the steps taken in the design, evaluation and interpretation of the research framework employed in carrying out this work.
1) Perform Principal Component Analysis (PCA) on trip dataset. Record and analyze results against degree of variance covered by each principal component.
2) Perform feature scoring and ranking using Relief metrics. Record and analyze results.
3) Reevaluate steps 1 and 2. Determine features and variables with largest weight in designing and building the model.
4) Evaluation and scoring of prediction accuracy and error tolerance (MAE, MSE, and R2) under both scenario 1 and 2.
5) In-depth scenario analysis of both scenario 1 and 2, firstly on drop-off community area prediction and pick-up community area prediction.
6) Analyzing implications on surge pricing policy and ridesharing efficiency.
Principal Component Analysis (PCA)
Principal component analysis (PCA) is based on the use of an orthogonal transformation to convert a set of observations with possibly correlated variables in a set of linearly uncorrelated principal components using eigenvalues to measure the total degree of variance explained by each factor.
FEATURE RANK USING RRELIEFF
The RReliefF algorithm estimates the quality of an attribute according to the degree with which it discriminates between instances near each other. Here, an instance R is randomly selected, then the K-nearest instances with respect to class value are selected. The difference between the value of A of R as well as the value of the same attribute for one of the K-instances is then compared with respect to the difference of their class values. This process is repeated and ultimately yields a weight for each attribute ranging between −1 and 1.
Cross Validation Model Evaluation and Scoring
The Leave-P-Out Cross Validation (CV) approach leaves “p” data points out of the training data, with a sample size of n-p being used as the validation set. This process is repeated for all possible combinations, with error being averaged for all trials in order to determine overall effectiveness.
To measure the degree of error of the developed models, error metrics will then be used to judge model quality and compare the different regression models. The Mean Average Error (MAE), Mean Squared Error (MSE), and R-Squared (R2) will be used for evaluation.
—Predicted value of Y
—mean value of Y
SSEM is the sum of Squared Errors by Mean line and
SSER is the sum of Squared Errors by Regression Line
Predictive Modelling using Ensemble Learning
Generally, ensemble learning is the term used to describe meta-algorithms that makes predictions based on inputs from different models, thus, by combining multiple individual models, the ensemble model tends to have less bias, variance, and avoids overfitting culminating in improved predictions.
Adaboost and Random Forest are the most commonly used.
5. Framework and Results
5.1. Principal Component Analysis (PCA)
In analyzing the weights of the individual features within the data sample collected, PCA analysis is performed, measuring the degree of variance covered by each principal component within the data set.
Analysis of PCA results reveals an increase in the degree of variance explained by each of the data attributes within the dataset.
Figure 3 describes the results obtained from PCA analysis. Results show that certain attributes within the dataset are able to explain 55.9% of the recorded variance, with 5 attributes able to explain 73.7% of the variance and so on. This aids in selecting the most important data attributes which will effectively improve the models prediction accuracy. Analysis reveals that 9 attributes to be the optimal number of features to incorporate in building the models.
Feature Scoring and Rank
After PCA analysis, the features within the dataset are then ranked in order according to feature influence on prediction output. Figure 4 details the weight associated with each attribute used in designing the model, with some attributes being more critical to predictive performance than others.
RreliefF is used to rank and measure individual features by level of importance as shown above.
5.2. Re-Evaluation and Model Calibration
Predicting drop-off community area (destination) with only pick-up (origin) data. Model results show an ability of linear regression models to predict potential Drop-off areas within a radius of 13 blocks (community area). This is in the absence of any information other than pick-up point (origin).
The results are shown in Figure 5:
Figure 3. PCA analysis displaying the level of variance covered by each of the inputs.
Figure 4. Feature rank displaying the weight of each Data point in prediction performance and drop-off Community area distribution graph.
Figure 5. Evaluation results of predictive accuracy of algorithms and results comparison under scenario 1.
This figure is divided into 2 parts, with the first part (Top) displaying results from model evaluation whilst the 2nd displays location predictive results against actual. The dark column above displays actual drop-off community areas as against predicted values on its left.
Predicting drop-off community area (destination) with partial information, that is, (destination zone) information
Model results show an ability of ensemble learning models such as Adaboost and Random Forest to predict potential Drop-off areas precisely with error under 1 block (community area).
This is in the absence of any information other than pick-up point (origin). The results are shown below.
Figure 6 is divided into 2 parts, with the first part (Top) displaying results from model evaluation whilst the 2nd displays location predictive results against actual. The dark column above displays actual drop-off community areas as against predicted values on its left.
Figure 6. Evaluation results of predictive accuracy of algorithms and results comparison under scenario 2.
Pick-up point Prediction after Drop-off
This part focuses on predicting demand centers within the city after the any given drop-off. The aim is to predict rideshare demand centers, anticipating demand and price surge before they happen.
In an effort to optimize rideshare vehicle distribution, it is imperative to be able to predict where demand will occur ahead of time, taking advantage of imbalance of supply and demand as well as revenue per trip, with the results displayed in Figure 7 below.
This figure is divided into 2 parts, with the first part (Top) displaying results from model evaluation whilst the 2nd part displays location predictive results against actual. The dark column above displays actual drop-off community areas as against predicted values on its left.
Research into the field of mobility remains a hot topic amongst many researchers. Mobility-As-A-Service (MAAS) where vehicle trips are used to render services has come to stay in the era where we’ve experienced a boom in ride-hailing
Figure 7. Evaluation results of predictive accuracy of algorithms and results comparison under scenario.
services. The need to optimize the operations of these services remains of utmost importance. The results show neural network algorithms perform best in generalizing pick-up and drop-off points when provided with only starting point information. The significance of this is to allow for trip generalization in pooled trips, where riders are most likely to have a common drop-off point, e.g. coworker’s trip to work or trips to work or shared trip to a sporting event. Ensemble learning methods, Adaboost and Random forest algorithm are able to predict both drop-off and pick-up points with a MAE of 1 community area knowing rider pick-up point and Census Tract information only and in reverse predict potential pick-up points using the Drop-off point as the new starting point. This allows the algorithm to confidently predict the most likely pick-up point of potential riders following a drop-off in in so doing increasing supply of drivers into potential surge zones and thus being less reactive, more proactive in trip deployment. Here, it can be seen that the introduction of more data and ensemble learning techniques greatly increases the precision accuracy of the model. This demonstrates the influence of data management within the ride-hailing industry, especially in a time when privacy concerns and right to privacy have become a matter of safety and security, of which varies from rider to rider. Direct impacts on the ride-hailing industry and operations include:
Implications on ride-hailing Industry includes:
1) Improved vehicle utilization, and time efficiency.
2) Reduced dead-mileage and idle time after trips.
3) Improvement riders and driver experience.
In conclusion, results from the research indicate the ability to use predictive modelling and analytics to adequately maximize driver positioning and deployment by predicting surge zones before they occur irrespective of rider privacy settings.
The implications of these results on the transport industry includes:
• Reduced incidence of the surge and increasing rider satisfaction.
• Reduced transport costs.
• Increase in the ease of parking particularly in high-demand (downtown) areas.
• From a social and environmental point of view for fewer wasted miles would translate into less emissions overall.
 Furuhata, M., Dessouky, M., Ordóñez, F., Brunet, M.-E., Wang, X. and Koenig, S. (2013) Ridesharing: The State-of-the-Art and Future Directions. Transportation Research Part B: Methodological, 57, 28-46.
 Kooti, F., Grbovic, M., Aiello, L.M., Djuric, N., Radosavljevic, V. and Lerman, K. (2017) Analyzing Uber’s Ride-Sharing Economy. International World Wide Web Conference Committee (IW3C2), Perth, 3-7 April 2017.
 Hahn, R. and Metcalfe, R. (2017) The Ridesharing Revolution: Economic Survey and Synthesis. Volume IV: More Equal by Design: Economic Design Responses to Inequality. Oxford University Press, Oxford.
 Shokoohyar, S. (2018) Ride-Sharing Platforms from Drivers’ Perspective: Evidence from Uber and Lyft Drivers. International Journal of Data and Network Science, 2, 89-98.
 Nair, G.S., Bhat, C.R., Batur, I., Pendyala, R.M. and Lam, W.H.K. (2020) A Model of Deadheading Trips and Pick-Up Locations for Ride-Hailing Service Vehicles. Transportation Research Part A: Policy and Practice, 135, 289-308.
 Chaudhari, H.A., Byers, J.W. and Terzi, E. (2018) Putting Data in the Driver’s Seat: Optimizing Earnings for On-Demand Ride-Hailing. 11th Eleventh ACM International Conference on Web Search and Data Mining, New York, 5-9 February 2018, 9 p.
 Merlin, L.A. (2019) Transportation Sustainability Follows from More People in Fewer Vehicles, Not Necessarily Automation. Journal of the American Planning Association, 85, 501-510.
 Wan, X., Ghazzai, H. and Massoud, Y. (2020) A Generic Data-Driven Recommendation System for Large-Scale Regular and Ride-Hailing Taxi Services. Electronics, 9, 648.
 Cici, B., Markopoulou, A., Laoutaris, F.-M. and Nikolaos, E.A. (2017) Assessing the Potential of Ride-Sharing Using Mobile and Social Data: A Tale of Four Cities.
 Roor, R., Karg, M., Liao, A., Lei, W. and Kirsch, A. (2017) Predictive Ridesharing Based on Personal Mobility Patterns. Intelligent Vehicles Symposium (IV), Los Angeles, 11-14 June 2017.
 Dia, H. and Javanshour, F. (2017) Autonomous Shared Mobility-on-Demand: Melbourne Pilot Simulation Study. Transportation Research Procedia, 22, 285-292.
 Sonet, K.M.H., Rahman, M.M., Mehedy, S.R. and Rahman, R.M. (2019) A Dynamic Ridesharing and Carpooling Solution Using Advanced Optimised Algorithm. International Journal of Knowledge Engineering and Data Mining, 6, 1-31.
 Jiang, W., Dominguez, C.R., Zhang, P., Zhang, S., et al. (2018) Large-Scale Nationwide Ridesharing System: A Case Study of Chunyun. International Journal of Transportation Science and Technology, 7, 45-59.
 Zhao, M., Yin, J., An, S., Wang, J. and Feng, D. (2018) Ridesharing Problem with Flexible Pickup and Delivery Locations for App-Based Transportation Service: Mathematical Modeling and Decomposition Methods. Journal of Advanced Transportation, 2018, Article ID: 6430950.
 Tachet, R., Sagarra, O., Santi, P., Resta, G., Szell, M., Strogatz, S.H. and Ratti, C. (2017) Scaling Law of Urban Ride Sharing. Scientific Reports, 7, Article No. 42868.
 Zhang, Y. and Zhang, Y. (2018) Exploring the Relationship between Ridesharing and Public Transit Use in the United States. International Journal of Environmental Research and Public Health, 15, 1763.
 Olszewski, R., Palka, P. and Turek, A. (2018) Solving “Smart City” Transport Problems by Designing Carpooling Gamification Schemes with Multi-Agent Systems: The Case of the So-Called “Mordor of Warsaw”. MDPI Sensors, 18, 141.
 Alabbasi, A., Ghosh, A. and Aggarwal, V. (2019) DeepPool: Distributed Model-Free Algorithm for Ride-Sharing Using Deep Reinforcement Learning. IEEE Transactions on Intelligent Transportation Systems, 20, 4714-4727.
 Wang, C., Hou, Y. and Barth, M. (2019) Data-Driven Multi-Step Demand Prediction for Ride-Hailing Services Using Convolutional Neural Network. Computer Vision Conference (CVC), Las Vegas, 25-26 April 2019.
 Niu, K., Wang, C., Zhou, X. and Zhou, T. (2019) Predicting Ride-Hailing Service Demand via RPA-LSTM. IEEE Transactions on Vehicular Technology, 68, 4213-4222.