|Planning||FHWA > HEP > Planning > Border > Resources > Studies|
|U.S./Mexico Joint Working Committee on Transportation Planning|
A Report to the Arizona Department of Transportation
Forecast and Capacity Planning for Nogales' Ports of Entry
Chapter 6 - Model Alternatives
In this section, we test different types of models on the historical data to find the best alternative for forecasting. Generally, the models can be categorized into two types, regression based and time series based models. We begin with a brief introduction of each type of model, and then we use the commercial vehicle model as an example to describe the way we selected the models. Following this section, we present the models we built and the resulting forecasts. As with the baseline analysis section, an appendix to this section provides a more detailed review of the related technical issues.
6.1 Regression models
Univariate regression model
The univariate linear regression model, which is the simplest type of regression model, only takes time as a regressor (regression variable). Its basic equation is shown in equation (6.1.1) (Montgomery, Peck, and Vining 2006a). In this equation y is the target traffic, t is the time and Ò is the irregular fluctuation around the trend, which is usually assumed to follow a normal distribution.
y = ß0 + ß1t + ∫
The ßs are the coefficients we need to estimate. This is the first type of model we applied, however, we will explain later in this report why this was not the best choice for our forecasts.
Multivariate regression model
The second model we tested was the multivariate model because from the available literature, we found that border crossing traffic may be influenced by several exogenous variables, such as the GDP of the countries that share a common border. Unlike the univariate regression model, this type of model takes exogenous variables into consideration. The model has the form shown in equation (6.1.2) (Montgomery, Peck, and Vining 2006b), where y represents the target traffic and xi,i=1,2,...,k are the exogenous variables. In our study, each economic index will be an exogenous variable.
Based on previous research results and the conditions of the Nogales POEs, we identified a list of candidate exogenous variables as shown in Table 6-1. However, since there were only 14 years of available data, a limited number of variables could be used in the regression model. Thus, a variable selection procedure was used to identify the "best" variables to include in the model.
Two tier regression model for the truck traffic
We noticed from our baseline analysis that the truck traffic has a stable cyclic pattern. The existence of this cyclic pattern prevents us from using the multiple regression models directly, however, since this pattern is stable, we can build a two tier regression model. In the two tier model, we first built a regression model on the yearly data, and then split it into months according to monthly percentages. In contrast, for POV and pedestrian traffic, we built the model directly on the original monthly data as they had no obvious seasonality. Furthermore, it should be noted that the regression models used are all linear models.
Figure 6-1 is the box plot of the truck crossings of each month. This plot reveals some useful information about the truck data:
Figure 6-1 Box plot of truck crossings by month of the year
Mathematically, we explain the two tier model as described subsequently. Suppose we have N years of monthly data points available. Let yij be the data for month i in year j. Let be the total number of crossings in the data set. Then the portion corresponding to month i can be calculated as . Therefore, when the number of crossings for year j is calculated, namely yj, the estimated number of crossings of month i in year j can be calculated as yij = yj x pi. Note that all the pi's are calculated from the data in the training data set, the data set we used to build the model. When applying the method to new data, we still use the pi's calculated from the training data set values on which we built the model.
Given the small size of the variable pool and the limited number of data points we used an exhaustive method for variable selection. Using this method we enumerated all the possible combinations of up to 5 variables, and then built the corresponding regression models. The resulting models were then evaluated using several criteria:
For our variable selection, we applied the above criteria to both the training data set and the validation data set. When there was a tie, we chose the model with fewer variables.
6.2 Time series model
Another type of model commonly used in previous studies was the time series model. Particularly, we considered the ARIMA (Autoregressive-integrated-moving average) model3 (Farnum and Staton 1989; Shumway and Stoffer 2006a). In order to build a credible time series model, we needed to further explore the characteristics of the data. For example, the first question we needed to address was whether to use a regular model or a seasonal model.
The ACF (Auto Correlation Function) and PACF (Partial Auto Correlation Function)4 act as tools for determining the appropriate type of time series model as well as the structure of the model. Figure 6-2 depicts the ACF and PACF of the truck data. These functions allow us to determine seasonal and other patterns of the data. Note the unit of the lag is year, so 0.5 means 6 months. The ACF at lag 0.5 has a negative value near -1 while the value at lag 1.0 is near 1, which confirms the need to use a seasonal ARIMA model to forecast border crossings.
Figure 6-2 ACF and PACF plot of the Truck data
Univariate time series model
We mainly considered the ARIMA model and Holt-Winter's model for the univariate time series models. We have mentioned the ARIMA model in the previous paragraph, which is a type of time series model. The Holt-Winter's model is a more specific time series model, which is capable of handling both trend and seasonality in the data simultaneously. Due to the strong presence of seasonality in the truck traffic, we first tried the additive Holt-Winter's model on the truck traffic data. For the POV and the pedestrian flows, we used the non seasonal ARIMA model. Note that the Holt-Winter's model can be converted to a corresponding ARIMA model. The details of these two models are explained in the appendix of statistical details.
The Holt-Winter's model decomposes the target data into three parts: level, which is the non seasonal mean of the data; Trend, which is the slope of the likely line through data points; and an index of seasonality. Mathematically, it can be written as:
An ARIMA model is usually written as ARIMA(p,d,q), where p is the AR (Autocorrelation) order, d is the degree of differencing, and q is the MA (Moving Average) order. When applying the ARIMA model, it is important to first decide the structure of the model. PACF and ACF act as tools for determining the structure of an ARIMA model. Since it is possible to have potential models that work equally well, it is preferable to come up with a list of reasonable ARIMA models and then select from this candidate list. Therefore, instead of deciding the (p,d,q) directly from ACF and PACF, we defined ranges for (p,d,q), and tested all the possible combinations of the parameters within the established ranges. We used Theil's U statistic5, which is a measure of the similarity between two time series, as a criterion for model selection. R square was not used because when a data set contains nonlinearities, a large R square does not necessarily imply a good model. We use the same method to find the structural parameters in our multivariate time series models.
Multivariate time series models were another type of model chosen to forecast border crossings. To build this kind of model, we introduced exogenous variables into the model rather than only taking the data itself into consideration. We referred to the previous studies we reviewed to decide what exogenous variables should be incorporated in the model. We also referred to the variables selected in the multivariate regression model, and field knowledge.
A seasonal ARIMA model has seven structural parameters to determine (Shumway and Stoffer 2006b), which are shown in Table 6-2. A model with those parameters is usually reported as ARIMA(p,d,q)(P,D,Q)L.
We used the same method as we used in the univariate ARIMA model building to get a list of good model candidates, and then selected models from this candidate list.
6.3 Comparison of the models
Before applying the models to generate forecasts, we first tested the performance of the models on our data. We split the historical data into two subsets, a training set and a validation set. As described in the Historical Data section, we have data available from January 1995 to December 2008. We designated the last three years' data as the validation set, and used the rest as the training set. We use the truck data to illustrate the procedure we used to compare the models:
We defined some criteria for model selection; however, we may not strictly select the model with the best criteria. There are many reasons for doing this:
We use the variable selection procedure described in 6.1. Table 6-3 shows the best R square values we obtained using different numbers of regressors (independent regression variables). These values were obtained by applying the resulting forecast models to the validation data. As we observed from the table, there was a significant increase in the R square value when using two regressors as opposed to just one.
However, when the number of regressors was greater than 2 there was not much benefit in terms of the gain in R square value. In addition, some of the variables were highly correlated, which could cause multicollinearity issues, resulting in a forecast model that was unstable. In order to minimize multicollinearity issues we used the Variance Inflation Factor (VIF) metric to choose variables that were not highly correlated.
Note: AZpop: Arizona Population; AZemp: Arizona Employment; Xrate: Exchange rate; RXrate: real Exchange Rate; sonpop: Sonora Population; IIP: Index of Industrial Production; MX: Mexico
Table 6-4 shows part of the variable selection process for the trucks. Column 1 is the model we used, the variable to the left side of "~" is the target mode of traffic, and the variables to the right side of "~" are the variables in the model. Column 2 is the R square value on the training data, and Column 3 is the R square value on the validation data. All the columns after Column 3 are VIF values. The items were sorted according to the Validation R Square values in descending order.
Those models having VIF values greater than 10 were excluded from this table, as this indicates multicollinearity issues (Montgomery, Peck, and Vining 2006c). If there was only one regressor no VIF value was provided, for two regressors the VIF for these two regressors will be identical and for greater than three regressors each regressor has its own VIF. This table shows that of the models analyzed, a regression model using the Index of Industrial Production for the US (US IIP) and the exchange rate between US Dollar and to Mexican Peso would render the best results for forecasting the truck traffic border crossings.
We first decided the structural parameters of the multivariate ARIMA model. Table 6-5 lists the results of some of the ARIMA models tested, sorted by the Theil's U statistic (obtained by using the time series model on the training data set. A lower U indicates a better fit). In Table 6-5, we filtered out the models whose residual violates normality assumptions as these would potentially create misleading forecasts. One needs to be careful when choosing the parameters. All of the parameters listed in Table 6-5 were generally good candidates. When selecting among the list of parameters, the experience of the modeler, the plot of the fitted values as well as the residuals and reasonableness of the model all play important roles in the selection process. Here we chose the model with (p,d,q)(P,D,Q)L = (1,1,4)(2,1,2)12 to compare with other type of models, which is NO. 4 in Table 6-5. The subscript 12 means that we use a seasonal ARIMA model with seasonal period of 12 months
The comparison result
We were mostly concerned about the ability of the models to forecast future traffic crossings. Therefore, we used the models built on the training data set to forecast three years ahead and compared the forecasted values with the real data in the validation data set. Table 6-6 shows the comparison among the multivariate regression model, the Holt-Winter's method and the multivariate time series model. We could see that the multivariate time series model (ARIMA) outperforms the other two methods in terms of R square and Theil's U statistic.
Figure 6-3 is a graph of the three model forecasts. From the graph, we can tell that all the models fit well to the real data at the beginning. However, the Regression model tended to underestimate and the Holt Winter's method tended to overestimate later. From this example, we preferred to use the multivariate ARIMA model in our forecast.
Figure 6-3 Forecasts vs. Actual
6.4 Model alternatives for other modes
In section 6.3 above, we used the truck data as an example to show the model alternatives, and the results show that the ARIMA model outperforms other models. Therefore, our first choice was to use the ARIMA model on the other modes of traffic. Since the Holt Winter's method could be converted into an ARIMA model, we only considered ARIMA models on the other data sets. Before we made this choice we applied the same variable selection procedure on the other modes of traffic. The results of this testing did not show any variables which improved the quality of the models, such as the US IIP and the exchange rate which we found for the truck data. Thus, we mainly relied on the ARIMA model with no exogenous variables to forecast the other traffic modes.
In particular we note that the bus and train modes of traffic run on a relatively stable schedule, which did not typically change in response to economic variations as do the other modes of traffic.
We first generated the ACF and PACF of the POV data as in Figure 6-4. The ACF tails off and the PACF cuts off after 3. Although we did not see any spike after 1, we chose to use a seasonal ARIMA model. We believed the POV traffic also had some patterns that repeated from year to year. Table 6-7 lists some ARIMA models we tested on the POV historical data. We left out the last three years' data for validation as we did for the truck data. The last three columns are the R square value on the validation set, the Theil's U statistic on the Training set and the Theil's U statistic on the Validation set respectively. The table was ordered according to the Theil's U statistic on the Validation set in ascending order. As with the truck data, choice of model was not solely based on the order of the parameters listed in this table. Some other factors were also considered, such as the validation plots of the models.
We picked the model with structural parameters (p,d,q)(P,D,Q)L, and plotted the forecasted result against the real data in Figure 6-5. The fitted data seemed to overestimate the traffic. However, we also noticed that the real data had several fluctuations and the model was only able to capture the main trend excluding these fluctuations. If we tune the parameters to follow this fluctuating pattern, we may end up over fitting the model, thus generating an extremely implausible forecast. Instead, it may be appropriate to estimate a fixed correction, depending on discussions with subject experts.
Figure 6-4 ACF and PACF of the POV data
Figure 6-5 Plot of the fitted data to the real data on validation set (POV)
As we stated at the beginning of the section, we could not find an exogenous variable that would allow us to build a reasonable regression model for the pedestrian data. However, since we believe the majority of the people crossing the border by foot are locals, we thought that employment in Arizona might influence this crossing. Therefore, we incorporated Arizona employment into our time series model. We show the ACF and PACF of the data as in Figure 6-6. The ACF tails off, while the PACF dies off after 4 steps.
Figure 6-6 ACF and PACF of the Pedestrian data
Similarly, we have a list of relatively good models, which are listed in Table 6-8. We chose the model with structural parameters to see how it performed on the forecast. Figure 6-7 plots the fitted data against the real data in the validation set. We can see before the middle of 2008, the fitted values follow the real data relatively well. However, a big drop occurred in late 2008, which was not captured by the model.
Figure 6-7 Plot of the fitted data to the real data on validation set (Pedestrian)
We depicted the historical data of the bus crossings in section 5.2 Historical Data. We show the graph of historical bus traffic and bus passengers crossing the border as Figure 6-8 here for review. Note that the bus traffic started to increase by the end of 1997 and then began increasing faster in 1998. The amount of bus traffic jumped up significantly in the middle of 1999. According to a fact sheet from USDOT (U.S. DOT 2002), "the NAFTA timetable also called for the United States and Mexico to lift all restrictions on regular route, scheduled cross-border bus service by January 1, 1997." We believe this jump was associated with the implementation of the NAFTA. Therefore, we decided to use the data after NAFTA had been implemented, and the impact of this implementation had stabilized. For convenience, we used crossings since January 2000.
Comparing the bus traffic and the bus passenger data, we found that there was a slight difference between the patterns of these two data sets. For example, the bus passenger data did not show any decrease in its general trend between 2000 and 2008, while the bus traffic started to decrease after 2000, and then began increasing in 2005. We decided to build the model based on bus passenger data rather than the number of buses crossing the border. First, there are many companies involved in the bus operations. There are always new companies joining in and other companies leaving this business. This makes the number of bus crossings more difficult to predict. Secondly, bus capacities may not be fully utilized. If this is the case, predicting the number of buses will not reflect the number of passengers crossing the border.
Figure 6-9 depicts the ACF and PACF function of the bus passenger data. Note that there was a stem at lag 1, which is 1 year. This spike indicated that there was some autocorrelation with an interval of 12 months. However, when examining the bus passenger data, we could not find a stable seasonality effect such as we found in the truck data. Thus the two tier regression model we used for the truck data was not viable here. Instead we decided to use the time series model. However, similar to the POV and pedestrian data, we did not think the time series model was capable of giving a good extended forecast; therefore, a regression model based on the yearly bus passengers was also built to produce the extended forecast.
We tested different models to find a relatively good time series model for the bus data. We used the ARIMA model with (p,d,q)(P,D,Q)L = (9,0,7)(1,0,0)12. In this case, the training data was from January 2000 to December 2005, and the validation data set was from January 2006 to December 2008. We found that the data for February 2003 was abnormally high, which prevented us from finding a good model, thus we used the average of January 2003 and March 2003 data to replace the original data point. Figure 6-10 shows the fitted value against the real value on the validation data. Due to the variety in the data, the model was unable to follow each fluctuation in the real data, but the general trend does not deviate. The Theil's U statistic is 0.091 on the validation set, which was high compared to those from other modes. However, this was a relatively good result among the models we tested.
Figure 6-8 Historical data of bus crossings and bus passengers
Figure 6-9 ACF and PACF of the bus passengers
Figure 6-10 Plot of the fitted data to the real data on validation set (Bus Passenger)
Besides the relatively stable schedule of the trains, train traffic was also highly dependent on the availability of equipment and underlying customer demand. Recall the historical data of the train crossing we listed in section 5.2 Historical Data. We plot the graph here again in Figure 6-11. There were three huge spikes during the last 14 years. During the years 2003 and 2004, traffic was significantly lower than other years. Aside from these two instances, the railway traffic was relatively stable, though some fluctuations existed. Realistic projections of rail traffic will depend critically on Union Pacific's assessment of customer demand and other external factors such as the success of Punta Colonet, rerouting away from the center of Nogales, and expansion of the port of Guaymas.
Figure 6-11 Historical data of the number of trains crossing the border
1Refer to R square section of appendix of statistical details for further explanation