Introduction
The first publication of forecasting models for cancellations and no-shows was in 1958 by Beckmann and Bobkoski. They applied three different distributions for total passenger arrival and they made assumptions about demand arrival that may no longer be valid. In the almost sixty years after this publication, a lot has changed in the airline industry. Nowadays, Passenger Name Record-based cancellation and no-show forecasting are commonly viewed in the airline industry as one of the best methods available. These methods are part of revenue
management. The objective of revenue management is to maximize profits; however, airline short-term costs are largely fixed, and variable costs per passenger are small; thus, in most situations, it is sufficient to seek booking policies that maximize revenues (McGill & Van Ryzin, 1999). A revenue management system must take into account the possibility that a booking may be cancelled, or that a booked customer may fail to show up at the time of service (no-show), which is a special case of cancellation that happens at the time of service (Morales & Wang, 2008). Iliescu, Garrow and Parker (2006) studied airline passenger cancellation behaviour and stated that leisure passengers, who are more likely to book further in advance of flight departure, are less likely to cancel than business passengers. However, as the flight nears departure, both leisure and business travellers are more likely to refund and exchange their tickets. The study of Iliescu, Garrow and Parker (2006) points out that cancellation proportions of 30% or more are not uncommon today. Cancellation forecast is one important
aspect of revenue management. Accurate forecasts of the expected number of cancellations for each flight can increase airline revenue by reducing the number of spoiled seats (empty
seats that might otherwise have been sold) and the number of involuntary denied boardings at the departure gate (Lawrence, Hong, & Cherrier, 2003). Another important aspect of revenue management in airlines is overbooking. Overbooking intends to increase revenues by deciding the number of seats to be offered for sale (virtual capacity) such that it maximizes the chance of the aircraft seats being occupied (physical capacity) when the flight departs (Talluri & van Ryzin, 2004). The overbooking levels are based on a cancellation forecast in combination with service criteria to keep the risk of having
too many passengers showing up very small. It is therefore critical to have accurate cancellation rates.
The task of forecasting the probability of cancellation of a single booking can be modelled as a two-class probability estimation problem with the two classes being “cancelled” and
“not cancelled” (Morales & Wang, Cancellation forecasting Using Support Vector Machine with Discretization, 2008). There are different classification techniques one can use to solve
this estimation problem such as Decision Trees (DT), Logistic Regression (LR), Stochastic Gradient Descent (SGD), Support Vector Machines (SVM), Naive Bayes (NB) and Random
Forests (RF). Morales and Wang (2008) showed that decision trees perform better than logistic regression since there are practical limitations on logistic regression. Furthermore, the study by Wang on airline data showed that dynamic decision trees outperform logistic regression in terms of runtime and forecast accuracy (Morales & Wang, Identify Critical Values for
Supervised Learning Involving Transactional Data, 2009). But with the latest developments in machine learning techniques maybe it is possible for logistic regression to outperform the
dynamic decision tree. Modern-day ML techniques are able to deal with a lot of dummy variables. To forecast cancellation rates, KLM uses dynamic decision trees. They experience some
practical problems with these trees, namely, when is it better to make the decision tree less dynamic or less static, what are the best thresholds to create a node, what are the best attributes to consider in the dynamic decision tree model and what is the best pruning method for decision trees? The objective of this report is to investigate whether a decision tree is the best model to predict cancellations. This is examined by comparing the decision tree model with the five other classification models mentioned earlier to see which one performs best. The rest of this report is structured as follows. In the second chapter, the cancellation forecasting problem is described and detailed background information about previous research on this topic is given. Also, some basic terminology used throughout this paper is given. The third chapter describes the real-world dataset used for this research. Also, the attributes are explained. Chapter 4 explains the five different methods commonly used for
binary classification. Besides that, some techniques are discussed. Chapter 5 discusses the main results followed by the conclusion, discussion and the possibilities for future research
in the sixth and last chapter.
Theoretical Background
To optimize the expected revenue of an airline company, it is essential to have an accurate passenger cancellation forecast. With this forecast the risk of unnecessary empty seats
on a flight will be reduced by overbooking. Overbooking is the fact that the number of seats available for sale is higher than the physical capacity of the airplane. An optimized
overbooking rate leads to reduced expenses due to denied boardings and to reduced revenue loss due to seats that are not sold although there is a demand for those seats (Hueglin &
Vannotti, 2001). In this chapter, we discuss some papers about different cancellation forecasting models proposed in the literature and some basic definitions used in this report
are described.
2.1 Existing Forecasting Models
Most of the proposed forecasting models in the literature focus on the no-show case. However, these models can also be used to forecast cancellation rates. Conventional forecasting
methods predict the number of cancellations using time-series methods such as taking the seasonally-weighted moving average of cancellations for previous instances of the same flight leg (Lawrence, Hong, & Cherrier, 2003). Time series forecasting looks at sequences of data points, trying to identify patterns and regularities in their behaviour that might also apply to future values (Lemke & Gabrys, 2008). Weatherford, Gentry and Wilamowski (2002) compared traditional forecasting methods such as moving averages, exponential smoothing
and regression with the neural network method. Neural networks represent a promising generation of intelligent machines that are capable of processing large and complex forms of information (Weatherford, Gentry, & Wilamowski, 2002). Weatherford, Gentry and Wilamowski (2002) concluded that the most basic neural network can outperform the traditional fo
recasting methods. Lawrence et al. (2003) used two
different passenger-based forecast models to predict no-show rates based on the Passenger Name Record (PNR) and implemented these models by using different classification methods such as Naive Bayes, Adjusted Probability Model
(APM), which is an extension of Naive Bayes, ProbE (based on tree-algorithms) and C4.5 (an algorithm for making decision trees). They have shown that ”models incorporating specific information on individual passengers can produce more accurate predictions of noshow rates than conventional, historical-based, statistical methods”. Neuling, Riedel and
Kalka (2003) also used C4.5 decision tree based on PNRs. Hueglin and Vanotti (2001) used classification trees and logistic regression models to predict the cancellation probability of
passengers. They concluded that ”the accuracy of no-show forecasts can be improved when individual passenger information extracted from passenger name records (PNRs) is used as input”. The three publications mentioned above conclude that making use of PNR data improves forecasting performance. The PNR data mining approach models cancellation rate forecasting as a two-class probability estimation problem (Morales & Wang, Forecasting
Cancellation Rates for Services Booking Revenue Management Using Data Mining, 2009). Popular two-class probability estimation methods are tree-based methods and kernel-based methods. Probability estimation trees estimate the probability of class membership, in our case the probability that a booking will be cancelled or not. Quinlan (1993) developed
an algorithm, C4.5, that generates decision trees. The trees produced by C4.5 are small and accurate, resulting in fast reliable classifiers and therefore decision trees are valuable
and popular methods for classification. In contrast to Provost and Domingos (2003) who concluded that the performance of conventional decision-tree learning programs is poor and therefore they have made some modifications to the C4.5 algorithm. The C4.4 uses information gain criteria to divide the tree nodes and no pruning is used. Fierens, Ramon,
Blockeel and Bruynooghe (2005) concluded that overall the C4.4-approach outperforms the C4.5-approach. However, the trees of the C4.5-approach are much smaller than for the C4.4-
approach. The C4.4 method builds a single tree, however, random forests can improve the predictive performance of a single tree by aggregating many decision trees. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forests (Breiman, 2001). For a large number of trees, it follows from the Strong Law of Large Numbers and the tree structure that random forests always converge so that overfitting is not a problem (Breiman, 2001). In random forests, the idea is to decorrelate the several trees and then reduce the variance in the trees by averaging them (Random Forests in R, 2017). Averaging the trees helps to reduce the variance and improve the performance of the trees
and eventually avoid overfitting. Kernel based methods make use of kernel functions which map input data points to a
higher dimensional space, such that a linear method in a new space becomes non-linear in the original space and therefore these methods are able to model non-linear relationships between dependent and independent variables (Morales & Wang, Forecasting Cancellation Rates for Services Booking Revenue Management Using Data Mining, 2009). One of the most popular kernel based methods for class probability estimation is Support Vector Machine (SVM). If we have labelled data, SVM can be used to generate multiple separating hyperplanes such that the data space is divided into segments and each segment contains only one kind of data (Machine Learning Using Support Vector Machines, 2017). SVM is able to find the hyperplane that creates the biggest margin between the training points for class 1 and −1 (Hastie, Tibshirani, & Friedman, 2001).
Caruana and Niculescu-Mizil (2006) evaluated the performance of SVMs, logistic regression, naive Bayes, random forests, decision trees and more supervised learning algorithms
on binary classification problems. From the five methods mentioned, random forests were the best learning method overall followed by SVMs. The poorest performing models were logistic regression, naive Bayes and decision trees. However even the best models sometimes perform poorly, and models with poor average performance occasionally perform exceptionally well (Caruana & Niculescu-Mizil, 2006). Rich (2001) did an empirical study of the NB classifier and concluded that ”despite its unrealistic independence assumption, the naive
Bayes classifier is surprisingly effective in practice since its classification decision may often be correct even if its probability estimates are inaccurate”. In this report, we evaluate the performance of decision tree, logistic regression, support
vector machines, naive Bayes and random forests on a real-world data set to forecast cancellation rates. In the next section, the terminology used throughout this report is given.
2.2 Terminology
In this section, we introduce some basic definitions used throughout this report: a cancellation, a no-show, a passengers show-up and denied boarding are described. Passengers are said to cancel when the associated confirmed seat reservations are freed and returned to the inventory for sale. For every passenger, Ck is the random variable associated with the realization of the cancellation indicator, which is equal to 1 if the passenger has cancelled his booking before departure, 0 otherwise. The passenger cancellation probability is denoted as ck = P(Ck = 1). Passengers are said to no-show when their confirmed booking was not cancelled, but they do not show up at the departure time. For every passenger, Nk is the random variable associated with the realization of the no-show indicator, which is equal to 1 if the passenger does not cancel but does not show up at the time of boarding, 0 otherwise. The passenger no-show probability is denoted as nk = P(Nk = 1|Ck = 0). Passengers which do not cancel or no-show are said to show-up. In this case, both the cancellation indicator and the no-show indicator are equal to 0. We denote Sk as the random variable associated with the realization of the show-up indicator. And hence, the passenger show-up probability is
P(Sk = 1) = P(Nk = 0 ∩ Ck = 0) = P(Ck = 0)P(Nk = 0|Ck = 0) = (1 − ck)(1 − nk).
Passengers which show-up with a confirmed reservation, but which cannot be accommodated on a flight are said to be denied boarding.
Data
Information about bookings is available in the form of Passenger Name Records (PNRs), which are typically transferred to a PNR database from an airline’s flight reservation system.
A new PNR is generated whenever a customer makes a flight reservation and contains information such as the creation date, the number of passengers, departure date, ticketing status, price class and many other attributes about the booking. Each time a customer contacts the airline in order to change the state of the booking (confirmation, cancellation, etc.), an additional transaction record is written into the PNR and stored in the reservation system. A PNR may include more than one passenger flying the same itinerary. If one of the
passengers in a PNR decides to deviate from the existing itinerary, then the PNR is split. For this passenger, a new PNR is generated. Each PNR is tagged with a label indicating
whether the booking is cancelled or not; 1 for a cancellation and 0 otherwise. When a PNR is cancelled, all passengers in the PNR have cancelled. This label is used as the target variable
for modelling t
he cancellation probability. The database that is
investigated contains booking records of flights from KLM and
AirFrance. All the booking records with a departure date between 01.10.2016 and 01.10.2017 are taken from the database, which is almost 92 million bookings. The datasets that are
used in this report are random samples of those 92 million bookings. The PNRs of these data samples were created in the time period between 22.05.2015 and 01.10.2017. Table 3.1
summarizes the characteristics of the datasets. For all datasets the mean cancellation rate is calculated as follows:
Note that the mean cancellation rate is more than 41% for all datasets. To investigate whether the fitted models on each dataset make the same predictions, another sample is generated from the 92 million bookings, which contains 10561 observations. Predictions will be made on this out-of-sample set. Figure 3.1a shows the number of bookings per month and figure 3.1b the number of bookings per day for all bookings in dataset L with a departure date between 01.10.2016 and
01.10.2017. Most flights are booked in January, March and May. December is the least popular month for booking. The number of bookings in weekdays is more or less the same, whereas the number of bookings in the weekends is much lower. Attributes are used to predict whether a PNR is cancelled or not. Table 3.3 summarizes the set of attributes extracted from the PNR database. The class-label attribute, IsCancelled, tells whether a booking is cancelled or not and has two values: 1 if the booking
is cancelled and 0 if not. The rest of the attributes is used to predict cancellations. Figure 3.2 visualizes the influence of three different attributes on the observed cancellation frequency.
The thickness of the bars in the bar charts indicates the relative number of observations in the dataset. This means, if we look at figure 3.2c, there are many observations with value N or V for the PricingClass attribute and only a few observation with value F, O or P. The number of observations for the different values of the DepartureMonth and DepartureDayofWeek attributes is more or less the same. All departure months have on average
the same cancellation probability ≈ 40%, the same holds for the departure day of week. In contrast to these two attributes, the cancellation rate for different price classes is not the same. For example, the Z price class, one of the cheapest chair in the business cabin with flexible cancellation standards, has a higher cancellation rate than the G price class, which is the cheapest chair in the economy cabin without flexible cancellation standards.
Some of the attributes need more explanation. First the attributes IsTrueLocal, TrueOriginAirport, TrueDestinationAirport, KarmaOriginAirport and KarmaDestinationAirport are explained. Suppose that a certain booking consists of a set of two flight legs, for example,
LHR-CDG and CDG-FRA (London-Paris-Frankfurt). The TrueOriginAirport is always LHR and the TrueDestinationAirport is FRA. If both flight legs are executed by KLM or AirFrance, then the IsTrueLocal attribute is 1, the KarmaOriginAirport is LHR (the same as the TrueOriginAirport) and the KarmaDestinationAirport is FRA (the same as the TrueDestinationAirport). If for example only the first flight leg is executed by KLM or AF, then the IsTrueLocal attribute is 0, the KarmaOriginAirport is LHR and the KarmaDestinationAirport
is CDG. And if only the last flight leg is executed by KLM or AF, then the IsTrueLocal attribute is also 0, the KarmaOriginAirport is CDG and the KarmaDestinationAirport is FRA.
The IsOutboundFlow attribute is 1 if the passenger or passengers in the PNR begin their journey, otherwise, the attribute is 0. The NegoSpaceType attribute is a special case
of a group booking and is made by a travel organization. Now suppose that a certain travel organization makes a group booking for fifty passengers, then all these passengers are on
the same flight with the same departure date. This booking is called the ’master’ booking and the NegoSpaceType attribute has value 1. When for example four of these passengers
desire a different departure date, then they are split from the ’master’ booking and a new booking is created. However, this booking is from the original group booking made by the travel agent and therefore the value of the NegoSpaceType attribute is 2. In all other cases, the NegoSpaceType attribute is 0.
Table 3.2 gives the ranges for the attributes LengthofStayRange, NbPaxRange and TimeFrameLabel. The TimeFrameLabel attribute of a booking is calculated by the demand date minus the departure date and gives the booking time in days before departure. Figure 3.3 shows the fraction of cancellation for each time frame. The thickness of the bars in the bar chart indicates
the relative number of observations in the dataset. This means in our dataset there are many observations that are booked between 31 and 90 days before the departure date of the
flight and much fewer observations booked 1 day before departure or even on the departure date. The cancellation rate decreases when the day of departure comes closer. However, a
booking that is made 200 days before departure and is still active 5 days before departure is unlikely to cancel.
All four datasets are stored as a nxp-matrix, where n is the number of bookings and p the number of attributes. These four models, also called the train sets, are used to fit the models.
The out-of-sample set is stored as a nxp-matrix as well and is used to estimate the prediction error of the models. After this, for both the train sets and the out-of-sample set, dummy
variables are created for the explanatory attributes, i.e. all attributes except the class label attribute IsCancelled. Each explanatory attribute returns the number of levels minus 1 as
dummy variables.
Example: the explanatory attribute DepartureDayofWeek has 7 levels, namely Monday, Tuesday, Wednesday, Thursday, Friday, Saturday and Sunday. When we turn this attribute into a dummy, 6 variables are created: DepartureDayofWeeki for i = 2, …, 7. Only one or none of these variables can have the value 1, all other variables have the value 0. If all six variables have the value 0, the departure day is Monday and if the variable DepartureDayofWeek2 has value 1, the departure day is Tuesday etc.. With these dummy variables, the train and out-of-sample sets are now nxk-matrices, where n is the number of observations in the train or out-of-sample set and k is the total number of (levels-1) of all explanatory attributes. These matrices contain only zeros and ones. For faster computation in R we ’delete’ the zeros and create a sparse matrix, for both the train and out-of-sample sets. Using the sparse matrix of the train set six different models are fit and the sparse matrix of the out-of-sample set is used to estimate the prediction error of the
model. The next chapter describes the classification models used in this report. The models have to work well with sparse matrices or with many factor attributes with a lot of levels in
order to fit the model.
Methodology and Techniques
In this chapter, the methodology and techniques used in this report are discussed. The first section gives an overview of the classification models that are compared to each other. This is done with the use of five accuracy measures which are discussed in the second section of this chapter. Furthermore, a test for significance is explained in the last section
4.1 Classification models
The task of forecasting the probability of cancellation of a single booking can be modelled as a two-class probability estimation problem with the two classes being “cancelled” and “not cancelled”
(Morales & Wang, Cancellation forecasting Using Support Ve
ctor Machine with Discretization, 2008). Classification is a process for predicting qualitative responses. Some of the methods commonly used for binary classification are:
1. Decision Trees
2. Logistic Regression
3. Support Vector Machines
4. Naive Bayes classifier
5. Random Forests
In this chapter, the above five models are described.
4.1.1 Decision Trees
The first model is a decision tree which is an area of data mining techniques. A decision tree is a structure that can be used to divide a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules (Berry & Linoff, 2004). The model construction method proceeds in two steps. In a first step, a tree is grown using a greedy heuristic to create the nodes. The algorithm
starts with a root node containing the entire population and proceeds by recursively selecting an attribute and splitting the nodes into child nodes which bear the same attribute value.
Splitting is performed until a termination criterion is met or there is nothing left to split. Each attribute for a split is selected as the one which locally maximizes a heuristic criterion
known as the gain function. In a second step, the tree can be pruned off some of its nodes and branches. This phase
is necessary to remove all nodes which might cause over-fitting of the data. Pre-pruning is done during calibration of the tree, post-pruning is done after the generation of the decision
tree. It tries to simplify the tree by removing superfluous nodes.
There are several strategies available for splitting and pruning the tree. Different strategies are discussed in the following subsections.
Splitting
The objective of the splitting algorithm is to select the best attribute on which to perform a split. The selection of split attributes is done dynamically and to find the best split condition the algorithm makes use of an impurity measure and a gain function. The impurity function measures how well the classes are separated, so this function should be 0 when all data belongs to the same class (low impurity means high coherence). There are multiple impurity functions used in the literature, two of all the impurity functions are discussed. For the following, we consider a node N with data set S that contains examples
from k classes (k child nodes Ni). Then pi(S) is the relative frequency of class i in S. The first impurity function is the Entropy: Entropy(S) = −Xki=1pi(S) log pi(S), (4.1)
which is mostly used by full split dynamic decision trees which split each value of an attribute. Entropy is intended for attributes that occur in classes and is used by the C4.5
algorithm. The second impurity function is the Gini-index:
Gini(S) = 1 − Xki=1pi(S)2, (4.2)
which is intended for continuous attributes and minimizes misclassification. This index is mostly used by binary dynamic decision trees, which splits all values of an attribute into two
sets. The Gini-index is used by the CART-algorithm. There is a large selection of gain functions to measure the worth of an attribute, the attribute which is worth the most gets selected. The simplest gain function is:
Gain(S, A) = I(S) − X v∈A |Sv| |S| · I(Sv), (4.3)
where I(S) represents the selected impurity function; v represents any possible values of attribute A; Sv is the subset of S for which attribute A has value v; |Sv| is the number of
elements in Sv; |S| is the number of elements in S (Du & Zhan, 2002). This gain function favours attributes that have a large number of values. A second gain function known as Gain Ratio compensates this favour by taking the intrinsic value of the split: GainRatio(S, A) = Gain(S, A) SplitInfo(S, A) = I(S) − P (4.4)
The attribute with the highest gain ratio is chosen for the split.
Earlier research on cancellation forecast by KLM shows that the best dynamic decision tree model consists of a dynamic full split decision trees that are calibrated using Entropy as impurity function and Gain Ratio as gain function. Pruning
The objective of the pruning algorithm is to remove nodes which over-fit the data in order to improve the forecasting capability of the tree. The simplest pruning method is based
on a fixed pruning threshold, which is a lower bound on the number of observations in a node. If the number of observations in a certain node does not exceed the pruning threshold
then this node is not created. There are three pruning thresholds: parent pruning, child pruning and gain pruning. A node with fewer observations than the parent pruning threshold becomes a leaf. A child node is a leaf node and if the number of observations does not exceed the child pruning threshold it is pruned. If the gain value of a certain split is lower than the gain pruning threshold, it means that the split does not bring much information and therefore not interesting to increase the tree size for such gain, the corresponding candidate attribute is not chosen for the split. It can be interesting to combine pruning and gain ratio computation.
A second method to prune decision trees is the pruning algorithm of the C4.5 decision tree algorithm, an error-based pruning algorithm. The algorithm is a kind of post-pruning
algorithm; it allows the tree to overfit the examples and then post-prune the tree (Kijsirikul & Chongkasemwongse, 2001). The algorithm starts from the bottom of the tree and examines
each non-leaf subtree. If replacement of this subtree with a leaf would lead to a lower estimated error, then prune the tree (Kijsirikul & Chongkasemwongse, 2001). The error rate
is defined as: qS = min(pS, 1 − pS), (4.5)
where pS is the average calibration probability (same as event probability) of data set S. For a parent node we calculate the upper bound for the error rate as:
qmax(S) = qS + 1, 96s qS(1 − qS) |S| . (4.6)
We also calculate the upper bound for the error rate for all the child nodes: qmax(Si) for i = 1, …, k. If the error rates from the child nodes satisfy the following criteria,
qmax(S) − Xki=1 |Si ||S|qmax(Si) ≤ 0, (4.7)
all the child nodes from the parent node are eliminated (pruned) from the tree.
4.1.2 Logistic Regression
The second model is a logistic regression, which models the probability that a certain booking xi, with i = 1, …, n, belongs to a particular category:
Multiple Logistic regression is a statistical method to estimate event probability using multiple attributes (X). In other words, we model the conditional distribution of the response Y, given the attributes X. To model the relationship between p(X) = P r(Y = 1|X) (the probability that a certain booking is cancelled given attributes X) and X the logistic function is used,
p(X) = expβ0+β1X1+…+βpXp1 + expβ0+β1X1+…+βpXp, (4.8)
which gives outputs between 0 and 1 for all values of X, where X = (X1, …, Xp) are p attributes. To fit the model the maximum likelihood method is used, which is discussed in the next subsection. After some derivations of (4.8), we find the odds:
p(X) 1 − p(X)= expβ0+β1X1+…+βpXp, (4.9)
which can take on any value between 0 and ∞. Taking the logarithm of both sides of (4.9) gives
log p(X)1 − p(X)= β0 + β1X1 + … + βpXp, (4.10)
where the left-hand side is called the log-odds or logit. Increasing X by one unit changes the logit by β1. However, because the relationship between p(X) and X in (4.8) is not a straight line, β1 does not correspond to the change in p(X) as
sociated with a one-unit increase in X. The amount that p(X) changes due to a one-unit change in
X will depend on the current value of X (James, Witten, Hastie, & Tibshirani, 2013).
The logistic regression gives the opportunity to use qualitative attributes such as DominantAirline or TrueOriginAirport (see chapter 3) by creating dummy variables.
Regularized Maximum Likelihood
The coefficients β0, …, βp in the logistic function are unknown and must be estimated based on training data. To estimate the unknown regression coefficients the regularized maximum
likelihood method can be used. This method estimates the coefficients by finding the parameter values that maximize the penalized log-likelihood (Friedman, Hastie, & Tibshirani, 2009):
l(β0, …, βp) = 1nXni=1[yilog pi + (1 − yi) log(1 − pi)] − λXpj=1|βj|, (4.11) where Ppj=1 |βj| is the lasso penalty, i.e. the sum of the absolute values of the β coefficients of all attributes. The effect of the penalty term is to set the coefficients that contribute most to the error to zero; i.e. the lasso penalty picks out the most predictive coefficients. This penalty is useful in any situation where there are many correlated predictor variables.
After the estimation of the logistic regression coefficients β0, …, βp we can compute the probability of a cancellation for any combination of attributes: 1X1+…+βˆpXp. (4.12)
However, the evaluation of the penalized log-likelihood over the full data set is computationally challenging when the amount of bookings and attributes is large. Therefore the SGD method is also used for estimating the coefficients β0, …, βp in the logistic function. This method is described in the next subsection.
Stochastic Gradient Descent Another method for estimating the coefficients in the logistic function is stochastic gradient
descent. Stochastic gradient descent is popular for large-scale optimization (Johnson & Zhang, 2013). To minimize the error of the logistic regression on the real-world dataset, the stochastic gradient descent (SGD) evaluates and updates the coefficients every iteration. The aim of stochastic gradient descent is to find the set of coefficients that result in the smallest error for the model on the training data. For the logistic regression model with a class label y with only two values, 0 for no cancellation and 1 for a cancellation, this is done by finding the coefficients that minimize the following error measure function:
LogLoss = −1nXni=1[yilog pi + (1 − yi) log(1 − pi)], (4.13)
where n is the number of observations, pi the predicted probability of a cancellation of booking i and yi the class-label attribute of booking i. The predicted probability of a cancellation of booking i is calculated as
pi = 1 1 + exp(−xi · w) , (4.14)
where the parameter w is estimated by using iterations. Each iteration estimates the stochastic gradient descent on the basis of a randomly picked example (xi, yi) (Bottou, 2010):
wt+1 = wt + γ(yi − pi)xi, (4.15)
where γ is the learning rate. The stochastic process {wt
, t = 1, …} depends on the examples (xi, yi) randomly picked at each iteration (Bottou, 2010). Stochastic gradient descent is efficient because it evaluates the Log Loss function at a
single observation yi given xi, rather than the entire data set, which saves significant computation time (Tran, Toulis, & Airoldi, 2015). Johnson and Zhang (2013) conclude that SGD
does not require the storage of gradients and this is easily applicable to complex problems such as structured prediction or neural network learning. In the rest of this report, the Logistic Regression model refers to the logistic regression model with estimations based on the regularized maximum likelihood and the Stochastic Gradient Descent model refers to the logistic regression model estimated with stochastic gradient descent.
4.1.3 Support Vector Machine
The third data classification method is Support Vector Machine (SVM). The basic idea is to find a hyperplane which separates the p-dimensional data perfectly into its two classes (Boswell, 2002). In a p-dimensional space, a hyperplane is a subspace of dimension p − 1 which need not pass through the origin (James, Witten, Hastie, & Tibshirani, 2013). The formula
β0 + β1X1 + β2X2 + … + βpXp = 0 (4.16)
defines a p-dimensional hyperplane, in the sense that if a point X = (X1, X2, …, Xp) T
in p-dimensional space satisfies 4.16, then X lies on the hyperplane. If a point X does not lie on the hyperplane, then X satisfies
β0 + β1X1 + β2X2 + … + βpXp < 0 (4.17)
or
β0 + β1X1 + β2X2 + … + βpXp > 0, (4.18)
which means that X lies on one of the sides of the hyperplane. So a hyperplane divides p-dimensional space into two classes. The two classes are: yi ∈ {−1, 1}, where −1 represents one class and 1 the other class. If a separating hyperplane exists, we can use it to construct a very natural classifier: a test observation is assigned a class depending on which side of the hyperplane it is located (James, Witten, Hastie, & Tibshirani, 2013). A certain booking x∗ is assigned to class 1 if f(x∗) is positive and to class −1 if f(x∗) is negative, where f(x∗) = β0 + β1x ∗ 1 + β2x ∗ 2 + … + βpx ∗ p . (4.19)
The support vector classifier allows some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane. It does not perfectly separate the two classes because it could be worthwhile to misclassify a few training observations in order to do a better job in classifying the remaining observations (James, Witten, Hastie, & Tibshirani, 2013). The support vector classifier is the solution to the optimization problem maximize β0,…,βp,1,…,n,MM (4.20)
subject to X
p
j=1
β
2
j = 1, (4.21)
yi(β0 + β1xi1 + β2xi2 + … + βpxip) ≥ M(1 − i), (4.22)
i ≥ 0,
Xn
i=1
i ≤ C, (4.23)
where M represents the margin of the hyperplane and C is a nonnegative tuning parameter. The higher C, the higher the tolerance level of violations to the margin and to the hyperplane. The slack variables 1, …, p, allow individual observations to be on the wrong side of the margin or the hyperplane. i tells us where the ith observation is located, relative to the hyperplane and relative to the margin. If i = 0 the ith observation is on the correct side of the margin; if i > 0 the ith observation is on the wrong side of the margin; if i > 1 the ith observation is on the wrong side of the hyperplane (James, Witten, Hastie, & Tibshirani, 2013). Support vectors are observations that lie directly on the margin, or on the wrong side of the margin for their class and they do affect the support vector classifier. An observation that lies strictly on the correct side of the margin does not affect the support vector classifier
(James, Witten, Hastie, & Tibshirani, 2013). The solution to the optimization problem above only involves the inner product of the observations; the inner product of two bookings xi and xi
0 is given by hxi , xi 0i =Xpj=1xijxi0j. (4.24)
With some technical computations, it can be shown that the support vector classifier described above, can be represented as
f(x) = β0 + X i∈S αihxi , xi 0i, (4.25)
where S is the set that contains all support vectors points; αi
is nonzero only for the support vectors, therefore summarizing over other points is not necessary.
The support vector classifier described above assumes a linear boundary between the two classes, but when we have to deal wit
h non-linear class boundaries we have to enlarge the feature space using functions of the attr
ibutes. The support vector machine (SVM) is an extension of the support vector classifier that enlarges the feature space by using kernels (James, Witten, Hastie, & Tibshirani, 2013). A kernel is a function that quantifies the similarity of two observations and is a generalization of the inner product of two observations of the form K(xi, x0i). (4.26)
There are different kinds of kernels, for example: a linear kernel,
K(x, x0i) = Xpj=1xijxi0j, (4.27)
which is used by the support vector classifier. An SVM uses a non-linear kernel, such as a polynomial kernel with degree d > 1
K(x, x0i) = 1 +Xpj=1xijxi0j!d, (4.28)
or a radial kernel with a positive constant γ
K(x, x0i) = exp −γ Xpj=1(xij − xi0j )2!. (4.29)
When the support vector classifier is combined with a non-linear kernel, the resulting classifier is known as a support vector machine and can be represented as a non-linear function
f(x) = β0 + X i∈S αiK(x, xi), (4.30)
where S is the set that contains all support vectors points (James, Witten, Hastie, & Tibshirani, 2013). One advantage of using a kernel rather than simply enlarging the feature space by using functions of the original features is computational. With the use of kernels, we only need to compute K(xi, x0i), this can be done without explicitly working in the enlarged feature space. Without the use of kernels, the enlarged feature space is so large that computations are intractable.