Introduction
Statistical matching, also referred to as ‘data fusion’ or ‘synthetical matching’ (D’Orazio et al., 2006:2), is a class of methods designed to integrate at least two data sets that contain information on common variables that have not been jointed observed, and that are based on disjointed units. The purpose of statistical matching is to make use of existing data from independently conducted research, so that a new study does not have to be carried out (D’Orazio et al., 2006). Statistical matching has been widely used since public use files became available in the 1960s, and the desire to use statistical matching was among the reasons for making census data publicly available (Moriarity and Scheuren, 2001).
Normally the problem of statistical matching applies to survey studies, where the identity of individuals, families, or organisations have been obscured for confidentiality reasons (D’Orazio et al., 2006). However, the various methodologies of statistical matching can be applied to any research with quantifiable data, where the sample is assumed to be the same. This discussion will present several of these varieties, along with their strengths and weaknesses, with the end goal of choosing an appropriate method for use in. Along with each method, it is important to consider the general assumptions that underlie the model; if they are incorrect, then the resulting model will reflect the incorrect assumption rather than reality (D’Orazio et al., 2006).
Approaches and General Assumptions
Matching may be constrained, which requires that all records from the files being converged are used once and only once, or unconstrained, which does not have these requirements (Moriarty and Scheuren, 2001). Additionally, statistical matching may apply the micro approach or the macro approach, although the two are not truly distinct (D’Orazio et al., 2006). The micro approach, which employs the construction of a complete synthetic file from the multiple sets of data, is most often used because it maintains confidentiality and is easier to work with (D’Orazio et al., 2006). The macro approach is used when the researcher requires a direct estimation of the joint distribution function of the variables that have not been observed jointly. However, (D’Orazio et al. (2006:3) cautions that “the micro approach is always a by-product of an estimation of the joint distribution of all the variables of interest”.
Moriarty and Scheuren (2001), interpreting the pioneering theoretical work of Kadane (1978), do not assume that the samples are the same; they argue that statistical matching is performed for the purpose of combining records for similar, rather than actual, entities. However, the widely cited text by D’Orzaio et al. (2006) states that one of the strongest assumptions of statistical matching is that, given two sets of data measuring dis-jointly observed variables, the sample is the same. Typical matching samples will have missing data which is resolved by a missing data generation mechanism. Additionally, it is assumed that joint information is missing. However, if samples are drawn at different times, then real statistical matching cannot truly be performed. When this occurs, additional information drawn from the second sample B should be lent to the first sample A. Procedures for using donor data are discussed under the non-parametric micro approach.
Conditional Independence Assumption: Macro and Micro Approaches
The conditional independence assumption (CIA) states that one set of values is independent of the other(s) (D’Orazio et al., 2006). It is not possible to test this assumption based on the observed values, but the remainder of this section assumes the CIA to be true. Using the macro approach in a parametric setting involves direct estimation of the joint distribution or of one of its characteristics. Estimators are based both on the overall sample and on subsets of the data, without iteration (Rubin, 1974, 2006). These methods are appropriate for univariate normal distributions, multi-normal distributions, and multi-nomials (D’Orazio et al., 2006).
Using the micro (predictive) approach in the parametric framework, the analyst constructs a synthetic complete data set by predicting missing values from known values, and then filling them in (D’Orazio et al., 2006). This can be performed by either conditional mean matching (CMM) or draws based on conditional predictive distribution (DBCPD) (D’Orazio et al., 2006). CMM, which “substitutes each missing item with the expectation of the missing variable given the observed ones”, may fill in predicted values for variable X that are not equal to the actual values, and the synthetic distribution for the predicted values for variable Y are based on those predicted for X (D’Orazio et al., 2006: 26). Assuming a MAR mechanism for data with multivariate distributions. Kadane (1978) notes that DBCPD is more likely than CMM to result in a synthetic data set that closely resembles the actual distribution of values because it preserves the variance that would naturally be found.
Non-parametric macro methods, which involve the estimation of the joint distribution of the values of interest. There are several estimators with asymptotic properties that are consistent when each of their assumptions are met (D’Orazio et al., 2006). Non-parametric macro methods are congruent with their parametric macro methods (D’Orazio et al., 2006).
The non-parametric micro approach, which forms a complete synthetic data set, can utilise either CMM or DBCPD (D’Orazio et al., 2006). Either approach will employ one of several nonparametric estimation procedures, but analysts will commonly apply nonparametric or ‘hot deck’ imputation procedures, which substitute missing values in the data set of the host or recipient file with observed values from a donor file (D’Orazio et al., 2006: 35; Singh et al., 1993). One method is to assign one sample to be the recipient and the other to be the donor. If recipient A is smaller, then donor B can donate values from a larger range of actual values, but this results in a smaller synthetic file. On the other hand, if recipient A is larger, then a larger synthetic file will result but the range of values will be smaller. Another method is to switch the roles of recipient and donor, so that missing values of B are donated by A and missing values of A are donated by B, thus making maximally efficient use of all of the available data. However, in most cases the rolls of recipient and donor should be fixed, and this is assumed to be the case for three different methods of hot deck imputation (Singh et al., 1993).
Non-Parametric Micro Hot Deck Imputation
Random hot deck imputation involves the use of two homogenous subsets from recipient file A and donor file B. The analyst randomly matches each recipient record from file A with a donor record from file B, then fills in any missing values for a given record from the matching donor record. The result is analogous to what would be obtained from a random draw from that estimated distribution. This method assumes that values are independent from one another; this assumption can be tested by running a correlation on the complete donor file B.
If there is an ordinal matching variable, the rank hot deck imputation can be used to rank order records from files A and B based on that variable, and then match records based on their respective rank order. Similarly, the distance hot deck imputation starts with the first known matching variable, which is used to compute a distance measure. The constrained distance hot deck requires that each donor record be used only once, while the unconstrained distance hot deck allows any given donor record to be used more than once.
The quality of the synthetic data set and ability to preserve the distribution of the imputed variable are affected by matching noise, the product of the sample distributions and variance contributed by mismatches between donor and recipient values. Generally, when sample sizes are allowed to be different for A and B, this produces less matching noise (D’Orzaio et al., 2006). Moriarty and Scheuren (2001) have observed that samples of under 100 typically produce too much noise, while sample of at least 500 are customarily adequate.
An Example
Ferrando and Mulier (2013) employed the non-parametric nearest neighbour distance hot deck (NNDHD) matching procedure (D’Orazio et al., 2006) to match survey replies to balance sheet information. The researchers administered the Survey on the Access to Finance (SAFE) to 11,886 European firms regarding their perceived financial constraints, and attempted to match these answers to actual balance sheet information which was available on 2.3 million firms in the Bureau van Dijk Amadeus database. First, all firms from both files were classified in a prior defined groups so that matching could only occur within the same groups. Second, NNDHD matching was conducted based on number of employees and age of firm. The analysists employed the Gower distance function, wherein distance is normalised between 0 and 1 (Dorzaio et al., 2006). A firm in SAFE was matched to an Amadeus firm based on the smallest minimum distance; if more than one firm had the same minimum distance, then one was chosen at random. A zero minimum distance was achieved for 31 percent of the matches, which means that these firms had the same group, age, and number of employees; 77 percent had a distance of less than .01. The drawback to this technique was there was no guarantee that the matched firms had the same financial characteristics. However, the construction of the groups was such that most firms within each defined group shared similar financial characteristics