INTRODUCTION
Use of soil information to boost achievements of UN-Sustainable Development Goals (SDGs) require interdisciplinary approaches placing soil scientist in a unique position to provide adequate and reliable soil data. Soils functions have direct link to the ecosystem goods and services whose vigour gurantee provision of food, adequate and clean water, resilience to climate change shocks and an enhanced biodiversity. Many challenges with regard to environment, social, economic, geologic, and human health can be addressed if we follow the path to better manage the soils (e.g Howitt et al., 2009; Brevik, 2013; McBratney et al., 2014). However, human interventions while utilizing soil resources coupled with climate change and variability are having unanticipated consequences. The Earth ‘s soils are being compacted (Jones et al., 2003); washed away by erosion (Cerdàr & Doerr, 2005); depleted in organic carbon (Bellamy et al., 2005); at a rate that cannot be sustained considering soil formation is a very slow process (Verheijen et al., 2009). This has continued to limit the soils capacity to optimally perform important soil functions like; biomass production, nutrient recycling, carbon regulation; water storage and filtration just to name a few.
Integrated soil fertility management has been identified as one of the best soil management approach to achieve agricultural intensification in sub-Saharan Africa (Vanlauwe et al., 2015). Despite Kenya’s economy being agricultural based, most of the existing soil inventories are very old (legacy data) and worse still a big record of these inventories do not provide adequate soil information at relevant scales that can guide soil and land management (Vagen et al., 2013). The inventories also lack a harmonized sampling design that satisfies data quality checks as described by Hengl et al. (2015). Efforts through projects like the Alliance for Green Revolution in Africa (AGRA) to increase crop yields and help smallholder farmers out of poverty traps (Nziguheba et al., 2010), has not been successful as anticipated partly due to poor soil fertility and lack of affordable soil assessment mechanisms. Winoweick et al. (2016) have pointed the need to understand soil properties in view of identifying limitations that hinder increased agricultural production.
Conventional methods for soil fertility assessment have been highlighted as expensive and time consuming (Shepherd & Walsh, 2002; McBratney et al., 2003). Farmers can hardly afford to have their soils analysed to ascertain soil fertility levels. Data quality checks described by Hengl et al. (2015) are important challenges of conventional methods of soil analysis. The commonly used plot experiments are expensive and do not capture the geographical variability of soil properties over a wide area. Thus, taking up new analytical procedures that can provide reliable soil properties data in a much more cost efficient manner are urgently needed. Over the past 30 years, reflectance spectroscopy in the near and mid infrared (NIR & MIR) has been used as a dry chemistry analytical tool to provide quantitative and qualitative data of soil properties in a much faster, non destructive, cost efficient and environmentally friendly because few chemicals are required compared to wet chemistry laboratory measurements (Nocita et al., 2015, Madari et al., 2006; Vagen et al., 2016). An added advantage is that a single spectrum may contain comprehensive information on various soil properties, and can be used to predict these simultaneously. MIR is integrative making it a good soil health diagnostic tool (Shepherd & Walsh, 2002). Inferences are made using multivariate statistics to calibrate soil spectra data with selected representative laboratory measurements.
Traditionally, farmers do not take into account field variability when applying inputs like fertilizers. This may lead to wastage in parts of the field that are well endowed with nutrients and under application in parts of the field high nutrient deficiency. Site specific management systems are possible with the input of geostatistical approaches that enable spatial estimation of soil properties in unsampled locations (Saito et al., 2005; Behera & Shukla, 2015). This study was an effort to quantify soil properties using rapid and cost effective approaches to support development of spatial distribution maps and management decisions.
MATERIALS AND METHODS
Soil sampling was done in Mt. Kenya region covering an area of 1200 km2 within latitudes 370 36 ‘E and 380 0 ‘ E and longitudes 00 6 ‘ N and 00 18 ‘ S. The major land use is rainfed agriculture. The altitude ranges was 700 m to 2000 m above sea level. The agro-climatic zone is mainly humid to semi arid (Jaetzold et al., 2007). The area is located on the windward side of Mount Kenya characterized by large rainfall differences. The rainfall gradient shows a strong east-west tendency and increases from east towards west. Annual rainfall is distributed in two seasons with short and long rains during March to May and October to December respectively. Rainfall gradient is from west to east ranging from 1500 mm in the upper humid zones to 600 mm in the lower semi arid zones. Temperature is correlated with altitude, showing a strong east west tendency, warm parts in the eastern lowlands and cooler zones high up towards western parts. The annual average temperature ranges from 100C to 350C. The geology is mainly volcanic rock and ash and some old metamorphic rocks (Schoeman, 1952).
Dominant soils and soil forming factors
The dominating soils are: Nitisols, Ferrasols, Regosols, Vertisols, Phaeozems and Arenosols. This is according to the 1:1 M KENSOTER map and database (Dijkshoorn, 2007). Based on the profile descriptions and laboratory results of the survey of this work the dominant soil types of the area are: Andosols, Umbrisols, Nitisols, Alisols, Cambisols, Leptosols, Plinthosols and Regosols. Soil forming factors; climate, topography and lithology influence distribution of these soil types. Western part is relatively humid with lower temperatures. Low rate of mineralisation of organic matter, strong leaching and eluviation give rise to humic topsoil, acid soils with low base saturation like Andosols, Umbrisols and Alisols. In the middle elevation the rainfall and temperatures are moderate. Hence less leaching and moderate organic matter decomposition. The soils are deep and well drained evidenced by presence of Nitisols.
The effect of past climate (alternating wet and dry conditions) resulted in the formation of pisolithic material as evidenced by presence of Plinthosols in the lower semi arid zones. Cambisols show incipient subsurface soil formation on alluvial plains and shallow Leptosols are mainly found in areas with basement rock. Presence of Regosols in the eastern semi arid zones is evident due to extensive erosion and accumulation especially in the mountainous terrain. Topography influences the drainage conditions and soil redistribution.
Soil sampling design
Latin Hypercube Sampling (CLHS) was adapted. The choice was based on the foreseen constraints (inaccessibility due to poor weather roads, very steep slopes, possibility of having sampling locations coinciding with water bodies, national parks or built environment). The need to reduce the sample size yet cover a wide geographical area with limited budget was put into consideration.
Conditional Latin hypercube sampling (CLHS) is an optimization procedure that picks sampling sites which can form a Latin hypercube in the feature space. The need to input the distribution of environmental variables in the design of the soil sampling scheme justified use of CHLS which is a stratified random sampling. This method aims to pick sampling sites that form a Latin hypercube feature space as demonstrated below:
- Assuming K variables Xi…………………XK the array of each variable X is divided into n equal strata.
- In this case K variables are: the environmental covariates(soil forming factors derivatives)
- Then samples are picked randomly for every variable Xi…………………XK.
- In total n samples covering n intervals are selected. [they can be randomly paired guided by some conditions (CLHS)]
- Use of conditions involved addition of constraints to the objective function formally formulated by (Minasny & McBratney, 2006).
- These constraints are based on field operation costs which are a function of time, sample size and accessibility to sampling locations.
- Finally addition of constraints to the objective function leads to equation (1)
………………………… (1)
n=samples; k= variables; nij =sampling frequency (where i= interval and j=variable); cp=cost associated with sampling. W1&w2 are the weights.
The ‘ease of reach’ was determined by generating an arbitrary ‘cost of reach’ layer (Figure 1.) using GRASS GIS. Away from road network and very steep slopes corresponded to a long cost-distance. Similar approach by Roudier et al. (2012) in Australia and Mulder et al. (2013) in Morocco reduced the operational cost of soil survey significantly by identifying easy to reach points yet covering the full range of the sampling area.
A comparison of CLHS and the commonly used Monte Carlo simple random sampling, CHLS is superb in that it ensures a more even distribution of sampling points than simple random sampling (Figure 2). CHLS selects points with maximum euclidean distance while simple random sampling can sometimes form clusters leaving large unsampled areas. A sensitivity analysis by Helton et al. (2005) proves CLHS to be a more robust sampling method compared to random sampling. On the other hand, simple random sampling guarantees independence of samples (Gentle, 2003).
Environmental variables
Good expressions of soil forming factors in remote sensing data have been reported (Dobos et al., 2000, Vagen et al., 2013). Jenny’s (1941) state equation for soil formation: S=f (cl, o, r, p, and t) clearly outlines the influence of each soil forming factor in the soil forming matrix. Climate (cl) which is surrogate of rainfall and temperature influences the rate of soil forming processes like humification processes (McBratney et al., 2003).
Organisms (O) were represented using the Normalized Difference Vegetation index (NDVI) derived from Landsat 8 satellite imagery with a resolution of 30 m for dry season from http://earthexplorer.usgs.gov/: row/path 168/60 from August 2009. The NDVI is a normalized difference ratio index of the near infrared (NIR) and red bands of multispectral image, NDVI = (NIR band – Red) / (NIR band + Red).
Relief (r) was represented by terrain derivatives (elevation, slope, topographic wetness Index and aspect).These were calculated from Global Digital Elevation Model (DEM), Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) with a resolution of 30 m. SAGA GIS 2.0.6 (available at: <http://www.saga-gis.org>) was used to generate these terrain derivatives.
Lithological map scale 1:5 M (Schoeman JJ, 1952) was used for parent material (p). The scale of 1:5 M was too coarse in relation to the sampling area. Fortunately, the geomorphology is closely related to the geology (van de Weg, 1978) and shows two clear bisections comprising of the volcanic and basement system rocks. This supported with information that the variability of geology/parent material may be insignificant. Dominant soils of the polygons of the KENSOTER units (Dijkshoorn, 2007) were used as categorical data to ensure sampling was done in every dominant soil type. R ‘clhs’ algorithm (R Core Team, 2013), and Quantum GIS processing were used to design the sampling frame. Box plots were used for validation based on natural and sampled distribution of environmental variables (Figure 3).
Mid-infrared (MIR) spectral-reflectance measurements
Out of the 77 sampling sites visited, 28 were open soil profiles and 47 were augured profiles. The sampling depth was 1 m and 269 samples were collected at three depth intervals (0-20 cm, 20-50 cm & 50-100 cm). Preprocessing of samples involved air drying, crushing and sieving using a 2 mm sieve, further crushing (< 100 μm) using agate pestle and mortal. Loading into A 752-96, Bruker Optics, Karlsruhe aluminium micro titer plates with wells measuring 6-mm diameter was done into four replications. MIR soil spectra measurements for the 269 samples were taken using a Fourier-transform MIR spectrometer (FTIR; Tensor 27). Liquid N2–cooled HgCdTe detector was used. The waveband ranged from 4000-400cm-1 (Figure 4a) and a resolution of 4 cm-1 at a waveband distance of about 2 cm−1. Sample variability and differences in particle size and packing density were solved by averaging the four replicates. The first derivative of the reflectance spectra was computed based on Savitzky–Golay smoothing filters (Wand & Ripley, 2008).
Calibration sample selection
There is no rule of thumb for the determination of the optimal number of calibration samples. However, selection of samples with normal distribution should be avoided because results of subsequent analysis regress towards the mean. Kennard-Stone algorithm (Kennard & Stone, 1969) was a solution because it ensures that the selected calibration samples are uniformly distributed by choosing the samples that maximize the Euclidean distances between each other. This aspect complimented well with our sampling design (CLHS) that ensured the full range of sampling area is uniformly covered. Principal Component Analysis that was computed on reflectance spectral matrix further reduced data overlap. The selected calibration set was n=30 with a tolerance of ±2 sampling points.
Laboratory methods for the calibration samples
After examining the spread of the calibration and prediction sets (Figure 4b.), the calibration samples were analysed for pH, Cation Exchange Capacity (CEC) following (Van Reewijk, 2002), Na, K, Ca, Mg and Al following Mehlich 3 extraction (Mehlich, 1984), SOC and TN following Thermal Oxidation (Skjemstad & Baldock, 2008). Laser Diffraction Particle Size Analyzer (LDPSA) was used for analysis sand, silt and clay particles. Calibration for LDPSA was done using the pipette method (Van Reewijk, 2002).
Data analysis
Statistical analysis.
Exploratory data analysis was performed using descriptive statistics and data normality testing using Shapiro-Wilk test at 5% significant level. Random Forest (RF) Regression was used for the calibration. The utility of RF for regression and classification is described by Gislason et al. (2006). Prediction performance was based on values of coefficient of determination (r2) and Root Mean Square Error (RMSE) calculated as shown in equation 2.
…………………………………………………. (2)
Where Xobs is observed values and Xmodel is modelled values at place i.
Geostatistical analysis
To allow estimation of soil properties at the unvisited locations, geostatistical approaches were used for spatial prediction. To allow estimation of soil properties at the unvisited locations, geostatistical approaches were used for spatial prediction. Ordinary Kriging (OK) was used following guidelines of selecting spatial prediction models as described by Hengl, (2007).
I. Dependent variables showed no significant correlation with explanatory variables (Table 1.)
II. The variables showed spatial autocorrelation calculated using Nugget to sill ratio (Table 2.)
III. A semivariogram with more than one parameters could be fitted.
OK integrates statistical properties of the measured data (spatial autocorrelation) while the kriging approach uses semivariogram to express spatial continuity (autocorrelation).
Kriging estimate z*(xo) was calculated as follows:
…………………………………………… (3)
Where Wi is the weight and (x0, xi) are the corresponding distances (Agrawal et al., 1995).
Evaluation of the spatial dependency of the data
Semivariograms are useful for measuring data correlation as a function of distance. They evaluate the spatial structure of data. The common theoretical variogram models are outlined by Webster & Oliver (2001). The nugget is the value of the semi-variogram as distance (h) approaches zero. The nugget effect is due to error in measurement, spatial variation that occurs within the sampling distance interval, and random events. Three parameters range, sill, nugget and nugget to sill ratio are used to interpret a semivariogram. Range is defined by the distance beyond which data autocorrelation no longer exists. Sill is the value of the semivariogram beyond which no spatial dependence of data is exhibited.
A semivariogram is expressed as follows (Nielsen & Wendroth, 2003):
……………………………………. (4)
S(H) is the semivariance, N(H) is the number of pairs of locations separated by a lag distance H, h is the lag distance, Z is the parameter of the soil properties, Z(xi), and Z (xi + H) are values of Z at positions xi and xi + H .
The strength of spatial data dependence is defined by the Nugget to sill ratio. Three classifications of spatial dependence of data based on the nugget/sill ratios are described by Zuo et al. (2008). A nugget/sill ratio of < 25% means strong dependence, a proportion between 25% and 75% indicates a moderate spatial dependence and > 75% often signify weak spatial dependency of data. Usually high nugget/sill ratios is an indication that the spatial variability of data is not strongly influenced by the natural factors but rather stochastic factors like land management practices. On the other hand, low ratios of nugget to sill suggest that structural factors like soil forming factors are a major cause of spatial variability of data.
Two concerns are important in the analysis of semivariograms
I. Fitting the semivariogram to the experimental data requires careful choice of the total lag distance. This was addressed by setting conditions such that selection of the separation distance involved 95% pairs to fit the semivariogram model.
II. The choice of appropriate model to fit the experimental semivariogram data is also a big concern. Spherical model was selected compared to exponential and Gaussian models based on analysis of our data (Table 2.).
Evaluation of spatial prediction accuracy
Model performance was evaluated on the basis of variations between the observed and the predicted values of soil properties. This was calculated using the Mean Error (ME), Root Mean Square Error (RMSE) and the Standardized Root Mean Square Error (SRMSE) using equations (5), (6) and (7) respectively where n is the number of sample points.
…………………………………… (5)
……………………………… (6)
……………….. (7)
Where Obsi is observed values and Predi is modelled values at place i. Predvar is the variance of the prediction. The best model is one that has ME nearest zero, the smallest RMSE and SRME nearest to 1. For good predictive model the RMSE values should be low (<0.3). RESULTS AND DISCUSSIONS Accuracy of soil properties predictions The model performance was classified as follows: excellent when r2>0.90, good when 0.81<r2<0.90, moderately successful when 0.66<r2<0.80 and unsuccessful when 0.50<r2<0.65. This classification was adopted from Saeys et al. (2005). Moderately successful predictions were achieved for SOC % at r2=0.78 and RMSE=1.64. Comparable accuracy (r2=0.77 and RMSE=1.64) for SOC has been reported by Terhoeven-Urselmans et al. (2010). But our coefficient of determination (r2) was better. This can be associated to the use of different calibration statistics as the sample size and all other methods were similar. In the case of Terhoeven-Urselmans et al. (2010), partial least squares regression (PLS) was used as the calibrating statistics while in our case random Forest regression (RF) was used to calibrate the soil MIR spectra. Findings by vasques et al. (2010) that RF was more accurate than PLS while predicting SOC in Florida, USA further supports our choice of RF as the calibrating statistic. PLS is a linear multivariate statistic and not good in presence of non-linear vibration common in FTIR (Hoffmann and Knözinger, 1987) that was used to take MIR soil spectra measurements in our study. RF can model both linear and non linear data (Vega et al., 2009). Higher accuracy of SOC prediction (r2 = 0.96) was achieved by McDowell et al. (2012), this may be attributed to high and wide range of SOC (%) of the calibration set (0.24 to 55.29%) compared to (0.56 to 10.83%) in our soil samples. Our accuracy for the calibration of pH was good (r2 = 0.88 and RMSE = 0.48) and better in terms of r2 and RMSE compared to those achieved by Terhoeven-Urselmans et al. (2010) (r2 = 0.81, RMSE = 0.63) and Shepherd & Walsh (2002) (r2 = 0.83, RMSE = 0.54) while analysing a spectral library of soils from Eastern and Southern Africa. Calibrations for Mehlich extraction of Na, Mg, K and Ca were excellent (r2=0.93 and RMSE=1.45; r2=0.88 and RMSE=1.60; r2=0.92 and RMSE=2.32 and r2=0.92 and RMSE=189.16 respectively. The prediction of Ca was better than that of Sila et al., 2016 who reported r2=0.91 for Ca. Again this could be associated with the efficiency of RF, a multivariate statistic that was used for the calibration of soil MIR spectra. The high RMSE values especially in the case of Ca can be attributed to under prediction by MIR due to low calcium abundant in all our sampling locations. Calibration of Mehlich extraction of Al was good (r2=0.85 and RMSE=0.51).Calibration of P was satisfactory (r2=0.78 and RMSE=71.25). However, poor performance was achieved for particle size distribution (clay r2 = 0.59 and RMSE = 9.9; Silt r2 = 0.63 and RMSE = 7.3 and even worse for sand (r2= 0.30 and RMSE=5). We suspect errors during the processing of the samples for LDPSA. Janik et al. (2016) reported poor results for sand and clay and after fine grinding of the samples good predictions (r2 = 0.88, for clay; r2 = 0.84, for sand) were achieved. Spatial interpolation of soil properties SOC, TN, pH and P were selected for spatial interpolation because they present important agronomic soil properties that can be restored through good soil management practices. Moderate spatial dependence was exhibited by all the soil properties (Table 2.). Spherical model and OK resulted in spatial distribution maps (Figure 6.). The performance of spatial prediction for TN and pH was satisfactory based on the cross validation results (Table 3.). Low accuracy was registered for P (RMSE=I.11). Soil type data layers and land management practices were not used as explanatory variables during the prediction process yet they influence P availability. Inventories for land management practices were unfortunately not available and soil type data layers are only available in the KENSOTER database with limitations highlighted earlier. Prediction of SOC (RMSE=0.81) was not as accurate compared to results by Wen et al. (2015). Based on zuo et al. (2008) earlier mentioned guidelines for classifying spatial structure of data, a nugget to sill ratio of 0.46 (Table 2) for SOC means a moderate spatial dependency of data. According to Kravchenko et al. (2006a), strong dependency of spatial data is a prerequisite for good performance of a spatial prediction model. Kravchenko et al. (2006a) reported strong spatial structure of data for SOC. But in his research samples were collected only 100 m apart. In our study the total sampling area was 1200 km2 and sampling distance was > 100 m. Differences in sampling distance affect the strength of association soil properties between the sampling points and ultimately influence the spatial structure of data. Land management practices influence the spatial structure of SOC data Kravchenko et al. (2006a). Different land use practices were evident during the soil sampling campaign and these have strong influence on soil properties like SOC. Those areas with tea plantations had better nutrient management practices than in the areas on subsistence food crop farming. Absence of land management explanatory variables as an input to the model might have weakened our model for SOC prediction.
Inferences from predicted soil properties
Our results show the soils were acidic with pH range of 3.9 – 5.3 (Table 4.). The causes and effects of acidification on soils are well explained by Goulding, 2016. Spatial distribution of pH values (low to high values) shows a west- east tendency (Figure 6.). High rainfall events in high altitudes in the west result in high leaching and extremely low pH. The spatial distribution map show that areas with Andosol reference soil group were associated with extreme low pH except within the tea management areas. The far east of the study area is characterised by less rainfall, less leaching, less or no andic material and less acidic pH values.
There was no clear spatial tendency of P (Figure 6). The range for P (Table 4.) was 3.15 mg/kg and 11.63 mg/kg, where 43.7% of total samples did not meet the threshold of 5mg/kg (Okalebo et al., 2006). The high phosphate retention may be due existence of andic material occasioned by presence of Andosol reference soil groups in higher altitudes.
There was a west-east tendency of SOC and TN distribution in the study area (Figure 6.). Land management had significant contribution to the variations of SOC and TN with highest values in tea plantations. The minimum and maximum values were 3.6 g/kg and 57.2 g/kg (Table 4.) and 93.07% did not meet the threshold of 20 g/kg using the assessment of (Loveland & Webb, 2003). TN results show a minimum 0.4 g/kg and Max 6.8 g/kg but only 15% of the total sampling area met the threshold of 2.0 g/kg based on suggestions from Ndakidemi & Semoka (2006).
CONCLUSION
Combining Conditional Latin Hypercube Sampling, Mid Infrared spectroscopy, and Random Forest Regression (RF), soil properties were satisfactorily predicted except for the soil particle distribution. The results of the linear regression show strong positive association between the measured and predicted values of soil properties. Out of the 269 samples, only 30 to 32 samples were selected for calibration and analysed through tedious and costly laboratory procedures. Laboratory measurements were calibrated to MIR spectra using RF regression and soil properties were simultaneously predicted with satisfactorily results. This supports the fact that our methodology is rapid, cost effective and environmentally friendly as compared to the dense sampling and intense wet chemistry laboratory procedures. Geostatistical techniques were used to enable spatial interpolation of soil properties in unsampled locations. Spherical model was selected as the most robust while fitting a semovariogram for geostatistical analysis of our data. Cross validations result show that SOC and P did not achieve high prediction accuracy because important explanatory variables like land management were missing during the prediction procedure due to lack of data. Moderate spatial dependency of the test soil properties may further explain the low performance of our prediction model because good performance of prediction models largely depends on strong spatial dependence of data. Spatial distribution maps of soil properties are in a relevant management scale and are extremely useful in planning and prioritising soil fertility restorative activities in the study area. Moreover, they form a good monitoring network that has not been there before for this study area. We recommend more research to improve calibration and prediction of soil particle distribution as soil texture is an important parameter in soil management. Development of land management inventories is required to enhance predictions of soil properties for more informed land use decision making.
ACKNOWLEDGEMENTS
This work was financed by the Hungarian Scientific Research Fund (project 113171.). The authors are indebted to the Kenya Agricultural and Livestock Research Organization for logistical support during the sampling campaign.
2017-5-3-1493818360