Advanced
Machine Learning Approaches to Corn Yield Estimation Using Satellite Images and Climate Data: A Case of Iowa State
Machine Learning Approaches to Corn Yield Estimation Using Satellite Images and Climate Data: A Case of Iowa State
Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography. 2016. Aug, 34(4): 383-390
Copyright © 2016, Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
  • Received : July 22, 2016
  • Accepted : August 23, 2016
  • Published : August 31, 2016
Download
PDF
e-PUB
PubReader
PPT
Export by style
Article
Author
Metrics
Cited by
About the Authors
Nari, Kim
Division of Earth Environmental System Science, Pukyong National University
Yang-Won, Lee
Corresponding Author, Department of Spatial Information Engineering, Pukyong National University

Abstract
Remote sensing data has been widely used in the estimation of crop yields by employing statistical methods such as regression model. Machine learning, which is an efficient empirical method for classification and prediction, is another approach to crop yield estimation. This paper described the corn yield estimation in Iowa State using four machine learning approaches such as SVM (Support Vector Machine), RF (Random Forest), ERT (Extremely Randomized Trees) and DL (Deep Learning). Also, comparisons of the validation statistics among them were presented. To examine the seasonal sensitivities of the corn yields, three period groups were set up: (1) MJJAS (May to September), (2) JA (July and August) and (3) OC (optimal combination of month). In overall, the DL method showed the highest accuracies in terms of the correlation coefficient for the three period groups. The accuracies were relatively favorable in the OC group, which indicates the optimal combination of month can be significant in statistical modeling of crop yields. The differences between our predictions and USDA (United States Department of Agriculture) statistics were about 6-8 %, which shows the machine learning approaches can be a viable option for crop yield modeling. In particular, the DL showed more stable results by overcoming the overfitting problem of generic machine learning methods.
Keywords
1. Introduction
Monitoring crop yield is important for many agronomy issues such as farming management, food security and international crop trade. Because South Korea highly depends on imports of most major grains except for rice, reasonable estimations of crop yields are more required under recent conditions of climate changes and various disasters.
Remote sensing data has been widely used in the estimation of crop yields by employing statistical methods such as regression model. Prasad (2006) conducted multivariate regression analyses to estimate corn and soybean yields in Iowa using MODIS (Moderate Resolution Imaging Spectroradiometer) NDVI (Normalized Difference Vegetation Index), climate factors and soil moisture. Ren (2008) presented regression models for the estimation of winter wheat yields using MODIS NDVI and weather data in Shandong, China. Kim (2014) estimated corn and soybean yields using several MODIS products and climatic variables for Midwestern United States (US) and represented prediction errors of about 10 %. Hong (2015) built multiple regression models using MODIS NDVI and weather data to estimate rice yields in North Korea and showed the RMSE of 0.27 ton/ha. Most of the previous studies are based on the multivariate regression analysis using the relationship between crop yields and agro-environmental factors such as vegetation index, climate variables and soil properties.
Machine learning, which is an efficient empirical method for classification and prediction, is another approach to crop yield estimation. Jiang (2004) adopted ANN (Artificial Neural Network) technique for estimation of winter wheat yields using AVHRR (Advanced Very High Resolution Radiometer) dataset, and the ANN model showed a higher accuracy than multivariate regression models. Jaikla (2008) estimated rice yields using SVM (Support Vector Machine) and compared the result with the simulation of DSSAT (Decision Support System for Acrotechnology Transfer) model, which showed a similar performance. Kuwata and Shibasaki (2015) employed DL (Deep Learning) methods for estimation of corn yields for Illinois and presented that the DL contributed to higher accuracy than SVM. Despite the efficient predictability of machine learning techniques, the applications in crop yield estimation are relatively insufficient, and the comparative studies among various machine learning methods for crop yield estimation have not reported yet.
The objective of this study is to estimate crop yields by employing several major techniques for machine learning such as SVM, RF (Random Forest), ERT (Extremely Randomized Trees) and DL, and to present the comparisons of validation statistics among them. We used satellite images from MODIS and the climate reanalysis data created by PRISM (Parameter-Elevation Regressions on Independent Slopes Model) for the machine learning analyses. To improve the prediction accuracies according to phenology effects, we set up three types of data period: (1) May to September, (2) July and August and (3) an optimal combination of the months.
2. Data and Method
- 2.1 Study area
Iowa is a state in the Midwestern US and belongs to the Corn Belt ( Fig. 1 ). Iowa produces approximately 18 % of the US corn yields, which is the highest ranking in the US ( USDA, 2012 ). Out of the 99 counties of Iowa State, we selected 94 counties whose cropland exceeded 10 % of the county area. The study period is between 2004 and 2014 according to the data availability.
PPT Slide
Lager Image
Study area
- 2.2 Data
- 2.2.1 Remote sensing data
Satellite remote sensing data was acquired from NASA (National Aeronautics and Space Administration) and ESA (European Space Agency) CCI (Climate Change Initiative). The Terra/MODIS products by NASA such as NDVI, EVI (Enhanced Vegetation Index), LAI (Leaf Area Index), FPAR (Fraction of Photosynthetically Active Radiation), GPP (Gross Primary Production) and ET (Evapotranspiration) are closely related to crop yields. Also, SM (Soil Moisture) dataset was obtained from ESA CCI, which produces the most complete and consistent global soil moisture data on the grid of 0.25° using active and passive microwave sensors. Table 1 shows the summary of dataset used. Previous studies ( Prasad , 2006 ; Na , 2014 ; Kim , 2014 ) presented these variables were associated with the corn yield.
Summary of dataset used in this study
PPT Slide
Lager Image
Summary of dataset used in this study
- 2.2.2 Climate data
The PRISM Climate Group ( http://www.prism.oregonstate.edu/ ) provides daily and monthly reanalysis of seven climate elements in the US: precipitation (PPT), maximum temperature (Tmax), minimum temperature (Tmin), mean temperature (Tmean), mean dew point temperature (TDmean), minimum vapor pressure deficit (VPDmin) and maximum vapor pressure deficit (VPDmax). We used monthly data for PPT, Tmax, Tmin, Tmean at the 4-km resolution.
- 2.2.3 Crop yield data
As a reference dataset, county-level yield statistics of corn were obtained from the NASS (National Agricultural Statistics Service) of USDA (United States Department of Agriculture) ( http://quickstats.nass.usda.gov ). The unit of corn yield (bushels per acre) was converted to ton per hectare for convenience sake.
- 2.2.4 Data processing
Because cropland areas for each county should be first determined, we extracted the pixels which were recorded as cropland (land cover ID = 12) throughout the period of 2004-2014 from the MODIS land cover data. Fig. 2 shows that the distribution of the cropland pixels is similar to the pattern of major counties for corn production in Iowa. For these cropland pixels, we constructed a database including satellite images and climate variables. Crop yield statistics were the values accumulated by county, so the satellite and climate data need to be averaged at the county level. We employed the zonal operation to summarize the pixel values for a given county.
PPT Slide
Lager Image
(a) Corn yields by county and (b) cropland pixels derived from MODIS land cover data (Iowa State in the dashed line)
Various environmental factors related to crop yields can have different sensitivities to growing seasons. Hence, we derived 13 cases for month combination such as MJJAS (from May to September), each individual month between May and September (May, Jun, Jul, Aug and Sep), two successive months (MJ, JJ, JA and AS), and three successive months (MJJ, JJA and JAS) for calculation of the correlation coefficients ( Table 2 ). From these combinations, we selected three period groups: (1) MJJAS for the whole growing season, (2) JA as the group having mostly highest correlation coefficients and (3) OC for the optimal combination of the periods in terms of the correlation coefficient (shaded in gray in Table 2 ). In order to estimate the corn yield in the 94 counties in Iowa, we built a matchup database consisting of 11 input variables from satellite images (NDVI, EVI, LAI, FPAR, GPP, ET and SM) and climate dataset (PPT, Tmin, Tmax and Tmean) for the three period groups between 2004 and 2012.
Correlation coefficients of the variables against corn yields, 2004-2014
PPT Slide
Lager Image
Correlation coefficients of the variables against corn yields, 2004-2014
- 2.3 Methods
- 2.3.1 Support vector machine
SVM is a powerful technique for general classification which can minimize the classification error of existing machine learning techniques ( Vapnik, 1998 ). For estimation or prediction, regression methods are combined with each classified group. SVM finds the optimal separating classifier between the two classes by maximizing the margin between support vectors using the kernel functions such as linear, Gaussian RBF (Radial Basis Function), polynomial and hyperbolic tangent ( Cortes and Vapnik, 1995 ; Karatzoglou , 2006 ). The Gaussian RBF were used in our experiment.
- 2.3.2 Random forest
The RF, which is an improved version of CART (Classification and Regression Trees), is an ensemble method using bootstrap aggregating ( Breiman, 2001 ). RF makes decision trees by extracting random samples from the training data and predicts results through the vote for classification or averaging of the regression using a large number of trees ( Ali , 2012 ). In our experiment, the number of trees were 500, and the number of variables used for splitting nodes were set to n/3 (n = number of input variables). In addition, the out-of-bag error was used as the criterion of model suitability.
- 2.3.3 Extremely randomized trees
ERT is an ensemble classifier method using unpruned decision trees. ERT is different from the other tree-based ensemble methods such as RF, in that it divides nodes by randomly choosing cut-points and that it uses the complete learning sample (no bootstrap copying) to grow the trees ( Geurts , 2006 ). Such randomization is based on the bias-variance analysis like the Friedman test ( Friedman, 1997 ). Randomization increases bias and variance of individual trees, but they can be attenuated by averaging over a sufficiently large ensemble of trees. In our experiment, the number of trees and the number of variables used for splitting nodes were set to the same as those of RF.
- 2.3.4 Deep learning
DL is a machine learning method similar to ANN but is capable of processing the complicated, huge input data by learning tasks by using feed-forward multi-layer network ( Ali , 2015 ). Training process of DL usually consists of pre-training and fine-tuning. Pre-training is the phase of data processing by using unsupervised learning for improving the generalization error of trained deep architectures. Finetuning by supervised learning is performed to improve the classification error ( Erhan , 2010 ). Our experiment used a 200×200 multi-layer network.
- 2.3.5 Validation
The leave-one-year-out cross-validation, also known as the Jackknife, was conducted to examine the accuracies of the corn yield estimation by machine learning methods. We calculated the mean bias, MAE (Mean Absolute Error), RMSE (Root-Mean-Square Error), MAPE (Mean Absolute Percentage Error) and the correlation coefficient (r) between the observed and predicted yields during the period of 2004-2014.
3. Results and Discussion
We implemented the machine learning methods (SVM, RF, ERT and DL) using R libraries ( https://www.r-project.org/ ). We first estimated the corn yields using the MJJAS dataset for the whole growing season, and the results were compared with the USDA yield statistics. The leave-one-year-out cross-validation produced 11 sets of validation results for each year between 2004 and 2014. Table 3 shows the averages of the 11-year validation results in terms of the mean bias, MAE, MAPE, RMSE and r. Fig. 3 shows the scatter plots of the predicted corn yields against USDA statistics between 2004 and 2014. According to the results, DL achieved the highest accuracy with the correlation coefficient of 0.776 and the RMSE of 0.844 ton/ha, although three methods (RF, ERT and DL) presented similar accuracies. In particular, RF and ERT showed very similar results with the correlation coefficients of 0.651 and 0.654, respectively, and the RMSE were 0.879 and 0.891 ton/ha, respectively. This is because the two approaches are based on regression trees even if their randomization strategies for tree splitting are somewhat different. The SVM showed the lowest accuracy with the correlation coefficient of 0.560 and the RMSE of 0.959 ton/ha.
Validation statistics for the period group MJJAS (May to September)
PPT Slide
Lager Image
Validation statistics for the period group MJJAS (May to September)
PPT Slide
Lager Image
Scatter plots for observed vs. predicted corn yields, 2004-2014 (red dots: 2012, black dots: all years except for 2012)
Tables 4 and 5 show the 11-year averaged statistics for JA and OC, respectively. When comparing the results of the three period groups (MJJAS, JA and OC), the correlation coefficients for SVM were almost the same (MJJAS=0.590, JA=0.575, OC=0.606), but the RMSE of OC (0.852 ton/ha) were somewhat improved than those of MJJAS (0.959 ton/ha) and JA (0.936 ton/ha). As for RF and ERT, the correlation coefficients (JA=0.774 and 0.774, OC=0.772 and 0.785, respectively) and the RMSE (JA=0.803 and 0.802 ton/ha, OC=0.767 and 0.756 ton/ha, respectively) were similar for both JA and OC, showing improved results than those of the MJJAS. Hence, it is notable that the seasonal sensitivities of corn yields were well captured by the RF and ERT methods. The DL method produced the highest accuracies for the three period groups in terms of the correlation coefficients (MJJAS=0.776, JA=0.796 and OC=0.800, respectively).
Validation statistics for the period group JA (July and August)
PPT Slide
Lager Image
Validation statistics for the period group JA (July and August)
Validation statistics for the period group OC (optimal combination of month)
PPT Slide
Lager Image
Validation statistics for the period group OC (optimal combination of month)
Moreover, the DL presented more stable results in the scatter plots while the other three methods had a tendency of overfitting. Machine learning techniques such as SVM, RF and ERT can have an overfitting problem, which occurs when a model is very complex with many parameters and shows a poor predictive performance by overreacting to minor fluctuations in dataset. The red dots in Fig. 3 were the cases of 2012, in which an extreme drought occurred in the Midwestern US. The machine learning models for prediction of 2012 (that is, the models built using the data of the years except for 2012, for the Jackknife) were too trained for non-drought years (except for 2012), so that they could not predict the corn yield under conditions of abrupt drought. However, the DL method can overcome the overfitting problem by a pre-training process based on unsupervised learning ( Erhan , 2010 ). Fig. 3(j) , 3(k) and 3(l) for the DL method shows that the red dots for 2012 are more closely located around the 1:1 line.
4. Conclusions
This paper described the estimation of corn yields in Iowa State using four machine learning techniques such as SVM, RF, ERT and DL, and presented the comparisons of the validation statistics among them. We set up the three period groups (MJJAS, JA and OC) to examine the seasonal sensitivities of the corn yields. In overall, the DL method showed the highest accuracies in terms of the correlation coefficient for all the period groups. The accuracies were relatively favorable in the OC group, which indicates an optimal combination of month can be influential in statistical modeling of crop yields. The differences between our predictions and the USDA statistics were about 6-8 %, which shows the machine learning approaches can be a viable option for crop yield modeling. In particular, the DL showed more stable results by overcoming the overfitting problem of generic machine learning methods. To utilize temporal characteristics of crop yields, time-series machine learning techniques such as RNN (Recurrent Neurual Network) are challengeable as a future work. A sensitivity test to examine the contribution of climate change to the crop yields by including or excluding the climate variables can be another future work.
Acknowledgements
This work was supported by the Research Grant of Pukyong National University (2015).
References
Ali I. , Greifeneder F. , Stamenkovic J. , Neumann M. , Notarnicol C. (2015) Review of machine learning approaches for biomass and soil moisture retrievals from remote sensing data Remote Sensing 7 (12) 16398 - 16421    DOI : 10.3390/rs71215841
Ali J. , Khan R. , Ahmad N. , Maqsood I. (2012) Random forests and decision trees International Journal of Computer Science Issues 9 (5) 272 - 278
Breiman L. (2001) Random forests Machine Learning 45 (1) 5 - 32    DOI : 10.1023/A:1010933404324
Cortes C. , Vapnik V. (1995) Support-vector network Machine Learning 20 (3) 273 - 297
Erhan D. , Bengio Y. , Courville A. , Manzagol P.A. , Vincent P. (2010) Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research 11 625 - 660
Friedman J.H. (1997) On bias, variance, 0/1-loss, and the curse-of-dimensionality Data Mining and Knowledge Discovery 1 55 - 77    DOI : 10.1023/A:1009778005914
Geurts P. , Ernst D. , Wehenkel L. (2006) Extremely randomized trees Machine Learning 63 (1) 3 - 42    DOI : 10.1007/s10994-006-6226-1
Hong S.Y. , Na S.I. , Lee K.D. , Kim Y.S. , Baek S.C. (2015) A study on estimating rice yield in DPRK using MODIS NDVI and rainfall data Korean Journal of Remote Sensing (in Korean with English abstract) 31 (5) 441 - 448    DOI : 10.7780/kjrs.2015.31.5.8
Jaikla R. , Auephanwiriyakul S. , Jintrawet A. (2008) Rice yield prediction using a support vector regression method Proceedings of Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology 2008 Krabi, Thailand 14-17 May 908 - 913
Jiang D. , Yango X. , Clinton N. , Wang N. (2004) An artificial neural network model for estimating crop yields using remotely sensed information International Journal of Remote Sensing 25 (9) 1723 - 1732    DOI : 10.1080/0143116031000150068
Karatzoglou A. , Meyer D. , Hornik K. (2006) Support vector machines in R Journal of Statistical Software 15 (9) 1 - 28
Kim N. , Cho J. , Shibasaki R. , Lee Y.W. (2014) Estimation of corn and soybean yields of the US Midwest using satellite imagery and climate dataset Journal of Climate Research (in Korean with English abstract) 9 (4) 315 - 329    DOI : 10.14383/cri.2014.9.4.315
Kuwata K. , Shibasaki R. (2015) Estimating crop yields with deep learning and remotely sensed data Proceedings of 2015 IEEE International Geoscience and Remote Sensing Symposium Milan, Italy 26-31 July 858 - 861
Na S. , Hong S. , Kim Y. , Lee K. (2014) Estimation of corn and soybean yields based on MODIS data and CASA model in Iowa and Illinois, USA Korean Journal of Soil Science and Fertilizer (in Korean with English abstract) 47 (2) 92 - 99    DOI : 10.7745/KJSSF.2014.47.2.092
Prasad A.K. , Chai L. , Singh R.P. , Kafatos M. (2006) Crop yield estimation model for Iowa using remote sensing and surface parameters International Journal of Applied Earth Observation and Geoinformation 8 26 - 33    DOI : 10.1016/j.jag.2005.06.002
Ren J.Q. , Chen Z.X. , Zhou Q.B. , Tang H.J. (2008) Regional yield estimation for winter wheat with MODISNDVI data in Shandong, China International Journal of Applied Earth Observation and Geoinformation 10 403 - 413    DOI : 10.1016/j.jag.2007.11.003
USDA (2012) Census of agriculture United States Department of Agriculture https://www.agcensus.usda.gov/
Vapnik V. (1998) Statistical Learning Theory Wiley New York, NY