White Mold of soybeans (Glycine max), also known as “Sclerotinia Stem Rot” (SSR), is among the most important fungal diseases affecting soybean yields and represents a recurring annual threat to soybean production in South Dakota. Initially reported in Poland in 1982 as a disease of local importance  , white mold was, more than a decade later, ranked in the top ten diseases that suppress soybean yields  . The apothecia of white mold generally appear after the crop canopy develops, around mid to late July and the environmental conditions corresponding to the development of white mold are cool (air temperature around 12˚C - 24˚C), wet and moist (enough rain: 70 - 120 hours of continuous wetness) conditions  . These conditions are favorable for optimal yield; therefore, incidence of white mold has been negatively correlated with yields  because the disease is more likely to develop where there is high yield potential. Thus, mapping and quantifying the disease is crucial to understand its impact on yields, and two options can be used: field scouting represents an accurate assessment, but remains time-consuming and does not provide a global view of the variations in the field, while remote sensing represents the best solution because it provides a synoptic view and allows observations to span large areas in a short period  .
The rationale behind the use of large scale imagery techniques is that they represent a fast, non-destructive method  , and rely on biophysical characteristics that depend on the wavelength used for crop status monitoring. Malthus and Madeira  highlighted the interest of using image to detect crop diseases by examining the spectral leaf reflectance properties of field bean infected by the fungus Botrytis fabae. Later, Polischuk et al.  studied the correlation between chlorophyll content and spectral reflectance in virus affected plants. In the 2000s, several authors explored diverse options for disease detection: Kobayashi et al.  used multispectral radiometers and airborne multispectral scanner to identify the panicle blast rice. Qin and Zhang  collected ADAR (Airborne Data Acquisition and Registration) remote sensing images to map rice sheath blight. Further, Huang and Apan  used a portable spectroradiometer to collect hyperspectral data and detect Sclerotinia rot disease in celery. Naidu et al.  later identified grapevines viral infections by using the leaf spectral reflectance collected with a portable spectrometer. The use of hyperspectral images is necessary to characterize plant stress   and spectral indices are crucial in detecting and identifying plant diseases    . However, most of these studies required the use of portable spectro-radiometer or airborne remotely-sensed images, which represent costly resources and have reduced accessibility to common users and farmers.
While vegetation stress has received a lot of scientific attention, soybeans stress mapping has received little attention, and when it has, these studies focused either on other diseases than white mold  , or in water stress   . Vigier, Pattey and Strachan  used hyperspectral reflectance to compute several vegetation indices to detect white mold, but the study focused on inoculated disease, rather than in-situ observation, and reflectance was collected using a field spectrometer. Recent studies have focused on mapping soybean at national scale   , but these efforts have not addressed disease detection. In South Dakota which is one of the main soybean producing state, no studies have been conducted for the quantification of soybean diseases, especially white mold using remote-sensing approaches.
There is still a knowledge gap in the effectiveness of free of charge moderate-resolution remotely sensed images such as Landsat in accurately mapping crop diseases, especially the occurrence and evolution of white mold in the Midwest. The current study employs free Landsat 8 images to map and quantify white mold in selected counties in South Dakota. Random forest (RF) classifiers  were used to extract spectral characteristics of soybean and white mold leading to mapping the spatial extent of the disease.
2. Materials and Methods
2.1. Study Area and Data Gathering
The study was located in northeastern South Dakota and includes three counties: Marshall, Day and Codington. Soybeans are planted in South Dakota between May 8 and June 21, with the most active period between May 15-June 11  . The harvest occurs between September 22 and November 3, with the most active period between September 28 and October 24. Field data consisted of scouting and reporting on the presence/absence of white mold during the months of July and August in the year 2017. In the study area, a total of 11 fields were scouted, where white mold was reported and confirmed as shown in Figure 1.
We downloaded the 30-meter spatial resolution Landsat Analysis Ready Data (ARD) from Earth Explorer (https://earthexplorer.usgs.gov/) for the growing season of the year 2017, and covering the three counties in the northeastern South Dakota (Marshall, Day, and Codington counties) as shown in Figure 1. These Cloud-free images were respectively from May 11, July 14, and August 31 and were derived from Landsat Collection 1 Level-1 precision and terrain-corrected scenes consisting of Top-of-Atmosphere (TOA) Reflectance, Surface Reflectance (SR), Brightness temperature (BT) and Quality Assessment (QA). In our study, the products of interest consisted of SR and the selected bands are summarized in Table 1. Yet, Landsat images were particularly hard to obtain during the growing season, due to persistent clouds that often extend the 16-day revisiting period of Landsat. This situation allowed to collect only two Landsat images (May and July) for soybean classification and one image (August) for white mold mapping.
The Crop Data Layer (CDL) is a land cover dataset developed by the National Agricultural Statistics Services (NASS) of the United States Department of Agriculture (USDA). This dataset can be used to extract soybean masks or other land cover of interest; however, the timing in the publication of CDL might not always match the needs to map the land cover within the growing season. The CDL is generally produced early in the year, for the land cover map of the previous year. We used CDL as a reference data in our study, guiding the trainings
Figure 1. Study area showing the three counties (Marshall, Day, and Codington) in Northeastern South-Dakota and the training polygons. The background image is a Landsat false color combination of bands 6-5-4 for July 14, 2017.
Table 1. Original Landsat 8 bands including the Shortwave Infrared (SWIR), the Near Infrared (NIR), the red (RED), the green (GREEN) and the blue (BLUE) bands, and their corresponding names used in the Random Forest (RF) classification, and in the stacked image.
for land cover mapping. This data also served in the comparison with our resulting land cover map.
2.2. Random Forest (RF) Classifiers for Mapping Soybean and White Mold
2.2.1. The Random Forest Algorithm for Image Classification
Methods that produce classifiers and aggregate their results have recently found many interests in the machine learning field  . The underlying principle is the same: based on a set of trainings used to extract spectral characteristics of different defined classes, these non-parametric classifiers (meaning that they require no statistical assumptions such as the normal distribution of the input dataset), build models that decide to which class to affect each observation. Among them are methods such as boosting, that use successive trees to assign extra weight to samples that have been incorrectly predicted by earlier predictors  , and bagging, in which successive trees are independent from earlier trees  . In the end of the prediction process, a weighted vote is taken in the boosting while a simple majority vote is taken in the bagging  .
The RF algorithm  is one of the learning methods that adds an additional layer of randomness to the bagging: each node is split using the best among a subset of predictors randomly chosen at that node, which is different from standard trees (i.e. Decision Tree-DT), where each node is split using the best split among all variables  . In the remote sensing field, especially in image or land cover classification, RF has shown to perform equally to Support Vector Machine (SVM)   or to outperform Decision Tree (DT)  . Other studies have shown that RF outperformed SVM in term of robustness and stability  and in terms of accuracy  . The RF is preferred in our study because it can deal with classification problems of unbalanced, multiclass and small sample data  . In fact, when collecting training data, some classes may require more training than others in order to capture the maximum variability in their spectral differences. This type of data collection can be dealt with by RF which does not require further processing.
2.2.2. Soybean Mapping and Validation
To classify land cover, we collected a set of trainings (about 183,810 pixels) used to extract spectral characteristics of different classes in ArcMap. We particularly trained four classes namely: Water, Corn, Soybean and Other Land Cover. To guide the trainings, three types of information could be displayed to better interpret the land cover in digitizing the training polygons: 1) Landsat-8 composites, 2) Crop Data Layer (CDL) serving as a cross-reference, and 3) high resolution Google Earth images. The quality of the training samples was evaluated using the Jeffries-Matusita’s (JM) spectral separability index, which provides a good mean of estimating the difference between the classes   . This index is a measure of statistical separability for two-class cases based on distance, and can be extended in the separability of multiple classes. The JM distance between classes ωi and ωj is formulated as shown in Equation (1). In general, a JM of greater than 1.9 represents a good difference, while JM of less than 1 implies a combination of the classes (no difference); a JM between 1 and 1.8 generally suggests improvement of training classes. The JM index was computed in ENVI.
where x is the feature vector of dimension k and and are class conditional probability distributions of x.
The training polygons were imported in R, and seventy percent of the pixels (128,667) were used to build the model while thirty percent (55,143) were used for validation. The two early images (May and July) bands were stacked using ENVI 5.0, and the resulting stacked image was classified using the RF algorithm in R. The ten Landsat bands (Table 1) were used as independent variables, while the land cover (four classes) to predict represented the response variable. The soybean mask was extracted from the resulting land cover classification map. The set-apart thirty percent of the samples were used to assess the accuracy of the Land cover map. A confusion matrix was built to assess the accuracy of each class as well as the overall accuracy, and to estimate the classification errors.
2.2.3. White Mold Mapping and Validation, and Areas Estimates
The August 31 Landsat image was used to evaluate soybeans health and to characterize white mold. Field locations of well-known white mold occurrence were used to extract the spectral characteristics of white mold using the computed Normalized Difference Vegetation Index-NDVI  from the same image. NDVI is a measure of the vegetation health and greenness, computed as the ratio between the difference and the sum of the Near Infrared (NIR) band and the Red band, which respectively represent the regions of high chlorophyll absorption and reflectance (Equation (2)). Locations presenting similar NDVI than the known fields were targeted to train the data for modeling; a total of 3981 pixels were collected in the trainings. Classes consisted of white mold (unhealthy) and other soybean (healthy), representing the response variables, while the explanatory variables consisted of the 5 individual Landsat bands and the NDVI. To maximize the accuracy of white mold detection and reduce the false positive, all pixels with low NDVI that do not correspond to white mold were excluded from the soybean mask. In fact, soybean disturbances occurring in July are not white mold because at this stage, there is not yet canopy closure. While healthy soybean in mid-July has and expected NDVI around 0.5, all pixels with NDVI lower than 0.45 within the soybean mask were excluded.
The RF algorithm was run on the soybean mask extracted from the LC classification; as with the land cover, seventy percent (2787 pixels) of the total sample pixels were used to build the model while thirty percent (1194 pixels) were used for accuracy assessment. To assess the accuracy of the results, the set-apart thirty percent of the samples were used to produce the confusion matrix, estimate the individual classes errors and the overall map accuracy. The resulting mapped white mold pixels were used to estimate areas by using the pixel counts and pixel size as it pertains to Landsat (Equation (3)).
where TA is the Total Area, N is the number of pixels, and A is the area of a pixel (30 m × 30 m).
3. Results and Discussion
3.1. Land Cover Spectral Separability
The performance of the trainings was assessed using the computed Jeffries-Matusita index, which assesses the classes’ spectral separability. Overall, all the classes exhibit good spectral separability (JM > 1.9) while the pair Soybean/Corn exhibits the lowest index (1.86) and water showing the highest separability (JM = 2). Table 2 provides different values of JM index between classes as trained for the Landsat bands in the northern part of the study area.
The original input Landsat bands have been stacked in a color composite image combining both May and July bands. The corresponding output bands designations are listed in Table 1. Figure 2 provides a visual display of each band’s ability to discriminate individual classes. Both NIR and SWIR bands in May and July separated water successfully; corn tended to stand out particularly in July using the visible bands (Blue, Green, and Red), while soybean (areas where soybean will grow) was distinguished in the visible bands in May. In fact, soybean is not visible in the fields at this period, but their areas can be distinguished with corn. The “OtherLC” class looks particularly difficult to extract because of the high variability of the land covers included (grass, pasture, other crops).
3.2. Land Cover Classification Results
The stacked May and July images were classified using the RF algorithm and the land cover map was generated using the R software. The four classes (Water, Corn, OtherLC and Soybean) were labeled and colored to match the Crop Data Layer (CDL) dataset. Figure 3 shows a comparison between the July False color (6-5-4) Landsat composite, the CDL and the classified images. Water (Upper-right) is in some cases classified as other land cover, especially when it corresponds to
Table 2. Jeffries-Matusita (JM) spectral separability index, showing the goodness of the training.
swamps as mapped by CDL. Overall, the classified image is close to the CDL but reflects more what is observed in the composite Landsat image, especially the field roads in-between soybean fields that are excluded from the classified map, thus excluding the false positive when mapping the disease. The rationale behind
Figure 2. Classes spectral separability shown by the density plot of each band reflectance. The Near Infrared (NIR) band is very good in discriminating Water in both images (B5_05 and B5_07), while Other Land Cover can be distinguished using the July green band (B3_07); Corn is however well distinguished using the May NIR band; the shortwave infrared however is the remaining band susceptible to separate soybean, when the above mentioned classes are successfully extracted.
computing a land cover map instead of using existing datasets such as the CDL is the timing: The release date of the CDL for a given year occurs early the following year, while the estimate the disease extent may be needed earlier than that. However, extracting the mask of interest from CDL is a good alternative provided it is released on time.
3.3. Land Cover Map Accuracy Assessment
The accuracy of the resulting classification map was assessed using the confusion matrix (Table 3), with the 30% set-apart pixels that were not used in the RF classification process. The classification results achieved an overall accuracy of 95%. The “Water” class performed the best (98% accuracy) while “Corn” performed the least (91% accuracy); OtherLC was classified with 97% accuracy while soybean achieved an accuracy of 94%. Table 3 reports the individual class accuracies as well as the errors. The commission and omission errors are reported as
Figure 3. A comparison between a July 14 false color 6-5-4 Landsat 8 image composite (A), the Crop Data Layer (CDL) map (B) and the resulting classification (C) of the stacked May and July images. The classification image is very similar to the CDL; the classified map can clearly delineate soybean, especially in the roads between the fields.
Table 3. Confusion matrix of the land cover map accuracy assessment.
well: Soybean is accurately classified with a 94% producer’s (meaning that approximately 94% of the soybean ground truth pixels also appear as soybean pixels in the classified image) and 93% user’s accuracy (meaning that 93% of the soybean pixels in the image actually represent soybean in the ground).
3.4. White Mold Mapping
Figure 4 shows the computed NDVI (B) on the August Landsat image (A), and the resulting mapped soybean and white mold (C). In late August, the soybean crops are mature and therefore the vegetation index is high. The detected white mold NDVI ranges between 28% - 78% while the healthy soybean exhibits a high NDVI of more than 79%.
Some unhealthy areas can also be detected with very low NDVI values, corresponding to early soybean damages that are not white mold. However, these cases represent sparse and isolated pixels and were not included in the training. Despite the efforts to accurately detect white mold, some other disturbances can also present similar spectral index, especially since the white mold mapping is only using one image. Including several images in the white mold mapping would allow exclusion of disturbances that have the same index with white mold while representing something else. Information on the timing of white mold is crucial in excluding such disturbances in the presence of several images. Yet, unplanned disturbances such as drought or hail damages would not exhibit similar spatial patterns as white mold in the field, and can therefore be distinguished from the mapped disease.
3.5. White Mold Map Accuracy Assessment
The accuracy of the resulting white mold map was assessed using the 30% set-apart samples that have not been used in the model building. The map achieved an overall accuracy of 99%. Table 4 reports accuracy and the commission/omission errors of the resulting white mold map. White mold is mapped with high accuracy (99%). These results can be explained by the quality of the independent variables that not only use individual bands, but also includes the NDVI in the modeling. However, this accuracy depends largely on the set-apart pixels used for the validation process, and considered as ground truth. Unfortunately, one limitation of the RF which is known as the black box, is that it cannot provide the contribution of each variable in the model. More importantly, we checked the known fields that were affected by white mold and all of them were correctly mapped. The resulting final white mold map is shown in Figure 5, as well as the classified Landsat images and the fields locations.
Figure 4. August Landsat composite (A), August Landsat NDVI with white mold range (B), and mapped soybean and white mold (C): White mold is accurately mapped from the soybean mask, using the appropriate NDVI signal.
Table 4. White mold accuracy assessment: Confusion matrix table comparing the mapped classes with ground truth.
Figure 5. White mold in northeastern South Dakota: the map shows a classified image in background with the four important classes and the quantified white mold over the soybean mask.
3.6. Quantified Soybean and White Mold
Using the Landsat pixel size (30 m × 30 m), we estimated the total area of the classified soybean in the three counties based on the total number of pixels mapped. Table 5 reports the total soybean areas estimation from both the classification
Table 5. Comparison between soybean area estimates from the United States Department of Agriculture (USDA) and the classified map in this study, as well as white mold extent estimated for each county, based on the calculations from the Landsat pixel size (30 m × 30 m) and the total number of pixels.
and the USDA report  , as well as the estimated white mold areas per county. The USDA estimated areas reported consist of the harvested statistics, but the values are very similar to those obtained by the classified Landsat images. The white mold area estimates are respectively 132 km2, 88 km2, and 190 km2, and represent 31%, 22% and 29% of the total soybean area for Marshall, Codington and Day counties.
This study demonstrated that free of charge remotely sensed images could be used to detect and quantify white mold. The RF algorithm used was efficient in mapping the land cover and detecting white mold as reflected in the accuracy assessment. To improve the accuracy in the disease detection, this study combined both Landsat individual bands and NDVI. Including NDVI in the model provides more information, especially since the index puts together the strengths of the NIR band and the Red band.
A good knowledge of the investigated fields is necessary to complement images processing and ensure a proper validation. Constraints such as the images availability, or the timing of the disease should be addressed carefully in mapping the disease. To improve the classification results, more images can be obtained by the fusion of medium spatial resolution Landsat (30 m, 16 days) with high temporal resolution Moderate Imaging Spectroradiometer—MODIS (500 m, 1 day) for instance. Disease extents may be underestimated because of the Landsat pixel size that may not capture small disease patches. The use of satellite images with short revisiting period and a higher spatial resolution such as Sentinel-2 (10 m, 5 days revisiting period) or daily Rapid-eye may provide a better way of quantifying the disease, but the extent or the coverage might require many scenes according to the size of the study area.
The disease rating might also represent an important factor in mapping the occurrence of white mold, as according to the latitude and the difference in the planting dates for instance, some phenological differences might be observed in the signal of white mold. The disease severity can help accounts for these differences while mapping the crop stress, which may result in better disease quantification.
Financial support for the study came from USDA-NIFA Hatch grant #SD00H662-18 and South Dakota State University Agriculture Experiment Station. We wish to thank the anonymous reviewers for their valuable and useful suggestions and comments on the manuscript.