Fusarium Head Blight disease (FHB), also known as scab, leads to significant losses in crop yield, low grain weights, low seed germination and contamination of grain with mycotoxins. Several species of Fusarium can cause FHB. Among these, F. graminearum (teleomorph Gibberella zeae (Schwein.) Petch), infects different parts of the plant. Fusarium penetrates through the stomata to the palea and lemma and destroys these tissues. The disease has a symptomless phase where the hyphae advance between the plant cells prior to host cell death  . Fusarium infection introduces several mycotoxins into the food chain, such as the trichothecene nivalenol (NIV), deoxynivalenol (DON) and the estrogenic mycotoxin zearalenone (ZEN), posing significant threat to humans and animals health. Fusarium graminearum has a very broad host range and can suppress the plant defense responses. Although agricultural practices such as crop rotation and an invasive tillage can limit the development of FHB, some critical factors for the control of F. graminearum are still unknown  .
Synthetic fungicides are globally used to control plant diseases. According to Cools and Hammond-Kosack (2013) 95.5% of the wheat-growing area in UK is receiving fungicide sprays  . Nonetheless, fungicide resistance is a threat and leads to the search for new and more efficacious compounds  . The development of new fungicides in combination with genomic data can improve their specificity, lower costs and environmental impact. The Fusarium graminearum PH-1 genome was one of the first fungal pathogens to be sequenced  and recently an Australian isolate (CS3005) of F. graminearum genome senso stricto was reported by Gardiner et al. (2014)  . Ma et al. (2013)  reviewed the major discoveries performed by the genomics analyses of several Fusaria, focused on plant pathogenicity and secondary metabolites production. Furthermore, the compartmentalization of Fusaria genomes into core regions, and adaptive areas with a higher frequency of virulence genes was discussed. The Fusarium graminearum genome size is 36 Mb distributed in 4 chromosomes with a total of 13,718 annotated coding genes, 319 tRNAs, and genetic markers   .
Searching for novel targets to control wheat head blight disease remains a major challenge   . The goal of selecting protein targets from the genome is to develop chemical compounds against isolated biological targets from a pathogenic organism or a disease using computer-aided design. The definition of a drug target is a naturally existing cellular or molecular structure involved in the pathology of interest where a chemical compound can act on  . To find a new compound for agricultural application, we firstly select protein targets from relevant databases and prepare those targets to the virtual screening (VS) step   . This work is the first report of genomics analysis to search for new targets and to develop fungicides using molecular structural based approach applied for agricultural purposes.
2. Results and Discussion
2.1. EST Sequences Analysis
At the early phase of infection, fungal growth is symptomless representing the majority of mycelium colonizing the host tissues. Secondly the hyphal colonization is observed where the plant cell collapses. During the infection process the fungus maintains living host cells, probably with the expression of various secreted proteins, that might increase during the progress of the infection   . In order to explore the relationship between plant and pathogen induced in planta, we obtained ESTs from four infected wheat varieties at the early stage of infection   . The total amount of 57,282 ESTs from four interactions between wheat and Fusarium graminearum, processed at Rothamsted Research were assembled to build 34,970 unigenes from which we trimmed those related to the Triticum aestivum plant to those related to Fusarium. For that purpose, BLASTx searches were performed on these 34,970 unigenes against the plant and the pathogen genomes. Moreover, as we were only interested in contigs and not in singlets, this procedure eliminated 10,242 contigs similar to wheat, 951 unknowns and retained 5,094 analogous to the Fusarium, which were considered to the database. These results are summarized in Table 1. The annotation of 10,242 wheat sequences
Table 1. Summarized results for targets candidates selection using the data set from Fusarium graminearum PH-1 genome, EST analysis, PHI-base and Secretome  -  .
*Number of copies showed below, !validated by proteome experiments; (PHI-base reference number 1 PHI: 44; 2 PHI:432; 3 PHI:1235).
and of the 5,094 from Fusarium graminearum (named subgroup EST FG) are described in Table 1 and Table 2, respectively (http://lbi.cenargen.embrapa.br/Fusarium_target_supp/). Table 2 summarizes the ortholog search for the selected candidates.
2.2. Target Selection
The selection strategy is summarized in Figure 1. To select the target a total of 20,466 protein sequences were retained: 13,321 sequences from the Fusarium genome, in which 31 proteins were identified as involved in the trichothecene pathway from GenRE-FGDB (Fusarium graminearum database). Additionally, 5094 EST sequences expressed in four wheat varieties infected with the pathogen and 1369 proteins from the wheat infected secretome prediction and 682 sequences from the Pathogen-Host Interactions database (PHI-base).
From 19,892 sequences only 2835 candidates were retained after considering protein annotation, phenotype description and expression in planta. Followed by the filtering for redundancy, cell localization and accessibility (cytoplasmic proteins), only 92 satisfied the criteria. Among them, 30 had a low number of copies in the genome and were considered small proteins. The final filter produced a set of 13 candidates, which had no orthologues within non host genomes. Among them, only three were finally cho- sen for the 3D modeling phase, namely trichodiene synthase, endoglucanase-5 and the ERG6 sterol C-methyltransferase. These proteins were considered as potential targets
Table 2. Ortholog identification for selected protein targets endoglucanase-5, trichodiene synthase and sterol C-methyltransferase in fungi database, plants, insects and human genome.
Figure 1. Schematic strategy for protein target selection for Fusarium graminearum databases.
representing a group of essential genes, for life or virulence on wheat, with a broad diversity of functions and cellular processes. The three potential targets are significant, unique and have a single function for the pathogen; all are expressed only by the pathogen and are assayable. Therefore, these candidates were selected as potentially and readily inhibited by low molecular weight compounds.
The gene FGSG 02658 is located on chromosome 1 and is predicted to encode an endoglucanase type K protein (E.C. number 18.104.22.168.), belonging to the cellulase class that catalyze the hydrolysis of the β-(1,4) glycosidic bonds of cellulose that primarily hydrolyses less well-ordered regions of cellulose by cutting at internal glycosidic bonds. Cellulases play an important role during infection, enabling the degradation of plant cell walls, pathogen penetration and growth through plant tissue   .
Protein sequence alignments between Fusarium endoglucanase FGSG 02658 and orthologs show high similarity with a well-defined conserved domain glycosyl hydrolase family 61 (GH61). To obtain a 3D homology model corresponding to the protein from Fusarium FGSG 02658 sequence (residue 1 to 380) was necessary to use complementary protocols for molecular modeling. Complete models were proposed by the HHPRED (one model), PHYRE2 (one model), TASSER (5 models) and ROBETTA (one model) server (Table 3). The comparison between these 3D models showed that the core se-
Table 3. Homology 3D predictions for the target endoglucanase-5 (FGSG_02658).
quence (residues 20 - 220) is very similar in all predictions (and even with the predictions not retained for the incomplete sequence), an enormous diversity of proposals is given. These models were all submitted to a short 10 ns of molecular dynamics (MD) simulation to check their conformational changes compared to the starting homology structures. According to our selection procedure, the model given by the Robetta server appeared to be the most suitable and was the most stable from short 10 ns of MD simulations. Consequently, this last model (Figure 2(C)) was submitted to 100 ns of MD to validate the stability of the model. The evolution of this 3D structure was followed by the RMSD curve (Figure 2(A)) and mapped (Figure 2(B)), showing that the stability of the structure was achieved during the MD. Conformation obtained at frame 75 ns (Figure 2(C)) was retained as the representative conformer of this family and used later to perform the virtual screening campaign. In this conformation, the overall 3D core structure was conserved during the MD trajectory and consists of a six-stranded beta- barrel and a seventh strand that is not part of the barrel itself surrounded by short helices. The catalytic region is connected to a cellulose-binding domain through a flexible linker region. Two aspartic residues located on either side of the catalytic groove and positioned above the tyrosine residue in the N-terminus were maintained favorably to ensure their catalytic role similarly to experimentally described structures for endoglucanases  .
2.4. Trichodiene Synthase
The trichodiene synthase gene (TRI5) is located on chromosome 2, locus FG03537.1
Figure 2. Molecular dynamics (MD) trajectory, RMSD map and three dimensional models for the three selected targets. FGSG 02658 Endoglucanase-5 (A) RMSD map obtained during 100 ns of MD; (B) RMSD map showing one representative conformer. Black color corresponds to high RMSD values between conformers; the white square represents the stable conformational family range; the arrow shows the representative conformer; (C) Proposed 3D model for F. graminearum endoglucanase-5. Colors represent the secondary structures. FGSG 03537 TRI5 (A) the RMSD map obtained during 100 ns of MD; (B) RMSD map showing three representative conformers. Black color corresponds to high RMSD values between conformers, the white square represents the stable conformational family range; (C) The proposed 3D structures, in purple the conformational state 1 (frame 200), in green 2 frame 500, in orange 3 frame 800. FGSG 02783 ERG6 (A) RMSD map obtained during 100 ns of MD. (B) RMSD plot for 100 ns MD simulation showing two representative conformers; (C) Proposed 3D structures, in purple conformer 1 (frame 450), in green conformer 2 (frame 850).
described as TRI5 GIBZE Trichodiene synthase (sesquiterpene cyclase) (TS). It contains two exons and encodes for a protein with 375 amino acids (XP 383713). Recently, the TRI5 gene was studied in terms of phylogenetic relationships among Fusarium chemotypes and reported as having a highly conserved organization and a common expression pattern  -  . Trichodiene synthase catalyzes the production of trichodiene, which is the first committed step in the pathway  as part of the 15 biochemical steps required for DON biosynthesis. The enzyme is a sesquiterpene cyclase that catalyzes the formation of trichodiene in the biosynthesis of antibiotics and mycotoxins   . It catalyzes the isomerization and cyclization of farnesyl pyro-phosphate to form trichodiene, the first cyclic intermediate in the biosynthetic pathway for tricho- thecenes. TRI5 serves to branch trichothecene biosynthesis from the isoprenoid pathway. Trichodiene synthase produces the intermediate trichodiene and is specifically induced in planta. This feature provides a substantial advantage over direct pathogen control by a chemical compound although it has been shown that the infection of F. graminearum in Arabidopsis can occur without the synthesis of DON toxins  .
The homology modeling procedure used was similar to the one described above for FGSG 02858. Nevertheless, the homology modeling phase was straightforward as all homology servers gave similar results and due to the crystal structure PDB templates existing for an ortholog for trichodiene synthase (1 PDB structure for an apo and 5 for complexes forms), such as the 1JFA structure of trichodiene synthase from Fusarium sporotrichioides  . Compared to the X-ray crystal structure of this free recombinant, our 3D homology model of FGSG 03537 presented the overall conservation between the templates and the proposed model with an average RMSD 2.0Å for the peptide backbone. The trichodiene synthase structure is, therefore, formed by well conserved 17 α-helices, six of which (C, D, G, H, I and J) define a conical and hydrophobic active site cleft. The structural validation step through 100 ns MD simulation and the RMSD map analysis (Figure 2), shows three conformational families that appear as stable states for this protein. These conformers differed mainly by the position of the C-terminal helix and by some loops arrangements (Figure 2(C)). The position of the aspartate-rich motif DDSKD from residue 100 at the C-terminal up to the end of helix D was very well conserved during the MD trajectory, in good agreement with studies that investigated the mutagenesis of D100 and D101 which are important for catalytic activity   . On the opposite wall of the active site cavity, at the C-terminal end of helix J, the “basic motif” DRRYR starting at residue 302 is also well positioned in all three of the models generated in this study. Mutagenesis studies of R304, Y305, and R306 in this motif similarly indicate the importance for catalysis. Consequently, these three models will be used in our structure-based virtual screening selection using an ensemble docking approach    .
2.5. ERG6-D-(24)-Sterol C-Methyltransferase
The ERG6 gene encodes for the enzyme D-(24)-sterol C-methyltransferase that catalysis the attachment of a methyl group acting in a bifurcation point of the ergosterol biosynthesis pathway, locus FGSG 02783. The protein is located in the endoplasmic reticulum with a transmembrane portion and an active site positioned toward the cytoplasm  . The ERG6 protein is required for the formation of fungal plasma membranes, with an essential role in membrane stabilization and signaling. ERG6 does not occur in plants or insects and sterols can be recognized by a plant cell as a “non-self” signal in host defense  . In Candida albicans, mutants that do not produce D-(24)- sterol C-methyltransferase showed an increase in plasma membrane permeability. In Saccharomyces cerevisae, ERG6 mutants have altered membrane fluidity and permeability  . The homology modeling of this protein was not trivial, and the Robetta server gave the more convincing model (mostly obtained from the 3BUS PDB structure) after the selection. Starting from this model and followed by 100 ns MD trajectory, it appeared that the molecular system was stable and that two conformational families depicted the protein conformational behavior (Figure 2(B)). Recently, Azam et al. (2014)  examined the structure of a 24-C-methyltransferase from Leishmania infantum and describe the prediction of the three-dimensional structure for the Leishmania orthologs based on theoretical data and structural details. The overall structure of ERG6 comprises ten alpha-helices and eight beta-strands and loops. The active site was previously described by Nes and co-workers (2002)  where the mutated residues 81?86 form a conserved motif of aromatic residues which is of considerable importance in inhibitor binding. Although ERG6 is a non-essential protein, as previously described in C. albicans and A. fumigatus   , this protein was added to the initial screening list since it was previously described as an already known drug target for the human pathogen Paraccocidioides lutzi and P. brasiliensis   .
2.6. Virtual Screening (VS) Results
All the 10,240 compounds selected from our virtual screening campaign were docked within the binding sites of the targets, as described above, for the major conformers’ population. These molecules were next ranked according to their docking scores. After analysis of the whole scores distributions, the best 15 scored compounds obtained for each target were selected. These compounds strongly interact with several target amino acid residues through polar, aromatic and hydrophobic interactions to form the more stable protein/ligand complexes. These 45 ligands thus selected from their docking scores were consequently retained as valuable candidates for further experimental validation. Detailed information for each target is presented below:
VS on Endoglucanase-5 (FGSG 02658)
The top 15 candidate compounds retained for the endoglucanase-5 target are shown in Table 4. Main interacting residues were Cys347, Gly349 and Lys351. Furthermore, these three residues were interacting with the majority of the 15 compounds. Additionally, the protein residues Gly348, Ala362, Asp356 and Ser352 were associated to a few compounds among the 15 ranked. Examples of protein ligand interactions for the three best score and non-toxic compounds are given in Figure 3(A).
Figure 3. (A) Protein-ligand interactions between the top three compounds and endoglucanase-5; (B) Protein-ligand interactions between the top three compounds and trichodiene synthase; (C) Protein-ligand interactions between the top three compounds and ERG6. Hydrophobic interactions are in red with black residue name and H-bonds are in green.
VS on Trichodiene synthase (FGSG 03537)
The trichodiene synthase docking results are shown in Table 5. Several amino acid residues were found to interact with the majority of the top-ranked ligands, such as Asn225 (8 ligands among the 15 top-ranked), Gln240 (6 among 15 compounds), Ser229 (5 among 15), Tyr305 (8 among 15) and Ser229 (all compounds). The protein-ligand interactions found for the three best non-toxic compounds retained are given in Figure 3(B).
VS on ERG6-D-(24)-sterol C-methyltransferase (FGSG 02783)
Table 4. Top-ranking synthetic compounds from Life Chemicals library after virtual screening against endoglucanase-5 unique conformer using GOLD docking software.
Table 5. Top-ranking synthetic compounds from Life Chemicals library after virtual screening against trichodiene synthase conformers’ using GOLD docking software.
The docking results for this target are presented in Table 6. The amino acid residues found to interact with the selected ligands were Trp97 (7 ligands among the 15 top- ranked) and Tyr93 (8 compounds among 15), through pi-pi interactions. The protein- ligand interactions found for the three best non-toxic compounds are given in Figure 3(C).
Overall, the searching approach for novel protein targets applied for agricultural purposes is a response to the demand for chemical crop protection. The search for new fungicides or antifungal compounds should be specific with no effect on the host or other organisms. The first part of this analysis was to select specific targets to start the virtual screening selection of chemical compounds. The availability of the genome, secretome, and PHI-base allowed and enriched the search, when used in combination with the experimental data provided from various in planta experiments (ESTs).
Several steps of selection enabled the choice of three major targets: an endoglucanase-5, a trichodiene synthase, and a 24-sterol C-methyltransferase. These three proteins are likely to represent a promising target group: the endoglucanase may be re-
Table 6. Top-ranking synthetic compounds from Life Chemicals library after virtual screening against ERG6 conformers’ using GOLD docking software.
quired for cell wall degradation and, therefore, the possibility to stop the disease before the development of symptoms; the trichodiene synthase, known to be required for mycotoxin synthesis, once inhibited could reduce the amount of toxins produced during infection; the 24-sterol C-methyltransferase is a well-known target for cellular growth inhibition, especially due to its specificity and absence in mammals.
The structure modeling was an important aspect when searching for new targets. Homology modeling and molecular dynamics proposed convincing 3D models at the atomic level, enhancing the future perspectives for the development of novel, efficient and specific fungicides to control plant diseases that do not impact on the environment. The experimental validation of the present virtual screening campaigns using the three targets models is under development.
4.1. Dataset Setting
To rank and select proteins from the Fusarium graminearum genome according to their potential as fungicides targets, we screened 13,321 proteins from the F. graminearum PH-1 genome  , 5094 assembled sequences from ESTs in planta, the F. graminearum refined secretome with 1369 proteins  and the filtered PHI-base databank with 682 proteins from F. graminearum  . Thus, the identification of potential targets was based on the ensemble of sequences coming from the Fusarium genome, assembled ESTs from wheat infected with Fusarium graminearum, the secretome and PHI-base sequences. The expressed sequences in planta were produced by the construction of 18 cDNA libraries from Fusarium graminearum-infected wheat from four different varieties: Bobwhite, Piko, Sumai-3 and Gottingen-2 (supplementary material: http://lbi.cenargen.embrapa.br/Fusarium_target_supp/). The in planta libraries were designed to reveal both plant and fungal genes expressed at the early stage of infection. The BLASTx search was performed for the 5094 assembled ESTs to characterize their functions. Additionally, another BLASTx search was performed against three specific datasets: the predicted secretome, PHI-base version 3.4 and a group of selected TRI genes to avoid redundancy. All unigenes were submitted to BLAST2GO for further categorization of biological processes, cell localization, and molecular function. All ESTs were also submitted to BLASTx search against the high confidence reference proteins from the wheat genome   . A total of 795 sequences from PHI-base which is a web-accessible database that catalogs experimentally verified pathogenicity, virulence and effectors genes, trimmed for Fusarium graminearum  and those from the total secretome  .
4.2. Target Selection Strategy
Our systematic criteria for target filtering were defined in terms of priority, as our primary objective was to identify genes that could be used as targets for the development of new control options: 1) protein annotation, phenotype description and transcriptomic evidence coming from EST analysis as unique genes expressed in early days of infection; 2) we discarded redundant sequences, proteins with low structure similarity within the Protein Databank (PDB), cell localization prediction in nuclei, and candidates with low accessibility to chemical compounds; 3) the number of gene copies in the genome, and protein molecular size; 4) the possible target selection also considered the absence of orthologues in another organism such as insects, plants, humans and evolutionary conservation among other fungi. Therefore, our selection process consisted of several elimination rounds. The first step of this funnel consisted in using protein annotation, phenotype characterization at PHI-base and gene expression in the early infection phase from EST data collections. The next stage of our selection was: 1) the elimination of redundancy and choice of candidates expressed during the early days of infection; followed by 2) a B2GO categorization and cell localization within the cytoplasm and accessibility to chemicals; and finally 3) a BLAST search of the retained candidates against the PDB  . The third step consisted in a manual curation for some gene copies (one or two) in the genome and protein size (limited to 500 amino acids). Finally, searching for and then eliminating orthologs in other species with sequence identity criteria (above 60%) resulted in the final list. The end of these selections aimed to propose a first set of protein targets necessary to perform a structure-based design of putative new fungicides for agricultural purposes. We kept as representative the F. graminearum three proteins for further analysis.
4.3. Target Modeling
In the absence of experimentally solved 3D structures, computational methods were used to predict 3D protein models and provide information regarding protein functions and structures. Homology modeling has been shown to be efficient in methods to reach reasonable theoretical 3D models as soon as a suitable sequence alignment exists between the sequences of the template and the query   . To build appropriate 3D models of our three selected targets, we used several homology modeling servers such as I-TASSER   , SwissProt  , PHYRE2  , M4T  , HHPRED  , MOD- WEB  and ROBETTA  . Once the target protein was identified as the most suitable template for homology modeling, we used MUSCLE 3.8.31 for multiple sequence alignment with default parameters to check for sequence similarity and to verify the conservation of structural signatures (find diagonals option disabled, a maximum number of iterations: 16, no duration limitation and no more than 200 sequences). The software MODELLER was used to perform the homology modeling task with its default settings  . The crude model for each selected target was obtained and equilibrated. Next, their predictions were compared and analyzed, and the models corresponding to the largest part of the query protein sequence and presenting high similarities between several servers’ proposals were kept according to structural index such as coverage and percentage of similarity. These 3D homology models were checked for their stability using short 10ns molecular dynamics and the most stable ones retained to be submitted to a long 100ns molecular dynamics simulation to ensure their structural behavior. For this purpose, each protein 3D model was first solvated with an 80 Å 3 box of TIP3P explicit water molecules. Next, ions were added for ensuring the electrostatic neutrality of the whole protein and solvent systems. The NAMD program  was employed in conjunction with the CHARMM27 force field. The initial states of dynamics were generated from the homology models after 64,000 steps of conjugate gradients minimization protocol followed by an equilibration stage of 1ns. The simulations were carried out in the isobaric-isothermal ensemble, maintaining the pressure and the temperature at 1 atm and 300 K respectively by using Langevin dynamics and the Langevin piston approaches. The equations of motion were integrated with a 1fs time step. Long-range interactions were treated using the particle-mesh Ewald approach with an 11Å cut-off (switching distance 9Å) for the real space calculation. The calculation of forces and motion equations was repeated to generate the trajectories corresponding to the requested simulation times. A conformation of the whole molecular system was recorded every 1ps, generating for the long 100 ns runs conformational samples of 1000 frames each which was then analyzed using the VMD graphic software  . The analysis of the models was achieved by monitoring their Root-Mean-Square deviations (RMSD) during the simulations and by checking the conservation of their secondary structure elements along the molecular dynamic trajectories.
4.4. Virtual Screening
Virtual screening uses computer-based methods to discover new ligands on the basis of biological structures. This technique reduces the molecular database to a few hit compounds for a protein target based on structural features such as size and toxicity of the synthesizable chemical structures. The chemical library construction was performed according to the steps described at Beautrait et al.  ; database creation and handling, substructure search, toxicity prediction and 3D structure generation. In order to limit the number of possible side effects, they were detected using three toxicity prediction web servers: Badapple  , PAINS-Remover  and Protox  . The chemical library used was the one providing a high diversity of compounds and was accessed from Life Chemicals  . This databank contains 10,240 molecules and each one was docked within the binding site of the selected targets. For each stable conformer identified thanks to the MD simulation of each target, the binding pocket was detected and characterized with the LigSite program  . The docking was performed by GOLD  which has been recognized as one the best docking software  . As several stable conformers were identified for each target, we used the ensemble docking possibility available in GOLD. The use of such conformational ensembles was considered as an improved strategy in structure-based docking calculations  -  . For each docking, 100 starting ligand conformers were used in GOLD. All target conformers used were aligned in a common reference system and the center of the pocket cavity is an average of the individual centers found in each conformation. A sphere of 15Å was selected to define the binding region around this center.
The authors gratefully acknowledge CNPq (funding grant 400432/2012-9) EMBRAPA Labex-Europe, University of Brasilia and CAPES for postdoc fellowship (#51/2013). MU and KHK receive support from BBSRC ISP Grant 20:20 Wheat (BB/J/00426X/1). The authors would also thank the funding resources of PHI-base, the UK Biotechnology and Biological Sciences Research Council (BBSRC) (BB/I/001077/1, BB/K020056/1) and receives additional support from the BBSRC as a National Capability (BB/J/ 004383/1).