1.1. Bioinformatics Data Set
Bioinformatics has evolved and expanded continuously over the past years and has become very important basic demand in life science research. There is an enormous growth of biological data on network and databases due to the massive amount of research done daily. The public databases growth rate is increasing exponentially over years, for example: NCBI Gene database and Protein database, nucleotide database reached 24, 300 and 210 million records in 2016 and have 13.8%, 37.7% and 5.2% annually growth rate, respectively .
1.2. User-Friendly Bioinformatics Tools
The biological data analysis and interpretation is getting a major bottleneck in Bioinformatics . In order to extract the target information from different biological data, there are plethora publicly available analysis tools, which could be used extract, analyze and visualize data. Some of the main differences between these softwares are availability, GUI user friendliness, visualization methods and performance. Each one of these softwares requires specific parameters in order to perform analysis or extract information about genes or gene clusters through simple and routine procedure. Many of the available bioinformatics open source tools uses command lines to perform different analysis, others have a graphical user interface (GUI) could simplify complex analytical procedures and provide a simple way to enter different parameters.
1.3. Similar Multiple Tools Softwares
Tenth of different general-use bioinformatics softwares are publicly available. DNASTAR (Lasergene) is a commercial bioinformatics software that compresses different applications such gene discovery, genomic visualization, NGS assembly with Sanger validation, primer design, Sanger sequence assembly, sequence alignment and others . CLC workbench is another bioinformatics pipeline provided by QIAGEN company (www.qiagenbioinformatics.com), which provides different data analysis tools such as NGS read mapping, De novo assembly, variant analysis assembly of DNA sequence data, multiple alignment sequence and reverse complement. EMBOSS is the european molecular biology open software suite, it integrates existing analytical programming packages and databases more effectively with over 100 applications and has the capability to be run with advanced graphical user interfaces .
In this study, we are introducing Bioanalyzer software, which is a bioinformatics tool that compresses simple and common data analysis applications with a user-friendly GUI. Bioanalyzer source code and freely available, where its code could be modified, extended or integrated in different bioinformatics pipelines. Bioanalyzer is a simple analytical software implements a variety of tools to performing common data analysis on different biological data types and databases.
2. Materials and Method
Bioanalyzer was developed using python libraries for perform data manipulation and using of Tkinter package to design the interface. About forty module and function from Biopython with integration from open source scripts and our self-wrote scripts.
Biopython is python library for Genomic data analysis and annotation provides plethora on scripts such as: data reading and extracting from FASTA and Genbank files, Multiple Sequence Alignment, BLAST searching against NCBI database and even accessing to the NCBI database itself . We used biopython scripts to create environment for data mining and annotation using scripts to read and manipulate FASTA and Genbank files using SeqIO module to produce annotated data in text format such as Multiple sequence Alignment, open reading frames, BLAST searching or NCBI query search, or in illustrated figures such as in chromosome genes, mRNA or tRNA visualization, enzymes restriction site using matplotlib and network module with integration of biopython modules.
2.2. Matrices and Algorithms for Proteins and Nucleotide Alignment
To maximize the accuracy of protein alignment, PAM and BLOSUM matrices is used for score the accepted mutation and find functional domains.
2.3. Packages Used in Data Visualization
Matplotlib is most sufficient and accurate for data visualization. Matplotlib used in the software draw and visualization of chromosome, restriction site, dotplot graph.
2.4. Tkinter Designing Graphical User Interface
Tkinter library is used to build the GUI that consist of frames, buttons, text boxes etc. tkinter provides availability to link scripts and functions with press of buttons and display the result text on text viewer .
2.5. Converting Python to Stand-Alone Executable Application
We used pyinstaller (http://www.pyinstaller.org/) to convert python file to standalone executable application. Pyinstaller collect the packages used in the python software and converting them locally installed packages in the directory of the software where the software can retrieve any function from this packages on this directory instead of calling the packages and function on system.
3. Results and Discussion
Bioanalyzer provides general aspects of data analysis such as handling nucleotide data, fetching different data formats information, NGS quality control, data visualization, performing multiple sequence alignment and sequence BLAST. The following description of each section of software with sample of results.
Nucleotide tools accepts nucleotide sequence(s) or NCBI accessions as an input. These tools provide DNA translation, GC%, reverse complement, transcription (Figure 1), back transcription and open reading frame (ORF) finding. GC% content could be used in transcriptome mapping (HTM) in gene-dense domains with high GC content . DNA translation could be used in protein sequence classification or finding statistically significant functional associations in genomic experimental . Additionally the tool also has options to choose between different translation tables and stopping translation at first stop codon.
Data Extraction can be used to extract specific targeted information from genebank sequence(s) with option of choosing file content and name (Figure 2). This tool creates folder that contains text files holds specific user-defined information extracted separately from genomic data. This tool can be used for the exploration biological database depends on genebank file data of drug discovery
Figure 1. Multiple FASTA record file transcription generated by transcription tool.
Figure 2. Data extraction tool is used to extract certain information from large genbank file. the left section from the figure is the content of each file, and the right section is the name of each file.
. the resulted file is text format file contain the information according the user choices (Figure 3).
Database tools could be useful in handling specific NCBI accessions in different databases for sequence retrieval in FASTA or genbank formats. This tool in discovering new mutations responsible for diseases by comparing different database records for the same gene in specific gene family . On the other hand, BLAST tool (Figure 4) could be used to align specific sequence or ID to public NCBI database, in order to discover similar published sequence(s). This option could be helpful the characterization of novel genes belong to the different gene families .
Alignment is most daily used tools in bioinformatics to do local, global, needleman or water nucleotide sequence or protein (Figure 5) alignment. Bioanalyzer offer different options to change alignment matrices according BLOSUM for either global or local alignments and the yielded score indicate how far those aligned sequences are similar to each other by giving score to every match, mismatch and gap. Sequence alignment illustrate how different aligned sequences are related to each other, discovering genes with common ancestor or to improve protein secondary structure prediction .
Visualization tools draw the massive nucleotide sequence such as chromosome files, illustrating genes/CDS positions (Figure 6 and Figure 7). This tool export illustrations in PDF file formats. This tool could be used in positioning genes on chromosome and depicting their rearrangement in different chromosomes . Phylogenetic tree tool can reconstruct phylogenetic tree(s) produced by using different nucleotide sequences, in order to screen there genetic diversity . Dotplot tool draws graphs between two sequences to show the sequence similarity, which could be used to compare complete genome sequencing data . GC% in visualization tools creates chart represent the GC% content of two FASTA file records, depicting the recombination drives the evolution of GC-content in different genomes . Restriction site tool can build a circular and a linear representation for the position of restriction sites in DNA sequences, which could be helpful in rapid polymorphism identification and genotyping using restriction site associated markers .
Figure 3. The output files contain the informations after extraction using extraction tool.
Figure 4. BLAST result contain the sequence ID, description, length, e-value and sequence with alignment.
Figure 5. The alignment two different protein can be produced by PAM or BLOSUM tools.
The weblogo tool illustrates the the consensus sequence in given record(s) which reflect the presence of the functional domains in protein such as: active site of or ligand binding site (Figure 8).
Quality control tools deal with FASTAQ files in order to do post-sequencing processing such as primer and adopters trimming to prepare the reads for different analysis such genome assembly, mapping or any other application . Also, convert FASTAQ format to FASTA format to allow user to do different analysis such as de novo transcript sequence reconstruction from NGS data .
Figure 6. The form of chromosome produced by chromosome tool illustrating each position of gene on chromosome (zoom out).
Figure 7. Genes ID generated by chromosome visualization tool showing each gene ID position on chromosome (magnification may be different due to the density of genes on chromosome, in this figure 500%).
Figure 8. Weblogo tool construct graph represent the consensus between sequences in given multiple fasta file.
Bioanalyzer was written using Python programming language (version 3.4+) that provides set of new functions, new tools and already available tools with minor edition in order to improve its functionality and presenting the output in more ordered way to implement a data analysis, extraction and visualization all gathered in one software.
Bioanalyzer was written using Python programming language (version 3.4+) that provides set of functions and tools to implement a data analysis, extraction and visualization. An additional python codes were written to provide new other
tools. source code, installer and manual are publicly available at (http://www.ageri.sci.eg/index.php/facilities-services/ageri-softwares/bioanalyzer or https://github.peterhabib/com/bioanalyzer).
Research is funded by the corresponding author Aladdin Hamweih, senior scientist, Department of Biotechnology at International Center for Agricultural Research in the Dry Areas (ICARDA).