Splice Site Detection in DNA Sequences Using a Fast Classification Algorithm

problems

In the field of biological research there are several issues related to the processing of DNA
related to the processing of data included in the field of Bioinformatics. some DNA-related issues to be solved in bioinformatics is classification of a group of data (sequences), similarity detection, separating the proteins into DNA sequence (splicing), predict the molecular structure, looking for a new drug structures etc. research in the field of DNA involves large data containing information such as gene, protein sequences, and other biological related data so that the
processing time and memory requires relatively large.

Pattern recognition in DNA is not an easy problem because in addition to having relatively large size of data which DNA is composed of exons (encoded in proteins) and introns (not encoded in proteins) which are separated without any characters (explanatory) that account for the separation between the two.

Goal

This paper describes the main aim of this method is to predict the location exons and introns in a large-size data with high accuracy and time reasonable. More generally, the problem of pattern recognition in the DNA is able to implement a system that is able to solve the problems of storage,
processing, and analysis of large DNA data.

Method

Previous research tend to use the SVM method determine the location of exons and introns separating. But in this paper are described fundamental weakness which is owned by the SVM method that is growing memory needs very high complexity is the square of the number of input data, so it can said that the dependence of the SVM is very high complexity to the size of the data sets. The main idea of the method proposed in this paper has a background of weakness
owned by the SVM method in the training process has complexity high. Repairs carried out by reducing the number of data sets used in conduct training with the consideration that the data are close to the limits / boundaries an important point and while the data is far from the hyperplane does not have strength / contribution in the process of training SVM. This resulted in the number of data sets used in the training process is much smaller than using the entire data set on regular SVM method.

The new method is an improvement of SVM is generally divided into three stages process. The first stage of this method is to determine the small-sized data sets of support vector (SV). The second stage is to conduct training using the Bayesian SV and without SV were obtained from previous data and reduce the input data are considered less representative and make the important data sets into a candidate SV. The third stage candidate SV is generated using the previous process and
using the second step in SVM.

Result

Tests were conducted in this paper is to test the accuracy and time dataset used in the training process.

The above table shows a comparison of the error value, true negative, false negative and tested
the two datasets were used that dataset Acceptor and Donor.

Conclusion

In the paper described a method of repair on the SVM is used for classifying large data sets. These algorithms perform the selection of relevant data for included as training data and which is not. It is intended to reduce current complexity of the model building process of training. The results show that time spent in the training process is reduced significantly when the
formation models.

ADVANTAGE
  • The proposed method is simple but very significantly reduces the processing time establishment of training data
  • Guidelines in conducting experiments also included a clear and detailed results
    research

DISADVANTAGE

  • In the title does not indicate that this method is a method derived from the method preexisting namely SVM
  • In the first stage was not given a reason as well as the specific number of data sets used because there is only a data instruction set used is small.

SUGGESTION

  • In the chapter mentioned that the method is derived from other methods (SVM) so that readers get a clear picture of the proposed method.
  • Added information about the comparison of the accuracy of the proposed method less than or equal to the other methods, so that further highlight the repair time used in the formation of a more efficient training models.