A MAPREDUCE FRAMEWORK FOR DETECTION OF TANDEM REPEATS IN DNA SEQUENCES

[ 31 Mar 2019 | vol. 12 | no. 1 | pp. 13-24 ]

About Authors:

Vandanababu T1, Raju Bhukya2, D Veeraiah3 and P. Victer Paul3
-1Department of Computer Engineering, Government Polytechnic Daman
-2Department of CSE National Institute of Technology Warangal
-3Department of Computer Science and Engineering, Vignan’s Foundation for Science, Technology &Research, Guntur, Andhra Pradesh, India

Abstract:

DNA data produced by highly parallel next generation sequencing technologies entering into the field of bioinformatics as a flood, but traditional single processor based tandem repeat computing methods are not able to keep that pace. Microsatellite, also named as simple sequence repeats(SSR) with size 1-6 bp, are tandem repeats in DNA data is a sequence of two or more contiguous copies of same nucleotide sequence. These are playing very important roles in the fields of genetic diversity, forensic applications, genetic markers, mapping studies of human genome, and in disease diagnosis. There are many tools exist to detect these repeats, most of them were implemented around many assumed constraints, implemented on available hardware to minimise execution time; Existing tools possess different kind of advantages and disadvantages with each other. Hence a tool with exhaustive search for all tandem repeats is required. We proposed an efficient and constraint free Hadoop MapReduce based approach for detecting tandem repeats. It works in parallel and takes functional advantages of key –value based implantations. It displays one kind of repeat and all its occurrence in entire file a once, consolidates all kinds of repeats. Our experimental results are more readable, easily interpretable. It shows that the proposed approach more efficient with respect to increased data size when other tools have limitations.

Keywords:

Next generation sequencing, Tandem repeats, Exhaustive search, Hadoop MapReduce framework, Map task, Reduce task

 

About this Article: