Big Genomic Data in Bioinformatics CloudAbstractThe achievement of Human Genome project has led to the proliferation of genomic sequencing data. This along with the next generation sequencing has helped to reduce the cost of sequencing, which has further increased the demand of analysis of this large genomic data. This data set and its processing has aided medical researches.Thus, we require expertise to deal with biological big data. The concept of cloud computing and big data technologies such as the Apache Hadoop project, are hereby needed to store, handle and analyse this data. Because, these technologies provide distributed and parallelized data processing and are efficient to analyse even petabyte (PB) scale data sets. However, there are some demerits too which may include need of larger time to transfer data and lesser network bandwidth, majorly.人类基因组计划的实现导致基因组测序数据的增殖。
这与下一代测序一起有助于降低测序的成本,这进一步增加了对这种大基因组数据的分析的需求。
该数据集及其处理有助于医学研究。
因此,我们需要专门知识来处理生物大数据。
因此,需要云计算和大数据技术(例如Apache Hadoop项目)的概念来存储,处理和分析这些数据。
因为,这些技术提供分布式和并行化的数据处理,并且能够有效地分析甚至PB级的数据集。
然而,也有一些缺点,可能包括需要更大的时间来传输数据和更小的网络带宽,主要。
IntroductionThe introduction of next generation sequencing has given unrivalled levels of sequence data. So, the modern biology is incurring challenges in the field of data management and analysis.A single human's DNA comprises around 3 billion base pairs (bp) representing approximately 100 gigabytes (GB) of data. Bioinformatics is encountering difficulty in storage and analysis of such data. Moore's Law infers that computers double in speed and half in size every 18 months. And reports say that the biological data will accumulate at even faster pace [1]. Sequencing a human genome has decreased in cost from $1 million in 2007 to $1 thousand in 2012. With this falling cost of sequencing and after the completion of the Human Genome project in 2003, inundate of biological sequence data was generated. Sequencing and cataloguing genetic information has increased many folds (as can be observed from the GenBank database of NCBI). Various medical research institutes like the National Cancer Institute are continuously targeting on sequencing of a million genomes for the understanding of biological pathways and genomic variations to predict the cause of the disease. Given, the whole genome of a tumour and a matching normal tissue sample consumes 0.1 TB of compressed data, then one million genomes will require 0.1 million TB, i.e. 103 PB (petabyte) [2]. The explosion of Biology's data (the scale of the data exceeds a single machine) has made it more expensive to store, process and analyse compared to its generation. This has stimulated the use of cloud to avoid large capital infrastructure and maintenance costs.In fact, it needs deviation from the common structured data (row-column organisation) to a semi-structured or unstructured data. And there is a need to develop applications that execute in parallel on distributed data sets. With the effective use of big data in the healthcare sector, areduction of around 8% in expenditure is possible, that would account for $300 billion saving annually.下一代测序的引入给出了无与伦比的序列数据水平。
因此,现代生物学在数据管理和分析领域面临挑战。
单个人类DNA包含约30亿个碱基对(bp),表示约100吉字节(GB)的数据。
生物信息学在这种数据的存储和分析中遇到困难。
摩尔定律推测,计算机速度增加了一倍,每18个月大小减少一半。
报告说,生物数据将以更快的速度积累[1]。
人类基因组测序的成本从2007年的100万美元降至2012年的1千美元。
随着测序成本的下降,在2003年人类基因组项目完成后,产生了生物序列数据的淹没。
测序和编目遗传信息已经增加了许多倍(如从NCBI的GenBank数据库可以观察到的)。
诸如国家癌症研究所的各种医学研究机构正在连续地将一百万个基因组的测序用于理解生物学途径和基因组变异以预测疾病的原因。
假定肿瘤的全基因组和匹配的正常组织样品消耗0.1 TB的压缩数据,则一百万基因组将需要10万TB,即103 PB(petabyte)[2]。
生物学数据的爆炸(数据的规模超过单个机器)使得与其一代相比存储,处理和分析更昂贵。
这刺激了云的使用,以避免大的资本基础设施和维护成本。
实际上,它需要从公共结构化数据(行- 列组织)偏移到半结构化或非结构化数据。
并且需要开发在分布式数据集上并行执行的应用程序。
随着医疗行业大数据的有效利用,支出减少约8%,每年可节省3000亿美元。
ReviewCloud computingCloud computing is defined as "a pay-per-use model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction" [3]. Some of the major concepts involved are grid computing, distributed systems, parallelised programming and visualization technology. A single physical machine can host multiple virtual machines through virtualisation technology. Problem with grid computing was that effort was majorly spent on maintaining the robustness and resilience of the cluster itself. Big data technologies now have identified solutions to process huge parallelised data sets cost effectively. Cloud computing and big data technologies are two different things, one is facilitating the cost effective storage and the other is a Platform as a Service (PaaS), respectively。