######专业英语结课论文学号: **********姓名: **********论文题目:The Relationship and Distinction Between Big Data and Data Mining任课教师: ************专业名称:计算机技术所属学院:计算机科学与工程学院桂林电子科技大学研究生院** 年 * 月 * 日The Relationship and Distinction Between Big Data and DataMiningStudentID:*Name:*Adviser:*Guilin University of Electronic Technology* *,*Abstract: In this paper, data mining is discussed in the context of big data. Firstly, we elaborate the fact that big data plays a primary role in attracting academic community, business industry and governments. Secondly, the adverse of big data is discussed, such as much garbage, heavy pollution and its difficulties in utilization. Finally, we dissect the value in big data, expound the techniques to discover knowledge from big data, and investigate the transformation from knowledge into data intelligences.Key words: big data; data mining; data intelligence1.IntroductionAs data volumes continue to increase exponentially, the data tsunami can easily overwhelm traditional analytics tools or platforms designed to ingest, analyze and report.Every day, 2.5 quintillion bytes of data are created and 90 percent of the data in the world today were produced within the past two years[1]. The challenge we are facing is not only how to store and manage diverse data but also to effectively analyze the data to gain insight knowledge to make smarter decisions.Currently, a number of works have been presented.These researches introduce big data, mining and analyzing from different aspects, such as status quo, ideas or implementations.For example: introduces the “Lambda Architecture” which provides a general purpose approach to implement arbitrary functions on massive dataset in real time; a scalable deep analytics platform has been implemented. Because of the complexity, there is no single tool or one-size-fits-all solution for deeply mining and analyzing the big data. Moreover, extracting valuable knowledge from massive datasets requires further studies, experiments as well as scalable and smart services, programming tools and applications achieved.The remainder of this paper is structured as follows.Section 2elaborate the fact that big data plays a primary role in every fields. Then the adverse of big data is discussed in section 3.After analyzing the value of big data, we introduces the related knowledge and development of data mining in section 5. In Section 6, the effectiveness of data mining is introduced. Finally, the conclusion follow.2.About big dataBig data is complex data set that has the following main characteristics: V olume, Variety, Velocity and Veracity[2][3].These make it difficult to use the existing tools to manage and manipulate. In these data, big data specifically accounts for the vast majority.Big data is the basis of data and source of wisdom for people to understand the real-world through the information world.Big Data is closely related to applications[4][5], and big data mining is its principal application.2.1 From understanding the real-world to creating the information worldHuman civilization is a process from understanding the real-world to creating the information world, which has gone through the following stages: preliminary sensing the world, helping memory by information, recorded and inherited by information, exchange and communication by information and understanding the world once again by information. Initially, Human take advantage of stones and shells to count according to the principle of one-to-one. And they tie knots Note to help memory. Later, Human use simple graphics, draw notes, and inherit more accurate memory through their own emotional prompted. When the graphics become body relatively fixed common symbol, and associate with the words in the language, it produces texts. Texts abstract and generalize the world, promote cultural understanding, and prepare the necessary foundation for the development of science. Aimed at breaking through the restrictions which the written symbols depend on artificial copying or engraving, Human use machines after industrial revolution to volume mechanized production, which improves the efficiency of the cultural transmission. Computer centers high-speed computing, and spins off the software from the hardware, contributing to the dissemination of information “electronically” and “automatically”. Internet centers network, interrelates computers, breaking local information restriction. Mobile communication centers users, making the machine follows user's movements and unbounds human from the machine. Internet of Things centers applications, automatically identifies objects, to enable the information sharing between the human and things. Cloud computing centers service by consolidating expertise and optimizing the allocation of resources.Big data centers data, and mines knowledge in the entire data, breaking the sampling randomness of the sample[6][7], and demonstrating on big data center and mobile terminal.These information technologies serve for the understanding and transforming of the real world.2.2 Big data is attracting much attentionAs humans explore the real world through scientific research, humans unravel the mysteries in the information world through big data and data mining, which are attracting much attention from academia. In May 2011, McKinsey published “Big data: the next frontier for innovation, competition, and productivity”, analyzed application potential of big data in different industries from the economic and commercial dimensions, spelled out the development policy for the Government and industry decision makers dealing with big data.In January 2012, the “Wall Street Journal” argued that big data, smart production and wireless network t will lead to new economic prosperity[8].In March 2012, the United States government released “Big Data Research and Development Initiative”, which roses the development and application of big data from business conduct to national deployment strategic in order to improve the ability to extract knowledge from large and complex data, to help solve some of the nation's most pressing challenges.In April 2012, “Nature Biotechnology” invited eight biologists to evaluate an article which published in December 2011 on “Science” titling “Detecting Novel Associations in Large Data Sets” in a paper titled “Findingcorrelations in big data”.In July 2012, Gartner released the first data survey report “Hype Cycle for Big Data, 2012”, which thought deeply in big data[9].In China[10], big data attracts as much attention as it does around the world. Baidu uses Hadoop to do off-line processing since 2007. Currently, Baidu has over 10,000 Hadoop servers, which is more than Yahoo and Facebook, and it plans to reach 20,000 in 2013. In these servers, 80% Hadoop clusters are processing 0 total of 6TB data every day on log analysis. Tencent, Taobao and Alipay are also using Hadoop to establish data warehouse and handle big data. In April 2010, Taobao launched a data mining platform “data cube”, based on an one hundred billion level database named OceanBase, which supports for 4 to 5 million times update operation, including over 2 billion records, containing more than 2.5TB data in one day. In May 2010, China Mobile established a massive distributed systems and structured mass data management system on the cloud. Huawei analyzes data based on mobile terminals and storage massive data through the cloud to obtain valuable information. Alibaba analyzes business transaction data through big data technology to do credit approval.3.Big data disasterBig data is closely related to human daily life, permeated all walks of life. The number, size and complexity are all in sharp increasing.A large amount of data has been stored in the database and warehouse in types of text, graphics, images and multimedia[11].The research from International Data Corporation has shown that, as of 2003 humans have created a total of 5EB data, while in the year of 2011, the amount of data that had been copied and produced is exceeded 1.8ZB. It is expected that by 2020 global data usage will reach 35.2ZB, which needs 37.6 billion hard drives of 1TB capacity to store. On the one hand these data broadens the scope of available big data available for human to gain wisdom. On the other hand the value of a single unit of the data is rapidly declining. Human is submerged by the data ocean but thirsty for knowledge.3.1 GarbageBig data is voluminous and it grows quickly, but it has very low density in value, which means there is a lot of junk data[12]. The study on the electron-positron collider has been able to shoot 40 million pictures per second, but only a few thousands are useful. Romania Internet security company BitDefender pointed out that spam and fishing information in the social network game has increased by more than 50%. Compared to other online communication environment, social network users are more easily to unknowingly accept and load garbage information.Big data and applications are closely related, and professional labeling of the data is the basic objective of rational analysis and sound judgment.Whether scientific experimental data or observation data need to be labeled by experts in the field.According to the IDC statistics, in 2012 only 23% of all information is useful, of which only 3% of potentially useful information had been labeled, and the proportionof data which had been analyzed is much less. With the development of modern measuring technique and digital recording method, in the face of huge information, traditional, artificial, experience elimination and analysis methods have become powerless.3.2 ContaminationData collected from the real world is contaminated. Moreover, as early as 1992, the Massachusetts Institute of Technology found that data contamination problems are not isolated. In the 50 units and agencies that are sampled for the survey, most of the data accuracy is less than 95%.Regardless of access to s8atial data, there are some inevitable problems or errors[13][14], such as contents incomplete, precision error, data redundancy, format contradictory, different type, structure uncertainties, different scales, different standard, outdated, error exception, dynamic change and local sparse. Moreover each issue has a number of causes. For example, the noise can be periodic noise, stripe noise, isolated noise and random noise. Further, these data are often affected by gross errors, system errors and random errors individually or collaboratively.It is bound to damage the expected data accuracy if these three kinds of error cannot be correctly found and eliminated in the adjustment.3.3 Difficult to useData is not only contaminated, but also difficult to use. The production, transmission, replication and accumulation of data have gone far beyond people's capacity for analyzing, understanding and implementing. Due to the large amount of “big data”, it is difficult to collect, store, search, share, analyze and materialize.Commercial image processing software (ERDAS, IMAGINE, PCI, ENVI, etc.) are difficult to complete the following mission: mix pixel, image match automatically, target extract automatically, and other automatic processing mission because the lack of new theories and methods. A newspaper published the same article of the same author on two different pages of “legal community” and “youth topics”. Another newspaper published three articles in the Edition of “home appliances”, “lifestyle” and “science and technology”, all to compare among VCD, CVD and DVD on the same day, and got three different conclusions, but the editor did not even realize it.Over time, all walks of life are submerged by contaminated data garbage, and then it could lead the big data into “garbage in, garbage out”, and the “big data” becomes the useless “big garbage”. Now, useful data is buried, and implied value is blanked in big data.On such a predicament, following problems are the bottlenecks for big data research to break through: how to understand the spatial data, how to extract information from the data, how to turn data into knowledge can be available, and finally how to realize the value of data.4.The value of dataBig data is collected from numerous and interconnected sources. Real usefulness is its maximum value. The generally accepted rule of big data is “decision on data”. The first prerequisite is to keep data always useful and activated. The ultimate value of big data is to gain human intelligence.4.1 Overall cognitive original appearanceBig data provides an unprecedented opportunity to observe the real world in a full view rather than partial samples. Without big data, probability statistics can only be produced based on random sampling from the real world, because space data is constrained by collection, storage, computing and transmission. Like the proverbial blind men grasping an elephant can only take a part for the whole, there is only a limited view. Incomplete data sampling and sample data dispersion make it difficult to understand the overall trends or to notice the abnormal changes.4.2 Basic resourcesMcKinsey believes that data is the basic resource, and can be compared with physical assets, human capital, create significant value for the world economy, improve the productivity and competitiveness of the enterprises and the public sector, and create a large number of economic surplus for consumers. In 2011, the World Economic Forum called big data as new wealth. In 2012, the Davos Forum “Big Data, Big Impact” treated data as economic asset like currency or gold. In 2012, Gartner believes that “Big data is big money”.The U.S. government considers big data as “new oil” related to the country's economic restructuring and industrial upgrading[3].5.Data miningData mining refers to the basic technologies to realize the value of big data, relocate data assets, and use it effectively. Spatial data mining can be used to extract information from data, mine knowledge from information, extract data intelligence in knowledge, improve the ability of self-learning, self-feedback adaptation, finally realize human-machine intelligence.5.1 Basic big data technologyThe basic techniques of big data include data collection, storage, processing, expression, and quality evaluation.Big data can be generated in mobile devices, tracking systems, radio frequency identification devices (RFIDs), sensor networks, social networking, Internet search, automatic recording systems, video archives, e-commerce, as well as the process in analyzing those data.Big data storage technology is the basis for data mining. It is designed to meet the growing need for data storage, which aims to provide scalability, high reliability, excellent performance data storage, access, and management solution, such as distributed data storage, multiple levels caching, load balancing, fault-tolerant mechanisms. Conventional methods are not adequate for these missions. It needs to establish a large platform for data through software, to provide places to store and interface to access.Big data processing is to implement the transitions: from data to information, from information to knowledge and from knowledge to wisdom.Big data expression technology is designed to represent the data in a clear and effective way that reveals meaningful information to the user, or provide the user with a new perspective of view. Big data expression technology includes digital elevation models, digital terrain models, flat maps, three-dimensional maps, and digital city maps.Big data quality assessment technology is aimed to avoid the risk of big data collecting and high-density measuring. The technology includes logical assessment method, exception value based assessment method, and accounting based assessment method.5.2 Discovery knowledgeKnowledge discovery is the technology that uses data mining method to extract previously unknown, potentially useful, and ultimately comprehensible rules. It is also a process of gradual sublimation from data to information, and to knowledge, step-by-step. Data mining systems aims to make data gradually summarized into knowledge. Through the integration of data, it can deeply extract knowledge. By using such new knowledge, data can be processed in real time in order to understand and apply the data, to make intelligent judgments and well-informed decisions. Knowledge can be self-learning, self-enhance, universal, and easily recognized. It could serve as a basis for decision support.If businesses take full advantage of knowledge, it will be more precise and dynamic for humans to learn, work, life, and achieve wisdom state. It will help to improve resource utilization and productivity level. Moreover, it will also help to respond to the economic crisis, the energy crisis, the deterioration of the environment and many other global issues.5.3 Extraction data intelligenceData intelligence is the ability to obtain a more innovative, systematic and comprehensive knowledge to solve a particular problem through an in-depth analysis of the collected data. It is an ability to understand and solve problems fast, flexibly and correctly. Spatial data intelligent has three features: more thoroughly perception, more extensive interoperability, and deeper intelligence.The three features are aimed to get bigger and more comprehensive data, to share and co-operate data via the Internet, to do data analysis and data mining by variety of advanced techniques, and to constitute a hierarchy of spatial data intelligences (Fig 1).Figure 1.The hierarchy of spatial data intelligencesBig data intelligence does not refer to simple overlay different data mining techniques, but a reasonable structure of industry-oriented organization, good runner, and powerful wisdom system. The more reasonable industry structure become, the smaller internal friction got, the greater effectiveness got, and the higher wisdom system got. Every time when a person interacting with the data he/she becomes more efficient and more productive, which means it forms a better way to analyze, summarize, and calculate. Through the consolidation and analysis of trans-regional, trans-sector data, with knowledge applied in specific industry, specific scenes and specific solution, big data intelligence can support decision-making and action in a better way.More in-depth data intelligence is to create new value of data. On the one hand, when making full use of spatial data knowledge in all walks of life, it can produce secondary knowledge. In order to form a mining mechanism to mine knowledge in knowledge, it needs to bring primary knowledge together to form an intelligent form of expression. Ultimately, the destination knowledge can be achieved. On the other hand, based on a general industrial or socio-ecological system, it can redefine the interactive mode of government, companies and individuals, so that it improves the interaction clarity, efficiency, flexibility and response speed. It changes from the traditional single dimension such as: production consumption, management be management, or planning execution, to a new multi-dimensional collaborative relationship. In this new relationship, both individuals and organizations can freely contribute and get information and expertise accurately and timely. This new relationship exerts a positive influence on each other to reach smart running macro-effects.6.EffectivenessWhen we possess the necessary knowledge and ability to control it, the data becomes our valuable asset that leads to market domination and huge economic returns.Big data technology providers use technology for users processing structured, semi-structured and unstructured data. Big data applications are increasingly Internet ubiquitous, rich interfaced, and fragmented. It is a vertical integration in the application industry, therefore, business that is closer to end-users, tends to have a larger influence in the industry chain.Morgan Stanley's report insists that “Big Data is soon to become Any Data[15]”, In order to win the future, the rational choice is that “giving customers the technologies they need to store and analyze ‘any’ data set - any type of data, any size of data, for any type of user, and in any timeframe.”7.ConclusionThe development of big data extends the scope of human activities. It demands proper attention from academia, industry and government. The world has been cooperating and integrating on a global scale. Human is enforced to change mode from the local to the global in their everyday life and work. It redefines the relationship among individuals, businesses, organizations, governments, and societies through networked thinking and further to improve the human living environment, to enhance the quality of public services, to improve performance, efficiency and productivity through the intellectualized interactive operating. The technological progress and industrial upgrading of big data will create new markets, new business models and new industry rules, and more importantly it demonstrates the collective will of a country that looking for strategic advantage. Although there is still a large gap to gain data intelligence like human wisdom big data is a promising topic and it certainly helps us to understand the world from an entirely new aspect.References:[1]What is big data: Bring big data to the enterprise.2012./software/data/bigdata/[2]United Nations Global Pulse. Big Data for Development: Challenges &Opportunities [R]. May 2012[3] Office of Science and Technology Policy | Executive Office of the President. Fact Sheet: Big Data across theFederal Government [R], March /OSTP[4]Victor Meyer Schoenberg, Kenneth Ku Keye. Big data era: Life, work and thought of the big bang [M],HangZhouZhejiang people's publishing, 2012[5]BARABASI A.-L. Bursts: The Hidden Patterns Behind Everything We Do [M] Plume Books2011[6]BURSTEIN F. HOLSAPPLE C.W. Handbook of Decision Support System [M]. Berlin: Springer, 2008[7]HAINING, R., Spatial Data Analysis: Theory and Practice [M] Cambridge: Cambridge University Press, 2003[8]MILLS M. P., OTTINO J.M., The Coming Tech-led Boom [N], [9]LAPKIN A. Hype Cycle for Big Data [R], 31 July 2012, Gartner, Inc. | G0*******, 2012[10]Zhijun Zhu, Congguo Yu, Lei Yan. Big data: Big value, Big chance, Big Change [M]. Electronics Industry Press,2012[11]RAJARAMAN A., Ullman J.D. Mining of Massive Datasets [M]. Cambridge University Press2011[12]McKinsey Global Institute, 2011. Big Data: the Next Frontier for Innovation, Competition, and Productivity [R].May 2011[13]Kim, W. et al., A taxonomy of dirty data [J]. Data Mining and Knowledge Discovery, 2003, 7: 81-99[14]Stolfo, S.J., Real-world data is dirty: data cleansing and the merge/purge problem [J]. Data Mining andKnowledge Discovery, 1998, 2, 1-31[15]Morgan Stanley. Cloud Computing Takes Off Market Set to Boom as Migration Accelerates [R]. May 23, 20119。