Big Data. Does this sound familiar? Of course, you might have read about it in the newspapers or tech journals or heard of it in the seminars. But most of us are still confused as to what is Big Data. Is it just another buzzword or does it mean something?
Wikipedia defines Big Data as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”
In the past 15 years, the amount of data collected by companies through the Internet has increased manifold. Companies like Google, Microsoft, and Facebook were collecting data blindly, but they didn’t know what to do with all that information. The data sets were massive to be processed by traditional analytical methods. If the companies were to gain anything out of all this data, they had first to develop the means to handle it.
It all began with a research paper released by Google titled “Google – MapReduce: Simplified Data Processing on Large Clusters”. Google – MapReduce is a programming model and an associated implementation for processing and generating large data sets. It states that most data sets can be processed using two basic functions, a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. This paper laid the foundation for the development of the Apache Hadoop Framework. Hadoop is an open-source software framework for distributed storage and distributed processing of large data sets on computer clusters built from commodity hardware. In simple words, it helps us derive conclusions from large data sets. Now, one might ask, why is all this even important? One must remember that many of these companies are service based companies. The information they get from their customers, if processed, can be used to predict their demands fairly accurately, thus increasing the sales. In fact, you can experience this yourself. Go to Amazon and buy a product, say a Raspberry Pi. Once you are on the Raspberry Pi page, you can see a list of recommendations as to what other users bought along with the Raspberry Pi. And there are 80% chances that you’re going to buy that Raspberry Pi case along with your Raspberry Pi.
But that’s not the only field where big data is useful. Big data, today, is being used to solve many real-world problems, to optimize processes and save money. It is being used all across the board, from banks, hospitals to logistics and even sports, anywhere and everywhere. In fact, the job title of my current mentor in the Alumni Mentorship Program of my college says ‘Data Engineer at San Francisco 49ers’! For those of you who don’t know, The San Francisco 49ers is a professional American football team located in the San Francisco Bay Area. So, apart from its original purpose, big data has found takers in many other unconventional fields.
The businesses in different sectors have just begun to realize the power of data science. Soon, any company that doesn’t incorporate data analysis into its work structure will become obsolete. Hence, a lot of them are looking for professionals with data analytical skills, who can make sense of all the data generated by the company and turn them profitable ideas. The demand has increased so much that ‘statistical analysis and data mining’ is the second most sought-after skill by employers noted a research released by LinkedIn in 2015. So, the next time you think about getting a certification, make sure that Hadoop is on the list.