How Google, Facebook and Yahoo laid the foundations for the Big Data revolution
We live in the era of ‘Big Data’ and it’s no longer a buzzword but a reality. Never before in human history have we produced such large volumes of data.
- More data was created in the past two years than in the entire previous history of the human race
- By 2020 digital data will boom to 44 trillion gigabytes
- Over 1.2 trillion searches are performed on Google each year
- Google has 30 trillion pages in its index and growing
- Every minute 300 hours of video are being uploaded to YouTube.
This is just a snippet of a staggering list of data production milestones that you can read here. But did you know that it was largely the giants of the digital industry who laid the foundations for this data revolution?
Google and Hadoop
No discussion about Big Data is complete without talking about Apache Hadoop. Hadoop is an open source software infrastructure that is built for the distributed storage (Hadoop Distributed File System – HDFS) and processing (MapReduce engine) of large data sets on computer clusters built using commodity hardware. (Commodity hardware is the typical server storage racks that you’ve probably seen your IT department working on).
Apache’s Hadoop was inspired by Google’s papers on MapReduce and a proprietary file system. Indexing trillions of documents on the web is a complicated job, so Google had to come up with a scalable and reliable hardware set-up without having to commission a computer storage manufacturer to build super storage machines. This set up was the Google file system that inspired HDFS.
The stored data then had to be processed and fulfill the millions of searches being made on the search engine each day (now up to 3.5 billion searches per day). Google soon realized that it would take it days to process that much data if it used traditional data processing methods. So it had to come up with a system that allowed it to break jobs into smaller chunks through filtering and sorting based on certain conditions known as a Map() and then a summary function called Reduce() that combines all these small jobs.
Google has since moved on to more complex ways of processing information since launching its file system and parallel processing. However, this initiative paved the way for other technology companies and marketers to come with more data processing solutions.
How can you use Hadoop?
Data-driven marketing is a commonly used term in the industry. The user today now has multiple interactions with a brand through a myriad of devices and touchpoints; each resulting in data that can be analysed.
The volume and velocity of this marketing data means that marketers have no option but to look towards a distributed file system for storage and processing in order to derive insight.
The good news is that you don’t have to invest millions in setting up the infrastructure. Through cloud platforms, such as Google, you can rent Hadoop clusters on a pay-as-you-go basis.
Yahoo and Hadoop
While Google laid the groundwork for the Big Data revolution through its distributed file system and parallel processing system, it was Yahoo that introduced it to the world as an open source framework.
In fact, Hadoop owes its name to Yahoo!, Yahoo’s software engineer Doug Cutting named it after his son’s toy elephant. Yahoo became one of the first significant users of the platform and, in June 2009, it made the source code to the version of Hadoop it ran in its data centres publicly available.
Facebook and Hive
Hive is one of the easiest to understand Big Data analytics tools. It’s a data warehousing infrastructure that is built on top of Hadoop. It allows data analysts to run queries and perform data mining using query languages similar to the commonly used query languages being traditionally used in the industry.
Facebook developed Hive because, just like Google, it had to deal with explosive growth in its data, alongside its exponentially growing network (rising from terabytes to petabytes). A traditional database simply wouldn’t have been enough to cope with such a volume and velocity of data.
While Hadoop and MapReduce were readily available to the Facebook engineers, they were still too complicated to use for simple queries. This motivated the Facebook engineers to come up with a simple query language that was easy to use on unstructured Big Data and was similar to the syntax of query languages that were commonly used. Using Hive Facebook was able to store data in various tables and partitions which can be joined to derive insight using simple queries.
How can you use Hive?
Using Hive you can write queries to run on Facebook Insights data and then can combine data in different tables.
Google and machine learning
Machine learning is a combination of artificial intelligence, statistics, and traditional data mining. It is typically split into two streams; supervised and unsupervised machine learning.
Although not a direct Google invention, such is the influence of the search giant that it has suddenly been brought to the forefront through one of its most significant algorithm updates – ‘RankBrain’. For Google machine learning is a big deal and they have been promoting its use through the Cloud Machine learning platform.
How can you use machine learning?
Using machine learning marketers can perform shopping basket analysis to reveal associations between product purchases, perform time-series predictions on website traffic using historical data and sentiment analysis using text analysis of social media data.
The Big Data ecosystem is not limited to the applications developed by the internet giants. A new platform, software or app is emerging in the market every fortnight which promises to get your company or organisation started with Big Data. It’s the digital industry however, because of the volume and breadth of its user data, which has inspired the revolution and has led its most innovative developments.