Tuesday, September 23, 2014

Big Data – An Introduction and a Look from In-Memory Data Grids

With the increasing number of sensors, computers, and smart phones more and more data is stored digitally. Traditionally, the Internet followed a model of information provider – information consumer, where the web sites produced information, which was merely consumed by a number of users. Nevertheless, Web 2.0 sees more user engagement. An average Internet user nowadays is not a mere consumer. Rather, she produces content through the social media and forums and engages with the existing data over the Internet. User interactions can be mined to produce more interesting information. With the huge amount of data produced, it is apparent that the data storage and manipulation should scale as well. Big data is a paradigm that attempts to handle data in a larger scale compared to the traditional means of data storage and access mechanisms such as relational databases.


Store, Process, and Access the 3 "V"
3V of Big Data
Information in the Internet keeps growing exponentially, as more and more information is made public through the wires, making a paper-less work environment. There is a definite paradigm shift on the view of historical data, from mere log files in the backup disks, to useful information for analytics and data mining, stored in data warehouses. For example, The Internet Archive Wayback Machine contains data as much as 2 PB (Peta bytes), which also keeps growing at a rate of 20 TB per month. However, Big data is not just defined by its gigantic volume. Rather, it also depends on the velocity and variety. Velocity defines how fast the data flows in and out of the system. In a weather forecast system, temperature, relative humidity, wind speed, and other measurements from the sensors flow in each second and they should be processed real-time, efficiently. This process involves a huge volume of data moving with a high velocity. The third V, variety, illustrates that the data can be of heterogeneous formats. It can be composed of a set of raw images, numbers, video, or music files. Data in multiple formats should be mashed up and processed to find the interesting information. In the weather forecasting scenario, each sensor may feed its output in different formats to the server. A Big data system differs from a relational data base system in its ability to store, access, and process such a complicated data set.

Pattern Recognition, Data Mining, Machine Learning, and Big Data
A big data solution may not be huge in volume. Obama's big data campaign had 10 TB of data initially, in various forms. Data was processed at a very high speed, as new data was made available by the volunteers and analysts frequently. Hence the campaign had high variety and velocity, with a relatively low amount of volume. This also required 66,000 simulations to be run everyday. The real-time processing of the simulation outcomes required parallel executions. Recently, a study made by Facebook on users' emotions and how the updates seen by the users affect their emotions and posts they share, created some uneasiness among the users. Nevertheless, mining user responses to optimize the business outcomes is nothing new. Regular A/B testing carried out by almost all the mainstream public web sites is an example where different layouts are tested against same content, or different titles and captions are placed, to find which of them persuade the readers to click the link, subscribe, spend more time, or even purchase an item. Patterns are recognized from millions of responses and the analytics are carried out on a big scale big data solutions. Machine learning analytics find recurring patterns and provide futuristic predictions based on the numbers. Big data solutions provide a whole new opportunity for the data mining domain in finding associations and mining.

Scalability and Simplicity
Big Data Solution over In-Memory Data Grid
Increasing volume and velocity of big data requires larger and more powerful computers to scale up. Cloud storage and data-as-a-service solutions started to replace high-end computers with the abundant resources of multiple utility computers. NoSQL solutions were developed having simplicity, horizontal scalability, and big data economy in mind. NoSQL databases have flexible data models that enable storing the variety of data objects of big data, unlike the relational data bases that come with a strict schema. NoSQL solutions can be categorized according to their design and functionality. Key–Value Stores, Column-Oriented Stores, Document-Oriented Stores, and Graph Databases can be considered major categories of the NoSQL data bases, which store data persistently in disk or in-memory.

In-Memory Data-Grids for Big Data
In-Memory Data-Grids (IMDGs) such as Infinispan, Hazelcast, Gridgain, Gigaspaces XAP, VMware vFabric Gemfire, IBM eXtremeScale, and Oracle Coherence exploit the abundant storage, processing, and memory resources in computer clusters, to provide a unified view of the nodes in the cluster. This model of shared storage, memory, and processing enables execution of larger tasks that cannot be executed effectively on a single node. While persistent storages use disk as a storage, in-memory data grids use memory as the storage, adhering to the commonly stated phrase “Memory is the new disk”. Data grids share computing resources among the instances, providing a unified view of a super computer. The abundant availability of memory enables the efficient use of computer cache, thus speeding up the processes beyond linear speedups.

While in-memory data grids have the functionality of integrating with a persistent store, in case of limited available memory to hold a very large object space, persistence of the objects stored in-memory is generally ensured through backups. Data is replicated synchronously or asynchronously based on the configurations. These ensure faster transactions compared to the cheaper disk accesses which provide slower response time. Distributed execution framework implementations handles the “Process” stage of the big data seamlessly, as they execute the algorithms in a distributed manner over the big data. MapReduce frameworks are hence implemented over the in-memory key-value stores such as Infinispan and Hazelcast using the distributed execution frameworks, following the MapReduce model of Hadoop.

Data storage vendors are focusing in positioning their products to withstand the big data storm. Oracle Big Data SQL is a recent product from Oracle, that provides single and optimized SQL query for distributed data, bringing back the simplicity of having a unified view as a traditional simple data base. As one can presume, Oracle Big Data SQL supports multiple data sources including NoSQL solutions, not limiting itself to Oracle database. Interoperability of such platforms show a favorable future for data mining and warehousing.

Conclusion
Data keeps growing, and currently what seems big may turn out to be tiny in the future. Data is measured in Exabytes (EB), when it comes to large data-oriented companies such as Facebook and Google, which was not a frequented measure a decade ago. The paradigm shift towards big data is rather economical, than technical. Though complex computations can be run over super computers with the availability of large resources, an in-memory data grid utilizes the existing resources from the utility computers, aligning with the big data economy. While more and more tools are developed for the sake of scalability and efficiency, research challenges such as security and privacy should not be taken lightly.

No comments:

Post a Comment

You are welcome to provide your opinions in the comments. Spam comments and comments with random links will be deleted.