Thursday, 3 September 2015

Big Data Series – Part 3 Six technology components for Data Acceleration


There are at least six key technology components to build a supporting architecture: Big Data platforms, Ingestion solutions, Complex event processing, In-memory databases, Cache clusters and Appliances. Each component helps with data movement (from source to where needed), processing and interactivity (the usability of the data infrastructure).

Big Data platform (BDP)

BDP is a distributed file system and compute engine. It contains a big data core, a computer cluster with distributed data storage and computing power. Replication and sharding partitions very large databases into smaller, more easily to manage parts in order to accelerate data storage.

Newer additions enable more powerful use of core memory as a high-speed data store. These improvements allow for in-memory computing. Streaming technologies added to the core can enable real-time complex event processing. In-memory analytics support better data interactivity.

Further enhancements to the big data core create fast and familiar interfaces with data on the cluster. The core stores structured and unstructured data, but requires map/reduce functionality to read. Query engine software enables the creation of structured data tables in the core and common query functionality (SQL etc.)

Ingestion

Collecting, capturing and moving data from its sources to underlying repositories used to be done traditionally through the extract, transform and load ETL method. Today the priority is not the structure of the data as it enters the system, but assuring that all data is gathered covering different increasing data types & sources and quickly transported to areas where it can be processed by users. Ingestion solutions cover both static and real-time data. The data the gathered by the publisher and then send to a buffer/ queue, where the user can request the data.

Complex Event Processing (CEP)

After data ingestion the CEP is responsible for preprocessing and aggregation (& triggering events). It tracks, analyzes and processes data of events and derives conclusions. CEP derives data from multiple sources and combines historic as well as fresh data in order to infer patterns and to understand complex circumstances. Its engines pre-process fresh data streams from its sources, expedite processing of future data batches, match data against pre-determined patterns and trigger events based on detected patterns.

CEP offers immediate insight and enables fast action taking. In-memory computation allows to run Data movement and processing in parallel, increasing speed. CEP solutions add computing power by processing the data before it is submitted to the data stores or file systems.

In-memory databases (IMDB)

IMDBs are faster than traditional databases, because they use simpler, internal algorithms and executive fewer central processing unit instructions. The database is preloaded from disk to memory. Accessing data in memory eliminates the seek-time involved in querying data on disk storage. The applications communicate through SQL, which receives records in the RAM and triggers the query optimizer.

IMDBs constrain the entire database to a single address space. Any data can be accessed within microseconds. The steadily falling RAM prices favor this solution.

Cache Clusters

They are clusters of servers in which memory is managed by a central software designed to transfer the load from upstream data sources (databases) to applications and users. They are typically maintained in-memory and can offer fast access to frequently accessed data. They sit between the data source and the user.  Traditionally they accommodate simple operations such as reading and writing values. They are populated when a query is sent from a data user to the source. Prepopulating data into a cache cluster of frequently accessed data improves response time. Data grids can take caching a step forward by supporting more complex queries and using massive parallel processing (MPP) computations.

 Appliance

Massive parallel processing sits between data access and data storage. Appliance here is a pre-configured set of hardware and software including servers, memory, storage, input/output channels, operating systems, DBMS, admin software and support services.

It may have a common database for online transactions and analytical processing, which improves the interactivity and speed.  Appliances can perform complex processing on massive amounts of data.

Implementing and maintaining high performance data bases on clusters is challenging and few companies have the necessary expertise to do so themselves.

Custom-silicon circuit boards enable to develop their specific solutions. It enables development on devices for specific use cases and allows for network optimization (integrating embedded logic, memory, networking and process cores). This plug and play functionality offers interesting possibilities.

Continue part 4 out of 5

No comments:

Post a Comment