There are at
least six key technology components to build a supporting architecture: Big
Data platforms, Ingestion solutions, Complex event processing, In-memory
databases, Cache clusters and Appliances. Each component helps with data
movement (from source to where needed), processing and interactivity (the
usability of the data infrastructure).
Big Data platform (BDP)
BDP is a
distributed file system and compute engine. It contains a big data core, a
computer cluster with distributed data storage and computing power. Replication
and sharding partitions very large databases into smaller, more easily to
manage parts in order to accelerate data storage.
Newer
additions enable more powerful use of core memory as a high-speed data store.
These improvements allow for in-memory computing. Streaming technologies added to
the core can enable real-time complex event processing. In-memory analytics
support better data interactivity.
Further
enhancements to the big data core create fast and familiar interfaces with data
on the cluster. The core stores structured and unstructured data, but requires
map/reduce functionality to read. Query engine software enables the creation of
structured data tables in the core and common query functionality (SQL etc.)
Ingestion
Collecting,
capturing and moving data from its sources to underlying repositories used to
be done traditionally through the extract, transform and load ETL method. Today
the priority is not the structure of the data as it enters the system, but
assuring that all data is gathered covering different increasing data types
& sources and quickly transported to areas where it can be processed by
users. Ingestion solutions cover both static and real-time data. The data the gathered
by the publisher and then send to a buffer/ queue, where the user can request
the data.
Complex Event Processing (CEP)
After data
ingestion the CEP is responsible for preprocessing and aggregation (&
triggering events). It tracks, analyzes and processes data of events and
derives conclusions. CEP derives data from multiple sources and combines
historic as well as fresh data in order to infer patterns and to understand
complex circumstances. Its engines pre-process fresh data streams from its
sources, expedite processing of future data batches, match data against
pre-determined patterns and trigger events based on detected patterns.
CEP offers
immediate insight and enables fast action taking. In-memory computation allows
to run Data movement and processing in parallel, increasing speed. CEP
solutions add computing power by processing the data before it is submitted to
the data stores or file systems.
In-memory databases (IMDB)
IMDBs are
faster than traditional databases, because they use simpler, internal
algorithms and executive fewer central processing unit instructions. The
database is preloaded from disk to memory. Accessing data in memory eliminates
the seek-time involved in querying data on disk storage. The applications
communicate through SQL, which receives records in the RAM and triggers the
query optimizer.
IMDBs
constrain the entire database to a single address space. Any data can be
accessed within microseconds. The steadily falling RAM prices favor this
solution.
Cache Clusters
They are
clusters of servers in which memory is managed by a central software designed
to transfer the load from upstream data sources (databases) to applications and
users. They are typically maintained in-memory and can offer fast access to
frequently accessed data. They sit between the data source and the user. Traditionally they accommodate simple
operations such as reading and writing values. They are populated when a query
is sent from a data user to the source. Prepopulating data into a cache cluster
of frequently accessed data improves response time. Data grids can take caching
a step forward by supporting more complex queries and using massive parallel
processing (MPP) computations.
Appliance
Massive
parallel processing sits between data access and data storage. Appliance here
is a pre-configured set of hardware and software including servers, memory,
storage, input/output channels, operating systems, DBMS, admin software and
support services.
It may have
a common database for online transactions and analytical processing, which
improves the interactivity and speed.
Appliances can perform complex processing on massive amounts of data.
Implementing
and maintaining high performance data bases on clusters is challenging and few
companies have the necessary expertise to do so themselves.
Custom-silicon
circuit boards enable to develop their specific solutions. It enables
development on devices for specific use cases and allows for network
optimization (integrating embedded logic, memory, networking and process
cores). This plug and play functionality offers interesting possibilities.
Continue
part 4 out of 5
No comments:
Post a Comment