Saturday 5 September 2015

Big Data & Analytics - the full view (upon request)

Upon request, I have put together the five parts of the previous published Big Data series and combined into one documents. This offers you to read everything in one place. Please share your thoughts and best practices with me. You always may email me directly to alexwsteinberg@gmail.com
Big Data Series - Part 1   Technical challenges
Big Data requires to learn much about data as an asset and analytics. Data is the most precious asset in an organization, the currency of the enterprise.
Companies’ data ecosystems have become complex and littered with silos. A large majority of companies is still not able to make full use of Big Data advantages.
There are many challenges with Big Data: Lack of knowledge, varying definitions & expectations, different views about data sources and use cases, ignorance about valuable data sources, technologies, etc.
Companies must understand data across the entire data supply chain and their individual stages: Identifying & leveraging data sources, importing, enhancement of data value, combination with other data, generation of insight, and taking of specific actions.
This means: companies must mobilize data across the enterprise; deeply understand, analyze and determine value of respective data; understand business use case and data patterns to determine appropriate actions.
It requires companies to commit to continuous discovery, experimentation, testing, learning, adapting and innovation.
There are many approaches, solutions and technologies presently offered in the Big Data domain and quickly evolving. Companies need to be aware of the different options and their pros & cons to combine those to an overall solution.
Continue part 2 out of 5  
Big Data Series – Part 2 - Traditional data approaches not enough anymore
Given the varying types, sources and sheer size of data today the traditional approach of collecting data in a staging area, transforming into desired format, loading in mainframe/ data ware house and then delivering requested data to users on a point by point query does not work well any more.
Companies must perform calculations, run simulations models, compare statistics at fast speed to generate insights. Real-time analytical tools able to pre-process streaming data and correlate data from internal and external sources, offer interesting opportunities, but also complex challenges.
Data acceleration enables massive amounts of data to be ingested, processed, stored, queried and accessed much faster. It ensures multiple ways for data to come into the company’s data infrastructure and be referenced fast.
Data acceleration leverages hardware and software power through clustering and helps correlate different data sources, including localization. It improves interactivity by enabling users and applications to connect to the data infrastructure in universally accepted ways and ensuring that user queries are delivered as quickly as required.
Continue part 3 out of 5
Big Data Series – Part 3 Six technology components for Data Acceleration
There are at least six key technology components to build a supporting architecture: Big Data platforms, Ingestion solutions, Complex event processing, In-memory databases, Cache clusters and Appliances. Each component helps with data movement (from source to where needed), processing and interactivity (the usability of the data infrastructure).
Big Data platform (BDP)
BDP is a distributed file system and compute engine. It contains a big data core, a computer cluster with distributed data storage and computing power. Replication and sharding partitions very large databases into smaller, more easily to manage parts in order to accelerate data storage.
Newer additions enable more powerful use of core memory as a high-speed data store. These improvements allow for in-memory computing. Streaming technologies added to the core can enable real-time complex event processing. In-memory analytics support better data interactivity.
Further enhancements to the big data core create fast and familiar interfaces with data on the cluster. The core stores structured and unstructured data, but requires map/reduce functionality to read. Query engine software enables the creation of structured data tables in the core and common query functionality (SQL etc.)
Ingestion
Collecting, capturing and moving data from its sources to underlying repositories used to be done traditionally through the extract, transform and load ETL method. Today the priority is not the structure of the data as it enters the system, but assuring that all data is gathered covering different increasing data types & sources and quickly transported to areas where it can be processed by users. Ingestion solutions cover both static and real-time data. The data the gathered by the publisher and then send to a buffer/ queue, where the user can request the data.
Complex Event Processing (CEP)
After data ingestion the CEP is responsible for preprocessing and aggregation (& triggering events). It tracks, analyzes and processes data of events and derives conclusions. CEP derives data from multiple sources and combines historic as well as fresh data in order to infer patterns and to understand complex circumstances. Its engines pre-process fresh data streams from its sources, expedite processing of future data batches, match data against pre-determined patterns and trigger events based on detected patterns.
CEP offers immediate insight and enables fast action taking. In-memory computation allows to run Data movement and processing in parallel, increasing speed. CEP solutions add computing power by processing the data before it is submitted to the data stores or file systems.
In-memory databases (IMDB)
IMDBs are faster than traditional databases, because they use simpler, internal algorithms and executive fewer central processing unit instructions. The database is preloaded from disk to memory. Accessing data in memory eliminates the seek-time involved in querying data on disk storage. The applications communicate through SQL, which receives records in the RAM and triggers the query optimizer.
IMDBs constrain the entire database to a single address space. Any data can be accessed within microseconds. The steadily falling RAM prices favor this solution.
Cache Clusters
They are clusters of servers in which memory is managed by a central software designed to transfer the load from upstream data sources (databases) to applications and users. They are typically maintained in-memory and can offer fast access to frequently accessed data. They sit between the data source and the user. Traditionally they accommodate simple operations such as reading and writing values. They are populated when a query is sent from a data user to the source. Prepopulating data into a cache cluster of frequently accessed data improves response time. Data grids can take caching a step forward by supporting more complex queries and using massive parallel processing (MPP) computations.
Appliance
Massive parallel processing sits between data access and data storage. Appliance here is a pre-configured set of hardware and software including servers, memory, storage, input/output channels, operating systems, DBMS, admin software and support services.
It may have a common database for online transactions and analytical processing, which improves the interactivity and speed. Appliances can perform complex processing on massive amounts of data.
Implementing and maintaining high performance data bases on clusters is challenging and few companies have the necessary expertise to do so themselves.
Custom-silicon circuit boards enable to develop their specific solutions. It enables development on devices for specific use cases and allows for network optimization (integrating embedded logic, memory, networking and process cores). This plug and play functionality offers interesting possibilities.
Continue part 4 out of 5
Big Data Series – Part 4 Creating a suitable Technology Stack/ Solution
All of these components bring their individual technology features. Companies must wisely put together an overall solution from among those components, leveraging their complementary advantages and customizing those to their particular needs.
There are four fundamental technology stacks (with their variations) offer possible solutions:
  1. Big data core only or with enhancements (with complex event processing, with in-memory database, with query engine or with complex event processing and query engine)
    • This technology is the de-facto standard for exceptional data movement, processing and interactivity.
    • Data usually enters the cluster through batch or streaming.
    • Events are not processed immediately, but in intervals. Enables parallel processing on large data sets, and thus advanced analytics.
    • Applications and services may access the core directly and deliver improved performance of large, unstructured data sets.
    • Adding CEP enhances big data core processing capabilities, real-time detection of patterns in data and trigger events. Enables real-time animated dashboards. Could add machine learning program to the CEP.
    • IMDB can further increase computing power through placing key data in RAM.
    • Query engines can further open interfaces for applications to access big data even faster.
  2. In-memory data base (IMDB) cluster only or with enhancements (with Big Data Platform, with complex event processing)
    • External data is streamed in or transferred as bulk to the IMDB
    • Users and applications can directly query the IMDB, usually through SQL like structures.
    • The incoming data is first pre-processed through the BDP before it goes to the IMDB
    • In case of CEP, the CEP first ingests the data; the processing is then done in the IMDB and then returned to the application for faster interactivity.
  3. Distributed Cache only or with enhancement (with Application and Big Data platform)
    • A simple caching stack sitting atop of the data source repository. The application retrieves the data. The most relevant data subset is placed in the cache.
    • Processing of the data falls to the application (may result in slower processing speeds)
    • If BDP, the BDP ingests the data from the source and does the bulk of the processing, then puts data subset in cache.
  4. Appliance only or with enhancement (with Big Data platform)
    • Data streams directly into the appliances; the application talks directly to the appliance
    • If BDP, the BDP ingests and processes data. The application can directly talk to the appliance for queries.
Continue part 5 out of 5

Big Data Series – Part 5 – 12 Immediate suggestions to build a data supply chain
  • Consider data as perhaps the most important asset in your organization. Become data driven. Some people call it “data religious”.
  • Research about Big data & Analytics best practices. It requires continuous learning. Refer to the different approaches offered in previous blogs (Data Acceleration Part 1 and 2).
  • Do an inventory of existing data. Focus on most frequently accessed and time-relevant data.
  • Identify, simplify and optimize inefficient data processes. Eliminate manual, time-consuming data curation processes (such as tagging and cleaning).
  • Identify currently unmet business needs and develop solutions.
  • Identify and overcome data silos.
  • Simplify and standardize data access through a robust data platform
  • Build an effective technology stack using one of the four suggested options while leveraging some of the described six components (Data Acceleration Part 1 and 2).
  • Further explore API management, traditional middleware, PaaS and other possibilities
  • Analyze current internal data sources and look for still hidden sources. Explore external sources to increase quantity and quality of available data.
  • Identify and improve individual data supply chain streams
  • Develop a systematic roadmap for building an effective overall data supply chain

Special thanks to Accenture Technology Labs and Analytics Group, whose thought leadership, best practices and white papers have served as inspiration and knowledge source for this Big Data series.

+++
To share your own thoughts or other best practices about this topic, please email me directly to alexwsteinberg (@) gmail.com.

Alternatively, you also may connect with me and become part of my professional network of Business, Digital, Technology & Sustainability experts at

https://www.linkedin.com/in/alexwsteinberg   or
Xing at https://www.xing.com/profile/Alex_Steinberg   or
Google+ at  https://plus.google.com/u/0/+AlexWSteinberg/posts


No comments:

Post a Comment