Continuing on setting our goals, as i mentioned in my
previous post combining artificial intelligence and blockchain are
undiscovered area. We still don't really understand how to outcome with a good
business application. What I mean by good application is, to find most
effective and efficient way to use both technologies in the same environment.
As far as, we know that both systems are data-driven in their own
way.
Looking forward, we are pretty much set with our goal for future. Our challenge is to design and develop a Data Lake. Here is the reason that we wanted to take this challenge; when we look at components in our development environment our foundation architecturally similar to a data-lake.
While developing this massive system here are the challenges
that we are expecting to see and with today's data warehouses can impede usage
and prevent users from maximizing their analytics;
Timeliness. Introducing new content to the enterprise data
warehouse can be a time-consuming and cumbersome process. When users need
immediate access to data, even short processing delays can be frustrating and
cause users to bypass the proper processes in favor of getting the data quickly
themselves. Users also may waste valuable time and resources to pull the data
from operational systems, store and manage it themselves, and then analyze it.
Flexibility. Users not only lack on-demand access to any
data they may need at any time, but also the ability to use the tools of their
choice to analyze the data and derive critical insights. Additionally, current
data warehousing solutions often store one type of data, while today’s users
need to be able to analyze and aggregate data across many different formats.
Quality. Users may view the current data warehouse with
suspicion. If where the data originated and how it has been acted on are
unclear, users may not trust the data. Also, if users worry that the data in
the data warehouse is missing or inaccurate, they may circumvent the warehouse
in favor of getting the data themselves directly from other internal or
external sources, potentially leading to multiple, conflicting instances of the
same data.
Findability. With many current data warehousing solutions,
users do not have a function to rapidly and easily search for and find the data
they need when they need it. Inability to find data also limits the users’
ability to leverage and build on existing data analyses.
Advanced analytics users require a data storage solution
based on an IT “push” model (not driven by specific analytics projects). Unlike
existing solutions, which are specific to one or a small family of use cases,
what is needed is a storage solution that enables multiple, varied use cases
across the enterprise.
This new solution needs to support multiple reporting tools
in a self-serve capacity, to allow rapid ingestion of new datasets without
extensive modeling, and to scale large datasets while delivering performance.
It should support advanced analytics, like machine learning and text analytics,
and allow users to cleanse and process the data iteratively and to track
lineage of data for compliance. Users should be able to easily search and explore
structured, unstructured, internal, and external data from multiple sources in
one secure place.
Traditionally Data Lake is a data-centered architecture
featuring a repository capable of storing vast quantities of data in various
formats; however, in our case we are using Blockchain therefore the information
become decentralized. Data from webserver logs, data bases, social media, and
third-party data is ingested into the Data Lake, in our challenge we are
streaming data from stock markets such as NYSE, Nasdaq S&P and Down Jones.
Curation takes place through capturing metadata and lineage and making it
available in the data catalog (Datapedia). Security policies, including
entitlements, also are applied.
Data can flow into the Data Lake by either batch processing
or real-time processing of streaming data. Additionally, data itself is no
longer restrained by initial schema decisions, and can be exploited more freely
by the enterprise. Rising above this repository is a set of capabilities that
allow IT to provide Data and Analytics as a Service (DAaaS), in a supply-demand
model. IT takes the role of the data provider (supplier), while business users
(data scientists, business analysts) are consumers.
The DAaaS model enables users to self-serve their data and
analytic needs. Users browse the lake’s data catalog (a Datapedia) to find and
select the available data and fill a metaphorical “shopping cart” (effectively
an analytics sandbox) with data to work with. Once access is provisioned, users
can use the analytics tools of their choice to develop models and gain
insights. Subsequently, users can publish analytical models or push refined or
transformed data back into the Data Lake to share with the larger community.
Although provisioning an analytic sandbox is a primary use,
the Data Lake also has other applications. For example, the Data Lake can also
be used to ingest raw data, curate the data, and apply ETL. This data can then
be loaded to an Enterprise Data Warehouse. To take advantage of the flexibility
provided by the Data Lake, organizations need to customize and configure the
Data Lake to their specific requirements and domains.
The Data Lake can be an effective data management solution
for advanced analytics experts and business users alike. A Data Lake allows
users to analyze a large variety and volume when and how they want. Following a
Data and Analytics as a Service (DAaaS) model provides users with on-demand,
self-serve data.
However, to be successful, a Data Lake needs to leverage a
multitude of products while being tailored to the industry and providing users
with extensive, scalable customization. Knowledgent’s Informationists provide
the blend of technical expertise and business acumen to help organizations
design and implement their perfect Data Lake.

No comments:
Post a Comment