AZ-900 Episode 15 | Azure Big Data & Analytics Services | Synapse, HDInsight, Databricks

Apr 26, 2024

Hey guys, welcome back, I'm Adam and in this

episode

we're going to focus on what is considered big

data

and what

services

in Azure help us process and analyze those types of

data

sets. Stay tuned as part of

episode

15, we will learn about free Azure

services

. This time it will be Azure Synapse Analytics HD Insight and data breaks, but before we move on to those services let's back up a bit and talk about what is considered Big Data. Big Data is a field of technology that helps us solve typical challenges. extraction, processing and analysis of our data sets, but normally to call something big data certain characteristics must be met, the first is speed, speed means how fast our data arrives, how fast, how often we need to process that data. that data in batches or maybe in real time a second characteristic is volume so we are talking about megabytes gigabytes terabytes or even petabytes of data and the third is variety variety means how structured our data is are we talking about tables or databases or maybe? something very complex like video or information from social networks based on these three characteristics we can define whether this data is considered big data or not, but as soon as we reach one of those vectors, one of those characteristics traditional software will not be able to process. these types of data sets and this is how big data technologies emerged, they were software designed specifically to help us with those types of challenges, which brings us to our first

azure

synapse

analysis service, but to talk about the analysis of

synapse

s and their Benefits We need to talk about what the typical process looks like when it comes to transforming an analysis of our data.

Most data engineers will start their process by identifying where their data is, whether it be flat files, some web services or databases, and typical development from there. The process begins, so developers will first need to ingest their data from their sources to the cloud, then they will need to transform those data sets and store them somewhere, and after storing the data, expose them to other tools such as reporting tools , so that business users can obtain information. take advantage of your data and make good business decisions, and Azure Synapse Analytics helps with all those steps, first by providing a feature called Synapse Pipelines.

More Interesting Facts About,

az 900 episode 15 azure big data analytics services synapse hdinsight databricks...

This tool helps developers ingest and transform their data using visual workflows. Additionally, Synapse Analytics comes with Apache Spark integrated. A leading technology for big data analysis and transformation, in addition to its Synapse SQL and massively parallel processing database clusters based on a popular SQL Server, this feature helps with transformation using typical SQL queries, storing its data but also serving it to your reporting clients and all that's built into something called studio synapse studio, which is a unified experience for managing all of those tools and features and doing all the data transformation in one place and that's all pretty cool. integrated with another Azure service called datalake so that the ingestion transformation and storage of our data can also be done directly in the data lake to summarize Azure synapse analysis.

First of all, it is a big data analysis platform. It is a platform as a service offered on Azure that allows users, data engineers, data scientists to perform data and data analysis. transformation on very large data sets and has multiple built-in tools like apache spark or efficient synapse sql big data transformations that allows us to use familiar sql server tools with massive power processing design with dedicated or outlook capacity. It also has synapse pipelines that allow you to visually build your data ingestion and transformation workflows and all of that is combined into a single studio experience.

A unified experience for your data transformation needs. Next on our list is Azure HD Insight, since we were talking about the typical development process that HDInsight can also support. practically all stages of that process provide so-called big data clusters and when it comes to

hdinsight

there are many clusters available such as hadoop clusters, machine learning services Spark Kafka Hbase Hive or Apache Storm or many others in general, the idea of the service. is to give you open source big data technologies on the market that allow you to provision clusters so that Microsoft manages those clusters and you simply get the technology that you need to do the specific tasks that you need.

All these tools have a different purpose, but you can use them in combination to support the end-to-end development lifecycle of your application, so

azure

age hd insight is a flexible multipurpose big data platform in azure is another platform as a service offering that allows you to choose from multiple open source technologies on the market, be it hadoop spark kafka or many of the other technologies available and lastly we have azure data bricks azure data bricks is quite similar to

hdinsight

except that the clusters we create are based on apache spark and apache spark only and the main purpose of this service is to help you with large scale data transformation because Apache Spark is one of the leaders when it comes to performance and data transformations for big data, but in addition to data transformation, the creators of

databricks

also wanted to provide this as a collaboration platform for data engineers and analysts so that they have a single place where they can manage their cluster and collaborate on their data solutions in Azure portal.

I won't start creating services anymore. We've seen enough Azure Marketplace, so I'll go to Azure Data. interrupt service that I created earlier here I can use a button to start a workspace that will take me out of the Azure portal to a separate portal designed for collaboration on the Azure Databricks solution, a so-called workspace, the first thing we need to do within the new workspace is to create a new cluster by opening a new cluster panel. We can specify a cluster name. In my case, this will be a demonstration. If I want, I can modify some options, such as changing the cluster type or runtime version.

I can also modify automatically. scaling features and auto-termination options, which is amazing from a cost perspective, if I'm happy with all my selections I can just hit create cluster and just wait. Cluster creation takes 4-5 minutes as the cluster is created and ready. In running state we can start working, but look how easy it was. I was just clicking a few buttons and right now I have an Apache Spark based big data technology cluster running in the cloud and ready to go. Now we can create some scripts going. to the workspace on the left side where I have my personal workspace in the user section or a shared workspace where I can share and collaborate with other users.

I'll go to my personal workspace. I will open my catalog and create a new notebook. Notebooks are simple scripts in Azure Databricks. I can select a language like Python Scala SQL or R. In my case, I have a demo using Python and I will call this demo notebook inside my notebook. You may notice that there is some small text. blocks here and I can use those text blocks to write my scripts for now I will copy and paste the script that I prepared earlier for now we don't need to focus on the details of the script what the script does is connect to open data sets from Microsoft with some data sample and it literally takes seven lines of code to do it, it's very simple and straightforward, but once you get the data, you can use familiar SQL language.

In this case, it's Spark SQL, so you can use a familiar SQL language to review. your data and analyze it here you can also download this data as a csv change the chart type to bar chart and do all types of data transformation and analysis as per your needs to summarize blocks of data it is a big data collaboration platform it is another offered from Azure in the platform as a service category, but it's really about providing this unified workspace where users can manage data on their laptop clusters and manage access to other users and collaborate with them so that the Users can focus on their data solutions instead of managing their big data platforms and is based on Apache Spark, a leader when it comes to big data transformations in the market.

It integrates very well with common Azure data services by having connectors out of the box, making it very easy to extract data from Azure services. and generate data after doing our transformations, so let's summarize this episode today. We learned about Azure Synapse

analytics

. We also use a modern end-to-end approach to data storage and analysis of large data sets. We also learned about HD Insights with a fully managed system. open source

analytics

service with a large number of supported frameworks and tools that are currently marked as leaders when it comes to processing large data sets and lastly, we learned about Azure Databricks, a cloud-based collaboration platform in Apache Spark which allows us very easily. process big data sets by abstracting the difficult topics when it comes to big data platform management materials and the cheat sheets are available in episode 15 on my website so check them out and for this episode we finish the next the one about AI so definitely stay tuned if you like my work support the channel by subscribing by liking and commenting and see you in the next episode

Watch Video & Subscribe

If you have any copyright issue, please Contact