YTread Logo
YTread Logo

DP-900 Azure Data Fundamentals Exam Cram Whiteboard Video

Jun 08, 2021
hello everyone welcome to this dp 900


full review session class, i actually took the


tuesday morning and just like my az 900 i deliberately didn't look at what was on the exam. I hadn't really touched the


much for nine months. I wanted to get an idea of ​​what exactly was covered and then after taking the exam I went back and looked at what's actually on the exam so I could create this type. from the prep session now for me I had 49 questions and I think you had 60 minutes for the actual answer it's 90 minutes total but you have 60 minutes to answer the questions they are very simple format questions no case studies no labs it's literally , hey here is this little question choose one or hey here are some features drag the box to which maps or select all applicable options that's all they are you must try each question no negatives if you wrong, so it's best to guess often, you can look at a few and know they're obviously not correct and at least narrow down what your possible options might be, so try everything most of us take home right now, just get a laptop if you can a screen nothing else in the room you should have a camera to take a picture of your ID and then a bit of the front back of the room relax if you take the test and don't pass you really haven't lost anything sometimes you win sometimes you learn if you take it and fail it helps you identify where you are weak um i passed i got 8 90 something but it shows you later hey here we go strong areas and this is where you are weak As you take the test, if you are unsure, try to keep a mental note in your brain so that you can go back later and look for that to help you improve.
dp 900 azure data fundamentals exam cram whiteboard video
Anyway we're going to do it and think about studying for the test so the first point is we're looking at an ice blue


server when I'm thinking about Azure data services obviously we have Azure and when we're interacting , it's kind of an


resource manager, all we do is


resource manager and first of all if I want to interact with that, well I can do things like I can use the portal there's things like powershell there's cli there is rest but more commonly if I'm provisioning resources I can create a json template so this is declarative I'm going to say I want the final state to be Hey I want this Azure SQL Database I want this storage account with these attributes, do it like this so if you see questions like Hey, what is the correct way to provision resources in Azure declaratively? hey i'm going to use a json template which is the way to go and build stuff in azure now with azure like most things there's really two planes i can think of there's kind of a control plane also kind of management and then below that is the actual data plane type now when I'm doing things with the Azure resource manager what I'm redoing preliminarily is I'm accessing management plane control so if we have certain roles as role based access control if i have a certain role then i have root role based access control this might be good i'm a contributor i'm a reader there are many kinds of options here all those roles have to do with management plane control.
dp 900 azure data fundamentals exam cram whiteboard video

More Interesting Facts About,

dp 900 azure data fundamentals exam cram whiteboard video...

I can do with the resources at the struct level usually they don't give me access to the data directly now if I have one of those roles I can go and get a key I can go and change something but in it's own no It gives me direct access to the data. Now the way I access the data varies depending on the type. Some of them may have separate apples. database i can give azure ad users rights to the data from within sql give the select read whatever but it's not a blue arm role it's a separate role there could be some kind of shared key that says signature, it could be that there is a completely separate user date type, so in sql I can create local users and give them rights for example, if you see questions and want to give someone access to read this data, it probably won't directly be a permission from azure resource manager it really will be maybe you need to create a local user in a sql database and give them permission and maybe you need to use this key but notice you have the control plane and the control plane of the dataplane doesn't normally give you access to the data which is, it's separate so if you see something like hey I need owner permissions on the resource that's probably a trap, look around and see what my other options are now ra as we think about the types of service that we have in azure, we often see these ideas around layers of things, so if we think about k In addition to the traditional, we always think about well, there's compute, there's networking, and obviously there's storage , those are like the core fabric layers, then we have a hypervisor sitting on top of that, now we have an operating system that we could have runtimes that we could have. middleware and then we kind of have our application and our data and why am i drawing this is in the cloud we never care about those things the fabric the hypervisor we never touch those things however in an iaz world infrastructure as a service i am responsible for all those things also in an eye if i choose an eye i also patch the operating system i'm worried about the fireball i'm worried about the antivirus i update the middleware now i say application here when we say application in this sense we're really thinking about the commercial application that's what we we're focusing and then in a past world well i only care about the top players here if i was thinking of a database for example fine on and me too if i was installing a sql server type that means i'm patching sql server i am installing sql server i am upgrading sql server i am installing postgres or same mice or maria for s er whatever that might be like this in an eye like world would have to yeah I'm installing upgrading the database I'm doing all of that there's all of those operational tasks available to me so there's work involved so I'm going to think hey there's a lot of things I intend to keep if I go to this kind of route scenario so let's expand this to a database and this could be things like Azure SQL Database and this could be things like Cosmos DB , this could be things like azure database so azure database is the open source offerings and we get kind of postgresql we get my sequel and we get maria db which is kind of fork of my sequel, so in these kind of older models I'm not installing the database I'm not updating the database I'm getting great things here so what I'm getting is these are ev ergreen I'm constantly getting the latest capabilities it's will update for me um I don't have access to the OS I'm not patching windows or linux or updating or why about any of those things that are made for me they are going to have native availability lol I can have the option to add mirrors in other regions depending on the exact bias you choose I have these massive scaling capabilities I can grow them to very large amounts there's less to manage because again they're pads I'm really focused on I could be creating tables mostly even things like indexing.
dp 900 azure data fundamentals exam cram whiteboard video
There are a lot of automated capabilities to help build the indexes for me and tune the database and give me recommendations, so it's a lot less for me. do in the iaz world well i would be installing sql server i would be patching the os so if you see questions hey i want an install i want the minimum amount of maintenance to do it will be a path fix if you see anything you know Yeah hey you have to have access to the OS then it will be a solution so realize the difference think about the layers. making a backup is another good backup is kind of native native backup is also just a part of this and there are things that help me in the world of Niaz but I'm still responsible for doing those things but so soon as i go to kind of peace databases or service offerings um a database is included for me all i'm focusing on right now is using the database my app connects to, but there's a lot of functionality built in natively for me and that's really kind of a key point of that now one thing i would say most of these things um i can't pause compute often is storage is compute and we pay for those mostly standalone although some of them are combined i can't pause the compute please now there is a sequel serverless option that allows me to pause the compute um azure synapse allows me to pause the computer which is kind of a new name for the data store which creates a lot of additional functionality but as a sql database um all these i can't pause the compute and stop paying for it so that's an important point of what i can and i can't do and again a good example here will be the sql server installed in the os now i drew this picture of the access type and the data what about getting access to things?
dp 900 azure data fundamentals exam cram whiteboard video
So if you think about it there's some data service there's some storage actually all my ones and zeros in it and it's me on my machine and I want to want to get to that data now the exact way it's going to happen is well that service will have an IP address now if it's public which for a lot of our data services won't exist there will be some firewall some permission if it's not public in front of my machine has to be able to reach that network now if this were if it was a sql managed instance the sql managed instance is deployed on my v-net it was another kind of service using a private endpoint again an ip on my network what do i have to do sure the network i'm on has connectivity to maybe the virtual network it's on so it could be accomplished maybe by a site-to-site vpn it could be a fast route so i have to be able to get to the ip. addresses that offer the service now plus there will probably be a dns name that actually resolves to that IP address so I have to be able to resolve the dns name maybe there's some kind of shared dns if I see something that says hey I've changed the ip address of my service but i cant access it or i updated the dns record so i have to be able to access the dns and resolve it correctly to get the ip to access the service and then of course to actually connect to it and access the data i have to have rights this is where the app comes into a guard between can i actually get to it so make sure we think depending on what kind of service we're doing if there is a dns ok, can i resolve it correctly?
All of those things need to be in place to access and use the data, okay? Now how do we store the data so you can think of there are different key types of data and we? often i'll look at and the first one is structured so we've structured we have a well defined set of attributes schema around our data and often we talk from a database perspective we'll talk about the relational model so that i can say hey relational database and with a relational database what we have essentially if we look at the data we have rows and columns and what i can think of so this could be a table so the table often is what it actually we're going to do is we have a schema so the schema defines these are the columns the data attributes and this type of text or date or numeric whatever that might be we also normally normalize the data so we really we rearrange them we move them to different tables based on their type to eliminate waste eliminate duplication we put ourselves in a very good strict format so here we have our records nu These rows and then we have the columns, so the columns go like this, so you could have, for example, first name, last name, age, height, id, all these various attributes and so we'll have some kind of key and the key of these columns which is like the key here will be unique for each record so each of these will have something unique it can't be duplicated it uniquely identifies that record now when I think of just a database regular relational data, I'm talking about records, so again the rows are intersecting, that's a particular entity, a particular person, the columns, the attributes are going down, that's some aspect, some data about that part. what we do is write the data that way so we actually write this to disk we write a record and the record would be per row so there could be a key let's say it's id1 it could be first name last name age , you know, getting older, um, height, weight, all different things, but that's one record, so there would be other records for all the people, so I'm writing it, that's how I'm storing it, and that's great if I I am interacting with the data.
Hey, I want to consultabout certain things and I want to see all the data. Well this makes sense. I read the logs at once, but I'm interested in all of that data, so it's a good write. logging for each row is very logical to me plus other use cases imagine i had the scenario where i'm actually just operating on lets say on age i just want to get the average age split the ages or maybe sales, whatever. I'm looking for really only one column at a time this will be very inefficient because I have to go and get the entire record to get this attribute that I care about so another common type of format we'll look at is actually columnar and it will look the same if it still I have these the same thing the records and the columns of the data, but the way that they are written to disk, the way that you store the records is different because what we're going to actually do is store them as columns, so now one record would be nice uh john bob fred there were three people that's a record another record would be the ages um 45 33 29 etc. so i'm storing the same data but i'm storing it completely differently normally i'm storing records in columns i'm storing it columns make up each record which is super efficient if all i care about is one particular column at a time now i just have I have to read this log and I'm all ages so this column based storage can be very useful um the common type of format is parquet so this is column storage and parquet is a common format for storing that on disk so you can see these are highly structured both are tightly defined schemas these are the columns we have to have and again we have this unique key that identifies a record so all of these are nice of structured is fine, then we can have a good semi-structuring, so with semi-structuring I can think of loose schema or no schema , so there's no strict definition and very commonly nowadays we think of json it could be xml maybe even the csv could count as this but it's much more flexible with this there is still a format for json for xml but I can choose what attributes I want there it's a document so what we think about this is it's kind of a document write store now some solutions even though I'm just writing a document this json document will still go and pass this document maybe it goes to the index field so it can go and look up particular um attributes of the json document because that cosmos db can do this, but fundamentally there's no set schema that you have to follow.
I'm just writing a json document. I am writing an xml document. I'll just go here and we have a part there come on let's do it again so I have a key and I also have a partition key and why do we have a partition key if any resource is finite I have a finite amount of storage I have a quantity of computation that can operate on storage so if we have large amounts of data I can't destroy it as one thing so what partition keys do is pick a particular partition key and it will be used to separate my data on logical partitions now i want to choose a good partition key i want an even distribution i dont want to choose a partition key ninety percent of the data is between a logical partition and all the rest is a terrible distribution what would later affect my performance because obviously this logical partition i have n number of logical partitions that will actually be stored on a physical now i could have multiple iple logical partitions on one physical partition but fundamentally i want a good layout because behind the scenes i'm going to have compute blocks connecting to storage blocks.
I'm going to partition my data, so if I have a partition key like cosmos db, I'm putting all these documents, they will actually distribute the data, they will shard it. You might listen when we talk about this distro. you might hear the term sharding it's a way of dividing data by some aspect of the data i'll distribute it so i pick a certain partition key maybe it's the month of the year even that could be dangerous depending on how i am using the data because it is going to be a hot month the current month maybe it is all the works that may not be good i am just giving an example if i choose the month for the year i would end up with 12 logical partitions and it would be distributed we take it out so we can chunk the data on that to get a nice partition key but these docs really take advantage of that and then that's sort of semi-structured then we get unstructured so unstructured we often think of a blob a large binary object now there are different azure blob types I'll talk about this in a second, but essentially I can store anything I want here, uh, in a blob I could have images, I could have


, I could have documents. everything i want i can store in the blog there is no formatting there is no structure at all its just storing this for me and what i really see in the data world is sometimes i want a hierarchy and one of the nice things we have in azure it's kind of like we have this data binding so we have this blue gen 2 late date storage which actually sits on top so it uses blob it uses that big cheap storage but add a hierarchical namespace to it um hadoop compatible posix acls and now i can broke all this stuff its a nice landing zone but i just want to put stuff in the cloud so if i have unstructured data i probably think like a blob, anything you want we have these different types now there are a few others and we will hear about a common one is graph and graph is all about nodes and edges so a node is an object so i can image inar that maybe i have a person node type and that person is john then i could say it's a node and i could have another person type node and that's bob and then i could have a lead a relationship that says it works for could have a completely different type maybe this type of node so it's still a node this time it's his office and maybe it's dallas and once again i can have an edge so these are the edges these relationships and maybe this is an act of i work and could have many other nodes many people report to bob look at my color and the cool thing about the graph is you can see if i care about the relationships between these different types of objects i can super fast say hey who works in dallas hey look there borders here those people work in dallas who reports to bob so the graph is very powerful when i worry about the relationships between the objects so that's the graph so it's important to know well that's what the other is ro which you'll see very commonly is kind of a key value and for example azure storage tables is a key value and as the name really suggests I can have a key and then n number of types of name value pairs so i can i have a key maybe it's id01 and i could have it it could be anything i want it could be the name it was john 45 years old etc but there's no set values ​​it should have it's completely arbitrary i can write it in the same store key value type, each entry could have different name values.
I have a lot of these stored, so if I just want to store some data, it's useful for that again. Like storage tables, this cosmos db also has key-value stores, so there are a lot of them. from options um other common well it's not that common um it's a time series so it's kind of a time series database as the name suggests it stores time and some data so it's separate from the key value one big line these are separate things but if i had data at all what i really cared about was the time it arrived and the piece of data think of all these iot devices with constantly streaming data, all i care about is getting that sense of value then a time series database would be fantastic for that hey i just want to capture whatever that current value is actually ok so now let's look at some specific examples so those are the kind of types that we have what do we have in azure so the most basic which is the starting block of most of this is a storage account so I create a blue storage account in a certain region and has several to tributes and certain types of data it actually supports for example it supports blob remember unstructured data and i can block i can do page page i is very good for kind of random access inside the blob um managing disks behind the scenes , use the page blob, I can have a pinned, hey, I just want to keep writing to the end of this blob, imagine it's some kind of log and as I mentioned before. gen 2 actually sits on top of the block blob, but adds an additional API than hierarchical namespace etc. those native capabilities there we have the queue type this is where i'm just storing messages just like first in first out something in so it comes in something that's read and then i have files so the files have me provided an smb 2.1 or 3 file share, so it's important to understand.
If you saw a question, it's like, hey, I want to set up a file share in Azure. What should I use? i need a quick cheat to set up the key value store in azure i want to store unstructured data so we have all these native capabilities now there are some changes coming there are some new things i dont think the exam will bring them up but nfs 4.1 is coming for Azure files, it would be a different share type. I create shares in file also in blob. We're going to have nfs3, so nfs3 here only for very large sequential interactions. let's go to all these different types that we have in our storage account now remember let's think about our storage account for a second we deploy it in a particular region blue it has many regions the regions are defined but it's kind of a two millisecond latency envelope always there are three copies of the data i don't see i just write my data behind the scenes it makes sure there are always three copies now there are some configurations i can do it's important to understand i have some sort of redundancy um i ca n do extra things for redundant storage locally lrs i have those three copies in a particular data center within the region i can do zrs so those three copies over three a z so what does that mean if i think about um just for a second if i draw a region to be a region again it's that envelope two millisecond latency, well the region is actually made up of multiple hubs of data and when exposed they are called az's kind of az1 az2 and az3 if i do redundant storage locally all three copies of that data are in the same data center if i do zone redundancy zrs those three copies are spread across three different az within the region, so it's still within the region, but now I can survive the blast radius of a data center falling So if you saw a question, it was like, hey, I want to store my data within a region, but I need to be able to maintain access even if a particular data center sees a failure.
I want zrs zrs to spread them into three so we have grs so grs there are three locally a and then it replicates it asynchronously to the paired region and then to the pair I also have three copies so if you think about this image for a second, now there is a second region hundreds of miles away also with grs let's pick another color i have those three copies within that data center but it will replicate them asynchronously to a data center here where it will now also be stored three times, so that's grs and there's a ra grs variant that allows me to read from the replica so if you saw a question that said, "Hey I want to store my data and Azure storage, I want to make sure it's available even if there's an outage in the region where it's going to be grs or rag rs gos means it's geo-redundant, it replicates to another region now there's a mix of zrs and grs so actually there's sort of a g zrs and then a variant ra what it just means it's zrs stored on the asynchronously replicated primary where there are also three copies so that's the variation of gzrs it wouldn't give me residency from a data center failure within the region and i still have resilience from a level outage region so understand that replication is important to know the redundancy of my data so it's all about redundancy and again remember this this survives a level cut region we also have throughput and there's basically tears um kind of of hot and cool file that has to do with how much it costs to store the data, how much does it actually cost to transact, how readily is it available, so the old default that is hot data is always available I pay for a certain amount of storage certain transaction amountsthen they put in the call so call it's still available right away i paid less for storage but i paid more for transactions the idea is i'm going to access them less often and then actually underneath this we'll say there's a file now with the file isn't available right away it's essentially offline i'll still look at it but can't rea right away it needs to move back to hot or cold obviously that's the cheapest so the price goes down as i get down those levels So if I see a question that's like, well, I want to store the data as cheaply as possible.
I don't need an instant access file if I see a question. I want to store my data with reduced access, but it must be immediately available. um the transaction costs well that's going to be hot now at sort of the blob level I can move them independently but what you'll probably want to do is there's actually something called lifecycle management and here I can create these rules and what I can say is, look, if something hasn't been accessed for seven days, move it down, you say it hasn't been accessed here for 30 days, move it to archive so lifecycle management will tell me let me automate the movement of data between tiers to really optimize my spend my proper name might be storage lifestylemanagement um make sure they check that out but that will help me automate all that stuff so i have these rules, these policies that can do that automatically and again Azure Data Lake Storage gen 2 because it's in blob you can use this like well that's a super cool capability so that's our type of storage basic azure storage we have all these different types so azure storage is phenomenally powerful don't jump right away oh i wouldn't use an azure storage account you can do a ton of things especially with now things like the gen 2 blue date leg storage that sits at the top of the blob lobby is a super cheap kind of almost this almost infinite scale set so don't overlook that thing and there's actually another kind of performance um available as well as kind of a premium around iops um overall i don't think that's going to say he'll be interrogated but if it's like well we need a higher level of eye ops and available with the standard level well there there's a um premium that i can use as well ok so let's talk about other data types so maybe as we go along and what you'll hear topically is things like oltp and things like olap so the process online transaction processing this is what we commonly think of for the normal type of relational databases um we think that in fact the right database is not a big deal they are both types of databases um , I'm actually going to move this down for a second let's move it down here so I can think about higher volume so I'm constantly having higher volumes of probably small transactions so it's small things but constantly coming in these are activities for the app to store as they occur i want a lot of fast access for the query when i'm doing operations against it i'm normally going to normalize the data i'm splitting it into different types of tables again i deduplicate redundancy what i do very efficient when i'm querying it i think of in omic transactions i have the ability to make sure if i do it right then it happens to all or none of they're not going to process half the transaction they all succeed or none of them succeed and these are what we think about things like sql so this is correct sql is huge around this and then with olap ok , this is more like big data in terms of size, I want to capture the raw data and we typically think about that. as data warehouse type solutions I'm going to run analytics against this to get insights into various things that are going on and so traditionally we think of a data warehouse here there are other types of solutions around this but i want to focus on this oltp initially so again there are different types of service available for this sql server it's obvious there is also some kind of blue database where we have that postgres we have my sequel and we have mariadb again this is the community editions these are from open source but it's managed it's implemented for me it's patched for me it's taken care of not making the OS it just provides its inherent like inherent backups it's just there for me while it's wave and these plus this would more like synapse although synapse is evolving it's doing a lot more than just olap but these are common things to really be with we were grappling so let's focus on sql server for a second so if we go here it's a little bit of room for sql server to run again in different ways. those things and I'm patching the OS into a peace world type of database as a service this would be things like azure sql database now there's multiple types of that there's sort of an elastic pool where multiple can share a resource pool um there's hyperscale there and it's serverless there's different skus but essentially like a sql database which usually the ones we use the managed instance is actually deployed on a virtual network if we need those capabilities but these are relational and I'll really focus on this kind of way. world so we have these relational databases and so the key point about this so if we are relational we have the database and we create tables so this table remembers that we define the columns that we have here so that we have a strongly typed schema where we define these columns that we can have that the attributes that we can have there we have a primary key that has to be unique for each record within that table so we have the rows these are the records that we're going to leave and then we have the columns the various attributes of the data so it's fundamental about the table and the schema now you could have other tables I have another table here again you would have your own primary key type for your records y m It could be that this column right here actually references this column, so it would be a foreign key, it could be an id that I then go and look up again normalization, d We split the data into different tables, so now it opens up to various types of architecture models. for the schema because I can use these various relationships with foreign keys and when I interact with the data, it doesn't matter if the data is in different tables, I can do various types of joins, I can see them now, there are potential performance implications. but I can absolutely access this, I can create views that I have in a certain way, I'm interacting with data that's actually made up of data from different places, so I can join these things together, but this ability to have these kinds of relationships are the foreign keys open up to various types of database schemas these patterns we'll actually see commonly so the first model is what we call a star schema so what we have in the middle is kind of t The main table over our data and we call this kind of a fact table so this is the main data and then what we have around it is these other tables it looks a bit like a star and they have some aspect of detail about one. of the various fact dimensions so they are called dimension tables so here they are called dimension tables because they have information about a dimension of the main fact table so they would have foreign keys so this key external could bind to this this external the key could bind to that one this one so I'm going to troll out and get that information so this is the star schematic again remember because it looks like a star it could normalize further so maybe these tables are i have other information about something else so i could have other tables here that well this column reference is they have their own foreign keys for other things it filters like a snowflake , so this is called a snowflake schematic and these are all dimension tables as well, so if you just see a pattern where it's kind of a core table and then things just sort of entities relate to that if you filter further dimension tables have relationships to other dimension tables well that's a snowflake just remember those two things there is also another pattern I can think of once again , I have my fact table and once again I have ever faster dimension tables so we have that again so once again these are dimension tables this is a star essentially remember this is the fact table so we have those relationships whatever, but now I have another fact table and that fact table uses them too and maybe it has some of its own here so essentially I have multiple stars um what are multiple stars?
Well, that's a galaxy, so there's also a galaxy schematic, again the terminology, the one in the middle, the one in the center, the attribute, the dimension. ns that reference other things, that's the fact table in all of these models, the things that contain that extra information that I'm looking for based on some id in the fact table, their dimension tables, if it keeps going out, so it's a snowflake so it's the core models we really think about um those foreign keys allow those relationships allow now to integrate and do things with sql what do we use you probably heard of t sql transact sql this is how we actually do things and what we really have is if I'm dealing with data now there are commands to create a dropdown table if I really want to manipulate data there actually is data manipulation language so dml you're it's things that actually change the data that's the key data point manipulation is the data change now focus on that word change sometimes this select statement is included ion and that's more of a data query language but sometimes you'll see it drag but it actually changes data it's kind of an insert so I'm inserting a record I'm updating a record or I'm deleting a record obviously if was also selected i'm going to draw a little bit different color with select i can say hi select star or these particular columns from a given table and optionally i can do like a place where a given column name equals john probably other than John, you have the idea to select these attributes or customize all of them from this table where something equals something. that's kind of formatting um it's important to understand the basics of formatting these commands so go and look them up so if you were inserting right I'm inserting into a particular table name and you can see dbo dot dbo dot it's just the default dbo database schema so i want to insert into the table bracket or what columns i'm inserting into so i'm going to insert into column one which puts the column name in the first name um I want to insert in column two three, etc. and what i'm going to do an insert in there are values ​​so i'm going to set the values ​​and then what are the values ​​john 45 etc.
I'm putting it there, so make sure you understand the basics of the structure of that data manipulation language. all those records from a long time ago but we wrote about the data types and it's written that way so imagine you were doing a select on age well it's going to read each record to go and find age so that one of the things we can do is there's a certain column that we constantly want to go to and search or maybe perform some operation constantly. We're going to add an index so just like a book now I can quickly jump into that index and it will find me the record id so I don't have to go through each record to speed up interactions remember there is always a flip side adding an index well there's more i'm storing now i have to store the index but also now it's going to be a pain to write when i write data now it has to go and update the index with the new entry if i had a lot of different indexes for a book every time i added a page to the book now i have to go and update each index for whatever is on that new page, so the indexes will speed up querying things like that, but be aware that it will be a penalty in terms than it has to store the index and there will also be a performance penalty to actually get that when i perform successfully or updates so we can update the index , so it's important to be there but to realize that nothing is completely free so it's the basic formats that are relational so we don't have a sequel and what this really says is it varies some people say it means that not a sequel some peoplethey'll say it means not just sequel and then it says absolutely um right or wrong there and it's really talking about relational so it's not relational it's not just relational so we have different options what we're going to do and there are various solutions on azure around some sort of nosql.
I'm going to focus on cosmos db so cosmos db is kind of a managed offering in azure and one of the crazy things about it is it actually allows me to do multi-region properly with a relational database it's a single primary I can only write in one of them for the blue database for many of the offerings the database i can add additional replicas in other regions for resiliency and even performance i can read from them but i can't write to them it's that single primary that you can make changes and it replicates out so it's focused on consistency making sure i have that data cosmos db i can actually write to multiple copies i can have a lot of replicas of my data and i can set this capability for multiple wires at creation time or after so I can modify it now, if you know databases I would say this is impossible, you can't do it, it's about the limit theorem. the idea that I can choose between partition tolerance availability consistency and of course you always have to give something up so the point of cosmos db is that I can have variable consistency and that really goes from strong consistency which is what we see in a relational database all the way, there's sort of these different markers between sessions um, sort of a common marker within a particular session, rc and writes in the same order, but with the eventual even now i can write to multiple copies and they will eventually get in sync when a session within my session would go to a particular copy i will see reads and writes in the same guaranteed order but they become consistent over time because i usually can't replicate synchronously across regions the performance would be terrible I mean that's what strong strong would say hey I'm not going to acknowledge this am I? um until it's also written to all mirrors that's why I can't do multiple regions correctly and strong I can't have that combination wouldn't make any sense.
Hey, I want to be able to write to multiple regions, but I need great consistency. I'll always see the same reads and writes from any copy, but if it's strong, it has to write to all of them. at the same time, so why would I bother having multiple regions? I wouldn't get anything so I can't do multiple regions and strong consistency but I have a lot of flexibility in how I can do this so I can based on my workload again session is super common certain processes will share a session and see the same read write order, so i have that setup, also when i do, also when i do, i can do something. a product vs non-production and this doesn't have any impact on performance e or features it's just my experience on the portal the things it shows me on the portal we have different priorities um based on if it's a production workload Compared to if I'm doing dev stuff I want to see different things by default so that's really all it's doing then we choose an API this is the crazy thing about cosmos db it has five different types of data models, different APIs it can support so it has kind of a sequel and a mongodb API support so when you're doing this right this is using the document model remember the document for example super common json but it will pass those data.
I can still go to the index and query based on that so it has cassandra api cassandra api is that column based storage when we talk about that comma this is that that's what also has a table as you probably guessed the table will be a key value and lastly we have gremlin framelit is a graph that is set when I create the account I can't have multiple within one so I create I ate the account I chose the API so all these different types of APIs and, so the different types of data I actually want to use with that and the way I perform on this is something called request units so request units are used whatever you do against it. cosmos has a certain cost one read against a partition maybe it costs me a ru an update that has to go and fetch other things and go against several maybe five so there is a certain cost and I can do it as provisioned hey I want 1 000 rus and it will actually throttle me to a thousand, so if for some reason I did not have enough and I was doing too much, I will actually get errors saying: 'Hey, we throttle you', you run out of ruse or I can auto scale auto scale I can set a max number of ru but instead of always billing me by the thousand whether I'm using them or not it will bill me for the number I'm using up to that limit and then it will break me and I can actually allocate them so we have I have multiple builds in cosmos db but we have a database so you can map them at the database level and under databases I have containers or tables or graphs so you can map them at different levels below that so i have these cosmos db capabilities again super super powerful and as i say i can set up replicas of my cosmos db in 30 different regions like azure sql database maybe i can have four read only replicas in different regions and this i can have it in almost all regions and remember the replicas are not just for the resilient cdr pay amount of app instances in those regions wanting to be able to read from the data so its super powerful for that finally , before we get into actually getting data into the solution there's obviously a data storage solution hey I just have a lot of data I want to run great analytics against it this is synapse so synapse previously was a store of data um sql now it's synapse not only do you have this huge storage and you can use poco gs like azure data lake storage gen 2 for your storage but now it's also built in Looking at things like the data factory that I'm going to talk about and things like the data blocks that I'm going to talk about as well to actually put a synapse is trying to put all these things together is my access analytics place one of the nice things that have done with synapse is we talk about kind of compute and storage and i can scale it and pay for it but i can pause the compute so i can actually stop paying for that part of the service again with um sql server serverless i can do that , but not with the other sql types this actually allows me to do that so if I think about a data store hey question I want to do large amounts of storage on this data where should we store this and then we have to perform analysis against them?
It's going to be synapse I want to store the service where I can pause the computation within my datastore and synapse and then we can kind of think about finally starting the ing tool for these kinds of sql based workloads and there are two main ones that we'll look at, so there's the sql server administration studio so ssms this is the traditional this is for sort of real deep administration. I can do things like look at the query store uh, I can manage things like the a page. I can actually go and look good generate myself the sql statements to do this stuff very deep complex deep admin so it's probably not here uh deep so that kind of deep admin that deep admin there's one new.
I have this very nice blue data studio and also while this is really windows it's kind of cross platform it's open source so there's windows there's linux there's mac os and this is more for data queries. In fact, I can create queries and put them on a dashboard. There are extensions so yes I can use this with Azure SQL Database and Deep Data Warehouse and Big Data SQL. There is an extension for Postgres so I can manage that through Azure Data S. studio too so one of these two new deep admin tools I'm doing sql server management studio won't go anywhere if I just want to do some kind of see some interactions well that's probably azure data studio um i guess what i'm talking about um tools obviously there's things like power bi desktop etc so if you see anything around visualization and that will be power bi so visualization vision is what they do, so there are different tools available, so that's okay, this is all done a lot.
I've done a lot so far this is all about storing the data which is great we definitely want to do that but how do we get the data into a useful format so you can think about that? i got the data let's go here i can think of batching so a batching the actual data goes in and it's collected it's just stored somewhere maybe it's that Azure Data Lake Storage gen 2 just it's written somewhere and then it's processed o On an interval, maybe once an hour, once there's 500 records, there's an interval where it goes and says it takes it and processes it, so it's going to process large volumes of data and so there will be a latency a delay because it's not working it's real time the nice thing about that is I can throw massive amounts of processing in the batch to go and process it on these big mpp systems the other option it's stream so with stream I can really think about it being processed as it arrives so the batch could be an example of reports oh hey I have all these reports in one folder then I have to process them all and do something to convey, ta Once, hey, the iot device is constantly sending these little bits of data, I have to process it as it arrives. so i can think this is near real time or very low latency but it means i have to make sure i scale correctly if i'm processing it in real time i can't run out of resources i can't process that incoming transaction b because i would lose it and that would be a big problem so i think about the data coming in hey batch i collect it take a bunch process it and then i take the next batch maybe 10 minutes later transmitting hey there's nothing that enter constantly. i will process it as it arrives now often the data coming in is not in the format we want so there are multiple phases to this we will often hear this etl so we have to extract the data from something we have what to transform into the right format maybe clean it up maybe there is bad data there are no records there are dates in different formats we have to clean the data by putting it in a standard format and then we will load the data into something so again this etl is super super common there are challenges with etl even though it was very popular in the old days where storage was an issue with all this data coming in we want to transform it to be as small as possible so we can store it somewhere because storage costs a lot of money it's not really the case to nymore and one of the disadvantages of this transform before we load it, store it, you may not know the question I want to ask tomorrow, maybe there is some aspect of the incoming data that I removed by the transformation, I lost it. forever so what's more common today is extract load and then transform and then think once we've transformed it we go back and store it again somewhere else because now we're capturing it in the raw state and then transform and then We go and write it down in a format that we can use for analysis or whatever, but if tomorrow I'm like, oh wait a minute, what's up with this?
I want to ask this question, well now I can go back to that data because I saved it in the initial format and transformed it again there is definitely an m in transform um to a new dataset that I can run a different type of analysis against now, this is the only benefit of etl is obviously i can prune the data maybe there is data i shouldn't store or i can prune it first t but really elt is becoming the standard because again i can have the raw data it's great if there are questions that I don't even know what I have yet I can always go back and transform it into new shapes now when I think about this transformation there are different types of transformers there are many many dimensions again it could be cleaning up the data um standardizing a format it could be very very simple it could be mapping stuff hey this field goes to this field maybe a little bit of string manipulation ok it could be super complex heh it could be the parsing of that so transforms can mean a lot of different things and when I get to really complex type parsing this is where you're talking about things like hd insight um data bricks uh data bricks it's kind of spark irregular essentially hdinsight has a lot of different things kafka spa could do different things with hd insight but they have the ability to go and do that complex if only it wasmapping and there are lots of things that could do this even the data factory can do this basic kind of azure function mapping there are different ways to do that kind of simpler thing the other thing that makes this load is so attractive in the cloud now is because the cloud the storage is so cheap because it's so cheap and it has a kind of infinite scale, why shouldn't it just store everything in the original format?
Remember in a data lake. Gen 2 storage is a fantastic landing zone, very, very common. The answer to this is adls gen. 2. I can use those lifecycle policies to archive the older stuff, but usually that's where I'm going to initially put my data and then go and take it with something else. I'm going to take them. with datablocks i'll go and grab it with hdinsight i'll do those transformations i'll go and write it to cosmos or my sequel or synapse lots and lots of different options around it to be able to do different things with it let's pose a question ok ok i'm extracting g am transforming or loading transforming loading i have all these different things what is it doing that etl has to be some sort of orchestration tank has to run the various activities for elt or etl whatever we're doing so the answer here is azure data factory per what azure data factory is an orchestration solution that's your job your job is basically to get the data from the source and get it to the sink i where you're writing that's the job and how is this you have a pipeline?
So we have pipelines and a pipeline has multiple activities activity one activity two activity three whatever and again there are sources that you're going to read from you can have multiple sources and I can have multiple synths the data factory is really not doing much with the data you call activities, which could be a data break, it could be hdinsight, you could do some basic data mapping, but fundamentally it's the pipeline that's orchestrating that control plane, so if you think about what this is, this is the control plane you're dealing with that's your job so those pipes can be triggered so i have multiple triggers sugar could be a program could be manual and the program could be a window It could be a fall. window hey uh go get this 20 minutes of data then this 20 minutes of data and it could be an event.
Could use the event center. Something is written to this in a data lake. Call this pipe. it goes directly to cosmos db or synapse whatever it might be again it can do that basic data mapping so it could actually do some of those things but fundamentally the data factory is the orchestrator that will do these things happen just pipeline a set of activities i can have multiple types of triggers that will cause the pipeline to run and do the work so that's the point of the data factory so i mean this source i mean which could be a crm system um then it invites azure data lake gen 2 storage then it goes and triggers a data break and maybe that goes and writes to cosmos db or it could actually be a warehouse there are different things and then , well maybe it's going to power bi for the actual analysis actually um get all the information on that data so that this full flow uh giving me that almost there so that the analysis part just finish that thought pattern when do you think or in doing an analysis, so basically we do all of this mainly because we want information about our data, the rest we don't care as much. it has to be done, but the goal is to get us to this point because I want to do an analysis against the data, what is that analysis?
So there descriptive descriptive is tell me what happened, see if I think about it, okay. I have a descriptive graphic that it would be showing me that shows me what happened. You could have a diagnostic test. Why it happened. What did we do to make that happen? well that graphical or predictive now would be nice it's probably going somewhere in there that would be some kind of predictive analytics um we can also say a prescriptive a prescriptive is like well what do I do if I want this result I want to get to this, what should I do? and finally, um, I can think of the cognitive that generating conclusions will basically give me and those conclusions will be based on existing knowledge, so it has some knowledge of that knowledge that I'm going to generate. conclusions from him and again for all this stuff i can think of in power bi all this analytics stuff all these are going to end up as some kind of power bi for that big visualization for giving me the insights so i have all these different kinds of analytics, It's going to feed that, okay, wow, and I didn't think this was going to last an hour and a half.
I never apologize or not. We cover a lot. This is not designed to be a complete training course. e in all you need to go and look at the syllabus study um i think this is maybe like a crash exam maybe a few hours before the exam go and look at this might remind you of a few things um it will help you put some things together my watch at the beginning study watch it again at the end um really hope this was helpful this was a lot of research and planning work on this so kind of like subscribe comment and share is appreciated help other people find if no ads i make no money from this this is really just to help and again the more people see it the more it helps more people so good luck don't panic relax if you take it and don't does.
It so happens that you've learned things that you're weak about go and study on that uh you'll understand next time so good luck and take care

If you have any copyright issue, please Contact