YTread Logo
YTread Logo

PySpark Full Course [2024] | Learn PySpark | PySpark Tutorial | Edureka

Apr 25, 2024
five spark is a Python API for Apache spot, an open source distributed computing framework for big data processing. pisbar provides a simple and efficient way for developers to perform complex data processing and analysis tasks using the powerful Sparks engine. Hello everyone, we welcome you all to this spice. Park completed today's

course

, we have an interesting agenda prepared for you, but before we start, if you like our videos, don't forget to subscribe to our

edureka

's YouTube channel and press the bell icon to stay updated with the latest technologies and trends. Also, if you are interested in our PI Spark Certification Training, check out the link given in the description box now, without any delay, let's go through the agenda first.
pyspark full course 2024 learn pyspark pyspark tutorial edureka
We will start by understanding what Apache Spark is. Next, we will explore everything about Apache. Sparks architecture followed by, we will also look at some of the strategies on how to become a Spark developer, then we will see the introduction to Apache Spark with Python, then we will have some hands-on experiences for that, we will start by installing Pi Spark and then. We will move on to vice Park rdd and also

learn

about pi spark data frames. After that, we'll dive into some of the popular Pi Spark programs covering Pi Spark SQL and Pi Spark. Next we will look at the pi spark ml lip and then we have the Pi Spark training and finally we will end this session by discussing the Spark interview questions and answers at the end of this complete

course

you will have plenty of opportunities for hands on practice and have a solid knowledge of Pi Spark and be well prepared to work with Spark in a professional environment, so let's start with our first topic, which is what the Apache way is.
pyspark full course 2024 learn pyspark pyspark tutorial edureka

More Interesting Facts About,

pyspark full course 2024 learn pyspark pyspark tutorial edureka...

Spark is an open source, scalable, massively parallel in-memory execution environment for running analytics applications, you can think of it as an in-memory layer that sits on top of multiple data stores where data can be loaded into memory and analyze in parallel in a cluster, reaching big data processing, very similar to mapreduce Park Works to distribute the data. across the cluster and then process that data in parallel, the difference here is that unlike mapreduce which shuffles files around disk, Spark works in memory and that makes it much faster at processing data than mapreduce , is also said to be the unified Lightning Fast. analytics engine for big data and machine

learn

ing, so now let's look at the cool features of Apache Spark that are picking up speed.
pyspark full course 2024 learn pyspark pyspark tutorial edureka
You can call Spark as a fast processing framework. Because? Because it is 100 times faster in memory and 10 times faster on disk when compared. with Hadoop not only that, it also provides high data processing speed and powerful caching, it has a simple programming layer that provides powerful caching and disk persistence capabilities, and Spark can be implemented through messages or Spark's own cluster manager, as everyone knows, Spark was designed. and developed for real-time data processing, so it is an obvious fact that it offers real-time competition and low latency due to in-memory competitions. Next Polyglot Spark provides high-level APIs in Java Scala Python and R Spark code can be written in any of These four languages ​​not only provide a shell in Scala and Python.
pyspark full course 2024 learn pyspark pyspark tutorial edureka
These are the various features of Spark. Now let's look at the various components of the Spark ecosystem. Let me first tell you about the core component of Spark. It is the most vital component of the Spark ecosystem. which is responsible for programming monitoring of basic io functions etc., the entire Apache Spark ecosystem is built on this core execution engine which has extensible apis in different languages ​​like Scala Python R and Java, as I already mentioned Spark can be implement through mesos. hadoopia Yan or Spark cluster manager itself, the Spark ecosystem library is composed of several components, such as the Spark SQL streaming machine learning library.
Now let me explain each of them. The Spark SQL component is used to harness the power of declarative queries and optimize storage through execution. SQL-like queries on Spark data that is present in rdds and other external sources. The following Spark streaming component allows developers to perform batch processing and data streaming in the same application and, coming to the machine learning library, uses scalable machine deployment and development. learning channels, such as summary statistics, correlations, feature extraction, transformation functions, optimization algorithms, etc. and the graph component allow data scientists to work with graph and non-graph sources to achieve flexibility and resilience in graph construction and transformation and now, talking about programming languages, Spark supports Scala.
It is a functional programming language in which Spark is written, so Spark supports Scala as an interface and then Spark also supports Python interface. You can write the program in Python and run it on top of Spark again if you see the code in Python and Scala. Both of them are very Similarly, R is very famous for data analysis and machine learning, so Spark also added support for R and it also supports Java, so you can go ahead and write the code in Java and run it on top of Spark, then the data can be stored.
HDFS local file system on Amazon S3 cloud and also supports SQL and nosql databases, so these are the various components of the Spark ecosystem. Now let's see what's next when it comes to iterative distributed computing that processes data across multiple jobs and calculations. We need to reuse or share the data between multiple stores In previous Frameworks like Hadoop, there were problems while dealing with multiple operations or jobs. Here we need to store the data and some stable intermediate distributed storage such as hdfs and multiple I/O operations perform the overall calculations. of much slower jobs and there were replications and serializations which in turn made the process even slower and our goal here was to reduce the number of io operations through hdfs and this can only be achieved through in-memory data In-memory data sharing is 10 to 100 times faster than network and disk sharing, and rdds try to solve all the problems by enabling fault-tolerant distributed memory competitions, so now let's understand what rdds are, which mean a resilient distributed data set, are considered the backbone. of Spark and it is one of the fundamental data structures of Spark, it is also known as schema structures which can handle structured and unstructured data, so in Spark everything you do is related to rdd, you are reading the data into Spark and then they are read. rdd again when you are transforming the data, then it will perform transformations on the old rdb and create a new one and then lastly it will perform some actions on the rdd and store the data present in an rdd into a persistent storage resistant distributed data set. an immutable distributed collection of objects, its objects can be anything like strings, lines, rows, collections of objects, etc., rdts can contain any type of Python Java or Scala objects, even including user-defined classes and talking about distributed environment, each data set present in an rdd is divided into logical partitions that can be computed on different nodes of the cluster, due to this, you can perform transformations or actions on the entire data in parallel and do not have to worry about the distribution because Spark ensures that rdds are highly resistant.
They are able to quickly recover from any problem as the same chunks of data are replicated across multiple executor nodes, so even if one executor fails, another will continue processing the data. This allows you to perform functional calculations with your data set very quickly by harnessing the power. multi-node, so this is all about rdd. Now let's take a look at some of the important features of rdds. RDDs have a provision for in-memory computation and all transformations are lazy, that is, they do not compute the results immediately until an action is performed. It is applied so it supports in-memory computation and lazy evaluation, in addition to being fault tolerant in the case of rdds, they track the data lineage information to reconstruct the lost data automatically and this is how it provides fault tolerance to the system, it is can create or create immutability data. retrieved at any time and once defined its value cannot be changed and that is the reason why I said that rdds are immutable.
The following partition is the fundamental unit of parallelism and Spark rdd and all data chunks are partitioned into rdd. The following user persistence can reuse rdd. and choose a storage strategy for them, the course detailed operations are applied to all elements in the data sets through Maps or filtering or grouping operations, so these are the various features of rdd. Now let's look at the ways to create rdd. There are three ways to create rdds that can be created rdd from paralyzed collections and also can be created rdd from existing rdd or other rdts and can also be created from external data sources as well as hdfs Amazon S3 edgebase Etc .Now let me show you how to create rdds.
I will open my terminal and first check if my daemons are working or not cold. Here I can see how the Pan Spark daemons are running, so now at the beginning let's start the Spark Shell. It will take a little time for the Shell to cool down, now the Spark Shell. has started and I can see the Spark version as 2.1.1 and we have a Scala shell here. Now I will tell you how to create rdds in three different ways using the Scala language. At first, let's see how to create an rdd from parallelized. collections SC dot paralyze is a method that I use to create a paralyzed collection of rdds and this method is a spark context paralyzed method to create a paralyzed collection so I will give it SC Dot parallelize and here I will parallelize from one to a hundred numbers in five different. partitions and I will apply collect as an action to start the process, so here in the result you can see an array of 1 to 100 numbers.
Okay, now let me show you how partitions appear in the Spark web UI, so the web UI port for Spark. it's localhost 4040 so here you just completed a task which is SC dotparallelize compile success

full

y here you can see the five stages which completed success

full

y because we have divided the task into five partitions so let me show you the partitions so this is one day . visualization which is the directed or cyclographic visualization in which you have applied only paralyzed as method so you can see only one stage here so here you can see the rdd which has been created and coming to the smooth timeline, you can see the task that has been executed. in five different stages and different colors imply scheduler delay, task deserialization, shuffle rate, shuffle time, correct time, run a calculation time etc., here you can see the metrics Summarized for the created rdd, here you can see the maximum time it took to run the task on five partitions in parallel takes only 45 milliseconds.
You can also see the executor ID, host ID, success status, duration, launch time, etc., so this is a way to create an rdd from paralyzed collections. Now let me show you how to create an rdd. from the existing rdd, ok I will create an array called even and assign numbers from 1 to 10. one, two, three, four, five, six, seven, ok I got the result here i.e. I have created an array of integers from 1 to 10 and now it will parallelize this array 1. Sorry, I got an error, it is SC parallelized point of A1, okay, so I created an rdd called parallel collection.
Great, now I will create a new rdd from the existing rdd which is Val. The new rdd is equal to point map A1. data present in an rdd. I will create a new rdd from an existing rdd so here I will take A1 as reference and map the data and multiply it by two then what should be its result if I map the data present in one rdd to two then it should be like 2 4 6 8 up to 20. Right, let's see how it works. Yes, we got the result that is a multiple of 1 to 10, that is, 2 4 6 8 up to 20.
This is one of the methods. of creating a new rdd from an old rdt and I have one more method that comes from external file sources so what I will do here is give the test of Val which is equal to SC dot text file. Here I will give the path to hdfs file location. and link the path which is hdfs localhost 9000 is a path and I have a folder called example and in it I have a file calledshows great, so I created one more rdd here. Now let me show you this file that I have already saved. hdfs directory I'm going to look at the file system and show you the forward slash example directory that I created, so here you can see the example that I created as a directory and here I have a sample as an input file that they gave me. here you can see the same path location, this is how I can create an rdd from external file sources.
In this case I have used hdfs as the source of external files, this is how we can create rdds in three different ways i.e. paralyzed collections of external rdds and existing rdds, so let's move further and see the various operations of rdd. RTD supports two main operations namely transformations and actions, as I already said rtds are immutable so once you create an rdd you cannot change any content in the physical ID. So you might be wondering how rdd applies those transformations correctly when you run any transformation. Run those transformations on an old rdd and create a new rdd.
This is basically done for optimization reasons. Transformations are the operations that are applied on an rdd to create a new rdd. now these transformations work on the principle of deferred evaluations, so what does it mean? It means that when we call some operation on rdd it is not executed immediately and Spark keeps the record of the operation being called, since the transformations are lazy in nature, so it can execute the operation at any time by calling an action on the data, therefore in lazy evaluation the data is not loaded until it is needed. Now these actions parse the rdd and produce a result.
You can count a simple action that will count the rows in rdd and then produce a result. so I can say that the transformation produces new PDR and results produced by the action before continuing with the discussion, let me tell you about the three different workloads that stimulate us: interactive mode in batch mode and streaming mode. In the case of batch mode, we run a batch or you write a job and then schedule it. it works through a queue or a batch of separate jobs without manual intervention, then in case of interactive mode it is an interactive shell where you go and execute the commands one by one, so you will execute a command, check the result and then it will execute another command based on the output result etc. it works similar to SQL shell so shell is the one that executes the driver program and in shell mode you can execute it in cluster mode it is usually used for development work or used for ad hoc.
Queries then comes the streaming mode in which the program runs continuously as when the data comes it takes the data and performs some transformations and actions on the data and gets some results so these are the three loads of different work that drives us. Now let's look at a reality. -time use case here I am considering Yahoo as an example, so what are the problems of Yahoo? Yahoo properties are highly personalized to maximize relevance. The algorithms used to provide personalization, i.e. targeted advertising and personalized content, are highly sophisticated and the relevance model must be updated frequently because stories, news and ads change over time and Yahoo has more than 150 petabytes of data stored in a 35,000-node Hadoop cluster that must be accessed efficiently to avoid latency caused by data movement and to gain insights from the data and cause an effective way to overcome these problems Yahoo seeks Spark to improve the performance of your iterative model by training here the machine learning algorithm for news personalization required 15,000 lines of C plus plus code, on the other hand the machine learning algorithm has only 120 lines of the Scala code, that is the advantage of Spark and this algorithm was ready for production use in just 30 minutes of training on 100 million data sets and Sparks Rich apis available in multiple programming languages ​​and has robust in-memory storage options and supports Hadoop Throughyan and the Spark Yan project use Apache Spark to personalize their news web pages and for targeted advertising, not only because it also uses machine learning algorithms that run Apache Spark to find out what kind of news users are interested in reading and also. to categorize new stories to find out what type of users would be interested in reading each news category and Spark runs hadoopyan to use existing data and clusters and Spark's extensive API and machine learning library is the development of learning algorithms Automatic and Spark reduces model training latency through in-memory rdd, which is how Spark has helped Yahoo improve performance and achieve goals.
External architecture on your master node, you have the driver program that controls your application, so the code you write behaves. as a driver program or if you are using the interactive shell, the shell acts as a driver program within the driver program. The first thing you need to do is create a Spark context. Assume that the Spark context is a gateway to all Spark functions. It's similar. to your database connection, so any command you run on your database goes through the database connection, similarly anything you do in Spark goes through the Spark context. Now this Spar context works with the cluster manager to manage multiple jobs, the driver program and the Spark context take care of it.
To run the job in the entire cluster, a job is divided into tasks and then these tasks are distributed on the worker node, so whenever you create an rdt in the Spark context, the rdd can be distributed on multiple nodes and can be cached there, so Rd is said to be partitioned and distributed across multiple nodes, now the worker nodes are the slave nodes whose job is basically to execute the tasks, the task is then executed on the partition rdds in the worker nodes and then returns the result to the spark context spark context takes the work divides the work into tasks and distributes them to the worker nodes and these tasks work on the rdds partition perform whatever operation you want to perform and then collect the result and it returns it to the main Spark context if you increase the number of workers then it can split jobs and more partitions and run them in parallel on multiple systems.
This will actually be much faster. Additionally, if you increase the number of workers, you will also increase their memory and can cache jobs so they can run. much faster, so this all has to do with the Spark architecture. Now let me give you an infographic idea about Spark architecture. Follow the Master Slave architecture here. The client sends the Spark user application code when an application code is submitted. The controller implicitly converts user code that contains transformations. and actions in a logically directed graph called DHE at this stage also performs optimizations like pipeline transformations, then converts a logical graph called DHE into a physical execution plan with many stages, after converting it to a physical execution plan it creates units of physical execution called tasks.
At each stage, these tasks are grouped and submitted to the cluster. Now the controller talks to the cluster manager and negotiates the resources and the cluster manager launches the necessary executors. At this point, the controller will also send its tasks to the executors based on the location when the executors start. They register with the drivers so that the driver has a full view of the executors and the executors now start executing the tasks assigned by the driver program at any time when the application is running. The controller program will monitor the set of executors being executed and the controller node also schedules a future task.
Depending on the location of the data, this is how the internal work is carried out in the Spark architecture. There are three different types of workloads that Spark can serve. First batch mode, in case batch mode runs malfunction. buy here you write the job and then schedule it, it works through a queue or batch of separate jobs by manual intervention, the next interactive mode, this is an interactive shell where it goes and executes the commands one by one, so it will execute a command and it will verify the result. and then execute the other command based on the output result and so on.
It works similar to SQL shell, so the shell is the one that executes a driver program, so it is usually used for development work or it is also used for ad hoc queries. the streaming mode where the program runs continuously as the data arrives, takes the data and performs some transformations and actions on the data and then produces output results, so these are the three different types of workloads that Spark really caters. Now let's move on. and see a simple demo here, let's understand how to create an application and Spark Shell using Scala. So, let's understand how to create a Spark application in Spark Shell using Scala.
Suppose we have a text file in hdfs directory and we are counting the number of words in that text file, so let's see how to do it, before starting to run, first let me check whether all my daemons are running or not, like this which I will write sudo JPS so that all my Spark Demons and Hadoop elements are running. I have a Master. worker as spark daemons and not naming any resource manager, node manager everything as Hadoop daemons so first thing I do here is run Spark Shell so it takes a little time to get started meanwhile let me tell you the Web UI port for Spark Shell. is localhost 4040, so this is a web UI for Spark, like if you click on jobs right now, we haven't run anything, so there are no details here, so it has job stages, so that once you run the tasks you will have the logs of the tasks you have run here so here you can see the status of various jobs and tasks run so now let's check whether our Spark Shell has started or not.
Yes, so your question Spark is 2.1.1 and you have a finished Scholar Shell. here before starting the code let's check the content that is present in the input text file by running this command so I will write where test equals SC dot text file because I have saved a text file there and I will do. provide the location of the hdfs path. I have stored my text file in this location and Sample is the name of the text file, so now let me give it test.collect to collect the data and display the data that is present in the text file. in my text file I have data science from Hadoop research analyst and science so this is my input data so now let me map the functions and apply the transformations and actions so I will give where map equals SC points text file and I will specify my input path location so this is my input path location and I will apply the flat map transformation to split the data that is separated by space and then I will assign the word count so that be given as word comma 1.
Now, this will be executed, yes, now. let me apply the action so this will start the execution of the task so let me tell you one thing here before applying an action spark will not start the execution process so here I have applied the reduce action by kiazza to start counting the number of words. in the text file so now we are done with applying Transformations and Actions so now the next step is to specify the output location to store the output file so I will give as counts.save as text file and then I will specify the location for my output file I will store it in the same location where I have my input file and I will specify the name of my output file as output 9.
I forgot to put double quotes and I will run this so that it is complete now so now let's see. the output I will open my Hadoop web UI giving it localhost 50070 and look for a file system to check the output so as I said I have an example like my directory which I have created and in which I have specified output 9 as my output, so the two part files are being created, let's go through each of them one by one so that the data counts as one analyst counts as 1 and the science counts as two, so this is a file of the first part.
Now let me open the second part file for you. This is the file from the second part where you have the Hadoop count as one and the research count as one, so now let me show you the text file that we havespecified as input, so like I told you, Hadoop count is a research count is an analyst. a data 1 science and science as one, so you might be thinking that data science is a word not and the program code we have asked it to count the word that is separated by a space, that's why we have science that counts as two, I hope.
You have an idea of ​​how word count works in a similar way. Now I will parallelize 1200 numbers and divide the task into five partitions to show you what the partitions of the task are, so I will type SC dot parallelize 1200 numbers and divide them into five partitions and apply. collect action to collect the numbers and start the execution to show you an array of 1 to 100 numbers. Now let me explain to you the stages of the job, the partitions, even the timeline rendering and everything, so now let me go to the Spark web UI and click on jobs, these are the tasks that have been submitted, so come to work, count example, so this is the dag display.
I hope you can see it clearly, first you collected the text file, then you applied flat map transformation and mapped it to count the number of words and then you applied Reduce by K action and then save the output file as text file, for so this is the full label display of how many steps we have covered in a program, so here it shows the completed stages i.e two stages and also shows the duration is 2 seconds and if you click on the event timeline, it only shows the executor being added and in this case you can't see any partitions because you haven't split the jobs into multiple partitions, this is how you can see the pairs. timeline and dag display here you can also see the stage id descriptions when submitted which I just submitted now and in this it also shows the duration it took to execute the task and the output bytes it took for random read random write and many more now to show you the partitions.
See in this that you just applied SC point parallelism, so it only shows one stage where you applied the parallelized transformation. Here it shows the successfully completed task as 5x5 i.e. you have split the task. in five stages and all five stages have been executed successfully now here you can see the partitions of the five different stages running in parallel so depending on the colors it shows the scheduler delay, random speed time executor , calculation time, result serialization time and getting result time and many more so you can see the duration it took to run the five tasks in parallel at the same time at most one millisecond so in Spark memory It has a much faster calculation and you can see the IDS of the five different ones. all the tasks are successful, you can see the locality level, you can see the executor and host, ipid, law launch time, how long everything takes, so you can also see that we have created rdt and we have parallelized similarly here for the word count example as well. see the rdd that has been created and also the actions that you have applied to run the task and you can see the duration they took even here, also it is just a millisecond that it took to run the entire word count example and you can see the ID of the executor at the locality level, so in this case we have just executed the task in two stages, so it only shows the two stages, so it is about what the web UI looks like and what are the features and information that you can see in the web user interface. from spark after running the program and fellowship, so in this program you can see that first we gave the path to the input location and checked the data which is presented in the input file and then we applied flat map transformations and we create rdd and then apply action to start the task execution and save the output file in this location, so I hope you have a clear idea how to run a word count example and check the various functions in the web UI of Spark, such as partitions, views and everything external, so here are some reasons why Spark is considered the most powerful Big.
The data tool in today's ER, first of all, its ability to integrate with Hadoop Spark can integrate well with Hadoop and that is a big advantage for those who are familiar with Lada technically, a standalone project that Spark has designed to run in the Hadoop distributed file system. or hdfs, you can get to work right away with map R, you can run on hdfs inside mapreduce after you have deployed onion, you can even run on the same cluster along the mapreduce chops side, followed by the first reason, the second reason says it can meet global standards as per spark technology forecast is the future of big data processing around the world big data analytics standards are emerging immensely driven by high speed data processing and real-time results by learning spark now one can meet global standards to ensure compatibility between the next generation of Spark applications and distributions by being part of the Spark developer community.
If you think you love technology, contributing to the development of a growing technology and its growth stage can give your career a boost; after this you can keep up to date. up to date with the latest advancements taking place in Spark and be among the first to build the next generation of Big Data applications. The third reason says that it is much faster than Mapreduce because Spark is an in-memory data processing framework and it is all set. to take over all the primary processing for Hadoop workloads in the future, being much faster and easier to program than mapreduce spark is now among the top-level Apache projects that has acquired a puzzle from a large community of users, as well as contributors, CDO data blocks and one from Putsford Spark, the brains behind the Apache Spark projects, emerged as a multi-phase query tool that could help democratize the use of Big Data.
He also projected the possibility of the end of Mapreduce ERA with the growth of Apache Spark, followed by the third reason why we have the fourth reason. which says that Spark is capable of working in a production environment. The number of companies using Spark or planning the same tool has skyrocketed over the last year. There is a massive search in the popularity of Spark, the reason is because of its mature open source components and an expanding user community, the reasons why Spark has become one of the most popular projects in Big Data are the tools Entrenched high-performance solutions that handle diverse problems and workloads, a fast and simple programming interface, and high-end programming languages ​​like Scala Java and Python.
There are several reasons why enterprises are increasingly adopting Spark, ranging from speed and efficiency and useful for a single integrated system for all data pipelines, and many more Spark is the most active Big Data project that all major Hadoop have implemented in production. as well as non-hadook suppliers across multiple sectors, including financial services, retail media companies, telecommunications and public sectors. Now the last important reason why Spark is so powerful is its increasing demand for Spark. Spark developers are brand new and yet fully extended in big data. In the market, the Spark user is increasing at a very fast speed among many of the top-tier companies like NASA, Yahoo, Adobe and many more, apart from those belonging to the Spark community, there are a handful of professionals who have learned Spark and you can work with him in this internship. has created a growing demand for Spark developers in such a scenario, learning Spark can give you a huge competitive advantage by learning Spark right now you can demonstrate the recognized validation of your expertise, this is what John Tripper and alliance and ecosystem leader have in databricks.
To say that the adoption of Apache Spark by companies large and small is growing at an incredible rate across a wide range of industries and the demand for developers with certified experience is quickly following suit, so these were the few major reasons why Apache Spark is considered to be the most powerful tool in the identification industry today, let's move forward and understand the roadmap to becoming an Apache Spark developer. There is always a fine line between becoming a certified Apache Spark developer and being a real Spark developer capable enough to perform in real-time application, so how can we become a certified Apache Spark developer who is capable of performing in real time?
The step by step approach of the scene is given below to become an expert level Spark developer, you need to follow the right instructions. path and expert level guide from real-time certified industry experts in the industry for a beginner, it is the best time to take a training and certification exam once the certification has started you should start with your own projects to understand how it works. Apache Spark terminology The main building blocks of Xbox which are rdds or resilient distributed data sets and data frames, also spark has the ability to integrate with high performance programming languages ​​such as python Scala and Java Phi Spark rdds are best examples for python combination. and Apache Spark, you can also understand how to integrate Java with Apache Spark through an amazing article called Spark Java

tutorial

article by

edureka

which I have linked in the description box below once you have a better understanding of the main building blocks of Spark which is the rdds and data frames you can move forward to learn some of the core components of Apache Spark mentioned below which are Spark SQL Spark Mlib Spark Graphics Spark R and Spark Streaming and much more once you get the necessary training and certification.
It's time for you to take the most important and Wiggly thing, the CCA 175 certification. You can start solving some examples of CCA 175 and Spark certification exams that I have linked in the description box below and once you have a preferred idea and confidence , you can register for CCA. 175 and Excel with your true Spark and Hadoop developer certification, so this is the roadmap to becoming a true and certified Apache Spark developer. Now that we have discussed the roadmap to becoming an Apache Spark developer, let's move ahead and talk about Apache. Spark Developer Salary Apache Spark Developer is one of the highly decorated professionals with attractive salary packages compared to others.
Now we will discuss the hello trends of Apache Spark developers in different countries, first in India in India, the average salary offered to an entry-level Spark developer. is between 6 lakhs to 10 lakhs per year and on the other hand for an experience level spark developer the salary trends are between 25 lakhs to 40 lakhs per year in the United States of America in the United States of America The salary offered for an entry-level Spark developer costs between 75,000 and 100,000 US dollars per year. Similarly, for an experienced level Spark developer, the salary trends are between 145,000 and 175,000 per year. Now with this, let's go ahead and discuss the skills of a Spark developer.
What is required to become an excellent Spark developer is to be capable enough to load the data from different platforms into the Hadoop platform using various edl tools. Decide on an effective file format for a specific task based on business requirements. Clean data via streaming APIs or user-defined functions. Schedule effectively. Hadoop jobs are carried out with hype and Edge base for schema operations, following that you need to have the ability to work on hype tables and map schemas, deploy headspace clusters and manage them continuously, run ping scripts and Hive to perform various joins on data sets, apply different hdfs formats and structures like to speed up analyses, maintain privacy and security of Hadoop clusters, tune Hadoop applications, troubleshoot and debug any Hadoop ecosystem in runtime and finally install, configure and maintain the Enterprise Hadoop environment if necessary.
Now we will move ahead and understand the Roles and Responsibilities of Apache Spark Developers The roles and responsibilities of a Spark developer are to be capable enough to write executable code.for analytics services and insights into Spark components and high-performance programming languages ​​such as Java, Python and Scalar. You should be well versed in related technologies like Apache Kafka. Storm Hadoop and Zookeeper are ready to be responsible for system analysis that includes design coding unit testing and other software development lifecycle activities. Gathering user requirements and converting them into sound technical tasks and providing economic estimates for the same needs to be a team player with global standards so that to understand project delivery risks, ensure quality of technical analysis and resolution expertise of problems, review the use case of the code and ensure that it meets the requirements, so these are the few rules and responsibilities of a Spark developer.
Now let's go ahead and get to know what companies they are. Using Apache Spark Apache Spark is one of the most widespread technologies that has changed the face of many ID industries and helped them achieve their current achievements. Additionally, let's look at some of the tech companies and major players in the ID industries that are under scrutiny. user Spark, some of the companies are Oracle Dell Yahoo CGI Facebook conscious capgemini Amazon IBM LinkedIn Accenture and many more foreign sparks let me first tell you about the bi-spark ecosystem, as you can see from the diagram, the spark ecosystem is made up of various components like Spark SQL Spark Stream Streaming mlib graphs and the core API component The Spark SQL component is used to harness the power of declarative queries and optimize storage by executing SQL-like queries on Spark data presented in rdds and other external sources .
The Spark Streaming component allows developers to easily perform patch processing and data streaming in the same application, the machine learning library makes it easy to develop and deploy scalable machine learning pipelines. The graph component allows data scientists to work with graph and non-graph sources to achieve flexibility and resilience in building graphs and transformations and finally the Sparco component, is the most vital component of the Spark ecosystem, which is responsible for the functions basic input and output, programming and monitoring of the entire Spark ecosystem is built on the score execution engine that has extensible APIs in different languages ​​​​like Scala Python R. and Java, and in today's session I will talk specifically about the Spark API in Python programming languages, which is more popularly known as Pi Spark.
Now you might be wondering why Pi Spark is good for getting a better understanding. Let me give you a brief summary about Pi Spark. Now, as we already know, Pi Spark is the collaboration of two powerful technologies: Spark, which is an open source clustering computing framework built around speed, ease of use, and streaming analytics, and the other It's Python, of course, Python, which is high-level general purpose. programming language, provides a wide range of libraries and is now primarily used for machine learning and real-time analytics, giving us Pi Spark, which is a Python API for Spark that allows you to take advantage of Python's simplicity and power.
From Apache Spark for Tame Bit Data, Pi Spark also allows you to use RDD and comes with default Pi 4G library integration. We will learn about RDD later in this video, now that you know what Spy Spark is, let's now look at the advantages of using Spark with Python. As we all know, Python itself is very simple and easy, so when Spark is written in Python, it makes Party Spark quite easy to learn and use. Furthermore, it is a dynamically typed language, which means that rdds can contain objects of multiple data types, not only this but also makes the API simple and complete and talk about the readability of code maintenance and familiarity with Python API for Apache Spark is much better than other programming languages.
Python also provides several display options that are not possible using Scala or Java; In addition, you can call comfortably. Python R directory besides this Python comes with a wide range of libraries like numpy Panda Scotland Seaborn matte broadlib and these libraries help in data analysis and also provide mature and time-tested statistics with all these features that you can program without effort in spy spark. In case you get stuck somewhere or have any doubts, there is a huge Pi Spark community that you can contact and ask your query and it is very active, so I will take this opportunity to show you how to install Pi Spark on your system now here I am using a Red Hat Linux based system shipped to a system, the same steps can be applied to use Linux systems as well, so to install Pi Spark first make sure you have Hadoop installed on your system, so if you want To know more about how to install Hadoop, check out our Hadoop playlist on YouTube or you can check out our blog on the edureka website.
First of all, you need to go to the official Apache Spark website, which is spark.apache.org, and download. section you can download the latest version of Spark Release that supports the latest version of Hadoop or Hadoop version 2.7 or higher now, once you have downloaded it, all you need to do is extract it or for example enter the contents of the file and then you need to put the path where Spark is installed in the RC bash file. Now you also need to install pip and Jupiter notebook using the PIP command and make sure the version of pip is 10 or higher as you can see here this is what our bash RC file looks like here you can see we have put the path for Hadoop Spark and also for Pi Spark Driver Python, which is The Jupyter Notebook.
What it will do is the moment you run the Pi Spark Shell, it will do it automatically. Open a Jupyter notebook for yourself now. I find it very easy to work with the Jupiter notebook instead of the shell. It's a personal choice now that we are done with the installation path. Let's now delve into Pi Spark and learn some of its fundamentals that you need to know to work with Pi Spark. Now this timeline shows the various topics we will cover in Pi Spark Fundamentals, so let's start with the first one. The topic of our list is the Spark context.
The Spark context is the heart of any Spark application. Configures internal services and establishes a connection to a Spark runtime environment through a Spark context object. You can create cumulative rdds and stream variable access. Running Spark services. jobs and much more, the Spark context allows the Spark driver application to access the cluster through a resource manager that can be a thread or a Sparks cluster manager. The driver program then executes the operations within the executors on the worker nodes and the Spark context uses pi 4J to start a jvm which in turn creates a javaspark context.
There are now several parameters that can be used with the Spark context object, such as the name of the master application. Spark Home, pi files, environment in which batch size is set, serializer configuration, gateway and much more. These parameters, the master and application name, are the most commonly used parameters now to give you a basic idea of ​​how a Spark program works. I have reduced the basic phases of the life cycle. The typical life cycle of a Spark program includes creating rdds from external data sources or If you paralyze a collection in your driver program, then we have the lazy transformation in a lazy transformation of the base RDS into new rdds using the transformation, we then cache some of those rdds for future reuse and finally perform actions to run parallel calculations and produce the results of the next topic. on our list is rdt and I'm sure people who have already worked with Spark are familiar with this term, but for people who are new to it, let me explain it now. rdd stands for resilient distributed data sets and is considered the building. block of any Spark application, the reason behind this is that these elements run and operate on multiple nodes to perform parallel processing in a cluster and once you create an rdd, it becomes immutable and by immutable I mean it is an object whose State cannot be modified after it is created, but we can transform its values ​​by applying certain transformation.
They have good fault tolerance ability and can automatically recover from almost any failure. This adds an additional advantage now, to achieve a certain task, multiple operations can be applied on these IDs which are classified into two. forms, the first in the transformation and the second are the actions, Transformations are the operations that are applied in an rdt to create a new rdd. Now these Transformations work on the principle of lazy evaluation and the transformation is lazy in nature which means when we call some The operation on rdd is not executed immediately. Spark maintains the record of the operations through which it is called with the help of acyclic diuretic herb which is also known as dhe and since the transformations are lazy in nature, when we execute the operation at any time by calling. an action on the data, the lazy evaluation data is not loaded until it is needed and the moment we call the action, all the calculations are done in parallel to give you the desired result.
Now some important transformations are flat map filter, distinct, reduced by key. map partition sort actions are the operations that are applied on an rdt to tell Apache Spark to apply the calculation and pass the result to the controller. Some of these actions include collecting the collection as a map reduction. Take first now let me implement some of these for To make you understand it better, first of all let me show you the IC bash file that I was talking about so here you can see that in the RC bash file we provide the path for all the Frameworks that we have installed in the system, so, for example, you can Look here, we have installed Hadoop at the moment we installed it and unzipped it, or rather, enter it.
I have moved all my Frameworks to a particular location, as you can see, it is the USR the user and inside this we have the library and inside that I have installed. Hadoop and also Spark now, as you can see here, we have two lines. I will highlight this one for you, the python driver pi spark, which is Jupiter and we have given it the option available as a laptop. What I do is that the moment I start Spark, it will automatically redirect me to The jupyter Notebook, so let me rename this notebook as rdd

tutorial

, so let's start, here to upload any file to an rdd, suppose I am uploading a text file that needs. to use the test, it is a spark context SC points text file and you must provide the path of the data you are going to load, so one thing to keep in mind is that the default path that artery or notebook takes of Jupiter is the sdfs path so to use the local file system you need to mention the file colon and double forward slashes now once our sample data is inside the rad now to take a look at it we need to invoke its action, so let's go ahead and take a look at the first five objects or rather say the first five elements of this particular rdd.
Now the sample data that I have taken here is about blockchain, as you can see, we have one, two, three, four and five elements here, I guess what I need. to convert all the data to lowercase and split it word by word, for that I will create a function and in that function I will pass this rdd, so I am creating as you can see here. rdd1 which is a new r80 and it uses the map function or rather it says the transform and passes the function that I just created to bring it down and split it, so if we take a look at the output of rjd1, as you can see here all the words .
They are in lowercase and all are separated with the help of a space bar. Now there is another transformation which is known as flat map to give you a flat output and I am passing the same function that I created earlier so let's go. go ahead and take a look at the result of this, as you can see here we have the first five elements which are the same as what we have here, the contrast transactions and the records, so just one thing to note is that the flat map is a transformation while take is the action now as you can see that the content of the sample data contains the main words so if I want to remove all the things all I need to do is start and create a list of stop words. which I have mentioned here, as you can see, we have everything as it is and now these are not all the stopwards, so I have chosen only a few of them just to show you what exactly the result will be and now we are using them here. filter transformationand with the help of Lambda function in which we have X specified as see the contract transaction logs of them if you look at result 5 we have contract transactions and and in are not in this list now suppose I want to group the data according to the first three characters of any element, for that I will use group by and I'll use the Lambda function again, so let's take a look at the result so you can see that we have EDG and edges, so the first three letters of both words are the same in a similar way. we can find it using the first two letters.
Also let me change it to two so you can see that we have gu and guid, which is the guide. These are the basic transformations and actions, but suppose I want to know the sum of the first ones. thousand numbers or rather say the first 10,000 numbers, all I need to do is initialize another rdd which is the numeric underscore rdd and we use the SC Dot parallelism and the range we have given is one to ten thousand and we will use the action of reduction. here to see the result you can see here we have the sum of the numbers ranging from one to ten thousand now this is all about rdd now the next topic we have on our list is streams and accumulators now in Spark we do parallel processing through of shared variables help or when the controller sends any task to the executor present in the cluster, a copy of the shared variable is also sent to each node in the cluster, thus maintaining high availability and fault tolerance.
Now this is done to achieve the task in Apache Spark, let's assume two types of shared variables, one of them is broadcast and the other is the accumulator. Now the streaming variables are used to store the copy of data in all the nodes of a cluster, while the accumulator is the variable that is used to aggregate the incoming information through different associative and commutative operations, now moving on to our next topic, which is a Spark configuration, the Spark configuration class provides a set of configurations and parameters that are needed to run a Spark application on the local system or on any cluster now when you use the Spark configuration object to set the values of these parameters automatically take priority over system properties.
Now this class contains several Getters and Setter methods, now some of which are set methods used to set a configuration property. We have the set master which is used to set the master URL, it has the set app name which is used to set the name of an app and we have the get method to retrieve a setting value from a key and finally we have set up Spark Home , which is used to configure the Spark installation. The path on the worker nodes now moves on to the next topic on our list, which is the Spark files.
The Spark file class contains only the class methods so that the user cannot create any Spark file instances. Now this helps resolve the path of the files being added. Using Spark's add context file method, Spark class files contain two class methods which are get method and root directory method. get is now used to retrieve the absolute path of a file added via Spark's context point append file and get root. The directory is used to retrieve the root directory that contains the files that are added to the Spark context. Adding now these are small topics and the next topic we will cover on our list is data frames.
Now data frames in Apache Spark are a distributed collection of rows under named columns that is similar to relational database tables or Excel sheets. It also shares common attributes with the rdts. Some features of data frames are immutable in nature, it is the same as you can create a data frame but you cannot change it. It allows for lazy evaluation, that is, the task is not executed unless and until an action is triggered, and furthermore, the data frames are distributed in nature and are designed to process a large collection of structural or semi-structured data. They can be created using different data formats, such as upload. the data from source files like Json or CSV or you can load it from an existing rdd you can use databases like Hive Cassandra you can use pocket files you can use CSV XML files there are many sources through which you can create a particular rdd now let me show you how to create a data frame in pi spark and perform various actions and transformations on it, so let's continue with this in the same notebook that we have here.
We have taken the data from the New York flight and I am creating a data frame. which are New York flights underline TF now to load the data we are using spark.read.csv method. I need to provide the path which is the local path by default it takes the sdfs same as rgd and one thing to note here is I have provided two additional parameters here which are infoscheme and header. If we don't provide this as true or omit it, what will happen is that if your data set contains the name of the columns in the first row, it will. take them as data too it will not infer the schema now once we have loaded the data into our data frame we need to use the show action to see the result as you can see here we have the result which is exactly gives us the 20 rows main or the particular data set, we have the year, month, day, departure time, departure delay, arrival time, arrival delay and many more attributes.
Now to print the schema of the particular data frame you need the transformation or say the action. of the print schema, so let's take a look at the schema, as you can see here, we have here, which is an integer, a month, almost half of them are integers, we have the operator as string, tail number as string , yes, the source string, the destination string, etc. Now suppose I want to know how many records are in my database or data frame. In other words, you need the count function for this and it will provide you with the results.
As you can see here, we have 3.3 million records here. 3 million 36 776. to be exact, now suppose I want to see the flight name, origin and destination from just these three columns of the particular data frame, we need to use the select option, as you can see here we have all 20 top rows now what we saw was the select query in this particular data frame but if I want to see or rather I want to check the summary of any particular column then suppose I want to check what is the lowest count or the highest count high. in the particular distance column I need to use the function described here so I will show you what the summary looks like so distance, count is the number of rows, the total number of rows, we have the mean, standard deviation , we have the minimum value. which is 17 and the maximum value which is 4983.
Now this gives you a summary of the particular column if you want now that we know that the minimum distance is 17. Let's go ahead and filter our data using the filter function in which the distance is 17. So you can see here that we have data where in the year 2013 the minimum distance here is 17. Now, similarly, let's say I want to take a look at the flights originating from EWR. Similarly, we will use the filter function here also now there is another clause here which is the where clause is also used to filter now suppose I want to take a look at the flight data and filter it to see if the day on which the flight was the second of any month, suppose here instead of filtering, we can also use a where clause that will give us the same result.
Now we can also pass multiple parameters and rather say multiple conditions, so I guess I want to. the flight day should be the seventh and the origin should be JFK and the arrival delay should be less than zero. I mean it's not for any of the postponed flights, so just to take a look at these numbers we'll use weight loss and separate all the conditions using the symbol and as you can see all the data here, day is seven, the origin is JFK and the arrival delay is less than zero. These were the basic transformations and actions on a particular data frame.
Now, one thing we have. Also you can do is create a temporary table for SQL queries. If someone is not good or familiar with all these transformations and aggregate actions, would prefer to use SQL queries on the data, you can use this record point temporary table to create a table for your particular. data frame what we will do is convert the NYC flights underline DF data frame into a NYC underscore flights table which can be used later and you can perform SQL queries on this particular table so that you remember that when At first we used the NYC flash highlight DF dot show now we can use the selected asterisks from the New York highlight flights to get the same result.
Now suppose we want to see the minimum time of any flight. We use the selected minimum airtime of New York flights, which is the SQL query that we pass all the SQL Query in the SQL function context.sql, as you can see here, we have the minimum airtime s20. Now to see the records in which the airtime is minimum 20. Now we can also use nested SQL queries, suppose if I want. to check which all the flights have the minimum time of 20. Now that can't be done in a simple SQL query, we need a nested query for that one, so we select asterisks from the New York flights where the air time is within and inside, we have another query. which is Select a minimum time for New York flights, let's see if this works or not.
CS, as you can see here, we have two flights that have a minimum air time of 20. So guys, this is for data frames, so let's go back to our presentation and take a look at the list that we were following, we completed frames of data, then we have storage levels, now storage level in pi spark is a class that helps to decide how rdds should be stored now, based on this rdds are stored in disks or in memory or both classes , the storage tier also decides whether the RADS should be serialized or replicate its partition. Now the final and last topic on today's list is ml lib.
Now mlib is the machine learning API provided by Spark which is also present in Python and this library is widely used in Python for machine learning as well as real-time streaming analysis. Now various algorithms supported by these libraries are firstly we have Spark Dot ml live and now recently Spice Pack Mlib supports based models. collaborative filtering using a small set of latent factors and here we describe all the users and products that we can use to predict message inputs; However, to learn about these latent factors, spark.mlib uses alternative least squares, which is the ALS algorithm. mlib point clustering and our supervised learning problem clusters.
Now here we try to trim subsets of entities from each other based on some notion of similarity. Then we have frequent pattern matching, which is FBM. Now frequent pattern matching is to extract frequent elements. sets of elements subsequences or other substructures that are usually among the first steps in analyzing a large-scale data set. This has been an active research topic in data mining for years. We have linear algebra. This algorithm now supports the Pi Spark ml lib utilities for Linden algebra. we have collaborative filtering we have classification for binary classification variables methods are available in spark.mlip packages, such as multi-class classification as well as regression analysis in classification, some of the most popular algorithms used are governed by their decision tree random forest and finally we have linear regression, now basically the main integration comes from the recreation algorithm family to find relationships and dependencies between variables is the main goal of regression.
Additionally, the Pi Spark MLA package also covers other algorithm classes and functions, let's not try to implement all the concepts. which we have learned in the pi spark tutorial session, now here we are going to use a heart disease prediction model and we are going to predict it using recession tree with the help of classification and regression. Now all of these are part of the mlib library. Here let's see how we can perform this type of functions and queries. The first thing we need to do is initialize the spark context, so next we are going to read the UCI data set of heart disease prediction and we are going to clean the data, so let's import the pandas and the numpy library here, now we create a data frame as hard disease TF and, asmentioned above we will use CSV read method here and here we don't have a header so we have given the header as none now the original data set contains 303 rows and 14 columns now the heart disease diagnosis categories we are projecting if the value 0 is for 50 percent less taper and the value 1 we are giving is for values ​​that have 50 diameters more taper, so here we are using the numpy library.
Now these are particularly old methods that show the deprecated warning, but no problems, it will work fine. As you can see here, we have the diagnostic categories of heart diseases that we are in. predict that value 0 is less than 50 and value 1 is greater than 50. So what we did here was delete the row that has the question mark and the empty spaces now to see the data set here, now you can see here we have zero in many places instead of the question mark that we had before and now we are saving it in a text file and you can see here after removing the rows with the empty values, we have 297 rows and 14 columns, now this is what It looks like the core data set now we're importing the mlib library and the regression here, now here what we're going to do is create a label point which is a local vector associated with a label or a response, so to For that we need to import the mlib.regression so for that we are taking the text file that we just created now without the missing values.
Now the next thing we are going to do is pass the data from mlib line by line to the mlib Label Point object and we are going to convert the minus 1 labels to zero now let's take a look after passing the number of training lines, okay we have the .01 tag, that's cool. The next thing we are going to do is perform the classification using the decision tree, so for that we need to import the pi spark dot mlib.3 so the next thing we have to do is split the data into training and test data and here we divide the data into 70 to 30.
This is a standard ratio, 17 is the training data set and 30 is the test. data set, the next thing we do is train the model that we have created here using the training set. We have created a training model. Decision tree. Point train classifier. We have used the training data. The number of classes is archived. The categorical characteristic that we have. Given the maximum depth we're ranking at, it's three. The next thing we are going to do is evaluate the model based on the test data set now and evaluate the error, so here we are creating predictions and we are using the test data to get the predictions.
Through the model that we created here and we are also going to find the test errors here, as you can see here, the test error is 0.2297. We have created a classification decision tree model in which the feature value less than 12 is 3 and the value of features less than 0 is 54. As you can see, our model is quite good. Now we will use regression for the same purposes. Let's run the regression using the decision tree so you can see that we have the business model that we are in. using the decision tree point train regressor using the training data the same ones we created using the precision tree model above we use classification now we are using regression more similarly let's evaluate our model using our data set of test and we will find the test errors, which is the mean square error here for the regression, so let's take a look at the mean square error here, the mean square error is 0.168, that's good now, finally, if we take a look at the model of learned regression tree so you can see what we created. the regression input model to the depth of 3 with 15 nodes and here we have all the features and classification of the tree, so guys look at the system requirements.
Here I will explain the minimum system requirements, so the minimum RAM required is around 4 GB, but it is recommended to use an ATB Ram system and the manual free disk space should be 25 GB, at least 25 GB now, the Minimum processor must be I3 or higher to have a smooth programming experience and above all, the system must have 64 GB. Bit OS and in case you are using a virtual machine or a virtual box and it should also support a 64 bit image of the OS, these are all the hardware requirements so, to come to the software requirements , we need Java 8 or above we need Hadoop 2.7 or higher, since Spark runs on top of Hadoop, we need Pip with version 10. pip is a package management system used to install and manage software packages written in Python.
You can also use conda and finally we need the Jupiter notebook. The step is optional, but the programming experience in Jupyter Notebook is much better than the Shell, so let's go ahead and see how we can install Pi Spark on our systems. Here I have a Windows system and to install Pi Spark I am using a virtual box and I will create a virtual machine inside the virtual box because most of the time Pi Spark is used in Linux environment so that is what I am going to use, so let's see how we can install the virtual box everything you need.
What you need to do is go to the official virtualbox website and in the download section you will find the latest version of virtualbox. You need to click on the Windows host or Linux distribution, but if you have Linux you don't need virtualbox, so for Windows. you can click on this and install it, so I already installed virtualbox and created my VM. Now this VM has been pushed to S7 as a base image and our pennies are the Red Hat distribution OS, so it also works on the linear platform. Now first of all what we need to do is check whether we have Hadoop or Java installed or not, for that we need to check the dot bash RC file.
Now the Bash RC file contains the part of all the Frameworks that are being used, for example like You can see that we have Hadoop installed on our system. We have all the routes to Hadoop. We have Java installed, so we can also check the version of Hadoop we are running. As you can see, we have Hadoop 2.7.3 and check Java. version we need to write version, we have Java 8 running on the system so now that we have Hadoop and Java installed on our system we need to install Spark to install Spark we need to go to Apache official website and there it should go. to spark.apache.org and download bar there, you can select which version of Spark you want as stable release, so the latest version here is from June 8, 2018 and is pre-built for Apache Hadoop 2.7 and later, as we saw previously.
We have Hadoop 2.7.3 so okay now to download Apache Spark you have to click on this link and here you will get various mirror sites and links from where you can download the tar file so I have already downloaded it so let me. I show you guys, as you can see here, it is a TTC file which is a tar file. Now we need to extract this file and place it in our specific location where we want. Let me close this first. For that we must go first. for downloads as you can see we have the spark 2.3.1 Hadoop 2.7 tgz now we need to enter this file so we use the command tar hyphen xvcf and Spark 2.3.1 name so what we will do is extract it or I prefer to say unbind the file in the section downloads so now if we look at the list of items we can see that we have Spark 2.3.1 pin Hadoop 2.7 and we also have the beta file so what we need to do is move this to any specified location where we want our Frameworks, so what I normally do is I keep all my Frameworks like Hadoop Spark Kafka, we have Flume or Cassandra in my User Library section, so the USR library, as you can see, I have Cassandra Flume Hive Maven. storm and then I copied Spark, so now that we have copied Spark to a specific location, we also need to put its path into The Bash RC file, so let me open the bashasi fight again so as you can see here, put the path to Apache Spark, so there are two parts that you need to configure here, which is Spark Home and the path Spark Home has the path to where Spark has moved after it has been unbound or rather pulled from the top file and we.
We also need to provide the path of the bin folder which also has a size inside the Spark folder so after having mentioned the path in the dot bash RC file we need to type source and then dot bash RC so what happens is the time when We Add the path of a particular framework or any application in our patch RC file it is not saved so to save it we use source.rc command so now to move to Spark we just use CD and we use a dollar. we sign and write Spark Home, we are inside Spark.
Now if we take a look at the items inside Spark, we find that there is a Python folder. If you go into Python here, you can see that we have all the different libraries and configuration file that are used to run Pi spark and there is also a pi spark folder inside. Here we have several libraries for which Python is used and the various programs as well, so now that you have installed Spark and mentioned its path in the Bash RC file, now it is time to install Jupiter. notebook too, so to install Jupiter notebook first we need to install pip oconda, as I mentioned earlier pip is a package management system and it is used to install and manage software packages, so this is the command to install pip.
Now make sure the PIP version is 10 or higher to install Jupiter laptop now so you can install Jupiter. After having installed pip we will use the pip install Jupiter command. Now this will install Jupiter notebook on our system and after installing it, if we need to use Jupiter notebook, we just type Jupiter notebook in our command line and what we will do. Doing so will open up the Jupiter notebook for us, so as you can see here in the new section, we have Python 2, so we will use this while writing programs for pi Spark. Now one thing to keep in mind is that we have Jupiter's notebook. here and we have spy spark but Jupiter and Pi spark are not communicating in the meantime for that to happen we need to go to Bash RC file again and once we have provided the path for spark we need to provide the path for pi spark controller. which is also Jupiter notebook, so one more important thing to note is that if you are using Spark for Scala, you also need to provide the path for Scala and to use Jupiter notebook, all you need to do is enter these two lines of code which is pi spark driver python, which is Jupiter and the python driver options, which is notebook now that we have the installation of Spark, let's run Spark for that we need to go to Spark Home and inside we use the dot slash command S Pen slash start script all dot SH now what I will do is start the master and the worker but if you want to start the master separately and the worker separately you can use start hyphen master dot help and start slave script dot sh but usually I use start hyphen all .sh as it starts master and worker nodes ok now to check whether Spark is running or not we use JPS command as you can see here we have master and worker running along with the name of the hadoop resource manager, node name, node secondary name.
JPS node manager and data node now one more important thing is that after you have made changes to dot bash RC file again you need to go to command line and type Source dot bash RC which will save the notebook path as as well as Pi Spark, as you will see here, the moment I type Pi Spark and press Enter, Pi Spark starts running and I am redirected to the Jupiter notebook. Now what's happening is this Jupyter notebook is communicating with the Pi Spark environment, so what we can do is go to Python 2, we'll create a new notebook and we can start writing our programs here as well, but in Ultimately it is up to you whether you want to do all the programming in Shell or want to continue doing it in the Personally, I find Jupiter notebook easy to work with as here you have several options to cut, copy, insert, stop kernel and much more , so let's see if this book works or not, so here I am creating an rdd which are resilient distributed data sets which are a key concept in the Pi journey, so as you can see the star Mark here shows us that the process is done in the background and if I look at the rdd I just created you can see that the Pi Spark Shell works absolutely fine.
So as you can see here in Shell we have the notebook application open and it shows us some messages related to The Notebook, as you can see the last message is saving the file to untitled.ipynb which is the extension of pi spark. Jupiter Notebook, so that's it guys. I hope you understood howinstall Pi Spark on your system, what are all the required dependencies, hardware and software requirements. Now keep in mind that you can use the Jupiter extension, it is an optional step if you want. Do all the programming in the Shell itself, then you don't need Jupiter, but yes, Jupiter adds a certain level of sophistication to programming when it comes to iterative distributed computing that processes data across multiple jobs in computing, we need to reuse or share the data. between multiple jobs now in previous frameworks like Hadoop there were many problems in dealing with multiple operation jobs, we need to display the data in some stable intermediate distributed storage like stfs now multiple input and output operations make the overall calculation job slower and there were replications and serializations, which in turn made the process even slower.
Now our goal here was to reduce the number of input and output operations through the hdfs. This can only be achieved by sharing data in memory. In-memory data sharing is 10,200 times faster than Network order sharing now rdds tries to solve all these problems by enabling fault-tolerant distributed in-memory computations, so let's understand what rdds are. Now rdd stands for Resilient Distributed Datasets and they are considered the backbone of Apache Spark, as I mentioned earlier. From the first fundamental data structures these are now schema-less structures that can handle structured and unstructured data. The data in rdd is split into chunks based on a key and then scattered across all running nodes.
RTDs are highly resilient, that is, they are able to quickly recover from any problem, as the same data chunks are replicated across multiple running nodes, so even if the exec fails, another one will still process the data now. This allows you to perform your functional calculation with your data set very quickly by leveraging the power of multiple nodes now. rdd supports two types of operations namely transformations and actions so basically transformations are the operations that are applied on an rdd to create a new rdd. Now these transformations work according to the principle of lazy evaluation. So what does deferred valuation mean?
What it means is that when calling some operation on rdd it is not executed immediately Spark keeps track of which operation is called through an acyclic diet graph known as tag and since transformations are lazy in nature we can execute operations on any moment by calling an action on Therefore, the data in lazy evaluation is not loaded until it is needed. Now this helps to optimize the required calculation and recovery of lost data partition. Now, actions, on the other hand, are the operations that are applied in an rdt to tell Apache Spark to apply the calculation. and pass the result to the controller now, the moment an action is invoked, all the calculations that are in process occur, this gives us the result and it is stored in the buffer or in a distributed file system, but let's take a look at some of the important transformations and actions we have certain transformations like reduced distinct map flat map filter by keeping the map partition, don't worry I will show you exactly how they work and what is the use of all these transformations and, in actions we have collect collect like map reduce count by a key we take, count by value and many more so let's take a look at some of the important features of Pi Spark rdd.
First of all, the mini MRI calculation rdds have an in-memory calculation provision which makes the process even faster now all the transformations are slow, as I mentioned above it doesn't calculate the results immediately, rdd tracks the information of the lineage to generate loss data automatically, therefore it is fault tolerant, now data can be created or recovered at any time and once defined. the value cannot be changed, this refers to the immutability of the data partition, it is the fundamental unit of parallelism in pi spark rdd and users can reuse rdd and choose a storage strategy, which implies that persistence is now finally applies to all elements in the data set. through Maps or filter or group by operation, which implies that it successfully handles the main operations of the screen.
Now there are three ways to create rdds: one can create an rdd from a stalled collection, it can be created from another rdt or it can be created from external data sources like sdfs Amazon S3 edgebase or any such database , so let's create some rdd and work on them, so I'll run all my practices in the Jupiter notebook, but you can also run them in the shell to create an rdd from paralyze. collection we use SC method dot parallelization now SC means spark context which can be found in spark session spark session contains spark context streaming context and SQL context has been changed after previous release of spark 2.0 context of spark and SQL context in addition to the streaming context was always distributed separately and had to be loaded separately now the dot SC paralyze method is the Sparks contest paralyze method to create a paralyze collection, this allows Spark distribute the data across multiple nodes instead of relying on a single node to process the data so as you can see here I'm creating a list so I assigned raws as 19 joy 18 Rachel 16 Phoebe 17 and Monica 20. now that we have created our rdd, which is my rdt, we will use the take method to return. the values ​​to the console which is our notebook and we will also execute an action that is already the one that we took, so guys, if you remember, as I told you before, when an action is invoked, all the calculations that are aligned in The graft or the lineage graph of the transformations that have been performed on the rdd are carried out all at once, so a common approach in pi spark is to use the collect action which returns all the values ​​in your rdd from the worker nodes from spark to driver, there are now performance implications when working. with a large amount of data, as this results in a large volume of data being transferred from the Spark worker nodes to the controller for small amounts of data, this is perfectly fine, but as a matter of custom you should almost always use the take method.
Returns the first n elements passed as an argument to the action instead of the entire data set, they are more efficient because it scans the partition first and uses those statistics to determine the number of partitions needed to return the result, so that I have six elements in my rdd. I'm going to use my Rd document socket and as an argument I'm going to pass six, so as you can see here, this is the output of the rdd. Now there is another way to take a text entry. The file is through the sc.txt file method and here you need to provide the absolute path of the file that you are going to use, so I am creating a new addition here and to take a look at the new rdd that we are going to use. method to take a look at the first five elements, as you can see, we have the first five elements, the first, the second, the third, the fourth and the fifth.
Now we can also take any CSV file as input through the sc.txt file, so I show you how to do it, so here we are going to use the SC dot text file with the absolute path, as you can see. I am loading a FIFA player.csv file. Now there's another argument that I passed here, which is minimal partitions. Now indicates the minimum number of partitions that make up the rdd. The Spark engine can often determine the best number of partitions based on file size, but you may want to change the number of partitions for performance reasons and therefore the ability to specify the minimum number of partitions. here and after that we have used map transformation now here we use map function to transform the data from a list of strings to a list of lists.
Let's use the Lambda function now by placing the SC points text file into the map function. together it allows us to read the text file and split it by the tab delimiter to produce a list already composed of a parallelized list of collections of lists, so if we take a look at the first three elements of this particular rdd, as can see we have 201 which is the round id of the match id team the initials now to see the number of partitions we apply generally it occupies the partitions automatically but in case we want to get the number of partitions for a rtt in In particular, we use get num method of partitions, so here we have specified 4, so the output should be 4.
I assume so, the output is four and now, if you want to see the number of rows in a particular RTD or the number of records in a particular rdt, use the count method here, so as you can see here, we have 37,000 rows, so this was our rdd, the sample text file that we took. Now suppose I want to convert all this data to lowercase and I want to split it. all these paragraphs in words, for that we created a user-defined function, so I'll show you how it's done. As you can see here, we have created a function that will use the lower point and split point transformation, so we are creating a new rdd here, which is the split rdd and it will pass the new artery through this function using the transform of the map.
Now the map is basically used to run our transformation on each and every element of that particular rtt, so if you look at the output of the split RTD so, here, you can see that all the elements are separated. Now for individual words and they are all lowercase. The next thing we are going to do is use the flat map transformation. Now it is similar to the map, but the new rdd. I flattened all the elements so let's use the flat map transformation and if we take a look at the rdd, as you can see the output is flattened, it is not vertical, it is horizontal in nature which is easier to read.
Now here I am going to create a stopwatch rdd that contains some of the common Stoppers stopwords. I'm not using all the stop words here, so my agenda here is that I want to remove all the stop words from this particular rdd that we have here, so that's what we're here for. If we are going to use the filter transformation, we will provide a Lambda function. Here we have defined the Lambda function to create, you most likely do not have the words that are defined in the stop was rjd, so as you can see, output 13 contains contracts transactions and T, while output 15 contains contracts transactions and, as mentioned in stopwatch list they are not included in new rdd for that we use filter transformation, now filter can be used in many ways mainly it is used with the help of Lambda functions in pi spark rdd so What I'm doing here is creating an rdd filter that will contain all elements starting from C.
Here I'm using the Lambda function X so that X starts with C, so let's take a look at the output of this new filter. Another important thing to note here is that I'm using the distinct element. transformation that returns a new RTD containing the various elements in the source rdd, so as you can see here, we have the control claim code. Computer connections, case car, all things starting from C. Now, the next thing I'm going to do is run a little word count program. I hope you understand what the word count is. The result of this will basically give us the count of each particular word.
Here I am taking the result of the first 10 words, so I am using the Lambda map function X so that X is provided with the value 1. now I am grouping it by key. I am using group by key method. Here it is used to group the data based on the particular condition. Now I'm creating another adjective that adds to the frequency and that will take. the input of the group r d will map the values ​​with the sum, it will create the sum of the particular key which will be the word and then I will again use another map transformation and then sort by key is given as false so the output will be in the same format as the input but we will also get the word count so let's take a look at the top 10 words and see what the count is, as you can see it has 48 counts, D has 38 etc. we can see that they have 7.
So the initial rdd1 that we created after removing all the stop words contains 660 rows, so now I will use the distinct method here to ask what are the distinct elements that will be in rdd2. So if we do the point count of rd2, most likely it should be less than or equal to the count of r81, so,guys, it's 440, so 220 blocks were removed from the list. Now let's say I want to take a look at the elements of this rdd2 that I created. and I want to see the elements that have the first three letters the same as the first three letters similar.
I will show you the result that you will understand, as you can see, we have EDG and edgs, which are edges, we have year and years of the ALG algorithm. Robust SCA scale, so if you run it for 2, we see that we have a gu guide, a working base, we have gradual growth. Great law. Graphics, gradually, we have such streaming gains obtained. These are little tricks that will help you run your code faster with the help of rdt now I am going to use the sample method. I'm creating a new auditor which is sample rdd and I'm passing the rgd1 and as you can see here I have two arguments which are false and 0.1.
What does this mean? Now it's fake. It's basically the width replacement parameter, so let's say I want the output to not have all the replacements, so I'm going to assign false and 0.1 is basically the fraction of data that we're going to take the output with, we're going to take the sample. I'd rather say it here, I'm going to take 10 of the original data rdg1.sample, so rd1 contains 660 rows, so 10 of them must be around 66 or 67, so for example let's collect a sample RTD , so if we take the count. of the new sample rdd, as you can see, is 51, it is about 66 and as we have used false here, it is displayed without replacement, so if you take a look at the output of, say, the sample rdd, it will contain a sample of the original rdd, you see, we have contrast established rays.
Now the next thing we are going to learn is some of the functions like join, we have reduce, reduce by key, sort by key, we are going to see the joins, so for that I am creating two rdds which are the key value pairs, so which a has a given value like 2 and bs3 and similarly we have B which contains a value 9, B value 7 and C value 10. So to join this r d, I am going to create another rdd which is C and the method to join these two is like I'm going to join A to B, so I'm going to use a DOT join and when joining I'm going to give the parameter as B, so if we take a look at the output of C now here.
I am using collect because C has a very small amount of data, but it is recommended to also use take, so by looking at the value of C we can see that a has two values ​​2 and 9 and B has two values ​​three and seven. Now this is a type of join similarly you can perform any type of left join left outer join right outer join now here I am going to create a num rjd which will take all the numbers from 1 to 50000. now here I am using range method To get all numbers between one and fifty thousand, next we will use another reduce action that reduces the elements of an rdd using a specific method, so here the method is a Lambda function X and Y such that the output is X Plus .
So this will give us the sum of numbers from 1 to 50,000. As you can see, the sum of numbers is quite large, so next I will use the Reduce by Key method. Here I will show you the importance of that. function, while let's say the action works similar to reduce, but it performs a key by key reduction, so as you can see the key data has certain numbers assigned to the alphabets, we have A4 B3 C2 to D2 B1 D3 and we have another parameter which is 4. now this is the number of parallel tasks that will be executed while taking the input in data underscore key data rdd and as you can see here we have reduced by key we are using the same Lambda function for you can see in a we have 12 which is the sum of 4 and 8. in dvf5 D has 2 and 3 C has 2 and B has 4.
P has one here and three here, so to save your file you use the button save As a text file method here, we are going to save the RG3 values ​​to a file called data.txt. Here you need to provide the absolute path where you want to store the data, so let's check where our data is stored so that it is on the desktop. So as you can see here we have a folder created which is data.txt and inside we have the parts files and the success file as you can see it contains all the elements so here I am creating another rdd test to show you the sort by key function and how the sort by key transformation works sorts the rdd key value by key and returns an RDA in ascending or descending order, so by default it is ascending, but you can also use it to get the output in descending order, so As you can see, I'm using SC Dot parallelism and I'm taking the test and I'm using the sort key true and one now true is to ascend, so as you can see, I have one, three, two, five, A1 , B2 and D4 one comes first, then two, then a b and d, that is in the ascending order of the key, so next we will learn about union, how union works, now union transformation returns a new additive which It is the unity of source and argument. rdd which has been passed, so here we have two rdd which are underlined union rd and Union 2 contain certain elements, so if you want to take the union of these two rdd into a new rdd or just want to see the result of this Union, We're going to use the union method here because look, the syntax is pretty simple here, so as you can see, it created the union of these two rdds in a similar way, the intersection also works in a similar way.
Now the next thing we are going to learn is map partitions with index. is now similar to map, but returns the F function separately on each partition and provides an index for the partition. Now it's useful for determining the skew of data within partitions, so for example, I have function f here, which is split index, iterator, and yield. split index I have created mpwi which is a map partition with rdd index it is just a name you don't need to name your rdd like that just for the sake of it and if we do the sum you can see the result of the sum is 6. so you can Look, we have two IDs here, A and B, and I will show you how the intersection works, it is in the same way similar to the union, now what it gives is the intersection of these two sets, so now I do it What it will do is produce an RTD that contains elements. found in both rdd so ideally it should give us the output of 3 and 4 which is common to both rdd so let's see how it can be implemented guys as you can see the output is 4 point 3 that's what we expected now that we are.
If we are going to use the subtraction method, what it will do is subtract the set b r DB from rdd a, okay, so the output should be 1 and 2. I guess it would rather say 2 and 1. Now, another interesting thing to consider It is the Cartesian product that we use. method here to get the product of rad A and B so you can see that each element of rdta has been mapped with each element of rddb, so I hope you all understood a lot about the transformations, actions and operations that can be performed in rdd to get a certain result so guys let's do something interesting now so what we are going to do here is to find the page rank of certain web pages and we will use rdd to solve this problem so let's understand what it is.
Page rank comes first, so page rank is basically the rank of any page that is being developed by Google, so the page rank algorithm, as you can see here, contains a certain formula, so that the algorithm was developed by Sergey Brin and Larry Page. Larry Page is the person after whom this page rank algorithm has been named and these developers of page rank algorithm later founded Google, which is why Google uses spatial rank algorithm and it is one of the best engines. search of the world. Now the page rank of a particular web page indicates its relative importance within a group of web pages, the higher the page rank, the higher it will appear in the search result.
Now the importance of a page is defined by the importance of all web pages that provide an outbound link to the web page under consideration, for example. Let's say a web page X has a very high relative importance, then web page Do you understand me if we take a look at the formula we have here? So basically, the page rank of a particular page at a particular time is equal to the sum of all the page ranks of the inbound link divided by the number of links on that page. Now to have a clear idea of ​​how it works let's take an example here we have four pages we have Netflix Amazon we have Wikipedia and Google so initially what we do is assign the value of 1 to all the pages keep that in mind so according to Our formula in iteration zero, what we do is obtain the page rank, we divide it equally by all the number of pages that are on the network, so the page rank in iteration 0 for Netflix is ​​one by four, for Amazon It is one times four, for Wikipedia it is one times four and Google is one times four. now this is the initial iteration so it doesn't count, starting from iteration one, if you look at the formula, let's say we're talking about Netflix, so if you look at the formula it says page rank of incoming link divided by number of links on the page, so Netflix has incoming Wikipedia, the incoming link to Netflix only comes from Wikipedia, so we need to consider Wikipedia now, so the previous Wikipedia page rank is one by four and now we divide by the number of outbound links going from Wikipedia, which are one going to Netflix, one going to Amazon, and one going to Google, so it's one times four divided by 3, which gives us 1 times 12.
Okay, let's look at Amazon first so Amazon has a model coming from Netflix and Wikipedia as well, so it will take the sum of Netflix and Wikipedia, so considering Netflix, we have 1 times 4 like the previous iteration divided by the number of our power links from Netflix, which is 2. So 1 times 4 divided by 2, that has 1 times 8 and then we'll add that for Wikipedia as well, so for Wikipedia it's 1 times 4 divided by the number of outgoing links, which is 1, 2 and 3. So it's 1 times 8 plus 1 times 12 which is 2.5 times 12. If you do the math, this is how the page rank algorithm works, now we will implement the same using rdd, so let's go ahead and let's start now, so for all the examples we have four pages which are a b c and d now consider here that D has. two inbound links are fine, so we will do the sum, the first one coming from a and the second one coming from B, as stated above, it will be the sum of the previous page rank of a divided by the number of outbound links that are 1 2 and 3 fraud a and for B is 1 and 2. so let's see how it can be implemented now.
The first step is to create a nested list of web pages with external links and initialize ratings okay so let's get started so here I have created an rdd which are page links so for a the limits are b c and d for C is B for B is D and C and for D is A and C so initially we will assign the page ranges as one as I mentioned above now after creating ranges for the nested list we have to define the number of iterations to execute the page range now we are going to write a function that will take two arguments the first argument is in our function is the list of web pages and that will provide the result external links to other web pages and the second argument is the classification of accessing the web page through the outgoing links being the first argument.
Now the function will return the contribution to all web pages in the first argument so you can see our first argument. in our web page function list is the Uris and the second is the rant now the code is very explainable our function range contribution will return the page range contribution for the Uris list and what is the first variable and the function will first calculate the number of elements in our list Uris and then calculate the range contribution to the given Uris and finally for each URI the contributed range will be returned, so first let's create an rdd pair of the linked data and then we will continue. to create the pair already for our rank data and you can see that the rank is one.
Now the next thing we are going to do is create the number of iterations which is 20 and we will have S defined as 0.85, now s is known as Damping Factor so let's assign these values ​​and now it is time for us to write our final loop to update the page rank of each page, so since we have defined the number of iterations as 20 and with the damping factor at 0.85, let's calculate it now if you take a look at the result of the page is ranked, so guys, As you can see in block 49, no actions were taken.
Okay, so the actions that you're going to perform are in block 50. I'd rather say that what I'm going to do is execute this track. 20 iterations and to confirm you can go to the console and you can see here the processes are running so this will take a certain amount of time depending on the number of iterationsthat we have and this will give us the final result of the ranges. I'd rather say now, so let's first investigate our loop, so first we join the links from page rdd to page rank Rd via an inner join and then the second line of the block of four calculates the contribution to the rank of the page using ranking contribution. using the ranking contribution function that we defined above and in the next line we add all the contributions that we have and in the last line we update the ranking of each web page using the map function, as you can see we have the output of b as 0.22 D 0.24 and a 0.25 now the damping factor also plays an important role here now the value of C must have been even less than the value of a b and d so we can say that a has the highest page rank followed by D and then B and the minimum has to be C.
Now, one important thing to note is that the sum of all page ranks always equals 1. Suppose if we reanalyze our problem there, as you can see in the iteration zero, if we add all the elements here one times four plus one times four plus one times four plus one times four that's one in iteration one if we add 1 times 12 plus 2.5 times 12 plus 4.5 times 12 plus 4 times 12 that will give us 1 again and similar is for iteration two so the higher the number of iterations the more it will be a clearer result and we will be able to get more accurate results here so I am sure you are wondering why exactly we need data frames now that the concept of data frames comes from the world.
The statistical software used in empirical research data frames are designed to process a large collection of structured or semi-structured data. Observation in the Spark data frame is organized in named columns, which helps Apache Spark understand the schema of a data frame. This helps Spark optimize the execution plan. in these queries, now data frames in apartheid spark have the ability to handle petabytes of data, they can handle large amounts of data which is usually big data, they now have support for a wide range of data formats and sources, we will learn about this later in this video. and finally, last but not least, it has API support for different languages ​​like Python R Scala Java, which makes it easier to use for people with different programming knowledge.
Now, another important feature of the data frame is that the data frame API generally supports elaborate details. methods for splitting and splitting the data includes operations like selecting rows, columns and selling by name or by number filtering rows and many other operations statistical data is very unusual and very disordered and contains many missing or incorrect values ​​so it is a feature Critically important to the data framework is the explicit management of missing data. Now let's understand what data frames are. Now data frames generally refer to tabular data. A data structure that represents rows. Each of which consists of a series of observations or measurements known as columns.
Alternatively, each row can be treated as a single observation of multiple variables; In either case, each row and each column have the same data type, but the row that is the data type of the record can be heterogeneous, while the data type of the column must be homogeneous. Data frames usually contain some metadata in addition. to the data, for example column and row names, now the data frame is like a 2D data structure similar to a SQL or a table in a spreadsheet. Now there are some important features of the data frame that we need to know, first of all, they are distributed. in nature, making it highly available and fault tolerant.
Now the second feature is deferred evaluations. Now lazy evaluation is an evaluation strategy that maintains the evaluation of an expression until its value is needed. Avoid repeated evaluation and lazy evaluation in Spark means that execution does not start until an action is triggered and in Spark the Lee evaluation image appears when the Spark transformation occurs. Transformations are slow in nature, which means that when we cause some operations it is not executed immediately. Spark keeps track of which operation is called via DHE. which is the direct acyclic graph, as the transformations are lazy in nature, so we can execute operations at any time by calling an action on the data, therefore in lazy evaluation the data is not loaded until required .
Now there are many advantages of deferred evaluation, such as increased management capacity savings. The calculation increases speed, increases complexity and also provides optimization by reducing the number of queries. Now we can finally say that data frames are immutable. Now by immutable I mean that it is an object whose state cannot be modified after its creation, but we can transform it. its values ​​by applying certain transformations like in r80, now a data frame in a purchase path can be created in multiple ways, it can be created using different data formats for example loading data from Json CSV xml5 pocket files and it can also be can be created from an existing rdd and this can also be created from various databases such as Hive database, Cassandra database, and can also be created from files that resize in the file system as well as in sdfs, so let's take a look at the different important classes. of data frames and SQL we have the pi spark SQL dot SQL context which is the main entry point for the data frame and SQL functionality, we have the pi spark sql.data framework a distributed collection of data grouped into named columns and we have the pi Spark SQL dot column is for column expression in a data frame similarly we have the same for the row and then we have the pi Spark SQL crypt data which is the written aggregation methods by the group of points in the data frame.
At this point we will learn more about these important classes later in the video, so let's go ahead and create some data frames, so now I use Jupiter notebook instead of pi spark shell as I personally find it easier to work with . Ultimately it all comes down to your choice, so guys. Go ahead and start the demo now to start pi spark shell, all you need to type is spy spark when you have configured spark on your virtual machine or on your personal computer, these few lines of code you need to add to the bash RC file if you want the Jupiter notebook integrates with pi spark, so I did it.
Later I'll show you how it works, so as soon as I go into Pi spark instead of going to the pi spark shell, I'm redirected to Jupiter Notebook, as you can see here guys, it opened Jupiter notebook for me, so here I'm going to select a new Python 2 file. First we'll create example department and employee data, so I'll show you how. It's done, the first and most important thing for data frames is to import the pi spark SQL. Ok, now that the import is successful, as you can see, I am creating an employee database with the row function, so basically the rows will have the column name. such as first name, last name, email address and salary.
Now, next, I'm going to create some of the employee data. As you can see, I have created five employees with the employee data in the same format as shown in first name, last name, email. and the salary and these employees one, two, three and four are all in the employee format now that we have created employees, let's go ahead and create the departments now, as I was saying, Python is very easy and the data frames are easy to use . To show you how easy it is, let's say I want to see the values ​​for employee 3, all I need to do is simply run the print command and use the employee tree, as you can see we have a row in which the first name is Muriel. the last name is not specified, we have the email id and we have the salary.
Now suppose we want to see what was the row value of the employee that we created earlier as you can see here if I print the employee and in brackets if I type 0 it is the first name so basically here I am using the row function to create the employees and departments and now I will create the department with the employees instance of the Departments and employees now here, as you can see, I didn't even use them. the keys, all I have to do is print and department 4. now whatever is in department 4, which is the ID and the name, the name is the name of the department, we have nine zero one two three four , which is the department id and the name Department name is the development department.
Next we will create data frames from a list of rows, so this is where things get a little more complicated, as you can see we have used the data frame to create spark points. this will create a data frame T frame will now contain Departments with employees 1 and Departments with employees two instances which in turn have department one and Department 2 where employees are employed one two and five in department one and we have employees three and four in Department two, so if we show Frame D we will see the department here and say we have a string and a name and then we have the employees, it is an array like the one we have given here that has the name, last name, email and salary, this is how we create a data frame.
Now let's take a look at a FIFA World Cup use case. I hope Argentina also recovers from the Croatia game, so here we have the data of all the players. the FIFA World Cup, let's go ahead and load the data here, we have the data in CSV format, so I'm going to show you how it's done, so I'm going to select a new notebook for this one, as you can see here. FIFA DF underscore and I am using spark dot read.csv function. This will take the data from the CSV file given in the file path and I set the infoscheme to true and the header to true now if not Don't set these values ​​to True, usually the first row of any database has the name of the columns, so it won't infer the schema and will take the first row as the data frame values, as you can see here. the default path takes the sdfs path, so to provide the local file system path you need to put a colon in the file and double forward slashes now that we have the data frame in the fifa underscore data frame, Let's see how this data works.
The frame looks like this, as you can see, here we have the round ID, the match ID, we have the team initials, the name of the coach and the team they belong to, we have the lineup, we have the player name, we have the placed position and the event now. want to print the schema of the particular data frame, all we have to do is use the print schema and as you can see here we have the round idea, which is an integer match ID, also the initials of the team They are integers and here we have the option. of giving the nullable value as true, so that even if there is no value specified, it will take the value as null.
Now suppose below I want to see all the columns in my data frame which we saw earlier when we used show command but another way to do it is use dot columns so if you want to know the number of rows or the number of records in our data frame, all you need to do is give the name of the data frame and use the count function, as you can see here. We have almost 37,000 rows and if you want to see the number of columns we can add content manually but when that is not possible we can use the column length function so as you can see here we have the eight columns and we have 37 784 rows now suppose we want to describe the summary of a particular column, we use the describe function now what it does is give a summary which is the count which is 37,000, which is the number of rows from which we get the average of the column particular that we have given in our case is null because it is in the chain we get the standard deviation, we get the minimum now here the minimum is according to the alphabetical order, so we get a cost for Nelson and the maximum is equal now so similar we can Also describe the position of the column, as you can see here we only have a count of 4143, the mean and standard deviations are null and the minimum value is c as captain and the maximum is goalkeeper captain gkc.
Now suppose you want to know the player name and coach name in a particular format, we just need to use the select option and specify the column names here, as you can see we have the player name and coach name. Now there is another function in thedata frames which is the filter function, suppose we want. To filter all the players and all the data according to the match ID we use the filter function here and furthermore we use the show function to get the desired results as you can see here all the data we have is for match id1096 shows only top 20 rows now similarly you can see if you want to filter based on Captain position here we have our result and if now I want to check how many records I have in this particular data frame in which position is given as Captain, you can see that we have 1510 rows.
I prefer to say the records where the position is defined as Captain. It is very similar to SQL. We will see how we can import SQL queries later in a similar way if we filter. using the position of GK which is the goalkeeper you can see we have all the records now we can use the filter on more than one parameter also so here I took the position as Captain and the event as G40 so here we will get The result of all the players who are paying for the position as Captain and the event they participated in was 340 so there were only two results here now that we have seen filter and select let's go ahead and see how we can sort or group them in the order we use. the order by function here and here we are sorting according to the match ID, as you can see the lowest match ID was 25.
It usually goes in increasing order. You can also change the ascending parameter to false to get the desired result in descending order. Now here I am registering a FIFA Underscore Table temporary table for the same data frame that we had before, which is the FIFA DF underscore and now I'm going to use the SQL queries here so that, as you can see, using the context SQL, I passed the function instead of saying the command. use the FIFA underscore table selection asterisk, this will give us the same result as the FIFA DF underscore dot, as the same beta is now converted to a table that can be used in a SQL context, for which here, in this SQL context, we can pass any type. of SQL queries that we want, we will see later in our different use case, so I hope you have understood what are the basic functions, the base that filters the order by selecting show Group by all that can be used in the particular data frames and You also saw how we can pass SQL queries.
Also here I will show you that you just need to invoke SQL context.sql and within that you can give your SQL queries to be executed on the table, rather I will say because data frame has become the stable rate for underline tab, so now that we have seen how to create a data frame using a CSV data file and also apply some of the features like select order filtering and also passing SQL queries to that data frame, let's move on. with another amazing use case which is the superhero use case, so here we have the data of all the superheroes.
However, I'll show you that one too, let me load the data set again. This data set is also in CSV format so use the same method here to get the data so you can see the data is in CSV format here superheroes.csv and we will design the data frame in the same format for Note that if the top row of the data set or data frame we are using has all the column names of the data set. Make sure you use the info schema and header as true so as I mentioned earlier you can use the show command to see the content of the data frame but you can also specify the number of rows so here in case I have specified show 10, that means it will show me the top 10 rows of the particular data frame, as you can see here, we have the serial number starting from 0 to 9. so let's take a look at the columns The columns here we have the serial number we have the name gender we have the eye color of the superhero in particular the race to which he belongs alien which or is human cosmic entity Hungarian we have the hair color, we have the height for the editor and now there are several editors of these superheroes like Marvel DC Dark Horse NBC, yes, the skin color of the hero we have the alignment and we have the weight now, if we use the print scheme, You can see which columns have what they say all the associated data types, For example, race is a string.
You can also set nullable values ​​like true and false as I mentioned above. Now let's filter this using the filter function and see how many male superheroes. we have, we have 505 male superheroes and let's see how many female superheroes we have 200, so here I am creating another data frame which is the TF race underscore. Now what we will do is take the superhero data frame and first Group using the Rays and then it will provide the count of that particular race, so as you can see here we have the race and the count associated with that particular race it shows only the top 20 rows but I guess there are more superhero rays which we were hoping would not be the same as what we see in the comic frames now if we create another data frame here we are creating the skin underline TF which will take the data frame superhero underline TF data.
I will group it based on skin color and also provide the count of how many superheroes have that particular skin color. As you can see, we have 21 green superheroes, we have five big superheroes, we have certain red and gold superiors, but most of them are not specified, which is 662, so this is one of the benefits of using a data frame and you can also provide null values ​​to your data sets, it does not provide an error, so so far we have seen Group By and Sort By, let's go ahead and look at the sort function, so I have created weight. underline DF another data frame and I'm going to sort it by the weight of the superhero based on the weight column mentioned here and I'm here.
I'm using the TSC function which means descend, as you can see. I have Sasquatch, which is red male eye color and if you finally look at the weight, we can see that the weight is 900. I mean, whatever unit of that weight is the highest, it is 900. Now I have created a hero DC underscore to filter. all the heroes of DC Comics and if I want to see the count of how many heroes there are in DC Comics, we have 250 heroes that are in DC Comics and similarly, if you want to take a look at the heroes of Marvel.
Comics, all we have to do is use the filter function along with the publisher like Marvel Comics and the same thing because you do the Marvel underline hero point count so you can see that we have 215 euros of DC with 388 heroes of Marvel now if you want to have a look at all the editors and count how many heroes are in the editor, we will use the group by function along with the count function, so you can see here that Marvel Comics has 388. This is another way to to see how. There are a lot of superheroes there and DC Comics has 215.
Personally, I think TC has a much better story than Marvel, but still, as you can see here, Image Comics has 14 and there are nulls that have 15. So we can say that Marvel and DC are the two main contenders in the superhero comics market, now here I am going to create a table of superheroes using the DF superhero underscore data frame and similarly, I am going to show you how to pass SQL queries now to select asterisks from the Upper table, this is equivalent. For what we use in the superhero underscore, the DF dot shows that the result of this is not in a very sophisticated way as it was for the data frame.
Now, if we pass the SQL query to select a different eye color from the superhero table, let's see what the result is. of this will be fine, we have the list of all the different eye colors, which are the different eye colors that we have yellow without Violet, we have gray, green, yellow, brown, Indian, silver, purple. Now suppose I want to see how many different eye colors there really are. eye colors, so we have 23, of which we have the rest in sight. We saw earlier that the maximum weight was 900 pounds or 900 kilos for a particular Sasquatch, as you saw previously here.
You can see this so we can get the same SQL result as well, if we choose to select maximum given the name of the column we want, which is the weight of the Superhero table, although it will only give us the maximum record, but it will be the same, As you can see, the maximum weight is 900 units. Guys these are some of the functions and features that you can use in data frames and as I mentioned earlier data frames are very important as they are used for structured and unstructured data and can load better bytes of data which which is useful for large data calculations and most importantly it is used to perform major slicing and slicing which means cleaning and cleansing the data set.
The external module is a higher level abstraction on top of the pi spark core that is used to process structured and semi-structured data sets using pi spark you. can process the data by making use of SQL and Hive ql, thanks to which Pi Spark SQL has gained popularity among database programmers and Apache high views. Additionally, it provides an optimization API, which means it can read data from various types of sources. like CSV Json and the other file formats or the databases, now let me show you how you can apply SQL queries on our data frames to make them more accessible, so let's import the Spark session and also import the Spark SQL which is the SQL context. here now to load our data into a data frame we use SQL context here earlier in data frame we use spark.read.csv now here we are going to use SQL.read.load context we have given the information Schema is true and also the header is also true.
Now let's load the data into the DF data frame. Let's take a look at our database schema, as you can see here. We have the New York flight status, which is the New York flight data that we have. the year month day departure time departure delay arrival time delay queue number flight number we have the origin airtime distance and much more now suppose I want to rename a particular column uh suppose I want to rename the EST column as target for that We will use the width column rename function here as you can see here in df1 if you look at the outline of df1, as you can see here the EST has been changed to target similarly you can replace the particular column names and now you want to have a look at the basic statistical information of a particular data frame, you need to use the describe command as I mentioned above in the data frame, if you want to see the summary of a particular column , use the describe command and mention the column name but if you want to see the summary of the particular data frame you just need to provide the describe function and you don't need to mention any column name as you can see here the result is half as much as this way, but we can purify it using the associated library.
Now these libraries are the main reason why people opt for Python and Spark programming instead of Scala or Java or rather they say that the availability of these libraries makes visualization quite easier and also machine learning . Now, if you want to select particular columns from our data set, suppose I want to know the flight origin destination and distance. I will use the TF2. I am creating a new data frame and I am using the select query here and also. I am using the distinct function to get only the unique values ​​of the origin, destination and distance of the flight.
Now here I am going to import ASC which means ascending and is basically used for sorting so now I am creating another data frame which is DF sorted and I am passing the data frame to a sort function and then I am sorting it based on the ascending order of flight. Now in Flight we have the flight number, so we will see our departure right now. if we use the sorted DF point show, as you can see here in the fly, we start from one, the count starts from one, now this is again, the data set is very large, the number of columns is more to have a Look just the few corners he saw earlier, which were the origin, destination, and distance of the flight.
Let me show you what the cleaner output looks like to better understand how things work. I'm going to use the select query on this sorted data. framework we have, as you can see here, all flights start from flight number one. Now what I did before, I am sending the result of the particular select and sort query in a sorted data frame. 2 that I am creating and as you saw Before here we had so many duplicate values ​​since the database is so big that we have so many flights originating from JFK and the destination is LAX and the distance is 2475 which is fixed so for to get the different values ​​we will use a different command here,so just to take a look at the different output of sorted data frame 2 that we just created, as you can see here, the flight number is 1 but the queue number is different, the carrier is different and the time is also different, so let's take a look at the number of rows we have in our sorted data frame 2.
It should be around 3 million. I guess so, it's 3 million thirty-six thousand, so the next thing we're going to do is eliminate the duplicates that have the carrier and the queue number here, as you can see here the queue number and the carrier, we're going to eliminate all the values ​​that are similar and the duplicate, we can do this for as many columns as we want here I'm going to show you the operator and tail number and now let's look at the count of this particular data frame, so as you can see here, the output is only 4000 rows so after removing only two columns the duplicate values ​​of these two columns. we have four thousand, so now if you want to join two particular data frames in SQL, we have the dot join function.
I am creating another data frame that is joined with underscore tdf and now I am passing the two data frames that are sorted with underscore df2. and the sorted underscore DF that was just created and the condition that we give here is the basis of the flight that we are going to join based on the flight column, so let's take a look at the result of this join data frame that we have . created now, as you can see here, we have the joined output or the two particular data frames. Now suppose I want to find the data frame in which our source is JFK MCU and EWR, besides filter function or select function we have another one. function which is where it performs the same action, so let's see the output in our data frame, the source will only be from JFK MCO and EWR now to calculate the average we need to first import the average function and we have created another df3 which is the frame data 3 and we are using the aggregate function with average distance as you can see here we have the average distance result as 1039 so guys that's it for pi sparks sql programming so the following topic of our discussion is by Spark Streaming now Spark Streaming is a scalable failover torrenting system that takes the rdd batch paradigm and Spark Streaming processes the data in bulk which ultimately speeds up the entire task.
Spark Streaming receives an input data stream that is internally split into several smaller ones. batches and the size of these batches is based on the batch interval. The Spark engine then processes those batches of input data to produce a set of processing patches. Now the key abstraction for Spark Streaming is City Streams, it represents the small batches that make up the data stream. These teams are now built on rdds allowing Spark developers to work within the same context of rdds and batches, which is now also applicable. For streaming issues, now Spark Streaming also integrates with mlib, which is machine learning.
We will learn about machine learning. The mlib programming later in this video also integrates with SQL data frames. Graphics that expand your horizon of functionalities by being a high-level API. provides fault tolerance exactly one semantics for stateful operators Spark Streaming has built-in receivers that can take as many sources as possible. Now these are the building block of Spark Streaming, as you can see data can be ingested from many sources like Kafka Flume Twitter Kinesis. or TCP sockets and many more and furthermore this data is processed using the complex Express algorithm with high level functions like map reduce join and window and finally the process data is sent to the various databases of the file system and external live dashboards.
Now, what exactly is it? Machine learning Machine learning is a data analysis method that automates the construction of analytical models using algorithms that iteratively learn from data. Machine learning allows computers to find hidden knowledge without being explicitly programmed where to look. It focuses on the development of computer programs that can learn to learn by themselves. Grow and change when exposed to new data Machine learning uses the data to detect patterns in the data set and adjust the programs' actions accordingly. Most industries working with large amounts of data have recognized the value of machine learning technology in cleaning insights from this data, often in reality.
Now organizations can work more efficiently or gain an advantage over the competition. Now let's take a look at the various industries where machine learning is used. Government agencies, such as public safety and utilities, have a particular need for machine learning. They use it for face detection. security and front-end detection your marketing and sales websites now recommend items you might like based on previous purchases uses user machine learning to analyze your purchase history and promote other items you might be interested in now analyze data to identify patterns and trends es The key to the transportation industry, which depends on making routes more efficient and predicting potential problems to increase profitability that now reaches financial services banks and other companies in the financial industry, use machine learning technology for two purposes key: the first is to identify important insights in data and the second is to prevent the fraud that is now coming to healthcare.
Machine learning is a rapidly growing trend in the healthcare industry thanks to the arrival of variable devices and sensors that can use data to access a patient's health in real time. in the Biometrics section, the science of establishing the identity of an individual based on the physical chemistry or behavioral attributes of the person is one of the main key advantages of machine learning in the area of ​​Biometrics, now let's take a look at the Typical machine learning life cycle. Any machine learning life cycle is divided into two phases, the first is training and the second is testing.
Now for training we use 70 to 80 percent of the data and the rest, the primary data, is used for testing purposes, so first of all, you drain the data and use any particular algorithm to train the data and using that algorithm we produce a model now after we have produced our model now the remaining 20 to 30 percent of the data is used for the testing purposes to which we pass this data. the model and we discover the accuracy of that model with certain tests. Now this is what a typical machine learning lifecycle looks like. Now there are three main categories of machine learning, as I mentioned earlier, which are supervised reinforcement and unsupervised learning, so let's understand these. terms in detail from supervised learning, supervised learning algorithms are trained using labeled examples, such as inputs where the desired output is known, the learning algorithm receives a set of inputs along with the corresponding correct outputs and the algorithm learns by comparing their real output with the correct output to find errors, then more is the model accordingly through methods such as classification regression prediction and gradient boosting.
Supervised learning uses patterns to predict the values ​​of each label in additional unlabeled data. It is called supervised learning because the process of an algorithm learning from the training data set can be thought of as a teacher supervising the learning process. Now supervised learning is mainly divided into two categories, namely classification and recreation algorithms. Regression is the problem of estimating, operating on a continuous quantity, what the value of the S P 500 will be in one month. Today, how tall will a child be when he or she becomes an adult? How many of your customers will go to a competitor this year?
These are examples of questions that would fall under the umbrella of regression. Now we come to the classification. Classification is concerned with assigning observations into discrete categories rather than estimating continuous quantities in the simplest case there are two possible categories this case is known as binary classification many important questions can be formulated in terms of binary classification the client's argument will leave us by a competitor whether a given patient has cancer whether a given image contains Classification of a dog or not now mainly consists of classification trees supporting vector machines and random forest algorithms, while regression consists of linear regressions, decision trees that bias networks and fluency classification.
There are now other algorithms such as artificial neural network programming and gradient boosting that are also included in supervised learning. algorithms now we have reinforcement learning now reinforcement learning is learning how to map situations to actions to maximize a reward and is often used for robotics and navigation games with reinforcement learning the algorithm discovers through trial and error which actions produce the greatest rewards and The algorithm provides information about whether the answer is correct or not, but does not tell how to improve it. The agent is the learner or decision maker, whose job is to choose actions that maximize the expected reward over a given period of time.
Actions are what the agent can do. do and the environment is everything that the agent interacts with the algorithm whose ultimate goal is to acquire the largest amount of numerical reward possible, it is penalized every time its opponent gets a point and it is rewarded every time it manages to score a point against the opponent that you use. This feedback to operate your policy and gradually filter out all actions that lead to reinforcement learning with penalties is useful in cases where the solution space is huge or infinite and is generally applied in cases where machine learning can be considered as an agent interacting with in your environment now there are many reinforcement learning algorithms, some of them are Q learning, we have sarsa which is state action, reward, state action, we have deep Q network, we have deep deterministic policy gradient which is ddpg and finally we have the trpo which is trust region policy optimization, now the last category of machine learning is unsupervised learning so as I mentioned earlier supervised learning tasks find patterns where we have a data set of correct answers to learn, while in the case of unsupervised learning tasks we find patterns where No?
This may be because the correct answers do not have solutions or are not feasible to obtain or perhaps for a given problem there is not even a correct answer per se. A large subclass of unsupervised tasks is the clustering problem. Clustering refers to grouping observations in such a way. a way in which members of a common group are similar to each other and different from members of other groups a common application here is in marketing where we wish to identify segments of customers or prospects with similar preferences or purchasing habits a major challenge in grouping It is often difficult or impossible to know how many groups there should be or what the group should look like.
Unsupervised learning is used with data that does not have historical labels. The system is not told the correct answer. The algorithm must figure out what is displayed. The goal is to explore the data and find some structure within Unsupervised learning works well with transactional data and these algorithms are also used to segment text topics, recommend elements, and identify data delineators. Now there are mainly two classifications of unsupervised learning, one is grouped as I mentioned above and the other. one is dimensionality reduction, which includes topics such as principal component analysis, tensor decomposition, multidimensional statistics, and random projection.
Now that we have understood what machine learning is and what are its various types of machine learning, let's take a look at the different components of the Spark ecosystem. and understand how machine learning plays an important role here now as you can see here we have a component called mlib now Pi spark mlib is a machine learning library it is the wrapper on top of pi spark core to perform analysis using learning algorithms automatic in which it works. distributed systems and it is scalable and we can find the implementation of linear regression of classification clustering and other machine learning algorithms in pi spark ml lib we know that Pi spark is good for iterative algorithms using algorithmsiterative, many machine learning algorithms have been implemented in pi spark mlib apart from PI Sparks efficiency and scalability Pi Spark mlib APIs are very easy to use, so the software libraries that are defined to provide solutions to the various problems They come with their own data structure.
These data structures are provided to solve a specific set of problems with efficient options. Pi Spark mlips comes with many data structures including dense vectors, sparse vectors and local and distributed matrix, so the main MLA algorithms include mlib, we have clustering, we have frequent pattern matching, we have linear algebra, we have collaborative filtering, we have classification and finally we have linear regression. Let's see how we can leverage MLF to solve our few problems, so let me explain this use case to you. A system was hacked, but the metadata for each session that hackers use to connect their servers was found.
Features such as session connection time are now included. byte transfer, we have the Cali trees used, we have certain data such as corrupted servers, pages corrupted location and we have the topic speed in words per minute. There are now three potential hackers, two confirmed hackers and one not yet confirmed. Forensic engineers know that the hacker trades attacks. meaning each should have about the same number of attacks, for example if there were 100 attacks then in a situation of two hackers each would have 50 attacks and in a situation of three hackers each would have 30C attacks , so here we are going to use clustering, so see how we can use clustering to find out how many hackers were involved, so today I am going to use Jupyter notebook to do all my programming.
Let me open a new notebook from Python to Jupiter, so first of all, what we are going to do is import all the required libraries and start the Spark session. Now the next thing we are going to do is read the data using the spark.read method. Now here we are doing spark.read.csv as our data set is in CSV format and we have provided header and infosema are true now here the default location what is needed is sdfs when we do spark.read so to change the default location to your local file system you must provide a colon in the file and two forward slashes and then provide the absolute value. part of the data file we are going to read now, let's take a look at the first record of the data frame and also the summary of the data set.
Now, to have a summary of the data set, we use the description function here and now the output. of this is half of a set, so if we want to see the names of the columns that we have here, we just need to use dataset.columns so, as you can see, we have the session connection time, we have the bytes. transfer we have Kali Trace servers used corrupted Pages corrupted location and words per minute beat rate now wpm means words per minute now the next thing we are going to do is import the vectors and the vector assembler library so that these are all machine learning libraries that We are going to use now what the Vector assembler does is take a set of columns and define a particular feature so that our features consist of the session time, the bytes transfer the Kali Trace used, we have The service is corrupted, the page is corrupted, and the words per minute typing speed is one.
The thing to keep in mind is that feature selection is based on us, so regardless of feature selection, if our model is not working with the correct output, we can change the features accordingly to get the desired output. . Now I have created a VEC underscore assembler that takes all the attributes defined above and based on that it will provide us with the features column. Now, the next thing we're going to do is create our final data and we're going to use Vector Assembler and transform it into the data set we have now. What we will do is import the standard scaling library, now centering and scaling are done independently on each feature by calculating the relevant statistics on the samples in the training set, the mean and standard deviation are then ranked to be used on subsequent data using the Standardization of the data set transformation method is a common requirement for many machine learning estimators; they might misbehave if the individual feature does not more or less resemble standard normally distributed data.
Now let's calculate the summary statistics by setting the standard scalar and then normalize. each characteristic will have a unit standard deviation. Now we finally have the final data of the group, it's time for us to find out if there were two or three hackers, so for that we are going to use k-means here, so I have created K-means free and k-means 2 came. The industry will have all the characteristics with the K value of 3 and k-means 2. we have the characteristics column which are the scale characteristics with the K value S2 now what we will do is create models for these two variables, a means three and K means two, let's fit them into the final group data now w triple means for within the set of sum of squared errors, let's look at the values ​​of these for the model that has k equals 3, that is , three groups and for the model that has k equals 2. now, for k equals 3, the stated sum of squared errors is 434 and for k equals 2 it is 601 now let's look at the values ​​of K starting from 2 to 9 to see the values ​​within the sum of squared errors, as you can see the values ​​are getting lower and lower. that means that the probability of the number of hackers being more than 3 and 4 is very less, as you can see, for k equals 8 it is 198.
Now the last key fact that the engineer mentioned was that the attack should have a uniform number among hackers. Let's check with the transformation and prediction column that the result of this now groups by prediction we will see how you can see if we have the prediction of three hackers the count is 167 79 and 88 which is not evenly distributed so if we take a look at the data for k are equal to 2 with the K2 model and make the prediction so that, as you can see here, the count is evenly distributed, this means that only two hackers will participate.
The clustering algorithm created two groups of equal sites with k equals 2 and the count is 167. For each of them, this is a way to find out how many hackers will participate using k-means clustering, so let's move forward with our second use case, which is customer churn prediction. Now predicting customer churn is big business and it is minimized. divert the customer by predicting which customers are likely to cancel a subscription to a service, although originally used in the telecom industry, it has become common practice in banks, isps, insurance companies and other verticals, the process Prediction is largely data-driven and often uses advanced machine learning techniques.
In this post, we'll look at what type of customer data is typically used to perform preliminary data analysis and trend churn rate prediction models, all with pi spark and its machine learning framework, so let's take a look to the story of In this use case, now our marketing agency has many clients who use their service to produce, as for the client and clients, they have noticed that they have quite a few losses in clients, they basically assign account managers randomly right now, but they want you. to create a machine learning model that will help predict which customers will churn so they can correctly assign customers most at risk of churning to an account manager.
Fortunately, they have some historical data, so can you help them? Don't worry. I'll show you. how to help them, then we need to create a classification algorithm here which will help to classify whether a customer churn or not, then the company can test this with the incoming data for future customers to predict which customers will churn and assign them an account manager, so import the libraries first we need, so here we will use logistic regression to solve this method. Now the data is saved as client underscore churn.csv so we will use the spark.read method here to read the historical data and then we will take a look at the schema of the data and understand what exactly we are dealing with now to understand the schema of any particular data frame or data we use the print schema method.
As you can see here guys, we have the name H. we have the buyout, we have the account manager years, the number of sites onboarded, the date, the location, the company and the turnover, so let's take a look at the data as you can see here we have data from 900 customers here so I have used the count method to get exactly the number of rows to see how much we are dealing with now let's also load the test data and now let's take a look at the schema this data so you can see that the test data is also in the same format as the training data below what we are going to do is import the vector assembler library now since I already imported the vector assembler library here as you can see above we have done it from PI spark.ml.feature import Vector assembler so I am not going to import it again as it will show us some error.
Now we must first transform our data using the vector assembly function to arrive at a single column where each row of the data frame contains a Vector feature. Now this is a requirement for regression API in ml lab as you can see here I am using age, total purchase, account manager, years and number of sites or should I say it depends on the user who is creating the model, so say this model is not giving us the output we want or need, so we will change the parameters of the input columns. Now, what we are doing is creating an output underscore which is a data frame that will contain the input data that has been converted using all of these input columns or features and have a single output from the feature call. named, so let's take a look at the schema of this new output data so that, as you can see here, all the columns are the same except the last one. one and the last one we have an additional feature column and it is a vector so this will help us in predicting customer churn.
Now, to see what we're dealing with here, let's take a look at the output of the first element. So as you can see here in the last thing, we have the features, which is a dense vector containing the five column values, which is 42 years, 1166, which is the total purchase, we have 0.0, which is the manager of accounts, yes, 10.2 years and we have the number of sides which is 8. So now what we are going to do is create our final data, we will use this output data that we have from the vector assembler and what we will do is just select features and churn, so if we take a look at the final data, as you can see here, we only have two columns which are features and churn, so now what we're going to do is split our data into training and testing, for now we will use. the random split method and we are dividing it in a ratio of 70 to 30. now what we are going to do is create our logistic regression model and now we are going to use column rotation for the label, so now let's train the model using our training data that was just created from our final data, so let's take a look at the summary of the model that we just created, as you can see here, we have the dropout, the prediction, we have the leading standard deviation, the minimum and the maximum. value Now that we have created our model, let's use it to get the rater value in the raw prediction data.
Now for that we need to first import the binary classification evaluator. We are creating a predictions data frame on which we will fit the test. data into the model and evaluate using the binary classification evaluator now when we take a look at the output data we can see on the left side we have the features then we have the rotation then we have the raw prediction according to our model , then have the probability and finally the prediction, so let's use the evaluator which has the binary classification evaluator. It will take the prediction column and the label column that are displayed and tell us how accurate our model is so you can see 77 percent accuracy, so I loaded before. test datawhich I will show you here you can see we have the new customers.csv so we use the original data we split it in a ratio of 70 to 30 then we create a model and we train it using our training data and then we test it using the testing data, so now we will use the new incoming data, which is the new customers, and see if our model fits or not.
Now again we will use the assembler, the vector assembler. Here, we have now created a data frame with results in which We will take the logistic regression model and transform it using the new test data and this new test data also contains the features column because we used the vector assembler just before, so if you look at the results, it's kind of sad. I'm going to show you another format, just wait a second, so what we're going to do is select the company and the predictions just to see how it works so you can see that Karen Benson's prediction is true and Robertson's prediction is true .
The gold is also true at Robinson Park, so guys, as you can see, our model had 77 accuracy. Now we can play with the features column to see if our model produces a more accurate result, in our case if we are happy with the 77. The relationship between the model is true and the prediction is true, that's fine, but we can change it according to our preferences in a world where data is generated at an alarming rate. Correct analysis of data at the right time can be very helpful. Now one of the most amazing frameworks to handle real-time big data and perform analytics is a path to purchase and if we talk about the programming languages ​​that are used today for different purposes, I am sure that Python will top this chart as which is almost used. everywhere, talking about the features of Apache Spark, starting with the most important feature which is speed, it is almost 100 times faster than traditional data processing tools and frameworks.
It has powerful caching. The simple programming layer provides powerful caching and sparse capabilities. Apache Spark deployment can be deployed via mesos or Hadoop wire thread or via the Spark cluster itself. The most important feature that helps Spark achieve fast speed in real-time computation and latency level is the use of in-memory computation, lazy evaluation of transformations. direct acyclic graphs and much more, Spark is polyglot, meaning it can be programmed in multiple languages ​​like Python Scala, Java, and R, and it's one of the reasons why Apache Spark has taken on machine learning and analytics. exploratory. Now let's take a look at several companies that use a stake here we have Yahoo Alibaba Nokia Netflix NASA databricks which is the official enterprise distributor of Apache Spark, we have TripAdvisor and we have eBay so you can see that Spark has been used a lot in the industry .
Now let's take a look at the various use cases of the industry, starting firstly with healthcare, as healthcare providers look for novel ways to improve the quality of healthcare, Party Spark is slowly becoming the heartbeat of many healthcare applications. Many healthcare providers are using Apache Spark to analyze patient withdrawals along with pre-RFI clinical data about which patients are likely to face health problems after being discharged from the clinic. This helps hospitals prevent hospital readmission as they can implement home health care services for the identified patient, saving costs for both the hospitals and the patient. Spark is used in genome sequencing to reduce the time required to process genome data.
It took several weeks to organize all the chemicals with genes, but now with Apache Spark on Hadoop it only takes a few hours. Now, financial banks are using Apache Spark for access. and analyze social media profiles call recordings complaint records emails Forum discussion to obtain information that can help them make correct business decisions for credit risk assessment targeted advertising and customer segmentation What are the financial institutions that have retail banking and brokerage operations using Apache Spark? To reduce customer churn by 25 percent, the financial institution has divided the platforms between retail banking, trading and investment; However, the bank wants a 360-degree view of the customer, regardless of whether it is a business or an individual, to get a consolidated view of the customer. the bank uses Apache Spark as a unifying layer.
Spark helps the bank automate analytics with the use of machine learning by accessing data from each repository for customers. Speaking of media, Apache Spark is used in the gaming industry to identify patterns from the real thing. - Schedule and respond to in-game events to take advantage of lucrative business opportunities like targeted advertising. Automatic adjustment of game levels based on the complexity of player retention and many more. Now Conviva is another company with an average of approximately 4 million videos per month. Use Apache Spark to reduce the number of clients. abandonment by optimizing video streams and managing live video traffic, thereby maintaining a consistently smooth and high-quality viewing experience.
Now everyone may have heard of Netflix. Netflix uses Apache Spark for real-time stream processing to deliver online recommendations to its customers' streaming devices in Netflix delivery events. Capturing all member activities and playing a vital role in personalization, it processes 450 billion events per day that flow to the server-side application and then head to Apache Kafka, which now reaches the retail industry and of e-commerce, one of the largest e-commerce companies. Commerce platform Alibaba runs some of the world's largest Apache Spark Stores to analyze hundreds of petabytes of data on its e-commerce platform, some of the Spark jobs performing feature extraction on image data that over several weeks millions of merchants and users. interact with the Alibaba e-commerce platform, each of these interactions is represented as a large, complicated graph and Apache Spark is used for rapid sophisticated machine learning processing on this data. eBay now also uses Apache Spark to deliver targeted offers, improve customer experience, and optimize overall performance.
A purchase route on eBay is leveraged through Hadoop. The thread manages all the cluster resources to execute the generic tasks and eBay users take advantage of the Hadoop cluster in the range of 2000 nodes, twenty thousand cores and 100 TB of RAM through the thread that is now finally coming to the ride. industry Trip Advisor a leading travel website that helps plan a perfect trip is using Apache Spark to accelerate personalized recommendation for its customers. Trip Advisor uses Apache Spark to provide advice to millions of travelers by comparing hundreds of websites to find the best hotel prices for their customers.
The time required to read and process hotel reviews is created into a readable format with the help of Apache Spark. Now we all know Uber every day. This multinational online taxi dispatch company collects terabytes of event data from its mobile users by using Kafka Spark Streaming and sdfs to build a continuous ETL pipeline. Uber can convert unstructured data. event data into structured data as it is collected and then use it for additional and more complex analysis, so as I was talking earlier about Spark being polyglot, it basically means that programming in Spark can be done in multiple languages ​​like Scala Python and ah, now You may ask which one should you use or choose to start with, so Spark was developed in Scala, you should know that it was a default language in which Spark was developed.
It is very similar to Java, but recent data analysis and machine learning developments have made it difficult for Scala to keep up, so Spark came up with the Python API to use Python, so let's take a look at the reasons why one should choose Python to start with. First of all, it is easy for programmers to learn. Python is comparatively easy to learn. Due to its standard syntax and libraries, it is also a dynamically typed language, which means RDS can contain objects of multiple types. Now we will discuss more about rdds in this video. It is now portable and can be used with various operating systems such as Windows Solaris.
Linux we have the PlayStation bus and Mac OS and lastly Scala does not have enough data science tools and libraries like Python for machine learning and natural language processing. Spark mlib, the machine learning library only has fewer ml algorithms, but they are ideal for big data. In summary, we can say that Scala lacks good local data transformation and visualization tools along with widely used machine learning libraries. Now, edureka as we know provides detailed and comprehensive training on Apaches Spark in Python which is the training for Pi Spark Developer Certification in this course. is designed to provide knowledge and skills to become a successful Spark developer using Python.
You will gain in-depth knowledge of concepts like Hadoop Distributed File System, Hadoop Cluster, Flume, Apache Kafka. You will learn about the APIs and libraries that Spark offers. like Spark Streaming MLive Spark SQL and this Pi Spark developer course is an integral part of the career path of big data developers. This course is designed to provide knowledge and skills to become a successful Hadoop and Spark developer and would help you clean up Spark and Hadoop Developer which is the cca175 exam, this course has a total of 12 modules with one additional module and focuses in the Cloud Data Hadoop and Spark Developer Certification Training now coming to Module 1, which is the introduction to Big Data Hadoop and Spark in this model. understand Big Data the limitation of existing solutions for big data problems how Hadoop solves big data problem the Hadoop ecosystem Hadoop architecture sdfs knowledge of rack manual replication you will learn about Hadoop cluster architecture important configuration files in the cluster of Hadoop and you will also get an introduction To understand why it is used and understand the difference between upstream processing and real-time processing, now comes module 2 which is Introduction to Python for Apache Spark.
By the end of this model, you will be able to define Python, understand operands and expressions. You will be able to write your first Python program, you will understand command line parameters and control flow, you will understand how to receive input from the user and perform operations on it and you will also learn about numbers, strings, tuples, less dictionaries and sets that now come to the model. 3, which are basically the functions, oh, error modules and exceptions in Python. In this module you will learn how to create generic Python scripts, how to address errors and exceptions in the code and finally how to extract and filter content using regular expression, now model 4 is deep.
Dive into the Apache Spark framework in this model, you will understand Apache Spark in depth and also learn about the various Spark components that will build and run various Spark applications and at the end you will learn how to perform data ingestion using Scoop. to Model 5, which plays with spark rdds, in this module you will learn about spark rdds, which are resilient distributed data sets and other manipulations related to Rd to implement business logistics, such as Transform actions and functions performed in rdd, moving on to Model 6 which is Data Frames and Spark SQL In this module you will learn about Spark SQL which is used to process structured data with SQL queries.
You will learn about data frames and data sets in Spark SQL along with different types of SQL operations performed on it. data frames, you will also learn about Spark and Hive integration now in module 7, which is machine learning using Spark MLM. In this module, you will learn why machine learning is needed. Different machine learning techniques and algorithms and their implementation using Spark ml Lift now each. The module delves into Spark Mlib. In this model, you will implement various algorithms supported by a machine learning library, which is ml lib, such as Leland's regression decision tree random forest and many more that are now coming to Modern Line, which comprises Apache Kafka and Apache. flow in this model, you will understand Kafka and the Kafka architecture, thenYou will go over the details of the kafa cluster and also learn how to configure different types of Kafka cluster.
Finally, you will see how messages are produced and consumed using Kafka APIs in Java. You will also receive an introduction to Apache Flume, its basic architecture, and how it integrates with Apache Kafka for event processing. Now model 10 is Apache Spark Streaming. In this model, you will work on a Spark Streaming purchase that is used to create scalable glitchy torrent streaming. In this app you will learn about these streams and the various transformations done on the streaming data, you will learn about the commonly used streaming operators like sliding window operator and stateful operators, now model 11 are streaming data sources of Apache Spark in this model. about the different streaming data sources such as Kafka and Flume, and at the end of this module you will also be able to create a Spark streaming application.
Now the Model 12 is the class project. Now this project will comprise everything that we have learned until Now that is Hadoop Spark Kafka Flu and much more and as opponents we have another model which is graphics. In this module, you will learn the key concepts of Spark graph programming. Concepts and operations along with the different graph algorithms and their implementations now that we have seen the training structure offered by direction, let us understand what exactly is Pi Spark now Apache Spark is an open source cluster computing framework for real-time processing developed by Apache Spark Foundation and Pi Spark is nothing more than the Python API for Apache Spa now let's take a look at the various components of the Spark ecosystem, the core engine of the entire Spark framework provides utilities and architecture for other components, Spark Streaming enables analytical applications and interactive for live streaming data, ml lift, which is the Spark machine learning library on which it is based. top of Spark to support the various machine learning algorithms.
The graphics computing engine that is similar to Cryov combines data parallel and parallel graphics. Concepts Spark R is the package for our language that allows our user to take advantage of the power of Spark from the r shell. Now we finally have the pi spark the API developed to support python as a programming language for spa now the Pi spark shell links the python API to the Spark core and initializes the Spark context. The Spark context is the heart of any Spark application. Spark Context configures internal services and establishes a connection to a Spark execution environment, the Spark context object in the driver program coordinates all distributed processes and enables resource allocation.
Cluster managers provide executors which are jvm processes with the Spark Lodge context object, submit the application to the executors and then the Spark context executes these tasks. on each executor now when you have installed Spark on your system by just typing Pi Spark you can go into the Spark Shell and it will look like this just make sure all emails or Spark are running in the background. External source, it is a cluster framework. for real-time processing so three main keywords here Apache Spark It is an open source project which is used for cluster computing and for in-memory processing along with real-time processing it will support in-memory computing, so there are many projects that support cluster.
Computing along with that spark differentiates itself by doing in-memory computing it is a very active community and outside of the Hadoop ecosystem Apache Spark Technologies is very active multiple releases we got last year it is a very inactive project among Apache projects basically it is a framework of current support in in-memory computing and cluster computing and you may be facing this specific question how Spark is different from mapreduce or how it can be compared with mapreduce mapreduce is the processing methodology within the Hadoop ecosystem and within the ecosystem from Hadoop we have hdfs how to distribute file system mapreduce is going to support distributed computing and how spark is different so how can we compare spark with mapreduce in a way this comparison will help us understand the technology better but we definitely can't compare these two or two different methodologies that you go by.
To work Spark is very simple to program, but mapreduce there is no abstraction nor the sense that all implementations have to provide interactivity. It has an interactive mode for working in Spark. A map. There is no interactive mode. There are a few components. like Apache Peak And Hive which facilitates interactive computing or interactive programming and Spark supports real-time stream processing and to say precisely, within Spark, stream processing is called near real-time processing; There is nothing in the world, it is real-time processing. It is near real-time processing, it will perform processing in micro batches. I'll cover it in detail when we move on to the concept of streaming.
I am going to do batch processing of historical data in mapreduce when I say stream processing. We will get the data that is processed in real time and we will do the processing and get the result either by storing it or publishing it in the public community. We will do it in terms of latency. Map reduce will have very high latency because it has to read data from hard disk, but Spark will have very low latency because it can reprocess or use data that is already stored in memory, but there is a small cache here in Spark.
The first time the data is loaded, it has to read it from the hard drive, just like mapreduce. So once it is read, it will be there in memory, so Spark is good whenever we need to do Iterative Computing. So Spark whenever you do Iterative Computing over and over again, do the processing of the same data, especially in machine learning, deep learning, everything we will do. Use Iterative Computing here Sparks performs much better, you will see the performance improvement 100x faster than mapreduce, but if you do one time processing and hit and forget that type of processing, Spark can relatively get the same latency as you will get then. mapreduce maybe you like some improvements because of the building block or spark which is the rdd, you can get some extra advantage, so that is the key feature or key comparison factor of spark and mapreduce now let's move to the key features, explain the key features of spark that we discussed Regarding speed and performance, in-memory computing will be used, so the speed and performance will be much better when we do the Computing and some polygon activity in the sense that the programming language to be used with Spark can be any of these. the languages ​​can be Python, Java RR, scalable, we can program with any of these languages ​​and data formats to provide us with an input, we can provide any data format such as Json package, with data formats we can, if there is an input and the key selling point. with spark is its lazy evaluation, in the sense that it is going to calculate the DAC cycle, a directed cycle graph because there is a DHE, it is going to calculate all the steps that need to be executed to achieve the final result, so we need to give all the steps as well as what final result I want, it is going to calculate the optimal cycle or the optimal calculation, what steps should be calculated or what more steps should be executed, only those steps will be executed, so basically it is a lazy execution only if the results need to be processed, it will process that specific result and supports real-time computing.
It is through Spark Streaming. There is a component called Spark Streaming that supports real-time computing and integrates very well with the Hadoop ecosystem. It can run on top of Hadoop. or you can leverage hdfs to do processing, so when you leverage hdfs, Hadoop cluster container can be used to do distributed computing and you can also leverage resource manager to manage resources, so spark and the gel with the hdf are very In addition, you can take advantage of the resource manager to share resources and data locality. You can take advantage of the locality of the data. You can perform processing close to the data where the data is located within the HDFs.
It has a fleet of machine learning algorithms already deployed. from clustering classification regression, all that logic already implemented and machine learning isThis is achieved using mlib inside Spark and there is a component called Graphs which supports graph theory where we can solve the problems using graph theory using the component Graphics within Spark. These are the things that we can consider as the key features of Spark, so when you talk to Spark. When installing Spark, you may come across this thread. What is thread? Do you need to install Spark on all nodes in the thread pool? So thread is nothing but another resource negotiator, which is the resource manager within the Hadoop ecosystem, so it will provide resource management. platform, it will not provide the resource management platform in all the clusters and one spark will provide the data processing, so wherever the resource is used in that location, the spark will be used to do the data processing and of course , yes, we need To have Spark installed on all the nodes while placing the Spark clusters, that's basically what we need those libraries and in addition to installing Spark and all the worker nodes, we need to increase the RAM capacity on the workers' missions, in addition to consuming an enormous amount. amount of memory to do the processing, it will not reduce the way the map works internally, it will generate the delay cycle and perform the processing on top of the thread, so the thread at the high level is like a resource manager or like a system distributed computing will coordinate all resource management across the entire server fleet.
Also, I can have multiple components like Spark Days, Giraffe, Spark, especially, it will help us achieve in memory. Computing, so Spark Yarn is nothing more than a resource. manager to manage the resource in the entire cluster, on top of that we can have Spark and yes we need to have Spark installed and all the nodes on which the Spark Yarn cluster is used and also we need to increase the memory on all of them. the worker nodes, the next question is this: what file system generates support? When I say file system, when we work on the individual system, we will have a file system to work within that particular operating system, but in the distributed cluster or in the distributed cluster. architecture we need a file system where we can store the data in a distribution mechanism.
Hadoop comes with file system called hdfs, it is called Hadoop distributed financial system where the data is distributed among multiple systems and will be coordinated by two different types of components. called nanode and datanode and Spark, you can use this hdfs directly so you can have any file in hdfs and start using it within the Spark ecosystem and it provides another advantage of data locality when you do distributed processing wherever the data are distributed. data. it can be done locally on that particular machine where the data is located and to start with as standalone mode you can also use local file system so this could be used especially when we are doing development or any POC we can use the local file. system and Amazon Cloud provides another file system called S3 simple storage service.
We call it as S3 is a block storage service. This can also be leveraged or used within Spark for storage and many other filesystems that are also supported. There are a few file systems. like Alexa, which provides in-memory storage so that we can take advantage of that particular file system, so we have seen all the features, what are our functionalities available within Spark, we will see the limitations of using Spark, of course, each component when This is a great power and advantage, it will have its own limitations and the following question illustrates some limitations of using Spark, it uses more storage space compared to installing, it will consume more space, but in the world of Big Data that is not very big. restriction because the storage cost is not very big or very high and large data space, a developer must be careful while running the applications and Spark, the reason is that it uses in-memory computing, of course it handles memory very well, but if you try to load a large amount of data in the distributed environment and if you try to just join, when you try to join in the distributed world, the data will be transferred over the network, the network is really an expensive resource.
Therefore, the plan or design must be a way to reduce or minimize thedata transfer over the network and, as much as possible, with all possible means, we must facilitate the distribution of data across multiple missions. The more we distribute, the more parallelism we can achieve and the more results we can obtain. and profitability if you try to compare the cost, how much cost is involved in doing a particular processing, take any unit in terms of processing 1 GB of data with, say, five iterative processing if you compare the cost in terms of in-memory computing. Computing is always constable because memory is relatively more expensive than storage, so it can act as a bottleneck and we cannot increase the memory capacity of the machine beyond some limit, so we have to grow horizontally. , so that when we have the data distributed in memory across the cluster, of course the network transfer.
All these model lenses will come into the picture, so we have to find the right balance that will help us achieve Memory Computing, whatever the memory required. The computation required will help us achieve this and consumes a lot of data processing compared to Hadoop and Spark. works better than user iterative computing because we explore both Spark and the other technologies. You have to read data for the first time from the harder car from another data source and Spark's performance is actually better when reading the data or doing processing. when the data is available in the cache, of course, yes, the dark cycle will give us a lot of advantages in doing processing, but in-memory computing will give us a lot of leverage.
The following question lists some use cases where Spark outperforms Hadoop in processing. The first thing is real-time processing. Hadoop can't handle real-time processing, but Spa can handle real-time processing, so any data that goes into the Lambda architecture will have three layers and most of the Big Data. The projects will be in the Lambda architecture, we will have a batch velocity layer and a service layer and the velocity layer whenever the radar arrives that needs to be processed, stored and handled, and that kind of real-time processing spark is the best option, of course. Within the Hadoop ecosystem we have other components that do real-time processing like a storm, but when you want to leverage machine learning along with spark streaming in that computation, spark is going to be much better, so when you have an architecture like a Lambda architecture. you want the three layers, batch layer, speed layer and service layer, to activate and gel the speed layer and the service layer much better and will provide better performance and whenever you do batch processing, especially like doing a machine learning processing, we will take advantage of the Hydrated Computing and it can run 100 times faster than Hadoop, the more iterative processing we do, the more data will be read from memory and it will give us much faster performance than the reduced Hadoop map, so remember again as long as you do just the processing. once, so you are going to do the processing only once, read, process and return the result.
Spark may not be the best option to do with the reduced map and there is another compound called akka as a messaging system, our message coordination system. Spark internally uses AKA for scheduling or any task that the master needs to assign to the worker and tracking of that particular task by the master, basically an asynchronous coordination system and that is achieved using akka AKA scheduling internally, it is used by Spark like For example, for developers, we don't need to worry about scheduling, of course we can leverage it, but Spark uses AKA internally for scheduling and coordination between master and worker, and inside Spark we have some core components, let's see what are the core components of Apache Spark, the name for the components of the Spark ecosystem.
Spark comes with a core engine, so it has the core functionalities of what Spark or Spark rtds requires, they are the building blocks of the Spark Core engine on top. spark core basic file interaction functionalities file system coordination everything the spark core engine does in addition to the spark core engine we have n number of other offerings to do machine learning to do graph computing to do streaming we have n number of other components So the most used components of these components such as Spark SQL, Spark Streaming ml Lip Graphics and Smart Car at the top level, we will see what these components are.
Sparks is the same, especially it is designed to do the processing against structured data so that we can write SQL queries and we can handle or we can do the processing, so it will give us the interface to interact with the data, especially the structured data, and the language What we can use is more similar to what we use within SQL. I can say that 99 percent is the same and most of the functionality commonly used within SQL has been implemented within Spark SQL and Spark Streaming will support stream processing which is the offering available to handle stream processing and MLA B is the bid to handle machine learning, so it is called the component name.
MLM and it has a list of components, a list of machine learning algorithms already defined, we can leverage and use any of those machine learning algorithms. Graphs again, it is a graph processing offering within spark that will help us achieve Graph Computing against the data that we have like page range calculation, how many connected entities, how many triangles, all of them will provide us with a meaning for that data in particular and Spark R is the component with which it will interact or help us take advantage of the R language within the Spark environment. R is a Statistical Programming Language where we can do statistical computing within the Spark environment and we can leverage our language using the Smart Car to run it within the Spark environment.
Additionally, there are other components, as well as a rough database, it is called Blink DB. Of things like in the beta stage, these are the most used components within Spark, so the next question is how can Spark be used together with Hadoop, so that when we see that Spark works much better, it is not a replacement for Hadoop, will coexist with Hadoop properly leveraging the Spark and Hadoop together will help us achieve the best result that Spark can achieve in memory Computing or can handle the speed layer and Hadoop comes with the resource manager so that we can leverage the resource manager of Hadoop to make Spark working and free. processing, we don't need to take advantage of in-memory computing, for example, processing once, do the processing and forget I just store it, we can use mapreduce so the processing cost or computing cost is much less compared to Spark , so we can merge and get. strike the right balance between batch processing and stream processing when we have spark along with atom so let's have some detailed questions related to spark core inside spark core as I mentioned earlier the core component of spark core is rdd resistant distributed data set, it is a virtual it is not a physical entity, it is a logical entity, you will not see it so hard, it exists, the existence of rd will come into the picture when you perform some action, so this rdd will be used or will reference to create the Dax cycle and the arteries will be optimized. to transform from one form to another to make a plan for how the data set should be transformed from one structure to another structure and finally when something is taken against an RTD, the existence of the data structure, the resulting data will come into scene and that can be stored on any file system whether it be httfs S3 or any other file system and the rdds can exist in a partitioned form in the sense that they can be distributed across multiple systems and are fault tolerant when you say tolerant to failures if any of the rdd is any partition of the 100d is lost it can regenerate only that specific partition it can regenerate so it is a big advantage of RTD.
If someone asks what is the big advantage of the entity, it is fault tolerant where it can regenerate the latest rtds and it can exist in a distributed way and is immutable so once the rdd is defined or created it cannot be changed, the Next question is how we create rdds in Spark. The two ways we can create the rdds. One is to use the Spark context we can. use any of the collections that are available within the scalar or the other and using the parallelization feature we can create the rdd and it will use the distribution mechanism of the underlying file system if the data is located on a distributed file system like hdfs, will leverage it and make those rdds available on multiple systems so it will leverage and follow the same distribution and rd as well or we can create the rdd by loading the data from external sources as well as HBS softly, hdfsp may not consider it as a External source, it will be considered as a Hadoop file system, so when Spark is working with the hardware, mainly the file system we will use will be hdfs so that we can read from HPS or even we can read from other sources like parquet files or from different S3 sources, we can each read and create the RTD.
Well, the next question is what is running in memory in the Spark application, so each Spark application will have fixed heap size and fixed number of cores for the Spark executor. The executor is nothing more than the available execution unit. on each machine and that will make the processing easier to perform the task on the worker machine, so regardless of whether you use thread resource manager or any other meso like resource manager, each workerMission we will have an executor and inside the executor we will will handle the task and the memory that will be allocated for that particular executor is what we define as the heap size and we can define how much amount of memory should be used for that particular executor within the working machine as well as the number of cores , they can be used within the executor or by the executor within the Spark application and that can be controlled through the Spark configuration files.
The next question is Define partitions in Apache Spark so that any data, regardless of whether it is small or large data, we can divide those data sets into multiple systems. The process of dividing data into multiple parts and storing it in multiple systems as different logical units is called partitioning, so in simple terms partition is nothing but the process. Splitting the data and storing it on multiple systems is called partitioning and by default the conversion of the data to RTD will occur on the system where the partition exists, so the larger the partition, the more parallelism you will get at the same time .
We also have to be careful not to trigger a lot of network data transfer and each rdd can be partitioned within Spark and the panel is the partition that will help us achieve parallelism. The partition in which we have more distributions can be made and the key. What is important to the success of the Spark program is minimizing network traffic while performing parallel processing and minimizing data transfer within the Spark systems. What operations do they already support to be able to operate multiple operations against the RDD, so there are two types of things that what we can do, we can group it into two, one is transformations into Transformations, rdd will be transformed from one form to another, for example, filter, group all that, it will transform from one form to another, a small example, how to reduce it by key filter. those will be Transformations, the result of the transformation will be another rdd, at the same time we can take some actions against the rdd that will give us the final result.
I can say count how many records there are or store that result in the hdfs that are. They are not actions, so multiple actions can be taken against the RTD, so the existence of the data will come into the picture only if I take any action against the RTD. Well, the next question, what do you understand by transformations in Spark? So transformations are nothing more than functions, for the most part. Being higher order functions within the scalar, we have something like higher order functions that will be applied against that rdd mainly against the list of elements that we have within the rdd, that function will be applied due to the existence of the rdd and will only appear in the image.
If we take any action in this particular example, I am reading the file and I have it inside the rdd called raw data, so I am doing some transformation using a map, so it will apply a function. So inside the map I have some function that splits each record using tab so that the split is applied to each recordinside the raw data and the resulting movie data will again be another rdd but of course this will be a lazy operation, the existence of movie data will come into play. image only if I take some action against it like count, print or store only those actions will generate data so next question Define Spark Core functions to take care of memory management and fault tolerance of RDDs , it will help us to schedule distribute the task and manage the jobs running within the cluster and thus it will help us to store the data in the storage system as well as read the data from the storage system that will perform the operations at the file system level, it will help us and Spark Core programming can be done in any of these languages ​​like Java Scala Python, as well as using R, so the core is at the horizontal level.
On top of the Spark Core we can have several components and there are different types of rtd available, one of those special ones. The type is rdd pair, so the next question what do you understand by rdd pair? So it will exist on the page as keys and values ​​so you can perform some special functions within the rdds pair or special transformations like collecting all values ​​corresponding to the same key. like a kind of Shuffle, what happens inside Hadoop Short Shuffle, that kind of operations, like you want to consolidate or group all the values ​​corresponding to the same key or apply some functions against all the values ​​corresponding to the same key, like I want get the sum of value of all keys, we can use rdd pair and achieve it, so that the data inside the rdd exists in base keys and values, okay, a Json question, what are our vector rgds in machine learning. will have a lot of processing and using vectors and matrices and we do a lot of operations Vector operations, like Effective Vector, transform any data into a vector form, so vectors, as normal, will have a direction and a magnitude, so we can do some operations like adding two vectors and what is the difference between vector A and B as well as between a and C, if the difference between vector A and B is less compared to a and C, we can say that the vector A and B are somewhat similar, okay? terms of features so rdd vector will be used to represent the vector directly and will be used extensions will be used while doing machine learning and Json thanks and there is another question what is rdd lineage so here any data processing, any transformation we do.
Do you maintain something called lineage? So how is data transformed when data is available in a partition form across multiple systems? When we do the transformation, they will undergo multiple steps and in the distributed world it is very common to have machine phases or machines leaving the network and the framework system as such, should be an opposition to handle Spa handles it through 100 lineages so you can restore the last partition, it is only assumed that out of 10 machines the data is spread across five machines out of those five missions One mission is lost, so whatever last transformation that data had for that particular partition , the partition in the last mission can only be regenerated and you know how to regenerate that data or how to get the resulting data using the rdd lineage concept.
So from what data source was it generated what was your prayer step to make the full lineage available and maintained internally by the Spark framework? We call it rdd lineage, which is the Spark driver, to put it simply for those who have a Hadoop background. In the background, we can compare this with App Master. Each application will have a Spark driver which will have a Spark context which will modulate the entire execution of the job which will connect to the Spark Master and deliver the RDD graph which is the lineage of mastering and coordinating the tasks, whatever the tasks are running on the distributed environment, you can do parallel processing, perform the transformations and actions against the rdd, so it is a single connection point for that specific application, so the Spark driver is short-lived and the Spark context within the driver Spark will be the coordinator between the master and the tasks that are running and the Spark driver can start on any of the executors within the cluster managers Spark name types in Spark, so as long as you have a quest group, I need an administrator to manage resources.
There are different types of cluster administrators. We have already seen the thread. Another resource negotiator that manages Hadoop resources in addition to the thread. We can get Spark 2 working. Sometimes you might want to have Spark just on my. organization and not together with Hadoop or any other technology, then I can go for the standalone generator which has a built-in cluster manager, so only the smart ones can run multiple systems, but in general, if we have a cluster, we will try to leverage other platforms computing or computing frameworks like graphics processing giraffe, we will try to take advantage of all of these, in that case we will use thread or some generalized resource manager like mesos.
The thread is very Hadoop specific and comes along with Hadoop. Mezos is a cluster level resource manager when I have multiple clusters within the organization then can I use mesos mesos is also a resource manager it is a separate top level project within Apache the next question what do you understand by node worker? So in a cluster in a distributed environment we will have a number of workers that Call it as a worker node or a slave node that does the actual processing, it will fetch the data, do the processing and give us the result, and the master node will allocate what you should make which node perker and it will read the available data. in the specific worker node usually the tasks assigned to the worker node or the task will be assigned to the output node where the data is located in the Big Data space, especially Hadoop, it will always try to achieve data locality, that is what we count as availability of resources. as well as the availability of the resource in terms of CPU memory will also be considered.
Suppose I have some data replicated across three missions. All three missions are busy doing the job and there is no CPU or memory available to start the other task. wait for those missions to complete the job and get the resource and do the processing, it will start the processing on some other machine which will be close to where the machines which have the data and read the data through the network machines are no more than the ones that do the real work. I will inform the teacher in terms of what the resource utilization is and the tasks that are executed within the workers' missions.
I'll do the real work and what happens. Vector Just a few minutes ago I was answering a question like What is a vector? A vector is nothing more than representing data in multidimensional form. A vector can be multidimensional. A vector also assumes that I will represent a point in space. I need three dimensions x, y and z so that the vector has three dimensions. if I need to represent a line in space then I need two points to represent the start point of the line and the end point of the line then I need a vector that it can hold so it will have two dimensions the first dimension will have one point , the second dimension will have another point in the same way, if I have to represent a plane then I need another dimension to represent two lines, so each line will represent two points in the same way.
I can represent any data using a vector shape as I can. If you have a large number of product reviews or ratings in an organization, let's take a simple example: Amazon Amazon has millions of products, not all users, not even a single user, would have used billions of all products within Amazon, so we would hardly have used them like a point one percent or even less than that, maybe a few hundred products that we would have used and rated the products within Amazon over the entire lifespan. If I have to represent all product ratings with a vector, I will say the first position of the rating will refer to the product with id1 in the second position it will refer to the product with id2, so I will have millions of values ​​within that particular vector among millions of values.
I will only have values ​​for 100 products that I have provided the ratings for, so it can range from number one to five for everyone else it will say zero. Sparse means finely distributed to represent the large amount of data with position and by saying this particular position has zero value, we can mention that with a key and value then which position has which value instead of storing all zeros, I can store only non-zeros, the position of it and the corresponding value, which means that all others will be a zero value, so we can mention this particular sparse vector by mentioning it to represent the non-zero entities, so to store only the non-zero entities, this mass factor will be used so that we don't need to waste extra space while storing this pass.
Vector, let's discuss some questions about Spark streaming, how streaming is implemented in Spark, explained with examples of streaming Spark is used to process streaming data in real time to accurately say that it is a micro batch processing, so that the data will be collected between every small interval, say maybe 0.5 seconds or every second, and it will be processed internally that will create micro patches the data created from that micro batch that we call that is a dstream dstream is like a rd so I can do transformations and actions, anything I do with rdd. I can also do it with restreamer and Spark streaming can read data from Flume hdfs or other streaming sources as well and stores the data in dashboard or any other database and provides very high performance as it can be processed with several different systems in a distributed manner.
Again, the dstream stream will be split internally and has built-in feature of fault tolerance, even if any data is lost, any transformed rdd is lost, it can regenerate those rdds from existing or source data, so the chain D will be the basic component of the transmission and will have the fault tolerance mechanism. what we have inside the RTD, so this stream is a specialized form of rdd RTD specifically for use inside this Box stream. Well, the next question is what is the meaning of sliding window operation, which is very interesting in data transmission whenever we do Computing.
Data density is the business implications of that specific data. It can fluctuate a lot, for example within Twitter we used to say trending tweet hashtag just because that hashtag is very popular, maybe someone could have hacked the system and used your number of tweets maybe for that. R in particular could have appeared billions of times just because it appeared millions of times during that specific mini duration or for example two three minute durations, you should not include the trending tag or trending hashtag for that particular day or for that particular month, then what we will do, we will try to average, like a window, on this current time period and T minus 1 T minus 2, all the data we will consider and try to find the average or the sum to get the complete business logic. will be applied against that particular window, so any drastic change or to say precisely, spike or drop, very drastic spy cards, drastic drop in the pattern of the data, will be normalized, so that is the biggest importance of using sliding window operation within Spark Streaming and Spark. can handle this sliding window automatically can store the previous data T minus one T minus 2 and how big should the window be kept all that can be easily delivered within the program handles at the abstract level the next question is what is D string?
The expansion is a discretized tension, so that is the abstract form or the virtual form of representation of the data for the transmission of spark in the same way that when we transform from one form to another we will have a series of rdds all together called like a string d, so D String is nothing but another representation of rdd, they are like a group of rgds which we call, there is a broadcast and I can apply the broadcast functions or any of the functions. Transformations are actions that are available within the stream against this D chain.
So within that particular micro batch. then I will define at what interval the data should be collected or processed. A call there is a micro batch that could be every second or every 100 milliseconds or every fiveseconds. I can define that particular period so that all the data is received in that particular duration will be considered as a data and will be called as a string d. The next question explains caching in Spark Streaming, of course, yes, Spark internally uses In-Memory Computing, so any data when you're doing the Computing, that screen generator, it will do. it will be there in memory, but also if you do more and more processing with other jobs when more memory is needed, the least used rdds will now be erased from memory or the least used data available outside of the rdt actions will be erased out of memory. memory, sometimes I may need that data forever in memory.
A very simple example like a dictionary. I want the dictionary words to always be available in memory because I can spell check Tweet or comment commands and the number of names. What I can do? I can say cache all the data that comes in, we can cache it or keep it in memory, so that even when other applications need memory, this specific data will not be deleted and especially it will be used to do more. Processing and caching can also be defined whether it should be in memory only or in memory and hard disk. We can also define it.
Let's discuss some questions about Spark graphs. The next question is: is there an APA for implementing graphs in Spark? In theory everything will be rendered as a graph when you say graph it will have notes and borders so everything will be rendered using the rtds so the rdd will be extended and there is a component called graphics which exposes the functionalities to render a graph. we can have a hrdd Vertex rdd creating the edges and the vertex. I can create a graph and this graph can exist in a distributed environment, in the same way we will be in operation to do parallel processing as well, so Graphs is just a way to represent. the data that is graphed with edges and vertices and of course yes, it provides the API to implement or create the graph.
Perform processing on the graph. APIs are provided. What is the page rank on charts within chart facts? Once the graph is created we can calculate the page rank for a particular node, so it is very similar to how we have the page rank for websites within Google, the higher the page rank, that means it is most important within that particular graph it will show the importance of that particular node or edge within that particular graph when I say graph is a connected set of data all the data will be connected using the property and how important is that property, we will have a value associated with it.
So within the page range we can calculate as a static page range that will be executed. a number of iterations or there is another page line called Dynamic Page Sorting which will run until we reach a particular saturation level and the sci-fi level can be defined with multiple criteria and the APIs we call as well as graph operations can be executed directly. against those graphs and they are all available as API within graphs, what is lineage graph, so it is very similar to graphs, how the graph representation of each rdd internally will have the relationship that says how that rdd was created in particular and, from there, how it was created. transformed is largely how it was transformed, so the entire lineage or the entire history or the entire path will be recorded within the lineage which will be used in case any particular partition of the target is lost, it can be regenerated even if the full rarities Finally, we can regenerate so that you have the complete information about all the partitions where there are water transformations that it has undergone.
What is the resulting value? If you miss something in between, you know where to recalculate from and what all the essential things are. recalculated, that will save us a lot of time and if that is never used anymore, it will never be recalculated, so the recalculation is also triggered based on the action, only when necessary, the base will be recalculated, that's why it will use the memory so optimal. Does Apache Spark provide a checkpoint hook if it still takes the example as streaming and if any data is lost within that particular sliding window, we cannot recover the data or the data will be lost first by creating a window of say 24 hours to do something?
On average, I'm creating a 24 hour sliding window. Every 24 hours it will keep sliding and if you lose any system, assume there is a complete cluster failure. I may lose the data because it is all available in memory, so how to recalculate? If the data system is lost, something called checkpointing follows so that we can control the data and it is directly provided by the Spark API. We just have to provide the location where it needs to be controlled and you can re-read that particular data when you start the system again, whatever state it was in we can regenerate that particular data so yes to answer the question directly, Apache provides checkpoints and will help us regenerate the state it was in before.
Let's move on to the next lecture Park MLA, what is machine learning like? implemented again in Spark the Machine Learning, it is a very big ocean in itself and is not a specific technology for Spark Machine Learning. It is a common data science. It's a subset of the data science world where we have different types of algorithms, different categories of algorithms, like regression clustering. dimensionality reduction everything we have and all these algorithms or most of the algorithms have been implemented in Spark and Spark is the preferred framework or preferred application component to do machine learning algorithm today or machine learning processing, the reason because most machine learning algorithms need to be run iteratively and multiple times until we get the optimal result, maybe like 25 or 50 iterations, or until we get that specific accuracy, we will continue to run the processing over and over again. and Spark adapts very well whenever you want.
Do the processing again and again because the data will be available in memory. I can read them faster. Store the data back into memory. Read them faster. All of these machine learning algorithms have been provided within Spark, a separate component called ml lib and within. mlib we have other components like visualization to extract the features. You may be wondering how we can process images. The main thing about processing an image, audio or video is to extract the feature and compare in the future to what extent they are related, that is where the vectors are found. arrays, all that will come into the picture and we can have a render pipeline also for processing, then take the result and do the processing as well and you have a persistence algorithm and the result generated as processed result can be persisted and reloaded back into the system to continue the processing from that particular point onwards, the next question what are the categories of machine learning machine learning has such different categories available supervised, unsupervised and reinforcement learning supervised unsupervised is very popular where we will know about some that I will come up with a For example, I will know in advance which category azima belongs to.
I want to do character recognition while training the data. I can give the information that says this particular image belongs to this particular character category or this particular number and sometimes I can train. I won't know too much in advance. I guess I can have a different type of images, like you can have cars, bikes, cats, dogs, all that. I want to know how many categories are available. I won't know too far in advance, so I want to group them together. available and then I'll realize that okay, this all belongs to a particular category. I'll identify the pattern within that category and give it a category name, let's say all of these images belong to both categories or look like a boat, so I leave it. to the system by providing this value or not, that is why different types of machine learning come into the picture and as such machine learning is not specific to Spark, it will help us achieve running these machine learning algorithms, what are they Spark ml main tools? lib is nothing but a machine learning library or machine learning offering within this brand and it has a number of algorithms implemented and provides a very nice feature to preserve the result.
Generally, in machine learning we will generate a model, the pattern of the data, because that is a model, the model will persist in different forms, such as leaving Avro, different forms, it can be stored or persisted and it has methodologies to extract the features of a data set. I can have a million images. I want to extract the common features available within those. millions of images and there are other utilities available to process to define or define the seed, randomize it so there are different utilities available, as well as pipelines that are very specific to Spark, where I can pipeline or organize the sequence of steps that it needs to follow. machine learning, so machine learning is an algorithm first and then the result will be fed into the machine learning algorithm as well, so we can have an execution sequence and that will be defined using Pipelines.
These are all enabled features of Spark MLB. some popular algorithms and utilities in Spark MLA, so all these are some popular algorithms like regression classification, basic statistics recommendation systems, a complete system is well implemented, all we have to provide is to provide the data if you provide the ratings and products within an organization if you have a complete humidity we can build the recommendation system in a short time and if you give any user you can give him a recommendation these are the products that the user may like and those products can be displayed in the search result, it works based on the feedback we are providing for the previous products we have purchased dimensionality reduction clustering every time we make the transition with the large amount of data, it is very compute intensive and we may have to reduce the dimensions, especially the dimensions of the matrix within the MLA without losing the features, whatever the features are available without losing them, we need to reduce the dimensionality and there are some algorithms available to do that dimensionality introduction and extraction of features, so what are all the common features or features available within that particular image and can I compare what are all the common features? available within those images, this is how we will group those images, so please let me know if this particular image, the person who looks like this image is available in the database or not, for example, assume that the organization or the department Police crime department maintains a list of people who committed crimes and if we get a new photo when they do a search, they may not have the exact photo bit by bit, the photo may have been taken with a different background, different lighting , different locations, different time, so 100 the data will be different or the bits and bytes will be different. but look wise, yes, you will see it, so I will look for the photo that looks similar to this particular photograph as input that I will provide to achieve that, we will extract the features in each of those photos, we will extract the features. and we will try to match the feature instead of the bits and bytes and the optimization as well as in terms of processing we are doing the pipeline there are a number of algorithms to do the optimization let's move on to Spark SQL is there any module to implement SQL?
Spark, how does it work so directly? No, SQL can be very similar to Hive, whatever structured data we have, we can read the data or extract the meaning from the data using SQL and it exposes the API and we can use those APIs to read the data or create data frames and Spark SQL has four main categories data source data frame data frame is like representation of data the interpreter and the optimizer. Any query it finds will be interpreted or optimized and executed using SQL services and get the data from the data frame or it can read the data from the data source and do the processing.
What is a parquet file? It is a file format where the data in some structured form, especially theresult of Spark SQL, they can be stored or returned with some persistence and the package again is an Apache open source, it is a data serialization technique in which we can serialize the data using the package form and to accurately say that it is a column in our storage, it will consume less space, it will use the keys and values ​​and store the data and also it will help you to access specific data from that barcode form which uses reverse query, it is another form of data serialization in open source format to store the data or purchase it as well as to retrieve the list of data, Spark SQL functions, can be used to load the varieties of structured data, of course yes.
Spark SQL can work only with the structured data, it can be used to load varieties of data structures and can use SQL-like statements to query the program and can also be used with external tools to connect to Spark. Very good integration with SQL and using Python Java scalar code, we can create an RTD from the available structure data directly using Spark SQL. I can generate the rdd, so it will make it easier for people with database experience to make the program faster and faster below. The question is, what do you understand by deferred evaluation? So whenever you perform any operation within Spark World, it will not do the processing immediately, it will look for the final result that we are requesting from it, if you do not request the final result. you don't need to do the processing So according to the final action until we do the action, there will be no Transmission.
There won't be any actual processing, it will just understand what transformations it has to do eventually if you ask. for the action, then in an optimized way it will complete the data processing and give us the final result, so to answer directly, lazy evaluation performs the processing only when the resulting data is needed, if the data is not noisy , it will not perform processing. Can you use Spark to access and analyze data stored in Cassandra database? Yes, it is possible, okay, not only Cassandra, any of the nosql databases, can do the processing and Cassandra also works on a distributed architecture, it is an O SQL database, so it can take advantage of data locality , the query can be executed locally where the Cassandra nodes are available, it will make the query execution faster and reduce the network load and the Spark executors will try to start or the Spark executors on the machine where the Cassandra nodes. available, our data is available, it will do the processing locally, so it will take advantage of the locality of the data.
Next question: how can you minimize data transfers when working with Spark? If you ask the code design, the success of the Spark program depends on how much it is. reduce network transfer because network transfer is a very constant operation and cannot be parallelized, it offers multiple ways or especially two ways to avoid this is called streaming variable and accumulator streaming variable, it will help us to transfer any static data or any information you continue to publish. to multiple systems, so I will say that if any data will be transferred to multiple executors it will be used in common and can be transmitted, and you may want to consolidate values ​​that occur on multiple workers into a single centralized location.
I can use the accumulator so this will help us achieve data consolidation or distribution in the distributed world at the APA level or at the abstract level where we don't need to do the heavy lifting that Spark takes care of for us, what are the variables transmission at this time? We discuss the value, the common value that we need. You may want to make it available on multiple executors, multiple workers. Simple example: you want to do a spell check in the Tweet comments, the dictionary that has the correct list of words. I will have a complete one. list I want to make that particular dictionary available on each executor so that with the task when it is running locally on those executors they can reference that particular map task and do the processing avoiding network data transfer so that the Data Distribution process The intelligent context for the executors where the task will be executed is achieved using streaming variables and therefore integrated within the Spark API, using the Spark API we can create the streaming variable and The process of distributing this data available to all executors is taken. care for the Spark framework explain accumulators in Spark the same way we have our streaming variables.
We also have accumulators. A simple example wants to count how many error logs are available in the distributed environment as if the data is distributed across multiple systems and multiple executives each. The executor will do the process of counting the records and atomically you may want the total count. So what I will do, I will ask to maintain an accumulator, of course, it will be maintained in the smart context in the driver program because the driver program is running. to be one per application, it will keep accumulating and whenever I want I can read those values ​​and take any appropriate action, so it's like more or less accumulators and streaming variables look opposite to each other, but the purpose is totally different, is it? why is there a streaming variable needed when working with Apache Spark, it is a read-only variable and will be cached in memory in a distributed manner and it eliminates the work of moving the data from a centralized location which is a controller Spark or from a particular program to all the executors within the cluster where the transparency will be executed, we do not need to worry about where the task will be executed within the cluster, so compared to the accumulators streaming variables, it will have a read-only operation that cannot be changed by executors. value can only read those values ​​that it cannot update, so it will mainly be used as a cache which we have for the next rdt question, how can automatic cleanups be triggered in Spark to handle the accumulated metadata, so that there is a parameter that we can configure TTL? it will be triggered along with the running jobs and intermediately write the result data to disk or clean up the unnecessary data or clean up the rdds that are not being used, the least used rdd will be cleaned up and maintain the metadata as well as cleaning up memory what are the different levels of persistence in Apache Spark when we say that data must be stored in memory, it can be at different levels, it can be persisted, so it can be only in memory or in memory and disk or only on disk and when it is When storing it, we can ask it to store it in serialized form, so the reason we can store our persist is that I want this particular artery, this form of small RTD bank to reuse it, so that can read it again, you may not need it right away. so I don't want that to continue occupying my memory.
I'll write it to disk and read it back when necessary. I will read it again in the next question, what do you understand by rdd schema then schema? rdd will be used especially within Spark SQL, so rdd will have the meta information built in, it will have the schema also very similar to what we have, the database schema, the structure of the particular data and when you have the structure it will be easy. for me to handle the data, so the data and the structure will exist together and this rdd scheme is now called dataframe with its brand name and the term dataframe is very popular in languages ​​like R, as in other languages, it is very popular, so it is We will have the data and the meta information about the data, indicating in which column, in which structure it is located, explain the scenario in which you will use Spark Streaming, suppose you want to do a sentiment analysis of the tweeters to for the data to be transmitted, so we will use a Flume is a kind of tool to collect the information from Twitter and feed it into Spark Streaming.
It will extract or identify the sentiment of each and every tweet and market whether it is positive or negative and accordingly the data will be the structured data, the ID of the tweet, whether it is positive or negative, maybe the sentiment percentage positive and negative sentiment percentage store it in some structured form, then you can leverage Spark SQL and group or filter based on sentiment and maybe you can use a machine learning algorithm that boosts that particular tweet. To be on the negative side, is there any similarity between all those negative and negative tweets, perhaps specific to a product at a specific time the Tweet was about or a specific region it was tweeted in, so Could these analyzes be carried out taking advantage of the MLA? buff spark, so the MLA streaming core will work together, all these are like different offerings available to solve different problems.
I hope you all enjoyed, thanks friends. I hope you enjoyed listening to this video, please kindly like and comment. any of your doubts and queries and we will answer them as soon as possible. Find more videos in our playlist and subscribe to edureka channel for more information, happy learning.

If you have any copyright issue, please Contact