Introduction to NoSQL databases

Jun 04, 2021

Hello everyone. Today we will talk about where NoSQL

databases

can be used and why they are so popular. Yes, we will talk about NoSQL

databases

and why they are the best in the world. And we are very excited to talk about this topic. Yes, there are many scenarios where NoSQL databases are used. There are an equally large number of scenarios where NoSQL databases are not used. That's why it's important to know when to use them and when not to use them. Well, yes, whenever you're building toy applications, you'll need your RDBMS, but whenever you need to scale, it's NoSQL databases.

Alright? Actually, that's not entirely true. It's not that scalability demands a NoSQL database. There are certain scenarios where these databases tend to work well and we'll cover those in this video. Could you give us an example? Well, YouTube does not use NoSQL databases. StackOverflow does not use NoSQL databases. Instagram does not use NoSQL databases. Whatsapp does not use NoSQL databases. WhatsApp does not have a database. Let's start with the video. So what is the difference between SQL and NoSQL? Well, if you look at the database schema that we have for an example of a person, in which case they have an ID which is a user ID, we have the name, the address, the age, and the role.

More Interesting Facts About,

introduction to nosql databases...

Now management is a complicated object. So the way I'm going to store it is in a separate table. This row corresponds to address ID 23, which means that the address is Munich and in Germany and the district is blank. So you're seeing that there's some kind of foreign key mapping here, and that's how we store data in SQL. This is how you store it in NoSQL. You have ID 1, 2, 3 and you just have this big mass of data, right? This is jss on and the way it stores it is the column name that maps to a value, which is the name John Doe.

Same here. The address is no longer a foreign key. The address is another object inside this object. That's just J ss or, you know, nesting, we have address, city and country id. And since there is a district null value, we don't actually even store it. We also have age and role defined, and you're seeing that there's a huge mass of data here. So what makes NoSQL so efficient? The key is to think about how we store and retrieve data. When we store data, it's usually never like a user registers and submits her age later, or submits the role later, or submits the address later.

It's all together. So when there is an insert, there are usually all the fields inserted together, which means that this whole fashion blob could have been written to the API, that is, when the request came, all this information was there and this could have been done in a single insert. And whenever you extract information about any user, you will usually need all the information about that user, right? Select start is something. So common that people don't even think about adding column names nowadays unless of course it's a very large table. Or if you have a column that is quite large and you want to avoid it, that is a separate scenario.

But normally selecting star is very, very common. So because select star is so common and because you need all the data relevant to a user all the time, this means that this entire blob will also be pulled all the time. That means inserts and fetches require the entire blob. So why not keep it together? You see when, in your query running in the SQL database, usually the pointer comes to, let's say, this id, this row, and then you have to sequentially read all these columns. Not only do you also not have a clean way to denormalize things like this address could store the string, but this database is not built in to denormalize things.

Therefore, you may need to join, which is quite expensive considering that most of the time you will need both data. It's cheaper here. And that's the first benefit of using NoSQL, right? All your data, the relevant data, is contained together in one block, making it a little easier to insert and retrieve. The second thing is that this scheme is flexible. We saw that district was null and for this SQL approach, what we needed to do is add a new column, although we don't need it, we still do it here. What could happen is if the address is null, if the address is completely blank, that's fine because this blob doesn't care about the schema.

The only thing it cares about is a j ss in the document. So they will be called John Doe coma immediately, 30 years, and the role is st. So what you are seeing is that the scheme is very flexible in this case and not so much here. In fact, every time you add a new attribute, let's say we have some new attribute added here, which is salary. So every time the salary is added, we have to add a new column to this SS Q L database, which is a very expensive operation because some kind of lock is needed on the table.

And it's also risky to stay consistent right now. I mean, if you want to stay consistent, then you need the locks. And that is the reason why it is expensive. While you're here, if there's something you're adding that you don't need for all the older users, what you can do is start adding it right away. Because like I said, the scheme doesn't care. The above scheme does not know that there is something called salary. Well? So the second advantage is that the schema can be changed easily. The third advantage of NoSQL databases is that they have built-in horizontal partitioning.

Most of the time they expect a lot of scale to come in. I mean, users of these NoSQL databases expect a lot of scale. So what they tend to do is split this data horizontally. Now you can watch the fragmentation video to better understand horizontal partitioning. And of course, when this type of partition is allowed. It's more focused on availability, which is good. In reality, many systems require availability or consistency. Yes, that's the good thing built for climbing. The fourth and final big advantage of NoSQL databases is that they are designed for aggregations. Furthermore, when a person stores data in his NoSQL database, he generally expects to obtain important information from that data.

For example, what is the average age? What is the total salary? These types of databases are designed to find metrics and obtain intelligent data. So that's what aggregations are designed for. Well, these are the advantages we have of NoSQL databases. What are the disadvantages? Too many updates are not supported on this. So if you have a lot of updates this is not really good, what are the possible problems here? Well, the data may not be consistent, which means the two nodes may have different data for the same id. Yes, although these SS Q databases usually give you something called asset properties by which you can contain this problem.

So that's a problem. So I'll write it down. Consistency is an issue, which basically means that assets are not guaranteed. If the asset is not secured, you cannot perform transactions using NoSQL databases. At least it cannot have the same transaction properties as acid. Well? That's one of the main reasons why financial systems don't use NoSQL databases for their transactions because it doesn't make much sense. The second problem is that these databases are not optimized for reading. If I ask you to find me, all the ages of all the employees that we have in the company, what is going to happen is that it will go to these blocks, and each time it will read the entire block, then it will filter the age and do that for each row , then it will return the result to you.

As long as you are in a SQL database, all you need to do is access this column. I mean, it won't be that easy. The reader has to go to that column and then read it. But this is more efficient than this. So these are unread, optimized and read times are comparatively slower. The last two problems I can see here is that this has no implicit information about relationships. So in an R D B M S, the R stands for relation. Now 23, the address ID maps to this point, and what that tells you is that this row is somehow related to this row in the two tables, right?

Whereas in NoSQL there is no easy way to do this. If you had a separate table for, say, all address values, then the information would not be implied. You couldn't force a constraint like a foreign key constraint, which would say that this column 23 can only exist if there is a corresponding column in the employees table. So the relationships are not implicit. And the fourth and final problem, which is a major problem, is that joins are difficult. If you have two nos tables, let's say, then when you join those two tables, what you do is go through each block of data here, find the relevant column where you're joining the other one.

Of course you have to find the relevant column again, of course, and then you have to merge them. Actually, the joins are all manual, so to speak, in a NoSQL database. There is no intelligence behind this type of joint. You can try to improve them, but there is only so much you can do. While SQL databases are designed to some extent for joins, yes, inner outer join, skip join, join of all the things we didn't read about in college. Those kinds of things are very common when it comes to SQL databases because they have inherent relationships.

These are the advantages and disadvantages we have of NoSQL. When do we actually use NoSQL? Well, it depends on these things, of course. It depends on whether your data is a block and if you are doing some updates and want to keep them all together, for example if you find something that needs to be optimized properly, there are a lot of rights that come into play. , maybe NoSQL is the way to go. There are scenarios where you might want inherent redundancy or aggregations in the data, in which case NoSQL provides it for you in a really nice way.

Of course, you can see all the disadvantages and that is one of the reasons why applications like YouTube or stack overflows still do not use NoSQL databases, but it is really good and let's take the example of Cassandra to understand these databases. data in detail. . This is the Cassandra architecture we will talk about. The requests will come to this Cassandra cluster, which will have five nodes, and it is quite expensive to host a Cassandra cluster. The request IDs will be distributed across this cluster. So any request between zero and one hundred will fall on node one, between one hundred and 200 will fall on node two, and so on.

So there are five nodes and you can see that there is a request ID 1, 2, 3. So it should fall somewhere around here, or rather, it should fall somewhere around here. Therefore, request IDs may not always be numeric. It could be a U ID or it could be a person's name or something like that. So what we do is, instead of thinking about IDs in NoSQL databases, often things are considered keys. So take a look at the sharding video to better understand how these keys are assigned, but basically we just take the hash of 1, 2, 3. So it's passed through a hash function.

This could be a string, it could be anything you want and we'll get a value. So, I will take it as 2 5 6. So, this hash is used to assign this request to a particular node in this cluster. So 2 5 6 falls between 200 and 300, so it falls here, right? Or rather I should take it as wherever it falls. I'm going to take the next node clockwise, so I'm going to. Collect four. So if the hash function is good, that is, it is uniformly distributed, what we can assume is that if many requests come in, they will fall with the same probability on any of the nodes.

Therefore, all nodes should have approximately equal distribution of 20% of the node. The advantage of this is that if you get a lot of requests and you know you want to do this, make sure all the nodes are being used at full capacity. So because there is a random distribution, everyone will have the same load and will be able to reach their maximum capacity rather than one node having too much pressure. When can a node have too much pressure when its hash function is not really nice? So let's say your hash function is that any value less than one hundred is equal to zero and any value greater than one hundred is one.

So what will happen is all requests greater than one hundred, which is zero to 500, we said. So about 400 of the requests will fall into hash function one. So let's say it is split into two and the other requests are split into one and the rest of the nodes are not even touched. In this case, what will happen is that by the time you reach the load of two, your entire cluster will be fully loaded, right? Because two are going to collide. So for this you need a good hash function, or if your hash function is bad and you can't change the hash function for some reason, maybe you can't do a hot change of the hash function.hash function.

What you can do is something like a two-tier cluster where when the request hits two, you don't actually store it in your database. Instead, it sends it to another cluster that has five nodes and you pass a different one, that is, you run this request through a different hash function, so h of H dash, which gives it a different value. So this hash function sucks. This hash function can have a really nice uniform distribution and therefore these five nodes will have a roughly equal distribution. And using this multi-level fragmentation technique, so to speak, just watch the fragmentation video, multi-level fragmentation, you should be able to survive.

But of course this is not a very good idea. Why have multiple levels of hash function? Well, why not? If you are a user of, say, Google Maps and you are in India, then maybe your hash function here is hashed based on country. So the country ID is the only thing you're looking at. And depending on the country ID, if you send it to one place, one of the countries may have a huge amount of cargo for certain festivals. Let's say the valley, everyone uses Google Maps, everyone goes somewhere. What will end up happening is that this node will have too much pressure.

And in those cases, what you can do is go for multi-loading graphs. Alright? These are the main advantages of using a hash function. What's very intuitive with the hash function is that you have one node that you're going to send the request to, let's say, and you also want to make sure that this data persists in a way that you can't lose the data if two fail. So, because this is important data, if two fail, you don't want this data to be lost from the entire cluster. So if you want to make copies, you want to make replicas of that data.

Who do you choose to have those replicas? Because of this hashing concept, you can simply ask three to also have a copy of any node after two, if it falls on two, if the request falls on two, any node after that should have the copy. If the request falls within five, you must have a copy. It has two nodes that store the data, which means that the probability of losing data is less. And also when a person makes a query, what you have to do is encrypt it, encrypt this request, determine where it falls and any of the replicas can respond.

So if you're doing two replicas, one or five can respond. If you are doing three replicas, then five or one or two can respond and so on. So your read queries optimized your rights are also more guaranteed and could also be optimized because if five fail then you can write them to one and still work. Through this, Cassandra provides us with two load balancing features. You can take a look and the description for a good link for this. It's a system design playlist video. And the second is redundancy. So redundancy or, let's say, replication, are slightly different, but this gives you data guarantee and gives you read speed.

So as we said, we'll distribute the reads and make sure that the entitlements are done really well, we have both features in the Cassandra cluster. One of the most important concepts when it comes to NoSQL databases is the idea of distributed consensus. What I mean by this is that there are five nodes and let's say the application factor is three. So if a request falls to five, then five one and two will copy that data. They need some kind of mechanism to agree on a particular value to return to the user. Why is that the case?

Well, let's say I write in five. So there are some attachments here at the same time, I'm going to write about one and two as well. However, let's say one and two are a little slow, so they don't actually have the right yet. If that's the case, and I do a read operation now, let's say I added my profile on five, I expect one and two to have two, I do a read operation on my profile and five fails, nothing to do with it. worry because one and two should have all the data that five should have.

So I go to one and ask for my profile. I see that it doesn't exist. An error is returned, the application now assumes that this profile does not exist. It then returns a user not found error. So I'm going to be confused because I just created my profile and why it's not showing up in the database for these types of issues. What Cassandra should do is return a database error so that the application knows that there is something wrong with the database and tells the user that there is something wrong with our database. Wait a while.

Well? So to do that, what we need is some kind of distributed consensus. And one of the ways to achieve this is quorum. Well? Quorum is a way in which multiple nodes that are related to a particular query accept a particular value or find, decide or vote on a particular value. What I mean by that, let's say we failed five days, we went to one, one said, I don't have this data, and two said I have this data, let's say concurrent writer happened. If that is the case, I will collect the data with the latest timestamp, okay?

The version id, the timestamp, whatever you want to say, and return it to the user. In this way, the user is happy that their profile is created. However, let's assume that even two do not have the profile created yet. In this case, you will both agree that there is no profile created and unfortunately the user will be given a user profile not found. If the current value is equal to two and the replication factor is equal to three, that means that if two of the three nodes, meaning most of the nodes accept a particular value, then we consider it true.

So in this case, unfortunately, if one and two do not have the replicated rights, an incorrect error will be sent to the user. So do we care a little about this? But this is really weird. The possibility of five blocking and one and two not having the rights before performing a read operation is really rare. So this is a risk we are willing to take when you take a NoSQL database and simply move forward with availability instead of consistency. But what are the other good scenarios? I mean, the other good scenarios are that one has it, the timestamp is more relevant, so two's data won't be taken.

Finally, maybe they both have it. Why don't we become optimistic as engineers? Then that would also result in the correct data being returned. And that's why quorum is an important concept. What it allows you is to take risks, but in most cases it is correct, right? A column of two is very unlikely to fail. What if I make it a column of three, like three nodes have to match with a replication factor of three? In this case, this query will fail because five failed. You need three nodes to agree on a particular value. One and two do not agree.

I mean that one and two will return a particular value, but five does not return a value and therefore the query fails. I'm also taking a special case where I get the latest timestamp. If the quorum factor is equal to two, it is very, very likely that unless both agree on some value, the consultation will fail. Well? I have taken the timestamp because in this case it clearly shows that you can still work with one and one, one against one. Basically there is no majority, but most of the time they will not agree on a value.

Simply fail the database query and tell the user that we will not be available for some time for their particular search. Now, if you want the details of how a quorum works, I'll record a video on this a little later. It is distributed consensus. So there will be a packet source, there will be a gossip protocol, but in general you can assume that they send all their information to a central server. Yeah, let's say three is the person they send information to, and then three counts the words and then picks a value and returns it to the user.

Now, what happens if three fail? That's a master so to speak. In this, in this group, there is this strange consensus that we will do in the future video. So that should answer that. The last way Cassandra stands out as a NoSQL database, even Elasticsearch has this feature, is the way it stores data and the way it writes data. So if you have an incoming request in Cassandra and you have this key-value pair, assume this stable exists in memory, right? Because you need to write it somewhere. So Cassandra will store all these logs in memory as a log file, okay?

The reason I call it a log file is that as long as there is a write request, it will be written sequentially. So if a new request comes in, it goes to the next point, new request. The next point in this way is actually storing all the data as a record. This is efficient because all you need to do is go to the point where you have the current pointer and simply write the data instead of searching for anything. Well? So this is fast and periodically this memory is dumped into something called the ssss table. String table so neat.

Why is it an ordered string? Because the key is sorted in this string table. So if I have some data here, the key will be sorted and the values will be by key. This is persistent storage, meaning it will be stored on one of these nodes in the cluster. This concept comes from a very famous Google article, which is the large table data structure that Google created. You can take a look at the description below, but the special thing about the ordered string table is that it is immutable, right? So this data will not change. It is immutable.

So whenever Cassandra has some data in its memory, it puts it into a new table of sorted strings. Now you can imagine that because these requests arrive after a few days, what will happen is that you will have many sorted string tables throughout your cluster, and these will take up a lot of space. . Because? Because any update, let's say the key is one to three and two days later, you received an update to that key from one to three. So some data on that has changed. Maybe the name has changed from John Doe to the middle name being added.

So in this case, what happens is you have an update to that key. The last record is this record. It's in some other sorted string table because it was created later when it was sent to the S T. And sure enough, what happened is that you have multiple records for the same key, right? If you have multiple records for the same key, it's not a problem. The thing is, you can always use a timestamp. This record will have a timestamp and you can use the latest timestamp to get the data. The problem is not coherence. The problem is data usage.

Like you're going to use up a lot of storage with these duplicate keys. So if you have 10 records for the same key then you are using 10 times the required storage. So Cassandra Elastic Search provides a feature called compaction. What we do is take different sorted string tables and merge them together. So you can imagine this is some kind of fusion. Yes, you have two sorted areas and you are simply merging them. So this is an Order N operation and also the space complexity is the minimum of M and n where the size of the two areas is m and n.

I actually covered this at length in Tim Sort's video, which no one saw

How do you get rid of deleted records? Well, you can go to the deleted record and yes, Cassandra calls it a tombstone. So you put up a tombstone, probably put up a flag, and the tombstone says this record is dead. Yes, any read operation on that, if there are three or four records and there is a tombstone, then you see the tombstone in the last timestamp, call this record as dead and all three die. If there is an update on that key, again, if you see a tombstone, then you know that an update is impossible and therefore raise an exception as if the record did not exist.

In general, this is how NoSQL databases work. We've selected a Cassandra example specifically, but there are many concepts that are actually extensible to Elastic Search, extensible to Amazon Dynamo db, etc. There's a lot to digest here, and if you have any questions or suggestions, feel free to leave them in the comments below. If you like this video, hit the like button. And if you want to receive notifications and more such videos, press the subscribe button. See you next time.

Watch Video & Subscribe

If you have any copyright issue, please Contact