GOTO 2012 • Introduction to NoSQL • Martin Fowler

May 11, 2020

So welcome to the

nosql

database track, although it seems like we've already had a couple of

nosql

database tracks. My name is Martin Fowler. Stenosi has hosted the track. He asked me to lead most of this track. It will be about the practical experience of people using nosql databases, but this talk is the exception because it is actually an

introduction

to what nosql databases are. I'm going to do my best to fit as much useful information into 50 minutes as possible. I can, that will help give you some context for understanding a lot of what's going on in the later talks and the first part of this is I'm going to talk a little bit about the history of nosql databases because like it's with a lot of things. .

To understand why something is the way it is, it helps to know how it got there in the first place. When I started in the computer industry in the mid-80s, it was right around the time that relational databases were really coming out and getting started. its Rise um, it's a little hard to imagine that there was a time without relational databases, but I remember when they were the new thing and people were arguing about whether there would be any good or not and obviously they have brought us a lot of benefits. They analyze the persistence of our data and are also very important due to the fact that they manage concurrency through transactions.

More Interesting Facts About,

goto 2012 introduction to nosql martin fowler...

SQL has become a de facto standard language for talking to these databases. It's not quite standard, but it's standard enough that once you know SQL. you can talk to these different tools, they have also become very important for many organizations for integration and reporting, which as we will see, has its advantages and disadvantages, so SQL databases are a really good thing, but they also have some problems and most. The obvious problem is one that most application developers run into while working with applications, and that is that we assemble object structures in memory often in terms of a sort of cohesive hole of things and then to store them in memory. database. we have to separate it into bits so that it fits into those individual rows in individual tables.

A single logical structure for our user interface and for our in-memory processing ends up scattered across many, many tables, this is known as the impedance mismatch problem, the fact that we have these two different models of how to look at things. and the fact that we have to put them together causes difficulties, this is what leads to object relational mapping. Frames and all that kind of stuff, now the impedance. The mismatch problem is such an uncomfortable problem that in the mid-90s people said, well, we think relational databases are going away. The object databases are going to appear that way.

We can take our in-memory structures and save them directly to disk without any mapping between them. both of us, but we know what happened there, we didn't see the object databases, people like me who thought they were going to be something dominant in the future, we were wrong and you still listen to me, but hey, I guess you are. easy to take, but we argued endlessly about why object data didn't really fulfill that potential and I think at the heart of it is the fact that SQL databases had become an integration mechanism that many people integrated into different applications through SQL databases and as a result that really made the emergence of any other type of technology very difficult and that led to relational remaining dominant until the 2000s, so relational had 20 years complete domain in certainly the enterprise data space and many others.

I also want to say that we saw with the scientific work on the large hon collider that they didn't really want to use relational databases but they had to to some extent, at least what really changed was the rise of the Internet and particularly sites that have a lot of traffic. on the big internet sites like Amazon or Google or a betting fair or something like that, as you get huge amounts of traffic on your data, what do you do well? You need to escalate things and the obvious. The route is to scale things up and buy bigger boxes, but that approach has problems, it just costs a lot and there are real limits to how far you can go, so I hope everyone knows that many organizations, most notably Google , they use a complete system.

Different approach, many, many small boxes, basically, CPUs, m boards, disks, commodities. All-inclusive hardware in these massive networks, but there is a problem here for data storage. SQL was designed to run on those big boxes designed to run as a single data node system. It doesn't work very well with large groups of small boxes and several of the Big Data players understood this. They tried, they tried. I've talked to several people who have tried to spread relational databases and run them in groups, the usual term that comes up in conversation when they describe that if you had tried to do this it was in natural acts, it's very difficult to do so, so a couple of Organizations said we had enough of this, we had to do something different and they developed their own data storage systems that were really quite different from relational databases and they started talking a little bit about them, they published articles that talked about what that they were doing and this is what really inspired a whole new database movement, which is the nosql movement, now it's important in this I want to talk a little bit about where this term nosql comes from.

A lot of people complain quite reasonably because they say it's a really strange term when trying to define a movement by something it's not and the origin is actually very simple. There was this. guy in London Johan Oscar had worked a lot with Hadoop and things like that he wanted to take a look he had to go to a conference in California he wanted to take a look at all these interesting databases that they were exploring at the time and he said he proposed a Meetup, a little gathering where people could discuss ideas and of course if you're going to do that in the late 2000s you absolutely need something that's really important, you need a Twitter hashtag so he asked. well what would be a good # it has to be short it has to be unique um so we can classify it easily and some guy came up with the hashtag nosql that's it nosql was once intended as a twitter hashtag to advertise a single finding at one point the fact that it has now become the name of the entire movement was completely accidental, no one thought that was going to be the case, so you know, that's how language usually works, they are very unpredictable attacks so there were a lot of people that attended that meeting by the way this is the list of people there which is not what we call the full set of nosql databases as a lot of databases that were not at that meeting they are now considered part of that nosql umbrella, so this inevitably leads you to the question of what the definition of nosql is and this is something I had to think about about writing a book on the subject if it is important if you are going to write a book about something to Define what you are writing my conclusion is that we cannot define SQL databases because of this strange history.

What we can do is identify some common characteristics of nosql databases and it is very easy. obviously nosql databases are not relational, it's actually more about non-relational than non-SQL, obviously like a strong lead towards cluster support, minus the ability to run on large clusters because that's where the original spark came from via Google and Amazon, but that's not an absolute feature, there are some nosql databases that are not really focused on running on clusters, most of these databases, interestingly, are open source, so While most of the things we generally call nosql databases are open source, there are commercial tools they like to call themselves. nosql databases and maybe over time they become part of that would no longer be a common feature, but it is still a common feature at this point, perhaps most importantly they are all things that have emerged from the culture of 21st century websites, um.

There are many databases that go way back before relational databases that don't use SQL or the relational model, but we don't call things like IMS or Ms for those who have heard of any of those things. Rel uh, there are no SQL databases, so that's what I see as common features. I'll mention the last one in a moment, so one of the interesting things about nosql databases is that they use different data models than the relational model, obviously, since the name says that and if we draw a diagram. a picture of the most commonly referred to nosql databases, typically what we see is that they are divided into four broad parts based on their data model and let's dig a little deeper into these data models so that the simplest data model to talk about be that of the Value Store key, the basic idea is that you have a key, you go to the database, tell me, take the value of this key, the database knows absolutely nothing about what is in that value, it could be a single number, it could be something complex, uh document uh, it could be an image that the database doesn't know about, doesn't care about, now you can think of this basic as just a hash map but persistent on disk, it's that simple, Another data model that is very common is the document data model, now the document.

The data model thinks of the database as a storage of a large number of different documents where each document is a complex data structure, usually that data structure is represented in Json forms because Json is what it is. fashion these days, I mean, you could do it in XML, but no one wants to be seen using XML in public, so we have these different documents that appear and the usual document databases will allow you to say "give me a document that has these fields" with which you can consult the document. structure and you can usually retrieve parts of a document or update parts of a document, so there is a big difference with the key value store, where it is a very opaque structure, but the document is much more transparent.

One thing that should be noted right away about these databases. What about document databases and indeed all nosql databases is that they don't tend to have a set schema. With a relational database, you can only put data into the database as long as it conforms to the schema you have defined for that database. Almost everyone knows about SQL databases, basically you can store anything you like, just go in there and no one will talk endlessly about how this increases your flexibility, makes it easier to migrate data over time, that's all. Absolutely wonderful and as always, that's not the whole truth.

I mean, normally when you talk to a database you want to get some specific data from it, you're going to say I'd like the price I'd like the quantity I'd like the client, as soon as they're doing it, to be setting up an implicit schema, to be assuming that an order has a price field, you're assuming that the order has a quantity field, you're assuming it's called price and not cost or customer price or whatever else you can think of, that implicit schema is still there. current and you have to manage that implicit scheme in many ways. a similar approach to how you manage relational schema, a stricter schema, so schemas is really a bit of a complicated term here now, not having a fixed storage schema gives you some options that you don't get with foundations of relational data. and there is a difference and there are also advantages in terms of flexibilities, but one cannot ignore the fact that it is always an implicit scheme.

The only time you don't have to worry about an implicit schema is if you do something like give me all the fields in this record and it returns the value of the field name to the screen and occasionally you want to do that, but most of the time you want to do something more interesting, so I've talked about two key data models. value and document data models and I have presented them as two quite different things, but in reality the line between these two is much more confusing than the fact that many key value data stores allow you to store metadata about the value that this It allows.

Of course, you have to create more complicated indexes. I mean, if you want to get all the orders from a particular customer, you don't want to search every order in the database to find the moral equivalent of a table scan that you want to index. so key value databases allow you to store various pieces of metadata, which usually makes them feel a bit like document databases, and then in a document database, yes you can do all kinds of queries about one thing, but often there is an identification and oftenwhen you actually look up that, you actually do it by saying give me the thing with that particular id and that id is effectively the same as the key in a key value store, so the boundary between a key value and a database of documents, as I said, is somewhat fuzzy and I have often heard that a particular database is sometimes described as key value and sometimes described as document.

Actually, I wouldn't worry too much about the difference between them. I think it's kind of a first approximation to work with, but it's not. It's actually so important as you go, the important thing is that both key value and document databases have this common notion that you're taking a complex structure that you can store as a single unit in the database. , whether it is a relatively transparent document. or a completely opaque value, that notion still exists and those commonalities made me think, well, we really need some term to describe databases that work that way, which is why, for the book I came up with.

I can think of a term, an aggregate-oriented database that allows you to store these large complex structures and where the aggregate term comes from. It comes from this book here, written by Eric Evans, Domain Driven Design, how many people have read Domain Driven Design? Hopefully some of you are great books. It really talks about how to think about domain modeling and one of the key concepts in the first part of domain-driven design is that often when we want to model things, we have to group them into natural aggregates. because when we're talking to a database, even a relational database, it makes sense to think about those aggregates when we're storing and retrieving data if we're modeling orders, for example, they'll usually have separate classes for orders and order lines.

That's a pretty standard object 101 model, but we think of order as a whole, a single unit, so an aggregate can be many objects in many classes, it can be a pretty complex structure, but when we talk about persistence or Al fetching it from memory we think of it as a thing to traverse back and forth now in a relational database we have to SPL FL that aggregate into a bunch of tables but the nice thing about an aggregate oriented database is that we can store that aggregate as it is its only unit in terms of the database itself, so for a key value database the aggregate is the value in a document database the aggregate is the document and that it becomes the single unit that we move back and forth and I certainly find this to be a much easier way to think about the commonalities of these kinds of databases.

Now the third data model that I am going to briefly describe is that of column family databases. Now this is the most complicated data model. Of these, it is another aggregate-oriented database. However, Column Family Database basically says that we think of a unique key, they call it row key and then within that we can store multiple column families where each column family is a combination of columns that fit together, the column family here is effectively your Aggregate and you address it by a combination of the row key and the column family name. Now the column families can be a little different too.

Look at the bottom one here, which is effectively a list of items from a customer's various orders, so that's not it. It's not much like a typical record structure that you might know, but of course it's the same as storing an array in a document and something like that, so again you get something like that, that kind of richness. um structure that you can create here, column family databases give you a little bit more complex data model to work with, but the benefit you get is again, in terms of retrieval, you can more easily extract individual columns and things like that are out of the question, but again the broad data model is that of an aggregate-oriented image, so the great thing about this is that now, when you take your aggregate into memory, instead of spreading it across many individual registers, you can store everything in the database in one. go and the database knows what your aggregate limits are.

This is interesting where it becomes useful when we talk about running the system in clusters because if you are going to distribute data, what you want to do is distribute the data that tends to be accessed together, so the aggregate tells you what data will be accessed in set, so by putting different aggregates on different nodes in your cluster, you know that when someone says, "Oh, give me the details about this particular order," you'll only agree to go to one node in the cluster instead of shooting around, God knows how many, selecting different rows from different tables, so aggregate targeting naturally fits very well with storing data in large groups and that of course is part of it all with the big table. and Dynamo, both effectively opted for a cluster-oriented approach, one big table, very much a column family style approach.

Dynamo is much more of a key value store, but it makes running clusters efficiently much easier and like I said, that's really been the driving factor here, but nevertheless, nothing is perfect and aggregate targeting isn't always a good idea. Let's imagine we have our ordering system and we want to see the data like this. We mean, given a particular product, tell me the revenue, tell me the past. Revenue, now we don't worry about orders at all, we just worry about what happens to the individual order lines of many orders, effectively grouping them by product, what we are doing is saying we want to change the aggregation structure of one where orders add order lines to those where products add order lines the product now becomes the root of the aggregate now in a relational database this is simple, we just query the data differently, it is very easy to reorganize data into the structures we might want in different cases with an aggregate oriented database is a pain, you can do it and what they will usually do is run map reduce jobs to reorganize all your data into different aggregate forms and probably keep them persistent or maybe even do incremental updates, but it will always be more complicated, so being aggregate oriented is an advantage if most of the time you use the same aggregate to send data back and forth to be persistent, it's a disadvantage if you want to split and divide. your data in different ways, so what I've done so far is I've managed to cover some of these models.

Basically I took the document column family and key value and grouped them into this aggregate oriented category and I think it's a useful abstraction at least to the level of what I can tell in 50 minutes, there is one very notable exception that looks and that is that graph databases graph databases are not aggregate-oriented at all, but rather use a completely different data model, that is, a graph database data model. It's basically that of a non-Arc chart structure, not a bar chart or anything, just nodes and arcs, something that we're hopefully familiar with, at least from some boring computer science classes.

The nice thing about storing a graph database is that it's very good at handling movement through relationships between things, relational databases, you might think with the word relationship in there that they're good at handling relationships, but of course, relation doesn't mean relation, it means something in SE theory and in reality relational databases are not. awfully good at jumping between relationships, you have to set up foreign keys, um, you have to do joins, if you do too many joins, you can get into trouble if you've modeled a graph structure or a hierarchy, which is a special form of graph structure. in a relational database you will have had this experience, it is not easy, relational databases are not good at this, so graph databases come in and say yes, we can handle it, jumping between relationships left, right and center , we make it easy and optimize.

To make it faster to do that kind of thing, we can also create a cool query language that's designed to let you query graph structures. This type of query here is a NE Cipher for J is about saying well. given a certain graph structure um, let me use that graph structure to express a more complex query and you can do some very interesting graph oriented queries on graph databases, things that would be very, very difficult to write in terms of SQL, as well as a pig for Eh, in terms of performance, in many ways you can think of them as having gone in opposite directions.

Aggregate-oriented databases take a lot of things that are scattered around and put them into larger groups, while graph-oriented databases break things down into even smaller units and let you play with those smaller units more carefully. . I mean, you can still model relationships in aggregate-oriented databases just like you can in relational databases, you basically mean IDs in different documents, but it's a lot more complicated, so part. Your decision is whether a nosql database will be interesting to you, how do you work with your data? Do you tend to work with the same aggregates all the time, which would lead to an aggregate-oriented approach?

It really breaks things up and jumps through many, many relationships in the complex structure, but it would leave you with a graphical approach or the tabular structure works well for you, in which case you'd want to stay with a relational approach so that no SQL is broken into those. two categories c, these are all schemas, so graph databases also allow you to add any bit of data to any node. It has all that flexibility, but with the same caution about implicit schemas, so it's pretty much half the picture. part of the data model now I'm going to move on to another topic that has to do with consistency and dealing effectively with many people trying to modify the same data at the same time.

You've probably heard something like this: Relational databases are acidic. they do the familiar acid transactions that we all know and love, atomic, consistent, isolated, durable, without SQL, they don't do any of that kind of stuff and of course without SQL, people will say, well, we do low, which is a even more artificial and meaningless acronym. is and I won't even try to tell you what it is because I can only remember what it is on Tuesdays, but it basically boils down to if you have a single unit of information and you want to split it into multiple tables what you don't want to do is get stuck in a position where you only you can write half the data and someone else reads it or you can write half the data and someone takes the same order and writes a different half of the data and F gets really complicated in that kind of situation you need to have this control mechanism to give you atomic updates effectively and that is what transactions with atomic updates are all about so that you succeed or fail and no one comes in between. and it messes things up now when it comes to our well organized set of nosql databases, the first thing to point out is that graph databases tend to follow acidic updates, which makes sense, they break down the data even more than relational databases, so they have It's even more necessary to make sure that they use transactions to put things together, so if someone tells you that there are no SQL databases, they don't do acid, now you know an immediate replica, ah, but graph databases now add oriented databases, they don't really.

I need transactions so much because the aggregate is kind of a bigger, richer structure. In fact, if you read the Domain Driven Design book, one of the things they point out is that the aggregates in Domain Driven Design are transaction boundaries that you shouldn't allow transactions to cross. aggregate limits because if you do, it will be complicated to manage the concurrency of your system, so the domain-driven design community from the beginning, even before nosql said, keep your transactions within a single aggregate and that is effectively what he does in the world. of aggregate-oriented databases, any aggregate update will be atomic, it will be isolated, it will be consistent within itself, only when you update multiple documents in a document or in a database should you worry about the fact that you haven't done so. i got acidic transactions but that problem occurs much more rarely than you think so that's the first line about acidbase, i think some databases are completely acidic anyway and aggregate oriented databases that aren't They are acidic within their Aggregates, which is kind of what really matters, butYou also have to think a little bit more about consistency, even more than that, because even in a relational world, acid transactions don't mean that we're completely consistent and don't have to worry about update anomalies and I'll walk with you. through what we hope is a very familiar scenario to point this out and also to illustrate how some of this is handled, so imagine we have a typical multi-layered system, we have a person talking to a browser, a browser, talks to A server. talks to a single database and we are going to have two people talking to the same data in the same database at the same time although through different browsers and servers and here is the basic little scenario: we start with the people on the left and right taking the same A chunk of data with a get request essentially show it on the browser screen and now the human says: I need to make some changes to this and finally the guy on the left (I always confuse my left and my right) says, "Okay, I've got it." got my data updated, let's post some changes and shortly after the guy on the right says H.

I've uploaded my data now, let's post some changes now, of course, if we let that happen as well as that warning conflict, this is a correct conflict two. people have updated the same information, were not aware of each other's update and have gotten into trouble and come to the rescue. What do we do well? What do we have to do to avoid this conflict? interaction from getting the data on the screen and posting it back to a transaction that way we make sure the database ensures that we don't have a conflict effectively one of them will be told no you need to do this again to recover your data.

Again, we do not resolve conflicts. The problem is how many people do this in their production systems. Yes, occasionally, you can get away with it most of the time. Can not. Because? Because keeping a transaction open for that period of time while you have a user looking at and updating the data through the UI that will really eat into the performance of your system and I want to emphasize that you can do this in some circumstances if your performance needs. They are really minor. You know you only have a handful of people looking at the system at a time, you may be able to get away with this approach and it is advantageous to do so because many problems go away if you do this, except for most systems.

You can't afford to keep transactions open for that long and in fact most people who write about Transa Building Systems like this will tell you never to do this. Don't keep transactions open for user interaction. instead they say it just wraps the transaction around that update, the last part of the database update and that's good because that stops a collision where a half done update M gets mixed up with another half done update do and some tables are updated here. and some different tables were updated differently there and the result is an inconsistent mess but you still get a setup because the two people updated the same information without knowing that the other person did it and this is what normally could happen even in an aggregate oriented database if you have to modify more than one aggregate because you may find that one person modifies the first and then moves on to the second, the other person does it, has the solution and as a result could lead in an inconsistency between added now if you've come across this you probably have you probably also know how to solve it and basically use a technique that in one of my previous books I referred to as an offline lock um basically what does that mean the The usual way to implement this is that you give each data record or each aggregate at least a version stamp and when you retrieve it, you retrieve the version stamp with the aggregated data when you publish, provide the version stamp from where you read and then for the first guy everything works fine, the version stamp is incremented and then when the second person tries to publish it still has the old version stamp and then you know something is up and you can take any conflict resolution approach that you choose.

Use the same basic techniques again when working with a Nosql database. The good thing is that you don't have to worry as much about transactions on this issue because the aggregate gives you that natural unit of update, it's your transaction limit, but once you cross aggregates, then you have to think about juggling the stamps of version and do something like that, but it's really not much different from what you have to do with a relational database because offline locks force you to do this juggling with version stamps anyway, so Yeah. You don't get these acid transactions to the same degree as with a relational database, but the impact isn't as big as some people think because we have to deal with these things all the time anyway now that we're talking about consistency.

I find it useful to think of two types of coherence. The coherence I have been talking about so far is what I call logical coherence. These consistency issues occur whether running on a group of machines or running on a single machine. machine, you always have to worry about this kind of consistency issues now when you start distributing data across multiple machines, this can present more problems when it comes to distributing data widely, you can talk about it in two different ways, one is sharding data and taking a copy of the data and putting it on different machines so that each piece of data lives in one place, but you're using many machines, sharding doesn't really change the picture much, you still have the same logical consistency issues as you have with a On a single machine, they are exacerbated to some extent, but the basic problems remain the same.

Another thing, however, that is common to do with groups of machines is to replicate data to put the same data in many places, which can be advantageous in terms of performance because now you have more nodes handling the same set of requests, it can also be very valuable in terms of resilience if one of your nodes fails the other replicas can still continue so they will talk a lot about availability and resilience um with these cluster oriented approaches however as soon as the data is replicated it starts to appear a new class of coherence problems and then again everything is clear with a simple and simple example, so here we have two people, me and my co-author promode, and both. we want to book a particular hotel room so we send our booking request and we are in different continents promotion modes in India I am in the US we send our requests to our local processing nodes now the processing nodes at this point need communicate, we have to go, oh wait, what's going on here and the system as a whole needs to make some kind of decision, essentially guaranteeing that one of us has to sleep on the street, in this case me.

This is what happens 99.99 percent of the time, however, let's take a sort of variation of this example again. We both want to book our hotel room, but now the communication line has gone down, the two nodes cannot communicate, we send our requests, which happens fine. In reality, there are two broad alternatives. One is that the system says H. Our communication lines are down. We are sorry, we are unable to accept your hotel reservations at this time. Please try again later. The alternative is that the system says yes, we will accept your reservation. Thank you so much.

Because we are really trustworthy and on top of everything else and they proceed to double book the hotel room. I'm not that friendly with Promode. We are good friends, but you know there are limits. Maybe we don't want to. share that hotel room, so basically what we're looking at is a choice, it's a choice between consistency, which means you know I'm not going to do anything if my lines of communication go down, and availability, which says yes, I'm going To move forward, but at the risk of introducing inconsistent behavior, the most important thing now is to realize that this is a choice and it is a choice that can only be made by knowing the business rules, the domain rules that you are dealing with. is working.

I mean, it may sound really horrible to say, "We're going to double-book a hotel room, possibly with complete strangers." I mean, that would be bad, but actually maybe hotels have ways of dealing with this, maybe they have a block of rooms that they always keep available until the end. last minute for emergencies, they can use one of them or maybe they just send an apology letter and some frequent sleeper points it out to me, "to try to make me happy." There are several ways in business that people will deal with inconsistencies as they arise. I'm not saying you should always opt for availability over consistency, but what is true is that it is always a domain choice, it is business owners who will have to decide what is more important: the risk of booking twice the last hotel room or the The fact that we have to close the site and apologize, we cannot accept any orders at the moment, which is a bad thing for business.

This is one of the things that drove Dynamo: they wanted to make sure the shopping cart was always available. You can always put things in the shopping cart. Why is this? Because it's the United States. What is the most important thing to do in the United States? Buy we must maintain our retail destination. We should always be able to buy and what happens. Look, you come to pay and you leave. Why is this article here twice or am I sure I put the order here? Ah, computers make mistakes. Let me fix it when the worst could happen.

You actually send the order. You receive duplicate things. You call Amazon. I'm sorry. I'm sorry. I'm sorry, and getting it all back is a lot better than someone not being able to shoot for a few seconds, so the point is that it's a business choice, so this relates to something you'll hear incessantly every time someone talks about these things, what is the limit theorem all those who have heard of the limit theorem how many people understand the limit theorem some of you it is actually quite simple it is very poorly described, although well, no, not very badly, but I don't think very useful, they say that these three concepts exist. here and you can choose two, this is true, but I think it's easier to rephrase it.

It's a little bit clearer if you say if you have a system you can get a network partition, which basically means communication between different nodes in a cluster breaks down and by the way, if you have a distributed system, you'll get network partitions. If you get a network partition, you have the option to choose whether you want to be consistent or you want to be available, and that's really the limit. Theum is reduced if you have a single database running on a single server, it won't be split, you don't have to worry, you can be as available as that node and you will be consistent, you can maintain everything um but as soon as you have a distributed system you have to make that decision, but that's not a single binary option in your entire system, you actually have a spectrum that you can go for a certain amount, you can actually trade levels of consistency and Availability.

I'm not going to get into how, trust me. Additionally, it may vary depending on a particular operation you want to perform. Certain operations can be very consistent. Some other operations may have high availability. Any of the databases that do this kind of thing will give you all the knobs and settings to do this and therefore you will learn how to trade them and in fact most of the time you are not trading consistency versus availability, it is not availability, that's the problem and it's not Even when dealing with network partitions, that's the problem a lot of times what you're doing is balancing consistency with response time because what happens is the more you want to have consistency in a group of nodes, the more nodes they have.

To get involved in the conversation again, think about the hotel case where the two of you had to communicate, which will slow down the response time, so you could say that even if the network is up, you know I'm going to let each node book your own hotel stuff. and fix it later, even with the network up, I would still get a faster response time instead of doing all the communication I need to make P consistent and again, that's a business decision. Another thing Amazon said was that we want to always get people to buy fast because the most important thing in America is to buy, so we want really fast times and even if all the nodes are available and we can give you a completely consistent solution, we want to be fast and they also help thatfusion. shopping carts dealing with shopping cart inconsistency is relatively easy oh, they ask for this here they ask for that there well clearly they want both because this is America everyone wants everything God takes things out of shopping carts like why would we want to encourage? that and in fact this is a broader tradeoff in terms of computing, it's really just another aspect of a general concurrency tradeoff between security and liveliness and I, if you've been to the current classes and heard people talk about it.

Actually, this should sound pretty familiar in those kinds of terms. Now what I really wanted to do with this little segment on consistency was focus on giving you an idea of how consistency is different in the particularly aggregate-oriented nosql world, compared to how you may have thought about consistency until now, There are many topics I could have talked about here, but I simply don't have time to talk about the important thing to make clear is to realize that you have to think about coherence issues essentially differently. because you have this different data model and the possibility of replicated data and, in particular, you have to think a little bit about the terms of this consistency availability trade-off and that it's not just up to us as technicians to make that decision.

It actually depends on the way the business works that determines where we make these trade-offs and if you want more, well, I'll tell you to buy my book anyway so you know what to do, so the last little segment I'm going to talk a little bit about when and why you might want to use a nosql database and the way i see it is that there are two factors that push us towards a nosql database. The first is the one I have already talked about. as the real driver of the whole nosql movement itself and that is that you have to deal with large amounts of data if you have more data than you can comfortably or economically fit on a single database server that you're going to go to.

You're going to have to deal with some hassles, you can go through the trouble of trying to run a relational database on a cluster or you can get into this new nosql stuff and now, most of the time, I think I'd rather not use SQL. stuff, um, because running relational data databases on clusters is still kind of a black art, so big amounts of data is a big problem now, some people have said and in one of the reviews the comments on my book were yes, but only very few organizations have to worry. about these things if you search on Google and Amazon, yes, almost everyone else, no, while I read that what I heard in my head was that 640k is enough for almost everyone, the reality is that there is tons of data coming to us and every organization is going to be capturing and processing more and more data, so this large scale data problem is only going to grow and that is one factor, but actually this is not the main reason.

I think most people aren't looking for a sequel. There was a survey I saw on Monday's track that pointed out that most people aren't really interested in large amounts of data from non-CLE databases, what they want to do is be able to develop more easily, so a good example of this is me. have some friendswho work on the Guardian newspaper and website how many people have heard of The Guardian, a good English newspaper, many of you are good and know that you are dealing with articles, you are saving articles, updating articles, pushing articles back and forth from the article to them.

It is a natural aggregate that distributes the data and metadata of articles through relational databases, it is a headache, it is inconvenient, but taking it as a single thing, a single article, and sending it to the database, is much simpler, the map, the impedance mismatch problem is drastically reduced. If you have a natural aggregate and a lot of the projects I've talked to at Fort Works that have used a nosql database have gone that route, they've said that our data model doesn't really fit very well with the relational one. This one of these nosql options is better, it could be a natural aggregate where we've gone to the aggregate-oriented root or it could be that we have something that looks a lot like a graph structure, so we go the base route graphical data and that.

I think it's the most common reason right now that people don't use SQL databases because you have to effectively get rid of that impedance mismatch problem now of course that raises a question which was of course the promise for object databases we were going to get rid of the impedance mismatch problem, but were hit because the databases are being used for integration, why isn't that same problem affecting us now? Well, it is affecting us, but it has reduced considerably because now more and more people are saying no. If we don't want to integrate that way, we want to hide our databases within a broader application or service and then we want to use some kind of service-oriented interaction between the two, which can be web services, it can be something really unpleasant like soap in esbs with god knows what was added, but the point is that applications now control access to data and if you are in a scenario where you can do that and effectively encapsulate your database, then the integration problem is becomes much less serious and that I think is a very important factor in making it possible for SQL databases to not thrive.

Anyway, this is a good practice, even if you have relational databases that you don't want to integrate via integration databases that cause endless problems, believe me if you haven't experienced it yourself. It's much better to try to encapsulate something like that and if you're going to do that then you'll have a lot more freedom on which database to use and I think that will be a big boost. structure towards this um another thing that's encouraging people to use these databases is dealing with analytics that we all know about data warehousing, the usual data warehousing project, as far as I can tell, is that a vendor of one of the big companies and says oh, if you want to do a data warehouse, here's this project plan whereby every piece of data that you might have in your organization is put in one place so that everyone can access it easily and it's a multi-year project, we have many many very diverse stakeholders, um, we know that history, I mean, are there people who have come across these large data warehouse projects that they felt were successful?

There's usually one or two that no one is prepared to admit, oh, you're prepared to admit it, um, but most of them. goes wrong, what we're looking for instead is a different approach that says, "Let's focus particularly on a particular problem and see how we take the data from that and, by the way, the data might not be in a known relational format or even not have SQL". you store it, it may be scattered in log files or you know what most companies actually run, which is Excel spreadsheets. Well, let's get that data, put it together and put it together, and those SQL databases play a big role in this.

Graph databases allow you to easily do graph type analysis on the database which is really good, aggregate oriented databases are generally less good at this because they can't partition as well, but what they can do It's storing large amounts of data, so if you extract things from devices or log files or similar, they become very attractive and of course that's what has given Amazon a big advantage because they can extract all this information, so all this means no SQL. It's the future of databases as relational databases are going to disappear and we're all going to be doing no SQL.

I don't believe it. I think really the future is something I refer to as polyglot persistence and what that means is we think there will be room for many, many different types of databases and relational databases will continue to play an important role. If you're building an application, you may use many different databases as part of your application, certainly across your organization. You will use many databases and what you will do is choose the right database for the nature of the problem you are working with and because there are different types of problems, there are different data stores, but what is the idea of what be. your problem is that the answer is that a relational database will disappear now, this is great, it gives us many opportunities for the future, but as every cynic knows, every opportunity is really a problem and there are many that you now need to think about.

For this kind of thing, you have to decide which is the appropriate nosql database for a problem, you have to deal with organizational issues, relational databasers aren't going to like this, in fact, for some people it's a big deal. advantage, but let's not go there, there are no SQL databases. immature, they don't have the tools, the experience and the knowledge of how to work well with them that we have had for 20 years of relational databases and all these consistency problems can still end up affecting you, so when it comes to what kind of project that I have, I start with drivers, if you want fast time to market, fast cycle time, it needs to be fast and easier, development is really important, so if you can do it without SQL databases, that's one reason to use them similarly if I have a project that requires a lot of data, obviously SQL's ability to handle a lot of data is very important, but I think there is also another overriding goal: if your project is really important to the advantage competitiveness of your business, what I mean. as a strategic project because if it is a strategic project then it is worth taking the additional risk of dealing with an immature and not so well known instant messaging technology, which is what sqls is not if, on the other hand, you have a project.

That's what I call a utility project. It's a simple thing. It's not really vital to the operation of the business. So that may not be the best place to bring a stranger like this to us. In that type of situation, you're probably better off with a family member. at least for a few years, but there are a lot of strategic projects out there and certainly our experience over the last two or three years at Fort Works has been very positive, without SQL databases, I have heard very few complaints, and for the workers always. complain about what they are working with, so I am certainly very convinced that nosql databases have an important role to play in the spectrum of future developments and the rest of the talks in this section will explore different ways in which they have been used , so I hope you found this useful, if you want more depth, the book is very thin, my goal was 150 pages and I only missed it by two, so it's a quick overview of 152 pages, a little longer than what I just gave you, um and I.

I hope you find it useful. If you go to that page on my website, I compile several other things that I've done or talked about in non-sequel terms and thanks for listening.

Watch Video & Subscribe

If you have any copyright issue, please Contact