chhotii

Call me old-fashioned, but I do not grok NoSQL databases.

The "No" in "NoSQL" is supposed to stand for "Not only", supposedly. But I think people think of it as "NO SQL" like "look! No SQL in my codebase! Hah hah hah hah I got away from SQL! Yay!" Like... SQL is a bad thing?

OK, I know, SQL is terribly dorky. And there's the infamous object-relational impedance mismatch. Object Relational Mapping (ORM) frameworks try to overcome that, but they are tricky to get right.

But... when the ORM runs into difficulty, it's because there's a really real issue to grapple with. Here are some really real issues:

* How do you deal, in the most general way, with foreign key references? Let's say you have a course catalog database with tables named Course and Professor. Let's keep this simple and say that each Course has one Professor. Thus a Many-to-One relationship from Course to Professor, which, in a relational database, we deal with by having a professorID column in the Course table. OK. When we fetch a list of Courses into our application, do we just select from the Professor table? Then we just have the professorID for each Course, not the names of the Professors. If we're showing a listing, on which we want to show the professor names, a SELECT from Course could get followed up by 1,000 queries on the Professor table as each professorID is resolved. Or, do we do a join? That's the smart way to use the relational database. But how does the ORM know that in one context, when we get a bunch of Professor objects, we will be looking at Professor objects and should do a join, but in another context not?

* How do we cope with changes to the database schema? Maybe records created before some date are different, in terms of what data they contain.

When folks announce that they have developed some tremendous magical new tool that does away with all the headaches of SQL and ORM, often they've just swept the issues under the rug. They show off some demo that they've CRUD some data with nothing that looks like SQL involved like wheee!!! But the demo is so small it just doesn't run into any foreign key or versioning issues. Or the issues are so small they can be ignored.

Example, here's some sample code from my professor, who is teaching us the minimal basic getting started in MongoDB and the mongoose module:

var mongoose = require('mongoose')
var Schema = mongoose.Schema;
var courseSchema = new Schema({
    courseNumber: String,
    courseName: String,
    courseDevelopers: [
        {firstName:String, lastName: String}
    ]
});
// skipping buncha connection etc. code here...
course = new Course({
    courseNumber: 'cs601',
    courseName: 'Web Application Development', 
    courseDevelopers: [
        { firstName:'Eric', lastName:'Bishop'}
    ]
});
course.save();

Already I am screaming. As someone trained to think in terms of normalizing relational databases, I see a glaring hideous problem here. In case it's not obvious yet, let's go on:

course = new Course({
    courseNumber: 'cs602',
    courseName: 'Server Side Web Development',
    courseDevelopers: [
        { firstName: 'Eric', lastName:'Bishop'},
        { firstName:'Suresh', lastName:'Kalathur'}
    ]
});
course.save();

AAAAAAAAaaaaaa!!! Seriously, you are going to have someone enter the name 'Eric Bishop' twice, for the two different Courses? No no no no. Not to mention the fact that that might not make for the most efficient data-entry UI; what if there's a spelling or hyphanization or whether one uses the midddle name discrepancy? Then we can't do a query relably for "what are all the classes taught by Eric Bishop?" What if there's more information we should have on each Professor? Seriously, do we have to explain the benefits of database normalization all over again?

Looking at the MongoDB documentation. Surely there's a way to attach existing ProfessorSchema documents from the Professor collection. But does that create a linkage in the database, or does it duplicate data? If 'Eric' 'Bishop' is updated to be 'Erin' 'Bishop' now in the Professor collection, is that reflected in the course listing when Course documents are fetched? In other words did they re-invent the join?

Well... sort of. There's the $lookup feature. Not sure how scalable that is for the full range of uses of the join concept.

Here's an article that stronly suggests to me that the NoSQL fad is a cooler of Kool-Aid: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Flat | Top-Level Comments Only

From:

totient

Speaking as a stockholder in a NoSQL company:

The problem with SQL is that it obscures the performance impact of your ORM, particularly as it relates to locking. The promise of SQL is the promise of ACID: that your program will behave as if every SQL statement is run with a global write lock held on the entire database. Actual SQL implementations have read-write locks and fine grained locking and all kinds of cleverness to avoid livelock, but if you're doing anything at all complicated you have to put a whole lot of effort into performance.

I'll give two examples. First: Dreamwidth runs on MySQL, because it inherited a codebase from LiveJournal which was written when that was the only game in town. It's a relatively small site as social media sites go, and as a long form blogging site entries are updated relatively slowly, and the partitioning is easy. Still, friendships are arbitrary and a flist display will cross partition boundaries. Most page displays will have to get a read lock on large numbers of partitions of the entries table. Dreamwidth solves this by not growing to the point where they have to think about it, and by not displaying comments inline in your reading page. In practice the app does not care whether the version of your entry being displayed is consistent with the version of the next entry in my reading page, so it could get those read locks sequentially instead of having to hold them all at once. In other words, tell MySQL to skip the C in ACID, by mostly abandoning the relational features of the database and moving them into the app. But now you are doing a lot of work to split up the SQL query into little bits. Dreamwidth does not have engineering time for this. And more to the point, now you need engineers, and probably operational database tuners, who understand when this is necessary, which means they are thinking about what's really going on inside the database rather than the nice clean API the database presents. If the LiveJournal code had been written against a NoSQL database they'd have an API that aligns with the underlying database implementation in a way that makes the most performant solution more visible, and this would be easier, and in fact it'd probably already have been done by the LiveJournal engineers back in the aughties when they had the resources to do it.

My second example is in the same space: Facebook. Amazingly enough, the last time I looked (which admittedly was most of a decade ago), Facebook was *also* running MySQL. They were sharding across tens of thousands of MySQL instances. And their reading page displays not only entries but also comments, so there's another relational layer in what you'd think was the code for a page display, plus FB updates are little and frequent, and comments are a rapid fire back and forth conversation so consistency matters. Locking in this space is impossible and Facebook doesn't even try. They don't display a consistent view of your reading page because they *can't*. And they employ entire teams of engineers to constantly change up the display algorithms to hide this fact, and to destroy any expectation of a consistent rendering algorithm. This means they are using nearly none of the features of MySQL beyond replication (but wow, are they using replication).

These two examples represent the very small and the very large to illustrate that it's pretty hard to be a medium sized (like, 50,000-100,000 peak transactions per second) user of MySQL. And small commercial users typically want to grow into medium sized users, or at least their shareholders want them to. That's really what's driving most of the use of NoSQL.

Edited Date: 2020-07-21 02:59 pm (UTC)

chhotii

Hmm, interesting.

So, do you think that the fact that NoSQL databases seem to not foster or encourage normalizing the database schema is a feature rather than a bug?

Here's the way I was trained to think as a relational database SQL user: Let's say that we want to display the text of each comment with the name of the commenter. I'd write a query like: select c.text, u.name from Comment c, User u where c.posterID = u.userID and c.blogID = ? A particular User's name is only stored in one place in the database; so if "Eric Bishop" becomes "Erin Bishop", the new name appears everywhere in every page rendered after the update was written to the User table. Whereas the way my professor seems to be suggesting one should use Mongo, one hands the object { commentText: "Hey great point!", poster:"Eric Bishop" } to Mongo and then the user's name at that moment is immortalized in that document. Nobody is ever going to slog through all the leaves of the data structure and update. Nor can you easily do a query for "all comments posted by this user ever" which might be useful in various ways.

Would you say that the performance price for the consistency that normalization brings is too high? I can see your point, if doing all these joins across, say, Blog, Comment, and User is going to lock the User table so much that nothing ever gets done.

Or do you think that my professor is leading the class down the wrong path by not normalizing the schema? Should he have added a Person document to a Person collection, and then a professorID in the Course object? But then to get the professor's name associated with a Course there has to be a $lookup, within an aggregation... and this tutorial suggests that if you're doing that a lot, you're doing it wrong. I haven't read very much about NoSQL databases yet, but what I've read so far seems to suggest that one should duplicate data as needed so there's no temptation to do anything like a join ever.

I'm just bewildered. I see relational data everywhere. The world is full of things that participate in multiple relationships with other things. I'm not getting how MongoDB accommodates that fact.

I still think normalization is important. Where NoSQL is winning is by abandoning immediate and in some cases even eventual consistency. That is, changes to someone's display name may be partially propagated during what a SQL programmer would think of as a query so that you'd see a thread with a mix of the two names, or comments on a post that was deleted more or less simultaneously with the posting of the comment might show up as successful (and generate notifications) to the commenter but not the OP, or vice versa. If such a comment winds up forevermore included in "all comments made by $commenter" but not "all comments made on $poster's posts", even if that latter category otherwise does include comments made on deleted posts... well who really cares? Probably no one. Why pay the performance price?

I agree, no one should really care that much in the use case we're thinking of (social media blogging). Consistency isn't of such paramount importance that everything else should be sacrificed for its sake. Sure.

But. I'm trying to figure out what is the right way to use MongoDB. (Maybe it's different for the NoSQL you're most familiar with.) But the documentation here https://docs.mongodb.com/manual/core/data-model-design/ basically says there's two options to encode a relationship. Either an Embedded Data Model, or a Normalized Data Model. With the Embedded Data Model, normalization is abandoned completely. I don't see a mechanism to propagate updates. This is the case where we don't get even eventual consistency, I guess. On the other hand, with a Normalized Data Model, $lookup seems to impose even more of a performance concern than a good old-fashioned SQL join. (Haven't looked into the $graphLookup feature yet.)

So, the trade-off there still seems to be consistency or performance. (This trade-off exists with relational databases. One is supposed to, first step, think through how to normalize the heck out of the data model. But there's nothing stopping one from saying "eh, it's going to be too much of a performance hit to join that many tables, let's denormalize that relationship".) What you seem to be saying is that there's a mechanism to achieve performance plus eventual consistency. I'm going to look for that in the MongoDB documentation but do you have some idea what that would be shaped like?

S	M	T	W	T	F	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

in praise of relational database structure

in praise of relational database structure

no subject

no subject

no subject

no subject

Profile

July 2023

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags