chhotii: (apple)
[personal profile] chhotii

Call me old-fashioned, but I do not grok NoSQL databases.

The "No" in "NoSQL" is supposed to stand for "Not only", supposedly. But I think people think of it as "NO SQL" like "look! No SQL in my codebase! Hah hah hah hah I got away from SQL! Yay!" Like... SQL is a bad thing?

OK, I know, SQL is terribly dorky. And there's the infamous object-relational impedance mismatch. Object Relational Mapping (ORM) frameworks try to overcome that, but they are tricky to get right.

But... when the ORM runs into difficulty, it's because there's a really real issue to grapple with. Here are some really real issues:

* How do you deal, in the most general way, with foreign key references? Let's say you have a course catalog database with tables named Course and Professor. Let's keep this simple and say that each Course has one Professor. Thus a Many-to-One relationship from Course to Professor, which, in a relational database, we deal with by having a professorID column in the Course table. OK. When we fetch a list of Courses into our application, do we just select from the Professor table? Then we just have the professorID for each Course, not the names of the Professors. If we're showing a listing, on which we want to show the professor names, a SELECT from Course could get followed up by 1,000 queries on the Professor table as each professorID is resolved. Or, do we do a join? That's the smart way to use the relational database. But how does the ORM know that in one context, when we get a bunch of Professor objects, we will be looking at Professor objects and should do a join, but in another context not?

* How do we cope with changes to the database schema? Maybe records created before some date are different, in terms of what data they contain.

When folks announce that they have developed some tremendous magical new tool that does away with all the headaches of SQL and ORM, often they've just swept the issues under the rug. They show off some demo that they've CRUD some data with nothing that looks like SQL involved like wheee!!! But the demo is so small it just doesn't run into any foreign key or versioning issues. Or the issues are so small they can be ignored.

Example, here's some sample code from my professor, who is teaching us the minimal basic getting started in MongoDB and the mongoose module:

var mongoose = require('mongoose')
var Schema = mongoose.Schema;
var courseSchema = new Schema({
    courseNumber: String,
    courseName: String,
    courseDevelopers: [
        {firstName:String, lastName: String}
    ]
});
// skipping buncha connection etc. code here...
course = new Course({
    courseNumber: 'cs601',
    courseName: 'Web Application Development', 
    courseDevelopers: [
        { firstName:'Eric', lastName:'Bishop'}
    ]
});
course.save();


Already I am screaming. As someone trained to think in terms of normalizing relational databases, I see a glaring hideous problem here. In case it's not obvious yet, let's go on:

course = new Course({
    courseNumber: 'cs602',
    courseName: 'Server Side Web Development',
    courseDevelopers: [
        { firstName: 'Eric', lastName:'Bishop'},
        { firstName:'Suresh', lastName:'Kalathur'}
    ]
});
course.save();


AAAAAAAAaaaaaa!!! Seriously, you are going to have someone enter the name 'Eric Bishop' twice, for the two different Courses? No no no no. Not to mention the fact that that might not make for the most efficient data-entry UI; what if there's a spelling or hyphanization or whether one uses the midddle name discrepancy? Then we can't do a query relably for "what are all the classes taught by Eric Bishop?" What if there's more information we should have on each Professor? Seriously, do we have to explain the benefits of database normalization all over again?

Looking at the MongoDB documentation. Surely there's a way to attach existing ProfessorSchema documents from the Professor collection. But does that create a linkage in the database, or does it duplicate data? If 'Eric' 'Bishop' is updated to be 'Erin' 'Bishop' now in the Professor collection, is that reflected in the course listing when Course documents are fetched? In other words did they re-invent the join?

Well... sort of. There's the $lookup feature. Not sure how scalable that is for the full range of uses of the join concept.

Here's an article that stronly suggests to me that the NoSQL fad is a cooler of Kool-Aid: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Date: 2020-07-21 02:44 pm (UTC)
totient: (Default)
From: [personal profile] totient
Speaking as a stockholder in a NoSQL company:

The problem with SQL is that it obscures the performance impact of your ORM, particularly as it relates to locking. The promise of SQL is the promise of ACID: that your program will behave as if every SQL statement is run with a global write lock held on the entire database. Actual SQL implementations have read-write locks and fine grained locking and all kinds of cleverness to avoid livelock, but if you're doing anything at all complicated you have to put a whole lot of effort into performance.

I'll give two examples. First: Dreamwidth runs on MySQL, because it inherited a codebase from LiveJournal which was written when that was the only game in town. It's a relatively small site as social media sites go, and as a long form blogging site entries are updated relatively slowly, and the partitioning is easy. Still, friendships are arbitrary and a flist display will cross partition boundaries. Most page displays will have to get a read lock on large numbers of partitions of the entries table. Dreamwidth solves this by not growing to the point where they have to think about it, and by not displaying comments inline in your reading page. In practice the app does not care whether the version of your entry being displayed is consistent with the version of the next entry in my reading page, so it could get those read locks sequentially instead of having to hold them all at once. In other words, tell MySQL to skip the C in ACID, by mostly abandoning the relational features of the database and moving them into the app. But now you are doing a lot of work to split up the SQL query into little bits. Dreamwidth does not have engineering time for this. And more to the point, now you need engineers, and probably operational database tuners, who understand when this is necessary, which means they are thinking about what's really going on inside the database rather than the nice clean API the database presents. If the LiveJournal code had been written against a NoSQL database they'd have an API that aligns with the underlying database implementation in a way that makes the most performant solution more visible, and this would be easier, and in fact it'd probably already have been done by the LiveJournal engineers back in the aughties when they had the resources to do it.

My second example is in the same space: Facebook. Amazingly enough, the last time I looked (which admittedly was most of a decade ago), Facebook was *also* running MySQL. They were sharding across tens of thousands of MySQL instances. And their reading page displays not only entries but also comments, so there's another relational layer in what you'd think was the code for a page display, plus FB updates are little and frequent, and comments are a rapid fire back and forth conversation so consistency matters. Locking in this space is impossible and Facebook doesn't even try. They don't display a consistent view of your reading page because they *can't*. And they employ entire teams of engineers to constantly change up the display algorithms to hide this fact, and to destroy any expectation of a consistent rendering algorithm. This means they are using nearly none of the features of MySQL beyond replication (but wow, are they using replication).

These two examples represent the very small and the very large to illustrate that it's pretty hard to be a medium sized (like, 50,000-100,000 peak transactions per second) user of MySQL. And small commercial users typically want to grow into medium sized users, or at least their shareholders want them to. That's really what's driving most of the use of NoSQL.
Edited Date: 2020-07-21 02:59 pm (UTC)

Date: 2020-07-21 07:08 pm (UTC)
totient: (Default)
From: [personal profile] totient
I still think normalization is important. Where NoSQL is winning is by abandoning immediate and in some cases even eventual consistency. That is, changes to someone's display name may be partially propagated during what a SQL programmer would think of as a query so that you'd see a thread with a mix of the two names, or comments on a post that was deleted more or less simultaneously with the posting of the comment might show up as successful (and generate notifications) to the commenter but not the OP, or vice versa. If such a comment winds up forevermore included in "all comments made by $commenter" but not "all comments made on $poster's posts", even if that latter category otherwise does include comments made on deleted posts... well who really cares? Probably no one. Why pay the performance price?

Profile

chhotii: (Default)
chhotii

July 2023

S M T W T F S
      1
2345678
9101112131415
16 171819202122
23 242526272829
3031     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 27th, 2026 05:16 pm
Powered by Dreamwidth Studios