There is lots of truth to this. Some ORMs like Django perform joins in very unsuspecting ways.
A simple example is, say, foreign keys. Trying to access the foreign key of an object by doing `book.user.id` does an additional query for the user table to get the ID. It's less known that the id is immediately available by just doing `book.user_id` instead.
I've spent time optimising things like text searches down from 2000+ queries to about ~4, and one of the more noticeable things to me isn't actually the number of joins, rather the SELECT's that take place. Many of these ORMs do a SELECT * unless you explicitly tell them to otherwise, and when dealing with large-ish datasets or on models that have large text fields this translates into significant time taken to serialise these attributes. So you can optimise the query and still have it take a long time until you realise that limiting the initial `SELECT` parameter is probably more efficient than limiting the number of joins.
The most insidious part about misusing ORMs is it's often not visible for a while. Modern DBMSs on modern hardware are crazy fast, so when you have only a few tens or hundreds of thousand rows in your table, those inefficient and pointless ORM queries are just not noticeable because you still get sub-second response times. As your database grows, the site begins to gets slower and slower, but it's hard to distinguish between the real problem and "I guess we're just handling more requests per second".
I personally love tools like Miniprofiler [1] for this (though maybe there's something better today, it's been a while since I've worked on that type of thing). It's a constant and accessible way to keep an eye on what goes into each request, and I've caught many of those bad queries before they were problems by using it (eg: "WTF, why did it take 9 queries and 250ms to grab what looks to be a single row from a single table?!").
To be fair this is a problem inherent to databases in general. You can have hand written queries that perform badly due to structure or query frequency as well which are not apparent until the dataset grows. The ORM should make it easier to rectify such situations (eg drop in an eager loading directive) vs having to restructure hand-written routines for similar effects.
Indeed, even with query analyzer you might see say table scans instead of index scan just because the DB realizes just scanning the 100 rows you got is faster than trying to use an index.
So without a large number of rows it can be hard to know what it will actually do.
I encountered this a few times and started adding tests that assert each handler only executes the expected number of queries (and no more). If the application code is modified such that this N+1 query pattern occurs the test will immediately fail and you go optimise the query, problem solved.
> A simple example is, say, foreign keys. Trying to access the foreign key of an object by doing `book.user.id` does an additional query for the user table to get the ID. It's less known that the id is immediately available by just doing `book.user_id` instead.
Hibernate (on Java) at least optimizes this specific use-case. At first, accessing a lazy-loaded property-object will give you a "proxy" and you can access the ID without incurring a database load (since it knows that anyway). And when doing a query, the object won't be joined when requesting book.user.id unless it needs to be (like you have some other WHERE clause that requires an actual join on that row).
> Trying to access the foreign key of an object by doing `book.user.id` does an additional query for the user table to get the ID. It's less known that the id is immediately available by just doing `book.user_id` instead.
But that's not really unsuspecting. `book.user` is asking for the user table, `book.user_id` is not. Those two things are not identical even though they return the same value.
depends if you're doing active record (which I am not considering as typical ORM) or data mapper like hibernate where your entity is POJO with id and annotations but typically no foreign keys (just references to other model classes)
hibernate is then "the magic "environment where it just works"
I realize OO is out of fashion now but it's still true and it still works and I've been in a lots of projects where ORM was useful
> A simple example is, say, foreign keys. Trying to access the foreign key of an object by doing `book.user.id` does an additional query for the user table to get the ID. It's less known that the id is immediately available by just doing `book.user_id` instead.
Hmm.. Sounds like a bug. Why is this not the same value for a foreign key?
Well, it is the same value, but the ORM doesn't handle book.user.id any differently than it does book.user.name where it isn't the same value, and thus the only option is to fetch the second table. So it's not a bug, it's really just the ORM being consistent in how it handles queries, thus missing out on a possible optimization in this special case, where a simpler query could have given the same result.
No, that is clearly a bug. The ORM already has the value of book.id, that's how it knows how to fetch the right book. Performing extra queries is just poor implementation.
The programmer also already had the value in book.user_id but still chose to ask the ORM to fetch all of .user so they could get .id from there instead. And they might then afterward call .name on it as well, and there would be no further queries, because the ORM has already been asked to fetch all fields of .user - so it might in fact have been sensible to fetch all of .user if so. The query builder cannot know whether all of it will be needed or not, because Python is not a compiled language, so there's no way to tell in advance when executing the book.user.id query that no further fields of .user will be needed, so it shouldn't actually do what it's been asked to do, to fetch the entire object, but rather only fetch .id which is available in a different way, so the whole query can be skipped. So yes, this is suboptimal usage, but only the programmer can know that, so it falls to them to optimize if they want to.
Perhaps i'm a bit odd, but when I'm going to lean on an ORM to do things I expect it to actually do them. I expect that foo.user_id does not exist, because that representation has been transformed into an object. foo.user.id should be the only viable reference to the id. foo.user.id should return the value it already knows, any other property access i would expect will do the equiv of `select * from ...` if the object has not previously been populated.
Now perhaps some ORM's prefer to be thinner, to provide more footguns via a leaky abstraction that mixes implementation details with the object mapping. I don't think those are good implementaations.
> Now perhaps some ORM's prefer to be thinner, to provide more footguns via a leaky abstraction that mixes implementation details with the object mapping. I don't think those are good implementaations.
To me, it seems like the (hypothetical?) implementation you're talking about is much more leaky and footgun-y than the more straightforward ("thinner", in your words) version. In order for foo.user.id to not execute a new query, foo.user would have to return some sort of proxy object that only fetched the user row when you tried to access a field that hasn't been loaded. That's way more magic than the more obvious solution—which is to load the row when you access the related object—and could easily cause more problems than it solves in the long run when you need to debug very specific queries.
Furthermore, how is going out of your way to hide a field that exists in the database (user_id) not the definition of a leaky abstraction? What purpose does it serve to direct you through an unnecessary layer if all you need is the ID?
> I expect that foo.user_id does not exist, because that representation has been transformed into an object.
Or you can just consider user_id to be a reference pointer that is part of foo, while user.id is an attribute of user. Totally different things and I am glad that the distinction is there.
I’m sure there are valid engineering reasons to do it this way. One that comes to mind is memory footprint in allocating the objects associate with foreign key references.
Fancier ORMs will return a proxy for foo.user that doesn’t issue a query until the user asks for a property on it. And it can return user.id without querying.
But it doesn't have book.user itself. Unless you want Django to construct some empty proxy object representing book.user for this one particular optimization.
It's not a bug, it is the same value. Only instead of using `first_table.foreign_id` to fetch the entire record from `second_table` only to use `second_table.id`, if you only need the identifier itself you already have it in `first_table`.
A similar concept called covered queries exists whereby you index a table by foreign key, and a few additional columns that you do not expect to use in join conditions, but you do expect to frequently retrieve. Depending on your database, requests for only columns in the index (some being the join condition, some subsequent columns being in the set of popular additional columns) means faster access to those popular columns only. In the context of ORMs, you would need to do something to avoid a default behavior of "select *" in order to exploit this index.
Not necessarily, it can be overridden: "id" is only the default for models that haven't explicitly been given a field with the "primary_key" kwarg (common on legacy tables where the primary key column might for example be "user_id").
The alias guaranteed to be the same value is ".pk", and I'm not sure what django does if you try to create a column named "pk" that isn't the primary key.
If your tables don't use id, or are doing something strange, YOU wrote that code. You should know it's going to be a thing you need to deal with, because YOU did something unusual.
Note that I gave the example of "legacy" tables. On long-lasting codebases there's a high likelihood that "you" didn't write that code and aren't aware of everything Django's doing under the hood.
`QuerySet.select_related()` and `QuerySet.prefetch_related()` are the bread and butter of Django query optimisation. I think most of the time that I've noticed a performance issue in our code, it's been easily fixed with one of those.
Django's ORM gets a lot of flak, but I don't remember the last time I had complex queries that I could not do with
it.
You still need to understand a minimum of SQL and databases, and usually those that complain about the ORM are the ones that expect it to be a "sufficiently advanced compiler", but it has matured so much that nowadays the developers consider a *bug* every time the answer to How do I do this query X? involves something along the lines of use .extra or raw sql.
This is true, though to be fair to the critics, the syntax through which you express these complex queries is often clunky and unintuitive. For example, I need to re-read the documentation every time I use the annotation API because it's generally not obvious how to use it, and I've run into a few edge cases where you need extra code/syntax just to deal with its nuances and ambiguities.
Even though Django has come a long way, I greatly prefer ORMs like SQLAlchemy and Ecto that map more closely to the SQL query I'm trying to write.
It would return the same value, but the approach to obtain it would be different.
`user` would be a property defined as a User object on the Book model, so accessing `book.user` will cause the framework to fetch the entire user model (even if we then only fetch the id).
On the other hand, `book.user_id` is the auto-generated database column, generated to make the above property definition possible. But since this `user_id` is directly defined on the book object, there is no need to query the user table.
If you're selecting tons of data when you SELECT * you might also have a god object. I prefer to have my model be a bit more split up by use rather than being full of random stuff. E.g. a customer_address table rather than stuffing all that data into customer, even if they only have a single address.
I'm pretty sure Django has an active record ORM, which are generally a bit rubbish in terms of performance. A unit of work ORM such as SQLAlchemy seems to generate much better queries.
A simple example is, say, foreign keys. Trying to access the foreign key of an object by doing `book.user.id` does an additional query for the user table to get the ID. It's less known that the id is immediately available by just doing `book.user_id` instead.
I've spent time optimising things like text searches down from 2000+ queries to about ~4, and one of the more noticeable things to me isn't actually the number of joins, rather the SELECT's that take place. Many of these ORMs do a SELECT * unless you explicitly tell them to otherwise, and when dealing with large-ish datasets or on models that have large text fields this translates into significant time taken to serialise these attributes. So you can optimise the query and still have it take a long time until you realise that limiting the initial `SELECT` parameter is probably more efficient than limiting the number of joins.