So, you are not expecting that your co-workers have good reasons for what they are doing? Maybe the hiring bar at your place is too low then.
I prefer to work at places where my default assumption is that everybody around me is smart and responsible. Lifts lots of worries off my shoulders (and tends to benefit the stock price over time too and thereby my income).
My coworkers have called out gaps in my thinking thousands of times when I have explained perceived needs to them, that's one of the main value-adds one gets from working in a team.
If I wanted unquestioned control, I'd run my own shop. If I want the best product, then I hope that people question my assumptions.
We are not in disagreement here. Bouncing off ideas and thoughts is a good thing.
The way this was phrased was more from the angle "who knows what these guys were thinking; if they can't give me a good reason, no way they will get storage space as I don't trust that they make good decisions on their own".
Generally? No. Not because they are not smart, but because in a large company, each individual have different goals and priorities - that's why we have e.g. SREs as dedicated roles - and it takes a bit of effort to find the intersection between all these.
Let's say I work in DevOps and want to optimize cloud costs. In that case, I would challenge the size of everything, the use of higher-costs services, the number of regions, all that - but the team might want more regions and bigger resources to improve latency and performance, and use more high-cost services for developer experience, and ship features without having to think about utilization.
It's a tug of war, and only works when you have forces on both sides to balance out. Being too conservative might stall innovation or make things too slow to save a buck, not being conservative enough might drain funds or make things impossible to scale.
I believe you are intentionally misunderstanding. The term "tug of war" is not used to indicate armed conflict or even a problem. It indicates balancing forces that you want to maintain - pull the rope too far to one side, and you end up in a suboptimal extreme.
Unless you work with clones of yourself, there will always be differences in opinions and priorities, and not every feature and bug fix can be a company-wide stakeholder meeting, and you certainly will not get any social points for trying to micro-manage other teams.
Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders. Of course not the whole company.
But the attitude needs to be "let's put the requirements on the table and see what we can do" instead of "you don't get what you want unless you give me a good reason". The latter comes from an angle of distrust which I'm arguing against. The former comes from an angle of collaborative problem solving.
In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.
> I believe you are intentionally misunderstanding.
You are free to believe what you like. Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.
> Opening a reply with such a sentence is pretty sad though. It does not foster a healthy atmosphere, nor does it match reality, I might add.
Your response hitched on a single word ("war") within a common phrase ("tug of war", a game). While it might have been accidental, such answers mislead from the actual discussion (and tends to be used as distractions when no good answer is present).
> Of course there will be differences. That's why you sit down and plan things together, pulling in and coordinating with all _relevant_ stakeholders.
When you discuss new architectures or large projects, this is a given, but this covers only a small portion of company operation - the rest is organic day-to-day work, which slowly but surely distorts initial assumptions. Slowly boiling the frog, so to speak. Think one team making changes that affect request patterns, another team making something that is accidentally quadratic, and a third team suddenly asking for a large number of cloud resources to carry this that should absolutely be challenged.
And at the same time, teams are under different organization units with different budgets, schedules, leaderships and priorities - and most certainly don't care about daily scrum work of other teams.
> In a company in which I go to a team relevant to a project and like to engage in a discussion and am met with an attitude of "unless you give us a good reason we'll stop talking to you", the atmosphere is not one that will keep me personally for long. YMMV.
No one said "we'll stop talking to you", but "you get what can be justified". If you take offense to be challenged and would rather work somewhere else, you do you, but if you can't justify your request I'd argue that you are not doing your job properly in the first place.
Related to that, last year Uber's engineering blog mentioned very interesting results with their internal log service [1].
I wonder if there's anything as good in the open-source world. The closest thing I can think of is Clickhouse's "new" JSON type, which is backed by columnar storage with dynamic columns [2].
The design described there is what Uber should be logging in the first place. Instead they are logging the fully resolved message and then compressing back into the templated form.
However, the compression back into the templated form is a good idea if you have third party logs that you want to store where you can not rewrite the logging to generate the correct form in the first place.
Neat! The only downside of this approach is having to force the developers to use the library, which can work in some companies. On the other hand, other approaches discussed previously like Uber's don't require any change in the application code, which should make adoption way simpler.
I'd guess the worry is that once you increase the storage, you never decrease it again. Ever. It's a one-way street. So, once everything is 5x over-provisioned, then the services tend to fill that space anyway (cause why not be wasteful if it doesn't cost anything) and a year later you are in the same seat again.
I'm not saying this is real, but the worry certainly is.
That's certainly real and something to consider when provisioning systems. I'm fully on board with that. The problem is when the cost of the cost-savings solution vastly outweighs the cost of over-provisioning infrastructure. Like this Jenkins issue bubbling up ~2-4 times a month vs just giving the worker nodes more storage space. There's been times where it happened during the night and people got paged.
Or comparing the cost of one store not being able to open on time because the RDS database's space ran out. VPs and directors start yelling and there's suddenly like 20+ people involved in figuring out why this one store didn't open on time. What's the cost of that compared to just giving the DB 250GB of space so this never comes up again?
But you are also 100% correct and I've seen that happen here, too. There's some instances I'm responsible for that were using EFS for their local storage. Costing thousands of dollars every month for absolutely no reason. I switched those to reasonably-sized EBS volumes and that alone was half of my annual savings goal.
I was completely flabbergasted seeing these instances using EFS while others were stuck on 8GB EBS volumes. Backups on the EFS drives had ballooned to the many TBs. And the backups were worthless! Instances themselves are ephemeral. They use S3 for long-term storage & metadata is on a database. Those are the things that should be backed up & their cost compared to EFS is minuscule.
> compared to just giving the DB 250GB of space so this never comes up again?
As long as there is reasonable confidence in that this is actually the case, then just provision the space and be done with it. That requires a certain understanding of future space requirements/expectations, and anything even just so slightly running away / leaking space will hit any limit given enough time. So, due diligence requires looking at whether it's actually needed.
Yup, I implemented a bunch of graphs and alerts. Right now it's at 100GB of usage so it's still growing but at a fairly predictable rate. Another nice thing to know is if it's possible to reduce that usage. I haven't been able to look into that but I know one of the causes of the usage increase. The service uses the DB to store some indexing data. There's a team forcing it to re-index and I can tell when they deploy because the storage spikes a little bit every time they do a deployment. Nothing I can do about that, sadly.
How a Hollywood star lobbies the EU for more surveillance
The European Union debates a new law that could force platforms to scan all private messages for signs of child abuse. Its most prominent advocate is the actor Ashton Kutcher.
To put more context in this: movies made with military equipment need special approval by the us military, and they will not generally approve “unpatriotic” movies and blacklist studios who do make movies they disapprove of… from Defense dot gov:
“ The Defense Department has a long-standing relationship with Hollywood. In fact, it’s been working with filmmakers for nearly 100 years with a goal that’s two-fold: to accurately depict military stories and make sure sensitive information isn’t disclosed.”
If this is information anyone can get by flying a drone around military equipment or buying old equipment offered for sale, this seems like a startling suppression of first amendment rights
How so? Hollywood wants to make money. And most Hollywood decision makers want to brag about having Important Friends in High Places (like the DoD). Producing anti-war movies (or otherwise making the DoD look bad) would get them ~nothing that they really want.
That goes down the same conspiracy theory rabbit hole like claiming that all state employee bureaucrats just try to bloat their dept. to have more power which ultimately wastes tax money.
Some people have standards and want to do good. Some of them work in Hollywood. And some of them in your city's administration. Not everybody is as selfish and unethical as portrayed here.
Maybe it reflects on the person expressing such theories though.
The theory is actually that some people have standards, some people just want to advance, and in the long run the latter will inevitably dominate the organization.
Yep. About as "conspiracy" as "managers in the XYZ Corp. Sales Dept. are all eager to make XYZ Corp's products look good".
Meanwhile...the US Army's Public Affair Dept. is headed by a 2-star general (same rank as the top general in command of an entire Army Division), and its mission is:
"Public affairs fulfills the Army's obligation to keep the American people and the Army informed, and helps to establish the conditions that lead to confidence in America's Army and its readiness to conduct operations in peacetime, conflict and war."
It seems like the article is written by someone just starting to get into the data engineering subfield and they thought they were going to be writing python (pyspark is my guess) to support some kind of ML effort, but they got saddled with a bunch of SQL/data warehousing stuff to support business intelligence/analytics instead.
I'd say normally what you say makes sense especially when you're pulling in abbreviations that are not related to the topic at hand or you're introducing new people to the field, but ETL is a pretty basic concept in data engineering and it's a web search away (should be the top result), so I'm not sure if it would really add all that much to their article to start with definitions.
It sounds to me like the author got thrown to the wolves in an environment of what data engineering looked like before "big data" and ML took off (and before it was even really called data engineering). There are a lot of enterprises that are still working in this mode because they are not Google and they don't have the same level of sophistication and automation when it comes to this stuff.
There is some bad information no doubt in the article, but if we're being charitable, it feels like it's someone who took a wrong turn somewhere and is struggling to find their feet in an unfamiliar place without the proper guidance and mentorship and that's a bit admirable at least that they're trying on their own.
There is no direct bearing on ETL in the article, aside from the focus on SQL queries and data validation hints that they might be talking about ELT (Extract-Load-Transform) as the level beyond ETL, but it's not clearly explained. It's clear to me that they are at the start of their journey and they are gonna learn things the hard way without guidance from someone more experienced.
> There is some bad information no doubt in the article
Could you share more specific details? Happy to look over / revise where needed.
More broadly is the issue of the gap of what you think the role is, and what the role actually is when you join. There are definitely cases where this is accidental. The best way I can think of to close the gap is to maybe do a short-term contract, but may be challenging to do under time constraints etc.
> Could you share more specific details? Happy to look over / revise where needed.
Sure thing! I'd say first off, the solutions may look different for a small company/startup vs. a large enterprise. It can help if you explain the scale at which you are solving for.
On the enterprise side of things, they tend to buy solutions rather than build them in-house. Things like Informatica, Talend, etc. are common for large enterprises whose primary products are not data or software related. They just don't have the will, expertise, or the capital to invest in building and maintaining these solutions in-house so they just buy them off the shelf. On the surface, these are very expensive products, but even in the face of that it can still make sense for large enterprises in terms of the bottom line to buy rather than build.
For startups and smaller companies, have you looked at something like `dbt` (https://github.com/dbt-labs/dbt-core) ? I understand the desire to write some code, but often times there are already existing solutions for the problems you might be encountering.
ORM's should typically only exist on the consumer-side of the equation, if at all. A lot of business intelligence / business analysts are just going to use tools like Tableau and hook up to the data warehouse via a connector to visualize their data. You might have some consumers that are more sophisticated and may want to write some custom post-processing or aggregation code, and they could certainly use ORM's if they choose, but it isn't something you should enforce on them because it's a poor place to validate data since as mentioned there are different ways/tools to access the data and not all of them are going to go through your python SDK.
Indeed in a large enough company, you are going to have producers and consumers that are going to use different tools and programming languages, so it's a little bit presumptuous to write an SDK in python there.
Another thing to talk about, and this probably mostly applies to larger companies - have you looked at an architecture like a distributed data mesh (https://martinfowler.com/articles/data-mesh-principles.html)? This might be something to bring to the CTO more than try to push for yourself, but it can completely change the landscape of what you are doing.
> More broadly is the issue of the gap of what you think the role is, and what the role actually is when you join. There are definitely cases where this is accidental. The best way I can think of to close the gap is to maybe do a short-term contract, but may be challenging to do under time constraints etc.
Yeah this definitely sucks and it's not an enviable position to be in. I guess you have a choice to look for another job or try to stick it out with the company that did this to you. It's possible there is a geniune existential crisis for the company and a good reason why they did the bait-and-switch. Maybe it pays to stay, especially if you have equity in the company. On the other hand, it could also be the case that it is the result of questionable practices at the company. It's hard to make that call.
Perhaps the first thing I’d clarify is not all the ‘bad’ things described happened to me personally, and out of the ones that did, I employed artistic licence in the recollection.
We did start integrating dbt towards the end of my time in the role. Our data stack was built in 2018, so a fair bit of time before data infra-as-a-service became a thing. The idea is dbt would help our internal consumers to more easily self serve. That said I did see complaints about dbt pricing recently; as they say there’s no free lunch.
Re: ORMs, I respectfully disagree. I’ve come across many teams that treat their Python/Rust/Go codebase with ownership and craft, I have not seen the same be said about SQL queries. It’s almost like a 'tragedy of the commons’ problem - columns keep getting added, logic gets patched, more CTEs to abstract things out but in the end adds to the obfuscation.
ORMs don’t fix everything but it does help constraint the ‘degrees of freedom’ and help keeps logic repeatable and consistent, and generally better than writing your own string-manipulation functions. An idea I had I continued (I wrote the post early last year) was to use static analysis tools like Meta’s UPM to allow refactoring of tables / DAGs (keep interfaces the same but ‘flatter’ DAGs, less duplicate transforms).
Interestingly enough, I currently work on ML and impressed to see how much modeling can be done in the cloud compared to my earlier stint in the space (which had a dedicated engineering team focused on features and inference). On the flipside I similarly see an explosion of SQL strings, some parts handled with care more than others.
I’ve not looked into a data mesh but a friend did mention pushing his org to embrace it - self note to follow up to see how that's going. Looks like there are a couple of ‘dimensions’ to it; my broader take is that keeping things sensible is both a technical and organizational challenge.
I look forward to future blog posts on ‘how we refactored our SQL queries’, maybe there’s a startup idea there somewhere.
> Re: ORMs, I respectfully disagree. I’ve come across many teams that treat their Python/Rust/Go codebase with ownership and craft, I have not seen the same be said about SQL queries. It’s almost like a 'tragedy of the commons’ problem - columns keep getting added, logic gets patched, more CTEs to abstract things out but in the end adds to the obfuscation.
> ORMs don’t fix everything but it does help constraint the ‘degrees of freedom’ and help keeps logic repeatable and consistent, and generally better than writing your own string-manipulation functions. An idea I had I continued (I wrote the post early last year) was to use static analysis tools like Meta’s UPM to allow refactoring of tables / DAGs (keep interfaces the same but ‘flatter’ DAGs, less duplicate transforms).
I get what you're saying, but think about a large org with a lot of different teams and heterogenous data stores - it's gonna be pretty hard to implement a top-down directive to tell everyone to use such and such ORM library, or to ensure a common level of ownership and craft. This is where SQL is the lingua franca and usually the native language of the data stores themselves and is a common factor between most/all of them. This is also where tools like Trino / PrestoSQL can come in and provide a compatibility layer at the SQL level while also providing really nice features such as being able to do joins across different kinds of data stores / query optimization / caching / access control / compute resource allocation.
In general it's hard to get things to flow "top down" in larger orgs, so it's better to address as much as you can from the bottom up. This includes things like domain models - it's gonna be tough to get everyone to accept a single domain model because different teams have different levels of focus and granularity as they zoom into specific subsets so they will tend to interpret the data in their own ways. That's not to say any of them are wrong, there's a reason why that whole data lake concept of "store raw unstructured data" came in where the consumer enforces a schema on read. This gives them the power to look at the data from their own perspective and interpretation. The more interpretation and assumptions you bake into the data before it reaches the consumers, the more problems you tend to run into.
That's not to say that you can't have a shared domain model between different teams. There are unsurprisingly also products out there that provide the enterprise the capability to collaboratively define and refine shared domain models, which can then be used as a lens/schema to look at the data. Crucially the domain model may shift over time, so this decoupling of the domain model from the actual schema of the stored data allows for the domain model to evolve over time without having to go back and fix the stored data because we have not baked in any assumptions or interpretations into the stored data itself.
This is correct from my understanding, and glossing over the article and links is precisely what they're talking about.
Pipeline of taking data from one system, transforming it and loading it into another. There is a whole industry full of software products that facilitate this such as MuleSoft. It is referred to as ETL
I was surprised to see your succinct correct answer greyed out.
Right, through the OP article and linked article inside both articles did not define or describe ETL. If its in your title, please explain, I was lost.
So, you are not expecting that your co-workers have good reasons for what they are doing? Maybe the hiring bar at your place is too low then.
I prefer to work at places where my default assumption is that everybody around me is smart and responsible. Lifts lots of worries off my shoulders (and tends to benefit the stock price over time too and thereby my income).