I'm not surprised that Twitter has those teams, I'm always surprised that more places don't. About 22 years ago I was the only person at the company I worked for at the time who could analyze Solaris core dumps, and understood enough about the JVM to diagnose deep problems. In the 5 years until the rest of the engineering staff caught up to where they stopped needing me for every incident, I probably saved enough money to pay my salary 100x over.
Never saw a dime of that. The one time someone offered me a spot bonus to come solve a problem, they reneged on it: A manager in charge of a project came to me for help. We're crashing all the time, he says. IBM and the team can't figure it out, he says. If you can solve this, I'll give you a $5000 spot bonus, he says.
I would have done it anyway, because it's my, you know, job? But whatever, I won't turn down free money.
So I wander over to the team that's been looking at this and get the lowdown. They keep getting out of memory errors.
Me: So what does the heapanalyzer output look like?
Team: Huh?
Me: You...you've been having out of memory errors and haven't looked at the heap?
Team: Buh?
So I get the heapdump and look at it. Immediately it's clear that the system is overflowing with http session objects.
Me: Anything in the log files related to sessions?
Team: Just these messages about null pointer exceptions during session cleanup...do you think they're related somehow?
Me: <Bangs head on desk>
A little more research reveals that there were two issues at play. The first is that we had a custom HttpSessionListener that was doing some cleanup when sessions were unbound. It would sometimes throw an exception. We were using IBM WAS, and it turned out that when a sessionDestroyed method threw an exception, WAS would abort all session cleanup. So we'd wind up in a cycle: the session cleanup thread would start, process a few sessions, hit one that threw an exception on cleanup, and which would abort cleaning up any other sessions.
We did a quick fix of wrapping all the code in the sessionDestroyed method with a blanket try/catch and logging the exception for later fixing, and IBM later released a patch for WAS that fixed the session cleanup code to continue even if sessionDestroyed threw an exception.
So, I very quickly solved this problem and waited for my $5000 spot bonus. And waited. And waited...
I went back to the manager and asked him about it. Over the next few weeks, he proceeded to tell me the following series of stories:
* It was in the works, and I'd have it soon.
* He had to get approval from his superiors.
* Because so many people had worked on the problem, it was decided that it should be split among the group, and that I'd have to share it with the people that couldn't fix it.
* No bonus.
So even though it was his idea to try to bribe me to fix a problem, they still failed to follow through on it.
Another story: We had an issue once where they finally brought me in after a year of problems. One of our Java systems was failing intermittently, and the development team had given up and couldn't figure out what was wrong. The boss told me it was now my problem, that I was to dedicate myself 100% of the time to solving the problem, and I could rewrite as much as much of the system as needed, basically total freedom (and responsibility). About halfway through the spiel where they were talking about the architecture and implementation, someone mentioned that the system was dumping core. I immediately stopped them right there.
Me: You realize that if it's a coredump, it's not our fault, right?
Boss: Huh?
Me: If a Java program coredumps, it's either a bug in a 3rd party JNI library, a bug in the JVM, or a bug in the OS. What did the coredump show?
Boss: Wha?
Me: You guys have had this problem for a year and haven't looked at the coredumps?
Boss: Blurgh?
So I fire up dbx and take a look at the last few coredumps. Pretty much instantly I can see the problem is in a JDBC type 2 (JNI native code) driver for DB2. We contact IBM, and after a bunch of hemming and hawing they admit there's a problem that's fixed in the latest driver patch. We upgrade the driver and poof! the problem is gone.
We had a year of failures, causing problems for customers, as well as all the wasted man hours trying to fix something in our code that simply could not have been fixed that way, all because the main dev team for this product had no idea how to debug it. I had an answer within 30 minutes of being brought in to the problem, and the solution was deployed within days.
Please tell me you found better pastures! A place where you were compensated and appreciated. :)
I've been at the current job for 10 years, receiving the highest level of performance review for all but the last year (the global pandemic was one of four life upheavals... I'm glad I pulled off a 'great' review vs. 'exceptional'), doing similar style work of solving problems people can't seem to comprehend... yet I just can't get seem to break into the next job title tier because "We don't need a principal."
...
I just found out my lead was just promoted to principal. Once the divorce is off my plate, I plan on taking more risk and jumping ship.
I didn't mind so much, as I was well compensated in general. It was only in the last few years that Agile cultists took over and managed to ruin the place completely. After a long, painful decline, my entire group was laid off, from senior managers down to new hires. Fortunately, I immediately found a better job.
What amazing stories you have shared!. Glad you changed the job. Totally agree on Agile cultists. I am seeing similar decline here at work.
Obsession with hour tracking takes priority over any other thing. Things that one good engineer would do in 4-6 weeks on a single ticket is now broken into 100 little JIRA shitlets for 6 person months divided in to 5 developers. At the end of it no one really has clue what was really been achieved. But since 100 JIRA stories have been completed, it must be a great milestone.
I wonder if that's a situation where the bosses above your boss think that there should be a proper "shape" to a team's rankings (e.g., can't havea team with too many principals, or all SWE 3s, etc). Even when one's manager is willing to go to bat for you, sometimes they get told they can only promote one person.
Oh, I'm sure that's the case. That, and they know I love the product and would (usually) be willing to put up with lower pay to keep working on it. But, losing assets in the divorce, I'm going to need to take more risk to get back on my feet. Only have a few decades to retirement, and need to rebuild that nest egg. So my priorities are changing, and I'm less willing to put up with people that don't appreciate me.
Yup! I'd told them last year as the divorce was starting that I'd need to be promoted within 18 months, or I'd start to become bitter and perform less well... and I'd leave before that happened.
Just put in for a GDC talk, hopefully that gets approved. Otherwise I'll stick around to give a talk at the Game Analytics Summit (big networking event for my niche of a field), and spin up some conversations. Also have contacts in most major studios... or I may just strike out and start doing consulting work. It was always a dream of mine to get back into global consulting work once the kids were grown up. They have been, but the ex was wanting to stick around for the grandkids.
Me, I'll be a free woman to make her mark on the world (once I can extract myself from the leech).
I hear you. Tracing and core dumps is just part of being a programmer. At least it should be.
Try to have a conversation about this with anyone where modern architecture means not only there are nothing to analyze, but that every software everywhere should just be restarted on every failure and all state thrown away because "stateless" and "cloud". The best environment is when no one can log in anywhere.
It's not a problem that no one can analyze anything because nothing is ever analyzed anyway. Software simply shouldn't be fixed, it should be built upon.
It's seems to be pretty much everyone's state of mind these days, and I feel completely powerless about it.
Sometimes the Root cause for problems is buried one or two abstraction layers deeper than the "responsible team" is comfortable to work at.
This is where the "lower level" expert comes into play.
At my place the abstractions start at high level user facing code, reach down into the hardware interface (driver) code and down to the actual electronic circuit design.
EVERYWHERE something can go wrong. In case the circuit deteriorates too fast you need to get the material analysts to "debug"
Rule of thumb is: there is always one layer of abstraction below you where stuff can break which feels like magic to you but is fixed with a glimpse from the lower level guy
I've just always considered understanding and being able to debug your environment to be a standard part of the job. As a professional software developer working in a Solaris environment, knowing how Solaris works, how to use the shell, tools, and other stuff is just basic.
Back in the 90's I worked at a company doing development on HP Apollo workstations. They were X based and used CDE for the desktop environment. When I started there, I invested some time to learn how it all hung together, how to leverage ToolTalk, and customized my system to do cool stuff.
There was another developer who had started there a month before me, and had the same workstation. They had a problem one day that they wanted my help with, so I went to their desk for the first time. I found their screen had the default X stipple pattern, and a single terminal window 80 characters wide, maybe 2/3rds of the screen tall, and placed somewhat off center in the screen.
I thought it slightly odd, but was distracted by the problem at hand. So at one point when they had some code up in the single window in vi, I asked them to open another terminal so we could do some other stuff. They just sort of sat there, so I asked again. They got agitated and snapped that they didn't know how to do that.
This was a professionally employed C programmer, with several years of experience prior to this, who had been working with this equipment for several months, and _didn't know how to open a second terminal window_. They didn't have CDE running, because when you log in you can select your desktop, and they kept picking just a plain, raw X session with the default xterm. They were completely and utterly uninterested in the capabilities of the hardware, OS, and desktop environment.
The same goes for the issue with the crashing JNI driver in my other comment. Maybe you don't know how to write your own JNI stuff, but I just expected that any developer who'd been using Java for any length of time would at least _know_ about how the JVM works in general, and what a coredump means in the context of a Java program. Specifically, that it's not your software that's causing it, and that trying to fix it by rewriting and debugging your Java code is a waste of time.
Sadly the vast majority of developers might know that there is a tool called debugger and might also know how to set a breakpoint.
But that’s where it stops.
No drive to see what your debugger offers you.
callstack? what’s that?
Conditional Breakpoints? Huh?
I hope that instead of aiming further down the abstraction layers, they instead aim up and try to get the hang of macro level stuff, metaprogramming, abstractions which I most often put off as "that’s just a fad"
If something like containerization or container orchestration sticks around for 5 or 10 years I'll maybe take a proper look at it.
Until then it is ‚just a trend‘ and takes mental space I feel spent better digging down. Because that’s where software is interacting with hardware (the real world).
That’s the abstraction layer where you can 'actually move the world', where goods are created.
Not everybody is capable or experienced enough to move across different abstraction layers and have a proper mental models for each. SOMEWHERE you will probably have to draw a line.
Some people draw the line at 'writing software'.
Execution target (understanding the hardware / runtime) are out of focus for effectively all programmers.
I think partly because languages, tutorials, bootcamps don't bother to dig deep there because 'the compiler will take care'.
It just is not teached that way. What do you mean with Solaris? Sun? PDP? It is just a computer. Java is cross platform. How it does that? Thats the JVMs job, not my job. I just write the software.
> I've just always considered understanding and being able to debug your environment to be a standard part of the job. As a professional software developer working in a Solaris environment, knowing how Solaris works, how to use the shell, tools, and other stuff is just basic.
If this was a standard part of the job, it wouldn't be my superpower. Working in a team where everyone (or almost everyone) can debug is amazing!
At a smaller scale, you don't necessarily need a dedicated team, you can just have a couple people who know to look for core dumps and know how to look at them (although gdb exe exe.dump ... bt gets you 90% of the way there 90% of the time), and whatever their real job is, it can be deprioritized to deal with urgent issues elsewhere without much fuss.
If you get to the point where the fixers are always mostly busy fixing things, you can give their real jobs away.
Never saw a dime of that. The one time someone offered me a spot bonus to come solve a problem, they reneged on it: A manager in charge of a project came to me for help. We're crashing all the time, he says. IBM and the team can't figure it out, he says. If you can solve this, I'll give you a $5000 spot bonus, he says.
I would have done it anyway, because it's my, you know, job? But whatever, I won't turn down free money.
So I wander over to the team that's been looking at this and get the lowdown. They keep getting out of memory errors.
Me: So what does the heapanalyzer output look like? Team: Huh?
Me: You...you've been having out of memory errors and haven't looked at the heap? Team: Buh?
So I get the heapdump and look at it. Immediately it's clear that the system is overflowing with http session objects.
Me: Anything in the log files related to sessions? Team: Just these messages about null pointer exceptions during session cleanup...do you think they're related somehow? Me: <Bangs head on desk>
A little more research reveals that there were two issues at play. The first is that we had a custom HttpSessionListener that was doing some cleanup when sessions were unbound. It would sometimes throw an exception. We were using IBM WAS, and it turned out that when a sessionDestroyed method threw an exception, WAS would abort all session cleanup. So we'd wind up in a cycle: the session cleanup thread would start, process a few sessions, hit one that threw an exception on cleanup, and which would abort cleaning up any other sessions.
We did a quick fix of wrapping all the code in the sessionDestroyed method with a blanket try/catch and logging the exception for later fixing, and IBM later released a patch for WAS that fixed the session cleanup code to continue even if sessionDestroyed threw an exception.
So, I very quickly solved this problem and waited for my $5000 spot bonus. And waited. And waited...
I went back to the manager and asked him about it. Over the next few weeks, he proceeded to tell me the following series of stories:
* It was in the works, and I'd have it soon.
* He had to get approval from his superiors.
* Because so many people had worked on the problem, it was decided that it should be split among the group, and that I'd have to share it with the people that couldn't fix it.
* No bonus.
So even though it was his idea to try to bribe me to fix a problem, they still failed to follow through on it.
Another story: We had an issue once where they finally brought me in after a year of problems. One of our Java systems was failing intermittently, and the development team had given up and couldn't figure out what was wrong. The boss told me it was now my problem, that I was to dedicate myself 100% of the time to solving the problem, and I could rewrite as much as much of the system as needed, basically total freedom (and responsibility). About halfway through the spiel where they were talking about the architecture and implementation, someone mentioned that the system was dumping core. I immediately stopped them right there.
Me: You realize that if it's a coredump, it's not our fault, right? Boss: Huh?
Me: If a Java program coredumps, it's either a bug in a 3rd party JNI library, a bug in the JVM, or a bug in the OS. What did the coredump show? Boss: Wha?
Me: You guys have had this problem for a year and haven't looked at the coredumps? Boss: Blurgh?
So I fire up dbx and take a look at the last few coredumps. Pretty much instantly I can see the problem is in a JDBC type 2 (JNI native code) driver for DB2. We contact IBM, and after a bunch of hemming and hawing they admit there's a problem that's fixed in the latest driver patch. We upgrade the driver and poof! the problem is gone.
We had a year of failures, causing problems for customers, as well as all the wasted man hours trying to fix something in our code that simply could not have been fixed that way, all because the main dev team for this product had no idea how to debug it. I had an answer within 30 minutes of being brought in to the problem, and the solution was deployed within days.