I spent a little over 3 years in USDS and it changed my life. If you are interested in working on high impact projects that affects hundreds of millions of Americans, I would strongly recommend that you give it a try.
Our team is very careful at choosing technology that is stable and well proven. In fact, it took a lot of debate for us to move from Rails static pages to React SPA.
As for government being slow and only change once a decade, you might be underestimating engineers in the government.
The team I worked with managed to pull off React/Node.js/Redis/Zookeeper microservice framework over AWS using Jenkins as CI/CD pipeline. And most of the team came from an older enterprise stack using Java and IBM WebSphere. They were able to adapt to the new framework.
Be conservative, yes. But being overly conservative is part of the reasons why government is so far behind today.
Interesting, when I said once a decade I didn't particulary mean changing anything I meant that the system as a whole has to last a decade.
I've stuff in production I wrote in 2007, it's the same system but lots of the parts have been replaced (bit of a Theseus/Triggers Broom philosophical point) over time.
I agree on the overly conservative point, what I find helps there is to consider the migration away from anything new you are considering, I usually ask myself the following question - "If the entire dev team for foobar disappeared tomorrow, how much would it hurt" if the answer is "not much because I can maintain it myself while migrating away" then thats very different to "a lot because I don't understand it that well and/or it's massively complex".
Enterprise is a different world (even for a medium sized one like I work for) because (frankly) a lot of the programmers are doing it purely for the money and/or simply aren't that competent.
Not all of them by any means but I run into a lot of code that is just plain horrible, not "not the way I'd have done it" so much as "how does this thing even work?" and "Who possibly thought this was the right approach?".
I'm weird, I like LoB software development - I find that done well the feedback loop is very gratifying, you get to have an immediate affect on the companies bottom line and you have happier employees also the problems have a lot of hidden complexity which is intellectually stimulating.
Of course, setting priorities.. I was just wondering how different OSes would behave under these circumstances. For instance AWS S3 also doesn't support TCP Timestamps and this had a rather big impact on e.g. FreeBSDs TCP Performance until recently.
For engineering, USDS predominately hires senior engineers with years of experience. The reason is because we help troubleshoot some of the biggest crises in the government.
Sure, but isn't 161k the max, and there's no room to grow beyond that? For many people who work in the SF bay area (for example) and have 10+ years of experience, even 161k may represent a pay cut. DC is certainly cheaper cost-of-living-wise, but not by a lot. And if 161k is the max, where do you go from there, especially if you won't have a job after a few years due to how their "tour of duty" thing? Spend even more of your own money to move back to the bay area, and try to reestablish your old salary from before you left?
If your counting the money it will never be worth it. It makes no financial sense to go work for the USDS, you will get paid less, you will work in a much worse environment, with more frustration then you can imagine. However you will directly impact the lives of millions of Americans. People who would have died, might live. Educations can be obtained for those who might never have had one. Doctors will get paid more efficiently and will take on more medicare and medicaid patients.
You might not have as much disposable income, but you can live a good life in a nice area for 161k. Thats more them most people in the DC area make.
I actually did look at the TCP layer early on. However, I didn't pay close attention to the TS Val. From the packet dumps, it just appeared that the TCP window had stopped sliding. I couldn't conclude that NSOC's router was at fault.
Getting NSOC on-board is a big deal. After all, they deal with the entire VA network with 100,000+ employees. If you think about it from their perspective, why is USDS' TCP connections so special?
Network level troubleshooting is incredibly difficult, especially for individuals who don't have a networking background. Even showing someone how to read wireshark often isn't enough.
I just wanted to politely point out though, in this case, I think there should have been an indications of a network failure in this analysis early on, from the standpoint that TCP frames were sent to the server which were not acknowledged. This would depend on the point where you capture the traffic naturally, but the lack of acknowledgement would be a strong indicator that traffic is not reaching the server, or that replies are not reaching your capture point.
So while the TS Val may be the cause of the drops, I think the packet drops should have stood out when seeing the traffic being black holed, and likely the same segments getting re-transmitted continuously.
And for anyone out their who thinks this is easy to catch, I'd say this is very easy to miss, because you need to have a good understanding of how TCP works in the first place, to know what not working looks like.
True, but Wireshark will highlight dodgy TCP frames (retransmits, dups, etc) which should give a small clue to look further. I agree that it is necessary to understand how TCP works (or have access to someone who does) in order to run Internet services.
However, There is actually another 50% of the story that I never posted. VACOLS is a really old Oracle DB (from the 80s) that is out of our control. Somehow, it has a "feature" where you can only make one TCP connection to it every 2-3 second. So if we lose connection to the database, it will take many seconds to recover. At that point, our ELB health-check would've fired and restarted our EC2 instances. This is why recoverability of the database connection is not an immediate priority.
The infrastructure we operate in are very challenging (and interesting) because of legacy systems. That's why common sense engineering often may not apply in USDS.
I also bet, those challenging legacy systems in many case are way better built than what "modern" systems would provide. Sure there will be whacky things to work around, but I've seen my share of whacky engineering in brand new systems too. Common sense engineering seems to be few and far between these day's.
The government's current IT infrastructure crisis is not caused by any one administrations. The root cause goes back decades. Things like "Improving Veterans' lives so they don't have to wait 5-10 years for an Appeals decision" shouldn't be political.
I can honestly say that the projects I've been involved with in USDS are the most impactful and meaningful projects I've worked on in my entire life.
Similarly, Unreal Engine 4 offsets platform time (a double) by some large value so if it's stored in a float, accuracy errors will be exposed almost immediately. Looking it up, the offset starts out large enough that the epsilon is two seconds.
Sorry, no. It's not something documented other than a cryptic comment in the source code ( FPlatformTime::Seconds() ) assuming some knowledge of floating point number gotchas.