askldjd's comments

askldjd · on Feb 17, 2020

I spent a little over 3 years in USDS and it changed my life. If you are interested in working on high impact projects that affects hundreds of millions of Americans, I would strongly recommend that you give it a try.

askldjd · on July 1, 2017

Our team is very careful at choosing technology that is stable and well proven. In fact, it took a lot of debate for us to move from Rails static pages to React SPA.

As for government being slow and only change once a decade, you might be underestimating engineers in the government.

Here's the disability claim processing project I worked on when I was on SSA Digital Service. https://federalnewsradio.com/ask-the-cio/2017/01/ssa-turns-c...

The team I worked with managed to pull off React/Node.js/Redis/Zookeeper microservice framework over AWS using Jenkins as CI/CD pipeline. And most of the team came from an older enterprise stack using Java and IBM WebSphere. They were able to adapt to the new framework.

Be conservative, yes. But being overly conservative is part of the reasons why government is so far behind today.

noir_lord · on July 1, 2017

Interesting, when I said once a decade I didn't particulary mean changing anything I meant that the system as a whole has to last a decade.

I've stuff in production I wrote in 2007, it's the same system but lots of the parts have been replaced (bit of a Theseus/Triggers Broom philosophical point) over time.

I agree on the overly conservative point, what I find helps there is to consider the migration away from anything new you are considering, I usually ask myself the following question - "If the entire dev team for foobar disappeared tomorrow, how much would it hurt" if the answer is "not much because I can maintain it myself while migrating away" then thats very different to "a lot because I don't understand it that well and/or it's massively complex".

Enterprise is a different world (even for a medium sized one like I work for) because (frankly) a lot of the programmers are doing it purely for the money and/or simply aren't that competent.

Not all of them by any means but I run into a lot of code that is just plain horrible, not "not the way I'd have done it" so much as "how does this thing even work?" and "Who possibly thought this was the right approach?".

I'm weird, I like LoB software development - I find that done well the feedback loop is very gratifying, you get to have an immediate affect on the companies bottom line and you have happier employees also the problems have a lot of hidden complexity which is intellectually stimulating.

askldjd · on June 30, 2017

Yup, it would. Disabling the TS option was just a stopgap measure to make our deployments stable for the time being.

sydney6 · on June 30, 2017

Of course, setting priorities.. I was just wondering how different OSes would behave under these circumstances. For instance AWS S3 also doesn't support TCP Timestamps and this had a rather big impact on e.g. FreeBSDs TCP Performance until recently.

askldjd · on June 30, 2017

I am turning 34 coming this September. Not much of a kid anymore (even though I act like one).

Not true. GS15 caps at 161k at DC area. It is a very respectable salary. http://www.fedweek.com/pay-tables/2017-gs-pay-table-washingt...

For engineering, USDS predominately hires senior engineers with years of experience. The reason is because we help troubleshoot some of the biggest crises in the government.

kelnos · on June 30, 2017

Sure, but isn't 161k the max, and there's no room to grow beyond that? For many people who work in the SF bay area (for example) and have 10+ years of experience, even 161k may represent a pay cut. DC is certainly cheaper cost-of-living-wise, but not by a lot. And if 161k is the max, where do you go from there, especially if you won't have a job after a few years due to how their "tour of duty" thing? Spend even more of your own money to move back to the bay area, and try to reestablish your old salary from before you left?

dkhenry · on July 1, 2017

If your counting the money it will never be worth it. It makes no financial sense to go work for the USDS, you will get paid less, you will work in a much worse environment, with more frustration then you can imagine. However you will directly impact the lives of millions of Americans. People who would have died, might live. Educations can be obtained for those who might never have had one. Doctors will get paid more efficiently and will take on more medicare and medicaid patients.

You might not have as much disposable income, but you can live a good life in a nice area for 161k. Thats more them most people in the DC area make.

askldjd · on June 30, 2017

I actually did look at the TCP layer early on. However, I didn't pay close attention to the TS Val. From the packet dumps, it just appeared that the TCP window had stopped sliding. I couldn't conclude that NSOC's router was at fault.

Getting NSOC on-board is a big deal. After all, they deal with the entire VA network with 100,000+ employees. If you think about it from their perspective, why is USDS' TCP connections so special?

kevin_nisbet · on June 30, 2017

Network level troubleshooting is incredibly difficult, especially for individuals who don't have a networking background. Even showing someone how to read wireshark often isn't enough.

I just wanted to politely point out though, in this case, I think there should have been an indications of a network failure in this analysis early on, from the standpoint that TCP frames were sent to the server which were not acknowledged. This would depend on the point where you capture the traffic naturally, but the lack of acknowledgement would be a strong indicator that traffic is not reaching the server, or that replies are not reaching your capture point.

So while the TS Val may be the cause of the drops, I think the packet drops should have stood out when seeing the traffic being black holed, and likely the same segments getting re-transmitted continuously.

And for anyone out their who thinks this is easy to catch, I'd say this is very easy to miss, because you need to have a good understanding of how TCP works in the first place, to know what not working looks like.

dboreham · on June 30, 2017

True, but Wireshark will highlight dodgy TCP frames (retransmits, dups, etc) which should give a small clue to look further. I agree that it is necessary to understand how TCP works (or have access to someone who does) in order to run Internet services.

js2 · on June 30, 2017

I applaud you for not putting the obvious work-around in place:

- Inserting "sleep 300" into the startup process.

- Adding a cronjob to reboot the servers once a week.

I kid, but I'm sure we've all seen hacks like this.

askldjd · on June 30, 2017

You are dead on. We do have a bug where we are not recovering the Oracle connectivity correctly. It is on our radar to address the issue. https://github.com/department-of-veterans-affairs/caseflow-m...

However, There is actually another 50% of the story that I never posted. VACOLS is a really old Oracle DB (from the 80s) that is out of our control. Somehow, it has a "feature" where you can only make one TCP connection to it every 2-3 second. So if we lose connection to the database, it will take many seconds to recover. At that point, our ELB health-check would've fired and restarted our EC2 instances. This is why recoverability of the database connection is not an immediate priority.

Here's how we preallocate the VACOLS connection pool to workaround this throttling feature. https://github.com/department-of-veterans-affairs/caseflow/b...

The infrastructure we operate in are very challenging (and interesting) because of legacy systems. That's why common sense engineering often may not apply in USDS.

kevin_nisbet · on June 30, 2017

I also bet, those challenging legacy systems in many case are way better built than what "modern" systems would provide. Sure there will be whacky things to work around, but I've seen my share of whacky engineering in brand new systems too. Common sense engineering seems to be few and far between these day's.

Kudos on having something interesting to work on.

askldjd · on June 30, 2017

Thanks. I think the fact Cisco routers fail to route TCP packets bothers me even more.

/you had one job

fartbagxp · on June 30, 2017

But it did route those TCP packets over.

For exactly 5 minutes.

It just means you need to route everything faster, and then kill your connection, and restart it. :)

askldjd · on June 30, 2017

The government's current IT infrastructure crisis is not caused by any one administrations. The root cause goes back decades. Things like "Improving Veterans' lives so they don't have to wait 5-10 years for an Appeals decision" shouldn't be political.

I can honestly say that the projects I've been involved with in USDS are the most impactful and meaningful projects I've worked on in my entire life.

askldjd · on June 30, 2017

Author here. Completely agreed. My jaws dropped when I saw the INITIAL_JIFFIES. The kernel developers really saved our butt.

I could not imagine debugging this problem if INITIAL_JIFFIES was randomized. It may takes days/weeks/months for this bug to appear.

lfowles · on June 30, 2017

Similarly, Unreal Engine 4 offsets platform time (a double) by some large value so if it's stored in a float, accuracy errors will be exposed almost immediately. Looking it up, the offset starts out large enough that the epsilon is two seconds.

vageli · on July 1, 2017

Do you have a link with more info? I'd love to read more about this.

lfowles · on July 1, 2017

Sorry, no. It's not something documented other than a cryptic comment in the source code ( FPlatformTime::Seconds() ) assuming some knowledge of floating point number gotchas.

Edit: Here's a more detailed post o made about the specific gotcha if you're interested: https://community.gamedev.tv/t/why-is-fplatformtime-seconds-...

aaronchall · on June 30, 2017

Here's the patch that applies this: https://www.kernel.org/pub/linux/kernel/people/akpm/patches/...

askldjd · on July 1, 2017

Nice! I was looking for it, and it predates git. Thank you!!