I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.
I played again and found issues. After level 40ish, basically nothing can hurt you. You heal too fast. And once I got past level 80ish, I was running so fast, I ran through walls. And the saddest part, I got to level 200+ and couldn’t die to lock in my high score! It was a fun time though, thanks for sharing this game!
Ahh - sorry that it didn't get to the leaderboard. I pushed an update recently that should make the game significantly more challenging as time progresses.
I’m very surprised this isn’t getting more attention. Am I missing something?
It seems at or above SOTA on the given benchmarks, doesn’t have context rot, is orders of magnitude faster, and uses less compute that current transformer models. I suppose it’s just an announcement and we can’t test it ourselves yet.
We are SOTA in some ways and not in others, continuously working to make it better! We need a little more time to scale, as we are working on things like disaggregated prefill, etc., the norms of large-scale model infra.
This seems super cool if as described, but I'm sure you can understand the skepticism.
Do you anticipate having any kind of public accessible chat interface for testing in the near future?
Also, what, if any, benefits are there for smaller context windows? Is there still a material improvement in cost to serve under say 256k? I'm curious about the broader implications for the space beyond improvements for very large context windows.
I do, for sure! Yes, we have a few product rollouts lined up. The differentials for latency are posted in our blog post, so that should provide an idea of where the scaling law differentials kick in.
In this new knowledge economy, there is no benefit to publishing your secret sauce.
If I came up with a novel thing I'd monetise it first, because publishing it makes it part of the training that adds value to billion dollar corps with zero credit to me.
In the old knowledge economy I benefited from the credit assigned to me.
I'm not GP, but I would want a benchmark that actually tests the entire context window. A benchmark that only tests the first 128K tokens effectively tells us nothing about how well it works at its full capacity.
The proof is in the pudding. At this point, there have been plenty of models that overperformed on benchmarks and underperformed on real work. So my stance is that I'm curious, I'm excited to see where it goes, and I don't believe it until I can try it.
At a minimum, you increase top-k to cast a wider net, then after reranking, take the N you really want. You have to play around with it a bit, but that’s the idea.
I built one of the top 3 results on Google when you search “compound interest calculator” and a dozen other similarly popular calculator pages.
The value isn’t the interface, it’s the trust that its calculations are accurate. I can’t tell you how many meetings I had with accountants and finance people to validate all the calculations.
Being flexible is important, but the latest research shows that heavy lifting improves flexibility about as much as a dedicated stretching program: https://pmc.ncbi.nlm.nih.gov/articles/PMC9935664/
On the phone so can't see study details - what do day consider strength training ? I used to lift heavy (well by regular people standards) and for example my squat mobility was shit. Trying to improve that with significant load would have ripped my tendons I feel. Decreasing the load to the point where it wouldn't might as well be called stretching cause it wouldn't qualify as heavy lifting. Also most powerlifters I know have shit mobility.
Yeah, I noticed that after a certain age if you want to retain ability to do the movement you need to do that movement. Doing 20 others won't help with that one.
Less than 10% of my projects ever made anything. Check out indiehackers.com - tons of old posts of people starting things, but the domain is dead when you click through.
I’ve worked in e-commerce for years and the thing that always slows down the sites the most is 3rd party scripts. Are you addressing this? I couldn’t find anything in the repo.
Ive had websites slow down 10x just by introducing the Facebook re-targeting script for instance
Yes, we've had similar experiences and plan to address them to some extent. In certain cases, like with Facebook, it might not be straightforward, but we want to provide built-in alternatives e.g. analytics specifically tailored for e-commerce that cover ~80% of what you need, with the rest being a trade-off.
I also think Facebook, Google, Hotjar et al. will eventually get better with those scripts.
Fyi regarding cover images: I have built and run a handful of book related websites and Amazon is the easiest place I found to get book covers. You just need the Amazon id and every image is a standard url.
reply