Hacker Newsnew | past | comments | ask | show | jobs | submit | pstorm's commentslogin

I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.

For video/image stuff I found the ability for the LLMs to use ffmpeg and imagemagick to be quite fun.

This was more fun than I was expecting. Nice way to waste 20 min!

Awesome! Thanks for checking it out and glad you enjoyed it!

I played again and found issues. After level 40ish, basically nothing can hurt you. You heal too fast. And once I got past level 80ish, I was running so fast, I ran through walls. And the saddest part, I got to level 200+ and couldn’t die to lock in my high score! It was a fun time though, thanks for sharing this game!

Ahh - sorry that it didn't get to the leaderboard. I pushed an update recently that should make the game significantly more challenging as time progresses.

I’m very surprised this isn’t getting more attention. Am I missing something?

It seems at or above SOTA on the given benchmarks, doesn’t have context rot, is orders of magnitude faster, and uses less compute that current transformer models. I suppose it’s just an announcement and we can’t test it ourselves yet.


We are SOTA in some ways and not in others, continuously working to make it better! We need a little more time to scale, as we are working on things like disaggregated prefill, etc., the norms of large-scale model infra.

I am happy to answer any questions!


This seems super cool if as described, but I'm sure you can understand the skepticism.

Do you anticipate having any kind of public accessible chat interface for testing in the near future?

Also, what, if any, benefits are there for smaller context windows? Is there still a material improvement in cost to serve under say 256k? I'm curious about the broader implications for the space beyond improvements for very large context windows.


I do, for sure! Yes, we have a few product rollouts lined up. The differentials for latency are posted in our blog post, so that should provide an idea of where the scaling law differentials kick in.


> I do, for sure! Yes, we have a few product rollouts lined up.

When, more or less?


We will have a few rollouts in the next two months.


I have questions.

Can you back up your claims?

Why did you not release the white paper in parallel with the product?

Feels really fishy.


In this new knowledge economy, there is no benefit to publishing your secret sauce.

If I came up with a novel thing I'd monetise it first, because publishing it makes it part of the training that adds value to billion dollar corps with zero credit to me.

In the old knowledge economy I benefited from the credit assigned to me.

So, to me, nothing fishy at all.


What do you want in a whitepaper that was not in our blog post? There is time to add more before the whitepaper is released.


I'm not GP, but I would want a benchmark that actually tests the entire context window. A benchmark that only tests the first 128K tokens effectively tells us nothing about how well it works at its full capacity.


That makes sense! We are working on that.


The proof is in the pudding. At this point, there have been plenty of models that overperformed on benchmarks and underperformed on real work. So my stance is that I'm curious, I'm excited to see where it goes, and I don't believe it until I can try it.


> Am I missing something?

Yes, this product doesn't exist.

And the last time a company claimed something similar it disappeared after taking money from investors.


Yes you're missing something: the snake oil.


no one has access to it yet

no published benchmarks

no paper

no demonstrations of capabilities


I agree, it's a real architectural breakthrough if true


Just fyi, for RAG/similarity search, adding a reranker was much bigger pay off than switching embedding models.


What top K do you use for vector search before passing into the reranker?


At a minimum, you increase top-k to cast a wider net, then after reranking, take the N you really want. You have to play around with it a bit, but that’s the idea.


I built one of the top 3 results on Google when you search “compound interest calculator” and a dozen other similarly popular calculator pages.

The value isn’t the interface, it’s the trust that its calculations are accurate. I can’t tell you how many meetings I had with accountants and finance people to validate all the calculations.


Being flexible is important, but the latest research shows that heavy lifting improves flexibility about as much as a dedicated stretching program: https://pmc.ncbi.nlm.nih.gov/articles/PMC9935664/


On the phone so can't see study details - what do day consider strength training ? I used to lift heavy (well by regular people standards) and for example my squat mobility was shit. Trying to improve that with significant load would have ripped my tendons I feel. Decreasing the load to the point where it wouldn't might as well be called stretching cause it wouldn't qualify as heavy lifting. Also most powerlifters I know have shit mobility.


For the movements that you happen to make with the barbell.


Yeah, I noticed that after a certain age if you want to retain ability to do the movement you need to do that movement. Doing 20 others won't help with that one.


That’s a new fire as of a couple hours ago - Sunset Fire. There are 5 going on in LA at the moment.


Less than 10% of my projects ever made anything. Check out indiehackers.com - tons of old posts of people starting things, but the domain is dead when you click through.


I’ve worked in e-commerce for years and the thing that always slows down the sites the most is 3rd party scripts. Are you addressing this? I couldn’t find anything in the repo.

Ive had websites slow down 10x just by introducing the Facebook re-targeting script for instance


Yes, we've had similar experiences and plan to address them to some extent. In certain cases, like with Facebook, it might not be straightforward, but we want to provide built-in alternatives e.g. analytics specifically tailored for e-commerce that cover ~80% of what you need, with the rest being a trade-off.

I also think Facebook, Google, Hotjar et al. will eventually get better with those scripts.


This really isn't a problem a web platform can solve for itself.

Just use server side tag manager.


Fyi regarding cover images: I have built and run a handful of book related websites and Amazon is the easiest place I found to get book covers. You just need the Amazon id and every image is a standard url.


Fully agree with this. We built this LLM based book summary app: https://www.booksummary.pro/

The links here directly refer to images on Amazon (e.g. https://m.media-amazon.com/images/I/81YkqyaFVEL._SL1500_.jpg)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: