More

etep · 2025-05-12T17:44:20 1747071860

Surfer deserves to hit the front page also. Much better than gtk wave. Nice work Spade & Surfer!

etep · 2025-05-01T22:16:49 1746137809

PAX Markets (YC W25) | Founding SW or HW Engineer | Menlo Park, CA | Full-time

We're building a crypto exchange, on a chip. We sell ultra-low latency market access to HFT participants through our on-chip λ API. For all other participants, we offer free trading with cash rebates for making and taking.

PAX is the fastest exchange from crypto to Wall Street.

Apply here: https://app.dover.com/jobs/pax

etep · on May 11, 2021

I'm curious about the results under the section titled "Demonstration."

The author claims that ground truth is that each goroutine utilizes 10% of the CPU time (so stipulated that this should be the case). But, what if the results shown are accurate, i.e. that the results are the actual CPU time (because of idiosyncrasies of scheduling between the OS, the go runtime, and anything else happening on that system).

Does running the new profiler show less variance in the results from that initial experiment? Showing this result would strengthen the claim that the "out of the box" solution is inaccurate.

etep · on April 10, 2019

I have a theory that news occurs with a far lower frequency than people think, far lower than even when people try to account for their knowledge of the 24 hr. news cycle. In my theory, most of what counts for news, is either a story update (737 Max stories), or not news (say Trump "news").

My hypothetical news organization would strive to identify the unifying element of a given news item, and then keep one article about them. The article would include the following elements: a summary, a timeline, a fact set, and a commentary or critique. All subsections would be allowed to evolve, but the "story" would be one thing.

I've considered trying to self fund this somehow, but I've never convinced myself that it would really get traction.

asteli · on April 10, 2019

Have you seen Wikipedia's current events portal? [1] It's pretty much what you describe there, sans critique (level 1, factual reporting).

In trying to escape the aggregators and filter bubbles, I have blocked all news sources on my phone (including HN!) save for Wikipedia. Since the stories are only high-priority and slowly evolving, I spend a lot less time on reading news, and am happier for it.

[1]: (For convenience) https://en.wikipedia.org/wiki/Portal:Current_events

etep · on April 10, 2019

Wasn't aware of it, will try it - thank you!

max76 · on April 10, 2019

I'd pay a approximately $10 monthly fee for this if it was minimally biased and had wide coverage of all news stories. I'm unsure if enough people would to become profitable, but some market exists.

For some long running stories that I only tune into occasionally, such as the Muller Investigation, I do not have enough context to fully understand the newest headlines. The only place where full comprehensive coverage will be in one place is in a book, and while I often buy books of that genera it will not be released until after the topic is concluded. This seems to be a one-stop-shop for everything about a news topic.

I think you'll have more success if you find a niche to start with. Maybe US National Politics & Economics, which covers your two examples. Finding your zone and sticking to it can separate you from the mainstream news which follows hype much better than a startup organization can.

I think this type of project is a great candidate for Crowd Funding. Crowd Funding can give positive reinforcement for a product/market fit and to help attract journalism talent.

felixyz · on April 10, 2019

I would pay for something that let me keep up with actual news, with a time lag that allowed me to get a more stable picture after the first flurry of opinions and overselling had exhausted most of its energy.

jimbokun · on April 10, 2019

"I have a theory that news occurs with a far lower frequency than people think, far lower than even when people try to account for their knowledge of the 24 hr. news cycle."

I gave up Reddit for Lent, and I've been shocked how little news actually happens in a day. I compulsively check Apple News (I know, baby steps), but without all of the Reddit commentary, wise cracks, trolling, and mutually reinforcing outrage, it all just seems very repetitive.

"Yep, Barr still hasn't released the Mueller report."

"No, the House still doesn't have Trump's tax returns."

"Candidate X is too progressive/too moderate/too male/too old/too young to be the Democratic nominee, even though it's still almost a year before any Democrats can vote on them."

"Still no trade agreement with China."

"Trump makes hyperbolic claims about an action he might take, but doesn't actually take any action."

That seems to be the daily news summary every day for the past few weeks.

etep · on March 17, 2019

It's not nice that so many writers parrot that the solution is "more pilot training." Under the hypothesis that there needs to be a solution (one that I think is correct), then the answer of more training amounts to mere hope. One hopes that the pilot a) knows and b) remembers to turn off the autopilot when the emergency starts.

SomeHacker44 · on March 17, 2019

Pilots can fly planes with unusual characteristics when properly trained. That is the point of the type rating. I feel the 737 MAX needs its own type certificate distinct from the rest of the 737s and that should solve the problem, since the MAX has very different flight characteristics in certain situations.

Either that or ground the plane permanently for being poorly designed and unsafe with any amount of training.

If the FAA does anything else, it will harm my faith in the organization, which is already at a recent low.

NB: Jet pilot, not 737 pilot.

lutorm · on March 17, 2019

737 pilots have said that they are already trained in the tendency of a 737 to pitch up when full power is applied at low speeds, which apparently is present in all 737 models. (Stall recovery procedures for a 737 apparently specifically detail that you must lower the nose first, and then apply power (and not too much power) as applying full or higher power near stall speed will overwhelm the pitch authority.

What's not clear to me is how much worse the characteristics of the MAX are compared to the NG or earlier 737s? Would it really be unsafe to fly without the MCAS, or was this just the route Boeing took to avoid having to retrain pilots?

WrtCdEvrydy · on March 17, 2019

Money's on a last minute fix to the problem and engineering being told the launch couldn't be delayed.

cjbprime · on March 17, 2019

We're told that MCAS only activates with the autopilot already off.

I agree with you that pilot training doesn't seem to be a plausible answer given that the Lion Air crash was world news, resulting in retraining, and yet that wasn't enough.

etep · on March 17, 2019

Thank you for the insight! Just for clarity, I'm using "autopilot" as a catch-all term that refers the automated systems that plausibly seem to have caused these crashes.

alexbecker · on March 17, 2019

...which is a pretty confusing thing to do, given that there are multiple such systems, one of which is conventionally called "autopilot".

salawat · on March 18, 2019

The solution isn't so much "more pilot trainimg".

It's that pilot training wasn't done that absolutely needed to have been.

A three page if even that description of the system, and it's inputs/outputs/controls should have been sufficient for a pilot to be able to build up a mental model to deal with a faulty AoA sensor, and modify their piloting technique to stay within a more conservative maneuvering envelope while operating the plane.

Whether that would have been sufficient to prevent both disasters I don't know, but it sure seems like it would be a small trade off that even in the absence of any other system hardening could have made the difference between no survivors, and an emergency landing.

dehrmann · on March 17, 2019

Pilot training is a major reason air travel is so safe.

> One hopes that the pilot a) knows and b) remembers to turn off the autopilot when the emergency starts.

This is where training kicks in. After hitting the scenario 10 times in the simulator, the pilot doesn't remember; they _do_.

etep · on Jan 26, 2018

https://xkcd.com/1138/

kibwen · on Jan 26, 2018

Please see my other comment as to what this map is useful for. Furthermore, on a global scale, light pollution is not equivalent to population; consider the light levels of Indonesia, the world's fourth most populous country.

etep · on Jan 17, 2018

It should be the case that if the same amount of work is done, then the energy used will be the same. If it takes less work to compile the web assembly, then less energy (holding all other parameters the same). If you have to idle a CPU, then you probably use more energy (holding all other parameters the same) i.e. because you will spend more time and accomplish the same amount of real work (but waste energy on the idled core, albeit waste very little, accomplishing extra, but non-productive, work). Cannot let some other CPU parameter changes as a result of cores being idled (e.g. frequency gets boosted on non-idle cores as a result of dynamic frequency scaling with idle cores) to run this experiment. Thinking about CPU energy use is interesting :)

yjftsjthsd-h · on Jan 17, 2018

I don't think that's true; isn't the relationship between clock speed and power non-linear?

chongli · on Jan 18, 2018

The real killer with energy usage in browsers is idle wake ups per second. If you've got a lot of tabs open and they're all running timers, waking up, hitting the network, etc. then they're keeping the CPU from going into low power state and thus wasting a lot of energy.

Even though I'd prefer to use Firefox I tend to stick with Safari due to the battery life advantage which really shows when you open a lot of tabs.

lloeki · on Jan 18, 2018

Indeed the most dramatic improvement I saw in battery life was when macOS and Safari started cooperating around timer coalescing using App Nap.

ethbro · on Jan 17, 2018

That's for max frequency on a given process node. Power scales with voltage squared. But that doesn't say anything about wasted power. And dynamic scaling screws that up in modern chips.

I believe I could summarize things by saying the only way you can really save energy* doing the same work+ is by using a different semiconductor process (either power/leakage-reduction-focused or smaller).

* For serious values of "energy"

+ Where the same work is not always true for a given task, if one optimizes an algorithm

naikrovek · on Jan 18, 2018

Can you explain the voltage squared thing? To me, power = voltage * current.

mmwelt · on Jan 18, 2018

Even for purely resistive loads, power is proportional to the square of the voltage:

  P = V * I

But,

  I = V / R

So,

  P = V * (V / R)
    = V^2 / R

etep · on Jan 18, 2018

For chips, power scales with voltage squared. Is also true that P=IV (since both are true, these observations cannot be in contradiction). Apparently, for chips, the current must be proportional to voltage also. Glossing over some details, turning on (off) a transistor is the same as charging (discharging) a capacitor. The energy stored on a capacitor is 1/2 C V^2. If you turn on and off the transistor periodically (say with frequency f) you use 1/2 C V^2 energy f times per second (energy per unit time is power). Normally the capacitance is ignored when discussing how power changes because for a given design the capacitance is a fixed quantity.

Filligree · on Jan 18, 2018

Running a processor at higher frequency also requires increasing voltage, which increases effective capacitance by that same formula.

That's not the primary cause of the power = frequency^2 rule, but actually adds a factor on top of that.

etep · on Jan 18, 2018

I think it is linear for frequency and non-linear for voltage, i.e. P~fCV^2. But in many current CPUS, the feature that adjusts frequency also adjusts voltage. That's why I stipulated that, for my comments to be true, such shenanigans as dynamic frequency (and voltage) scaling must be "turned off." I think the OP was asking, what happens to CPU energy if you load the web page with and without the optimized compilation. The OP was interested in core sleep states, but I think that dynamic frequency scaling is a confounding factor. It would be interesting to see the measurements w/ and w/out that feature perhaps.

tspiteri · on Jan 18, 2018

To increase the frequency, you also have to increase the voltage so that the transistors charge faster, otherwise they won't be able to switch in the shorter time.

etep · on July 27, 2017

If Bezos gave all his money to everyone in the world we would all get maybe $13 or so. If all the billionaires did the same we would all get $1000 or so.

csomar · on July 27, 2017

So you are saying the richest still have potential? I mean we can expect the average person around the world to be able to "yield" $500 for the richest person in this global era. That's like trillions for the worlds richest.

yourapostasy · on July 27, 2017

I don't speak for your comment's parent, but anytime I see these types of figures, it brings to mind Iain Banks' comment "Money is a sign of poverty" [1]. You'll see rants [2] against this sentiment, but I have yet to find a rant that doesn't essentially deify money. It's just a tool, and if it doesn't help us sufficiently move the needle towards greater human satisfaction (which I believe is very congruent with moving us towards post-scarcity factors), then it is upon us to find another, better tool.

[1] https://en.wikipedia.org/wiki/The_Culture#Economy

[2] http://angry-economist.russnelson.com/money-is-not-a-sign-of...

alk123 · on July 31, 2017

Ah, the grand benefit of value - no matter what you call it.

https://en.wikipedia.org/wiki/The_Unincorporated_Man

_oghd · on July 27, 2017

I don't want their money. i want their companies.

Democratize their companies. control them by employment or by employee or government committee if they are too big. Make the big corporations work for the benefit of everyone, rather than shareholders and owners.

elmar · on July 27, 2017

Running Amazon by government committee, in a couple of years will be requesting a goverment bailout.

lxm · on July 27, 2017

This tends to realign the systems around serving the interests of that one large company, and historically has been a mixed bag. Finland's reliance on Nokia, Russia's on Gazprom and Saudi Arabia's on Aramco was good while it lasted.

For what it's worth US laready has large ownership stakes in such corporations as AIG, Amtrak, USPS, Freddie Mac, etc.

justinv · on July 27, 2017

I would argue - especially for Jeff & Bill - that the majority of employees are both the shareholders & owners of said companies.

gpawl · on July 27, 2017

Most Amazon employees are non-shareholding warehouse workers (even if technically they go through a contractor)

etep · on June 29, 2017

Hi stjepang,

If the following criteria is met, then perhaps the branch mis-predict penalty is less of a problem: 1. you are sorting a large amount of data, much bigger than the CPU LLC 2. you can effectively utilize all cores, i.e. your sort algorithm can parallelize Perhaps in this case you are memory bandwidth limited. If so, you are probably spending more time waiting on data than waiting on pipe flushes (i.e. consequence of mis-predicts).

_hrfd · on June 29, 2017

Absolutely - there are cases when branch misprediction is not the bottleneck. It depends on a lot of factors.

Another such case is when sorting strings because every comparison causes a potential cache miss and introduces even more branching, and all that would dwarf that one misprediction.

etep · on June 7, 2017

I had wondered, why are the L1 caches not growing, while L2 and L3 capacities continue to grow: a significant limitation on L1 cache size is actually the fixed tradeoff between associativity, page size (i.e. the 4KB pages allocated by the OS to processes).

Because a 4 KB page has 64 cache lines, then you can have at most 64 cache sets. With an 8 way associative cache this works out to 32 KB. Using 128 sets would cause aliasing, but with 64 sets the cache index is built from the LSBs that just index into the page (i.e. not used in the TLB lookup). Thus, the only way to grow increase L1 capacity is to: - totally abandon 4KB pages in favor of (e.g.) 2MB pages (not likely) - increase cache associativity (likely imo) - stop using virtual index+physical tag (not likely imo)

CalChris · on June 7, 2017

I think a simpler argument is that for L1 you want fast, not big. Same thing with registers (a form of cache at a lower level). Why did MIPS only have 32 registers?

Design Principle 2: Smaller is faster. [1]

BTW, if you look at Agner Fog's latency tables [2], mov mem,r (load) went from 3 cycles in Haswell to 2 cycles in Skylake. So Intel has been concentrating on faster which is nice.

And by way of comparison, AMD increased their μop cache size in Ryzen but then only slightly. Way size went from 6 μops to 8. This matches their increase in EUs.

[1] Patterson and Hennessy. Computer Organization and Design, 5th edition, p. 67.

[2] http://www.agner.org/optimize/instruction_tables.pdf

etep · on June 7, 2017

At a high level it's true that smaller is faster, but it's also true that those L1s could have grown by adding sets (not ways) and achieved the same latency. L2 has grown, but stayed iso-latency. This seems to say that "smaller is faster" does not always hold.

Always impressed that Agner Fog takes the time to publish his results. Pretty amazing. But I think focusing your thinking on the register count in MIPs or the the uarch for some random opcode does not get into the real constraints on L1 cache design at all. One could say that x86 should be even faster, because hey, far less than 32 registers (or historically at least).

My response is like this: yes, the L1 has to be small to be fast, but it has been stuck at 32KB forever now. It could have grown! So it's not as simple as small is fast.

xorblurb · on June 7, 2017

L1 size is probably constrained by a trade-off and competition for area between different CPU parts. If it can be increased with an overall positive effect on performance (while still being economically competitive to build), then I have no doubt Intel will do it... It is probably not crucial to have a big L1 on modern x86 arch because of very deep OOO queues, HT, speculative exec, prefetching, and all the other improvements on IPC and overall package perf that need to keep some efficiency even when L1 can't keep up anyway.

I also vaguely remember the Mill cpu guy talking about cache size constraints just because of the speed of light, but given node size has continued to decrease during the last decade while frequency has nearly stopped to increase, this might be less an issue than basic area optimizations. Or this might be an interesting consideration on Mill only because it is a radically different architecture, and needs different area ratios.

Only wild guesses though, I don't even have tried to confirm any of that with any kind of research or back of the envelop calculations.

CalChris · on June 7, 2017

x86_64 has 16 integer registers but Haswell has a 192 entry ROB. Skylake has 224. So Intel does increase these numbers. It's just that there has to be a good reason. In the 90s maybe something like clock speed could win a marketing spec battle. Not today.

I think at 6 transistors per bit we really aren't talking about a lot of die area. Still I'm stone cold certain the Intel architects would increase L1 cache size if that was beneficial, if it modeled out. (However they may want to keep performance similar+predictable unless there's a solid win.)

Agner is showing they've reduced L1 latency. So this smaller is faster seems to have gotten them something.

So you really have to work backwards and ask why they didn't/don't. There may more than one reason; but they don't and haven't in quite some time.

I'm old school assembly/compiler hack. I read Agner and the Intel Optimization Manual a lot. VTune, IACA and the PMCs. Someone has to do it.

etep · on June 8, 2017

I think maybe we were talking past each other. Yes there is more than one reason.

It's far easier to add capacity by adding sets, as opposed to ways. But they can't add sets in the L1 because of the aliasing problem. When they do increase L1 capacity, if nothing else has changed, then it will be by adding ways.

marcosdumay · on June 7, 2017

Increasing the register count spends opcodes. That leads to less available instructions, or at a minimum constraints opcode optimization.

CalChris · on June 7, 2017

As we saw with AMD64, x86 is a variable length ISA, up to 15 bytes long, allowing for quiet a flexible (and complex) encoding. With a fixed width RISC, yeah registers are going to eat into opcode space. And in both cases, register renaming will allow more renamed (180) registers than architectural registers.

BTW, renamed != ROB. I got that wrong above.

marcosdumay · on June 8, 2017

In varying length architectures it will constrain opcode optimization, making your binaries larger (requiring more cache). It's not as big a problem as in fixed length instruction machines, but adding named registers is never free.

fsaintjacques · on June 7, 2017

It seems you know this subject very well :) https://github.com/etep/resume

On an unrelated subject, do you know if the next desktop generation (coffeelake?) will support AVX512 or should I just buy a skylake-x?

redcalx · on June 7, 2017

Yes (partially). See the table in this article:

https://www.kitguru.net/components/cpu/anton-shilov/intel-sk...

fsaintjacques · on June 7, 2017

Coffee lake is not the same as cannon lake. This chart is from 2015 and suggests the cannon lake would be released in early 2017.

I doubt we'll see cannon lake this year, since they released kaby lake early 2017...

https://en.wikipedia.org/wiki/Coffee_Lake https://en.wikipedia.org/wiki/Cannonlake

keldaris · on June 7, 2017

Since I don't see any flaw in your reasoning here, the obvious question becomes - why haven't they just moved to 16-way associative L1 yet? What are the hurdles?

Tuna-Fish · on June 7, 2017

The reason he gave is not the main reason L1s don't grow. The main reason is latency.

Increasing cache size grows latency in two ways: every doubling of the cache size adds one mux to the select path, and every doubling increases the wire delay from the most distant element by ~sqrt(2). Both of these are additively on the critical path, and spending more time on them would require increasing the cache latency.

The size of a cache is always a tradeoff against the latency of the cache. If this was not true, there would only be a single cache level that was both large and fast. However, making something both large and fast is impossible, so instead we have stacked cache levels starting with a very fast but small cache followed by increasingly slower and larger ones.

etep · on June 7, 2017

Hi, it is the main reason L1 hasn't grown.

By your reasoning, no cache should be able to grow, because then their latency would increase too much. But instead, all other CPU caches are growing basically with iso-latency. The reason this is possible is technology scaling. Anyway...

But yes, the L1 does have to be small and fast, but it doesn't have to be that small to be that fast. It has to be that small because of virtual indexing combined with the cost of adding ways breaking other design constraints (possibly a latency constraint, fine). But you could grow the L1 by adding sets and get your required latency.

Tuna-Fish · on June 7, 2017

> By your reasoning, no cache should be able to grow, because then their latency would increase too much. But instead, all other CPU caches are growing basically with iso-latency. The reason this is possible is technology scaling. Anyway...

The problem is that technology scaling only gives you roughly enough to keep up with the speed of the CPU. Cache latencies measured as nanoseconds keep going down, but cache latencies measured in clock cycles are pretty stagnant at the same sizes. And when Intel added some more L3, they also relaxed latency to it, and when they recently cut the latency to it a little, they did so by cutting the amount.

> But yes, the L1 does have to be small and fast, but it doesn't have to be that small to be that fast. It has to be that small because of virtual indexing combined with the cost of adding ways breaking other design constraints (possibly a latency constraint, fine). But you could grow the L1 by adding sets and get your required latency.

No, you couldn't. The added latency of increasing the size would be simply too much. I know for a fact that the L1 load latency is currently one of the most important critical paths in Intel CPUs -- any increase in L1 size would mean that you have to reduce clock speeds.

etep · on June 7, 2017

If it was easy to build high performance caches with high associativity, we would certainly see higher associativity. Ideally, you want a fully associative cache, but it's too expensive. In CPU caches, once a set is selected, all N associative ways are compared simultaneously. So growing associativity costs area and power for extra comparators. This growth could cause timing issues, i.e. add latency to the memory access or cause CPU freq. to be lowered.

gchadwick · on June 7, 2017

You can also build a memory system that is able to deal with the aliasing. Then you don't have the dependence on page size.