Hacker Newsnew | past | comments | ask | show | jobs | submit | someplaceguy's commentslogin

> the DFA for an extended RE (including a lazy DFA implemented using derivatives, as here) is worst-case doubly exponential in the length of the expression

The authors seem to claim linear complexity:

> the result is RE#, the first general-purpose regex engine to support intersection and complement with linear-time guarantees, and also the overall fastest regex engine on a large set of benchmarks


We refer to this in the paper as well,

The standard way to do intersection / complementation of regexes with NFAs requires determinization, which causes a huge blowup, whereas for us this is the cost of a derivative.

It is true that we cannot avoid enormous DFA sizes, a simple case would be (.*a.*)&(.*b.*)&(.*c.*)&(.*d.*)... which has 2^4 states and every intersection adds +1 to the exponent.

How we get around this in the real world is that we create at most one state per input character, so even if the full DFA size is 1 million, you need an input that is at least 1 million characters long to reach it.

The real argument to complexity is how expensive can the cost of taking a lazy derivative get? The first time you use the engine with a unique input and states, it is not linear - the worst case is creating a new state for each character. The second time the same (or similar) input is used these states are already created and it is linear. So as said in the article it is a bit foggy - Lazy DFAs are not linear but appear as such for practical cases


> The second time the same (or similar) input is used these states are already created and it is linear.

Does this imply that the DFA for a regex, as an internal cache, is mutable and persisted between inputs? Could this lead to subtle denial-of-service attacks, where inputs are chosen by an attacker to steadily increase the cached complexity - are there eviction techniques to guard against this? And how might this work in a multi-threaded environment?


Yes, most (i think all) lazy DFA engines have a mutable DFA behind a lock internally that grows during matching.

Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.

The subtle denial of service part is interesting, i haven't thought of it before. Yes this is possible. For security-critical uses i would compile the full DFA ahead of time - the memory cost may be painful but this completely removes the chance of anything going wrong.

There are valid arguments to switch from DFA to NFA with large state spaces, but RE# intentionally does not switch to a NFA and capitalizes on reducing the DFA memory costs instead (eg. minterm compression in the post, algebraic simplifications in the paper).

The problem with going from DFA to NFA for large state spaces is that this makes the match time performance fall off a cliff - something like going from 1GB/s to 1KB/s as we also show in the benchmarks in the paper.

As for eviction techniques i have not researched this, the simplest thing to do is just completely reset the instance and rebuild past a certain size, but likely there is a better way.


> Multithreading is generally a non-issue, you just wrap the function that creates the state behind a lock/mutex, this is usually the default.

But you also have to lock when reading the state, not just when writing/creating it. Wouldn’t that cause lock contention with sufficiently concurrent use?


No, we do not lock reading the state, we only lock the creation side and the transition table reference stays valid during matching even if it is outdated.

Only when a nonexistent state is encountered during matching it enters the locked region.


Ah, I see, so it’s basically the Racy Single-Check Idiom.


> are there eviction techniques to guard against this?

RE2 resets the cache when it reaches a (configurable) size limit. Which I found out the hard way when I had to debug almost-periodic latency spikes in a service I managed, where a very inefficient regex caused linear growth in the Lazy DFA, until it hit the limit, then all threads had to wait for its reset for a few hundred milliseconds, and then it all started again.

I'm not sure if dropping the whole cache is the only feasible mitigation, or some gradual pruning would also be possible.

Either way, if you cannot assume that your cache grows monotonically, synchronization becomes more complicated: the trick mentioned in the other comment about only locking the slow path may not be applicable anymore. RE2 uses RW-locking for this.


I have experienced this as well, the performance degradation of DFA to NFA is enormous and while not as bad as exponential backtracking, it's close to ReDoS territory.

The rust version of the engine (https://github.com/ieviev/resharp) just returns an Error instead of falling back to NFA, I think that should be a reasonable approach, but the library is still new so i'm still waiting to see how it turns out and whether i had any oversights on this.


Here RE2 does not fall back to the NFA, it just resets the Lazy DFA cache and starts growing it again. The latency spikes I was mentioning are due to the cost of destroying the cache (involving deallocations, pointer chasing, ...)

Ah, sorry then i misunderstood the comment

I'm not sure if it's with both RE2 or Rust, but some internal engines of Rust appear to allocate a fixed buffer that it constantly re-creates states into.

I'm not really familiar with the eviction technique of RE2 but I've done a lot of benchmark comparisons. A good way to really stress test RE2 is large Unicode classes, \w and \d in RE2 are ascii-only, i've noticed Unicode (\p{class}) classes very drastically change the throughput of the engine.


These claims are compatible. For instance, lex, re2c, ragel, etc. are exponential in the length of the automaton description, but the resulting lexers work in linear time[1] in the length of the string. Here the situation is even better, because the DFA is constructed lazily, at most one state per input character, so to observe its full enormity relative to the size of the needle you need an equally enormous haystack. “One state per input character” somewhat understates things, because producing each state requires a non-constant[2] amount of work, and storing the new derivative’s syntax tree a non-constant amount of space; but with hash-consing and memoization it’s not bad. Either way, derivative-based matching with a lazy DFA takes something like O(needle + f(needle) × haystack) time where I’m guessing f(n) has to be O(n log n) at least (to bring large ORs into normal form by sorting) but in practice is closer to constant. Space consumption is less of an issue because if at any point your cache of derivatives (= DFA) gets too bloated you can flush it and restart from scratch.

[1] Kind of, unless you hit ambiguities that need to be resolved with the maximal munch rule; anyways that’s irrelevant to a single-RE matcher.

[2] In particular, introductions to Brzozowski’s approach usually omit—but his original paper does mention—that you need to do some degree of syntax-tree simplification for the derivatives to stay bounded in size (thus finite in number) and the matcher to stay linear in the haystack.


> I still see AI making stupid silly mistakes.

In contrast with humans, who are famously known for never making stupid silly mistakes...


> Okay, that's not fair. There's a big advantage to having an external compressor and reference file whose bytes aren't counted, whether or not your compressor models knowledge.

The benchmark in question (Hutter prize) does count the size of the decompressor/reference file (as per the rules, the compressor is supposed to produce a self-decompressing file).

The article mentions Bellard's work but I don't see his name in the top contenders of the prize, so I'm guessing his attempt was not competitive enough if you take into account the LLM size, as per the rules.


The benchmark counts it but the LLM compressor that was linked in that sentence clearly doesn't count the size.


All the more reason for asking the question?


> using the appropriate unicode characters might make it easier to read

It's probably also a great way to introduce almost undetectable security vulnerabilities by using Unicode characters that look similar to each other but in fact are different.


This would cause your compilation to fail, unless you were deliberately declaring and using near identical symbols. Which would violate the whole "Code is meant to be easily read by humans" thing.


> unless you were deliberately declaring and using near identical symbols.

Yes, that would probably be one way to do it.

> Which would violate the whole "Code is meant to be easily read by humans" thing.

I'd think someone who's deliberately and sneakily introducing a security vulnerability would want it to be undetectable, rather than easily readable.


Recently I tried to reinstall an eSIM on my Android phone while overseas but was told by my carrier that the eSIM can only be activated while connected to antennas located in the carrier's country, i.e. it can't be activated overseas, despite my plan supporting call roaming and both countries being in the EU.

I don't know whether this is carrier-specific or the same for all carriers.


This worked for me, French carrier "Free", and install new eSIM while in Spain.

But now I have doubts, especially outside the EU: if it doesn't work, that would loose one of the advantages that I'd sort of expected eSIM to have: if your phone gets lost / stolen while abroad, you could just get a new eSIM from your carrier immediately, and set it up on your replacement phone.

In my case, my bank uses mandatory SMS 2FA for setting up their app on a new phone, thus making it impossible to make purchases with my card without having the being able to set up the app.

So I'd be back to the oldschool method of having a fried back at home set up the new eSIM, receive the 2FA code...


I think almost all carriers require this. I've seen mentions that the Google Fi eSIM requires US towers to activate, but can be moved / reinstalled later without them (didn't test it though).


Just an end user, so don't quote me on this, but I think that requirement was largely a legacy Sprint requirement.

I've purchased newer Pixel devices from my local shop and activated Google Fi just fine overseas. (with the caveat that I might not have all of T-Mobile's bands if I'm back in the US).


Even the EU's own official web portal [1] has a cookie pop-up that covers half the screen of my mobile phone when I visit it.

[1] https://europa.eu/


Probably built by a web gency who added tracking, perhaps even GA, so there was need for a cookie pop up banner. Why that website would need tracking and profiling is beyond me.


I think every website should understand how and by who their website is used. I don't consider this "spying." If you walk into a brick and mortar store the shopkeeper has every right to count that you came in, and watch where you go in the store to optimize it. The web should be no different.

Fortunately there are in fact cookieless analytics systems that people can use to get this information why not being required to have the stupid cookie popup.


"I think every website should understand how and by who their website is used"

1. You don't need cookies or profiling for that - use Simple Analytics et. al.

2. You can ask for my consent, but you can't profile me against my will

3. A brick and mortar store does not profile me without my consent.


Yes, a brick and mortar store can absolutely profile you without consent if they wished, and so can a website. The only condition is not collecting PII.


Difficult, they try from time to time, then they get fake email adresses and fake zip codes in their database.

(Not using loyalty cards or CCs)


> The asteroid in spring hypothesis suggests that the asteroid impact that led to the K-Pg extinction event, occurred during spring season in the Northern Hemisphere.

Why is that significant?


The significance is likely more about paleontological methods of time determination than it is about what the difference of the impact would have been had it struck during another season, though that could also turn out to matter.

A massive asteroid impact is a singular event, and most especially in the realm of geology and paleontology, where the minimum timespan of concern is frequently one million years, the notion that we have reasonably postulated that specific fossil finds can be pinned to the day of the event (there's a T. Rex find that seems to have occurred on the date), and then to identify when in the year that day happened to fall (based on specimens of fish and fauna found in a deposit timed to the impact) provides methodologies which might be used to confirm (or refute) the timing both of other K-Pg boundary finds, and of paleontological / paleobiological finds more generally.

What we don't have, mind, is a speific timing of the Alvarez impact. The current estimate, based on argon dating, is "66,038,000 years ago, plus or minus 11,000 years" (<https://en.wikipedia.org/wiki/Alvarez_hypothesis#Evidence>). Though that itself is pretty remarkably precise.


Awesome, thanks!


Because there has been significant academic infighting over whose theory it is, and whether the evidence comes from a faked source.


What I meant is, why is it significant whether the asteroid impacted in the spring or not?


That's really cool! I wonder how soon I can buy one of these ESauron thingies.


I found this part of the code quite funny:

    # If there are < 5 items, just return the median
    if len(l) < 5:
        # In this case, we fall back on the first median function we wrote.
        # Since we only run this on a list of 5 or fewer items, it doesn't
        # depend on the length of the input and can be considered constant
        # time.
        return nlogn_median(l)
Hell, why not just use 2^140 instead of 5 as the cut-off point, then? This way you'd have constant time median finding for all arrays that can be represented in any real-world computer! :) [1]

[1] According to https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/


Big-O notation and "real-world computer" don't belong to the same sentence. The whole point of big-O notation is to abstract the algorithm out of real-world limitations so we can talk about arbitrarily large input.

Any halting program that runs on a real world computer is O(1), by definition.


> The whole point of big-O notation is to abstract the algorithm out of real-world limitations so we can talk about arbitrarily large input.

Except that there is no such thing as "arbitrarily large storage", as my link in the parent comment explained: https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

So why would you want to talk about arbitrarily large input (where the input is an array that is stored in memory)?

As I understood, this big-O notation is intended to have some real-world usefulness, is it not? Care to elaborate what that usefulness is, exactly? Or is it just a purely fictional notion in the realm of ideas with no real-world application?

And if so, why bother studying it at all, except as a mathematical curiosity written in some mathematical pseudo-code rather than a programming or engineering challenge written in a real-world programming language?

Edit: s/pretending/intended/


Big-O analysis is about scaling behavior - its real-world implications lie in what it tells you about relative sizes, not absolute sizes.

E.g., if you need to run a task on 10M inputs, then knowing that your algorithm is O(N) doesn't tell you anything at all about how long your task will take. It also doesn't tell you whether that algorithm will be faster than some other algorithm that's O(N^2).

But it does tell you that if your task size doubles to 20M inputs, you can expect the time required for the first algorithm to double, and the second to quadruple. And that knowledge isn't arcane or theoretical, it works on real-world hardware and is really useful for modeling how your code will run as inputs scale up.


thank you for this explanation! to me it looks like the algo sorts the whole array but in groups of 5; the number of chunks should scale O(N/5) = O(N), no? so how can you claim just by chunking you can ignore the fact that you still sorted N elements e.g. a selection sort would still perform N^2 comparisons total.


Ah, so the issue here is the difference between quickSort and quickSelect. In both cases you pick a pivot and divide the data into two partitions - but in quickSort you then recurse on both partitions, and in quickSelect you determine which partition your target is in and recurse only on that one.

So you're right that dividing into chunks of 5 and iterating over them doesn't affect the scaling - you wind up with an array of (N/5) medians, and it's still O(N) to iterate over that array. But the next step isn't to sort that array, you only need to quickSelect on it, and that's why the final step is O(N) rather than O(NlogN).


> (where the input is an array that is stored in memory)?

If the input is an array that is stored in a piece of real-world memory, then the only possible complexity is O(1). It's just how big-O works. Big-O notation is an abstraction that is much much closer to mathematics than to engineering.

> this big-O notation is pretending to have some real-world usefulness...

Big-O notation is not a person so I'm not sure whether it can pretend something. CS professors might exaggerate big-O notation's real-world usefulness so their students don't fall asleep too fast.

> fictional

Theoretical. Just like the other theoretical ideas, at best big-O notation gives some vague insights that help people solve real problems. It gives a very rough feeling about whether an algorithm is fast or not.

By the way, Turing machine is in this category as well.


Ultimately when an algorithm has worse complexity than another it might still be faster up to a certain point. In this case 5 is likely under that point, though I doubt 2^256 will.

In practice you might also want to use a O(n^2) algorithm like insertion sort under some threshold.


> Ultimately when an algorithm has worse complexity than another it might still be faster up to a certain point.

Sure, but the author didn't argue that the simpler algorithm would be faster for 5 items, which would indeed make sense.

Instead, the author argued that it's OK to use the simpler algorithm for less than 5 items because 5 is a constant and therefore the simpler algorithm runs in constant time, hence my point that you could use the same argument to say that 2^140 (or 2^256) could just as well be used as the cut-off point and similarly argue that the simpler algorithm runs in constant time for all arrays than can be represented on a real-world computer, therefore obviating the need for the more complex algorithm (which obviously makes no sense).


If you set n=2^140, then sure, it’s constant. If instead you only have n<=2^140, then n varies across a large set and is practically indistinguishable from n<=infinity (since we get into the territory of the number of atoms in the universe), therefore you can perform limit calculations on it, in particular big O stuff.

In the article n was set to 5. All of those arrays (except maybe 1) have exactly 5 elements. There is no variance (and even if there was, it would be tiny, there is no point in talking about limits of 5-element sequences).


> In the article n was set to 5. All of those arrays (except maybe 1) have exactly 5 elements. There is no variance

No, the code was:

    # If there are < 5 items, just return the median
    if len(l) < 5:
        return nlogn_median(l)
> and even if there was, it would be tiny, there is no point in talking about limits of 5-element sequences

So your point is: not all constants are created equal. Which circles all the way back to my original point that this argument is pretty funny :)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: