JS: Flag more results in js/redos#4823
Conversation
|
An evaluation came back looking just fine. No performance degradation, lots of new results, and no lost results. |
| or | ||
| r1 = r2 and | ||
| q1 = q2 and | ||
| epsilonSucc+(q) = q and |
There was a problem hiding this comment.
There should be a comment here explaining why this is a fork. It's not entirely trivial and it's part of what we discussed in the last PR, so let's continue that discussion here.
I've added my understanding of the problem below, feel free to use that or discuss here if you don't think it's accurate.
If
qcan reach itself by epsilon transitions, then there are two distinct paths to the q1/q2 state: one that uses the loop and one that doesn't. The engine will separately attempt to match with each path, despite ending in the same state. The "fork" thus arises from the choice of whether to use the loop or not.To avoid every state in the loop becoming a fork state, we arbitrarily pick the
InfiniteRepetitionQuantifierstate as the canonical fork state for the loop (every epsilon-loop must contain such a state).
The missing piece here is of course an explanation of why we check for a "mid" state being an infinite repetition quantifier as well. It seems that check is the reason we don't flag (a?b?)*c so I'm immediately a little suspicious of it.
I believe the only correct fork state in (a?b?)*c is immediately after the a since there are two ways to read b afterwards: with or without the loop. The infinite repetition quantifier state is not actually a valid fork due to the restriction that a repetition body can't match the empty string, which means we can't use the epsilon-loop from that starting point.
It's possible that picking all the correct fork states while avoiding redundant fork states is too hard, and in either case, it's fine to be out of scope for this PR. But I'd still like to see some justification for the "mid" criterion, if only to explain that it's a heuristic.
There was a problem hiding this comment.
I like your explanation - I'm going to put it in.
The missing piece here is of course an explanation of why we check for a "mid" state being an infinite repetition quantifier as well. It seems that check is the reason we don't flag
(a?b?)*cso I'm immediately a little suspicious of it.
That is correct. Without the mid check we would flag (a?b?)*c.
But we would also flag (a?)*, which is a false positive (that only blows up quadraticly).
So I want to keep it.
I believe the only correct fork state in (a?b?)*c is immediately after the a since there are two ways to read b afterwards: with or without the loop. The infinite repetition quantifier state is not actually a valid fork due to the restriction that a repetition body can't match the empty string, which means we can't use the epsilon-loop from that starting point.
I'm not sure what the right fork state is, but your proposal definitely sounds plausible.
The problem is that there are lambda transitions from everything to everything, and it is therefore difficult to recognize the pattern based on the NFA.
As I said in the previous PR. I think the best way to go is a more syntactic special case for the (a?b?)*c pattern.
This would also allow make it easier to get the correct pump string.
(Adding that special case is a thing for a later PR).
There was a problem hiding this comment.
A problem is that the fork for (a?b?)*c requires multiple transitions, and our isFork predicate only a single transition.
The fork (as I see it) is between:
a-> [skipband go back to the start of the loop] -> [skipa] ->ba->b
Both of these match the string ab, but they took 2 different routes in the NFA.
I'm not sure it's feasible to expand isFork to support the multiple transition steps required for that fork.
| or | ||
| step(_, _, _, q1, q2) and | ||
| q1.toString() <= q2.toString() | ||
| step(_, _, _, q1, q2) |
There was a problem hiding this comment.
Nice catch, that can't have been fun to debug. 😬
Merging symmetric states in a product automaton is a legit optimization, though, so removing that doesn't seem right.
I think the real bug was that we forgot to check for a transition to (q2,q1) which is equal to (q1,q2) after merging symmetric states. Could you try this instead?
(step(_, _, _, q1, q2) or step(_, _, _, q2, q1)) and
q1.toString() <= q2.toString()There was a problem hiding this comment.
Nice catch, that can't have been fun to debug. grimacing
Oh yeah! Debugging predicate delta2(StatePair q, StatePair r) was no fun.
I made graphs of state-pairs, those were not fun!
Could you try this instead?
That worked! And it slightly improves performance.
I'll try to it generalizes it to 3-tuples (for super-linear ReDoS).
Edit: The order matters in my 3-tuples, so it didn't generalize.
d62f957 to
28cffa1
Compare
|
I've rebased on |
esbena
left a comment
There was a problem hiding this comment.
Diff and comments LGTM, and I believe you have addressed Asger's comments sufficiently.
All of my comments are optional.
08a0b78 to
7e21081
Compare
Gets a TP/TN for CVE-2020-26289 and CVE-2017-16111
There is a few things happening in this PR
Removing unsound optimization:
The optimization removed around half the state-pairs.
But in some cases, that would cause other parts of the state-pair-graph to become disconnected.
What essentially happened was that
EandCwas removed from the below graph, which caused FNs.Support nested stars.
Consider the below two regular expressions:
(they link to their NFA, as generated by the
deltapredicate).var tst2 = /(d*)*e/var tst3 = /(f*)+g/;I added support for this pattern by detecting the epsilon-transition-loop to the inner repetition.
I have not added support for
/(a?b?)*c/.My previous PR at one point included support for detecting
/(a?b?)*c/, but that support was way to ad-hoc.I think I understand why this regexp causes worst-case
O(2^n)runtime:a?b?can match 3 possible string:a,b, orab.ababab, then all combinations of the above 3 will be attempted.aorbwon't work, as that will remove 2 of the possible 3 matches.Here is the NFA for
/(a?b?)*c/(as generated by thedeltapredicate).Here is a simplified version of the above NFA.
I've only found a few TPs that are caused by this pattern, examples:
/^-?\d+((\.\d+)?[A-Za-z%$]?)+$//javascript((s)?( )?)*://^([^.]?\.?)*__v$/,I think the way to go is to directly detect the syntax (sequence of optionals inside a repetition), and then add that as a new case inside
isReDoSCandidate.I'll look at that later, but I would appreciate early feedback.