Skip to content

JS: Flag more results in js/redos#4823

Merged
codeql-ci merged 7 commits intogithub:mainfrom
erik-krogh:furtherReDoS
Jan 7, 2021
Merged

JS: Flag more results in js/redos#4823
codeql-ci merged 7 commits intogithub:mainfrom
erik-krogh:furtherReDoS

Conversation

@erik-krogh
Copy link
Contributor

@erik-krogh erik-krogh commented Dec 14, 2020

Gets a TP/TN for CVE-2020-26289 and CVE-2017-16111

There is a few things happening in this PR

Removing unsound optimization:

The optimization removed around half the state-pairs.
But in some cases, that would cause other parts of the state-pair-graph to become disconnected.

What essentially happened was that E and C was removed from the below graph, which caused FNs.

image

Support nested stars.

Consider the below two regular expressions:
(they link to their NFA, as generated by the delta predicate).
var tst2 = /(d*)*e/
var tst3 = /(f*)+g/;

I added support for this pattern by detecting the epsilon-transition-loop to the inner repetition.

I have not added support for /(a?b?)*c/.

My previous PR at one point included support for detecting /(a?b?)*c/, but that support was way to ad-hoc.
I think I understand why this regexp causes worst-case O(2^n) runtime:

  • The inner part a?b? can match 3 possible string: a, b, or ab.
  • When the input string is repetitions of ababab, then all combinations of the above 3 will be attempted.
  • Just repeating a or b won't work, as that will remove 2 of the possible 3 matches.

Here is the NFA for /(a?b?)*c/ (as generated by the delta predicate).

Here is a simplified version of the above NFA.

I've only found a few TPs that are caused by this pattern, examples:

  • /^-?\d+((\.\d+)?[A-Za-z%$]?)+$/
  • /javascript((s)?( )?)*:/
  • /^([^.]?\.?)*__v$/,

I think the way to go is to directly detect the syntax (sequence of optionals inside a repetition), and then add that as a new case inside isReDoSCandidate.
I'll look at that later, but I would appreciate early feedback.

@erik-krogh erik-krogh added JS Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish labels Dec 14, 2020
@erik-krogh erik-krogh requested a review from a team as a code owner December 14, 2020 13:07
@erik-krogh
Copy link
Contributor Author

An evaluation came back looking just fine.

No performance degradation, lots of new results, and no lost results.
(The dist-compare thinks some results were lost, but the same regexps are still reported, just slightly differently).

@erik-krogh erik-krogh removed the Awaiting evaluation Do not merge yet, this PR is waiting for an evaluation to finish label Dec 16, 2020
or
r1 = r2 and
q1 = q2 and
epsilonSucc+(q) = q and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a comment here explaining why this is a fork. It's not entirely trivial and it's part of what we discussed in the last PR, so let's continue that discussion here.

I've added my understanding of the problem below, feel free to use that or discuss here if you don't think it's accurate.

If q can reach itself by epsilon transitions, then there are two distinct paths to the q1/q2 state: one that uses the loop and one that doesn't. The engine will separately attempt to match with each path, despite ending in the same state. The "fork" thus arises from the choice of whether to use the loop or not.

To avoid every state in the loop becoming a fork state, we arbitrarily pick the InfiniteRepetitionQuantifier state as the canonical fork state for the loop (every epsilon-loop must contain such a state).

The missing piece here is of course an explanation of why we check for a "mid" state being an infinite repetition quantifier as well. It seems that check is the reason we don't flag (a?b?)*c so I'm immediately a little suspicious of it.

I believe the only correct fork state in (a?b?)*c is immediately after the a since there are two ways to read b afterwards: with or without the loop. The infinite repetition quantifier state is not actually a valid fork due to the restriction that a repetition body can't match the empty string, which means we can't use the epsilon-loop from that starting point.

It's possible that picking all the correct fork states while avoiding redundant fork states is too hard, and in either case, it's fine to be out of scope for this PR. But I'd still like to see some justification for the "mid" criterion, if only to explain that it's a heuristic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your explanation - I'm going to put it in.

The missing piece here is of course an explanation of why we check for a "mid" state being an infinite repetition quantifier as well. It seems that check is the reason we don't flag (a?b?)*c so I'm immediately a little suspicious of it.

That is correct. Without the mid check we would flag (a?b?)*c.
But we would also flag (a?)*, which is a false positive (that only blows up quadraticly).
So I want to keep it.

I believe the only correct fork state in (a?b?)*c is immediately after the a since there are two ways to read b afterwards: with or without the loop. The infinite repetition quantifier state is not actually a valid fork due to the restriction that a repetition body can't match the empty string, which means we can't use the epsilon-loop from that starting point.

I'm not sure what the right fork state is, but your proposal definitely sounds plausible.
The problem is that there are lambda transitions from everything to everything, and it is therefore difficult to recognize the pattern based on the NFA.

As I said in the previous PR. I think the best way to go is a more syntactic special case for the (a?b?)*c pattern.
This would also allow make it easier to get the correct pump string.
(Adding that special case is a thing for a later PR).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem is that the fork for (a?b?)*c requires multiple transitions, and our isFork predicate only a single transition.

The fork (as I see it) is between:

  1. a -> [skip b and go back to the start of the loop] -> [skip a] -> b
  2. a -> b

Both of these match the string ab, but they took 2 different routes in the NFA.

I'm not sure it's feasible to expand isFork to support the multiple transition steps required for that fork.

or
step(_, _, _, q1, q2) and
q1.toString() <= q2.toString()
step(_, _, _, q1, q2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, that can't have been fun to debug. 😬

Merging symmetric states in a product automaton is a legit optimization, though, so removing that doesn't seem right.

I think the real bug was that we forgot to check for a transition to (q2,q1) which is equal to (q1,q2) after merging symmetric states. Could you try this instead?

(step(_, _, _, q1, q2) or step(_, _, _, q2, q1)) and
q1.toString() <= q2.toString()

Copy link
Contributor Author

@erik-krogh erik-krogh Dec 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, that can't have been fun to debug. grimacing

Oh yeah! Debugging predicate delta2(StatePair q, StatePair r) was no fun.
I made graphs of state-pairs, those were not fun!

Could you try this instead?

That worked! And it slightly improves performance.

I'll try to it generalizes it to 3-tuples (for super-linear ReDoS).
Edit: The order matters in my 3-tuples, so it didn't generalize.

@erik-krogh
Copy link
Contributor Author

I've rebased on main to resolve conflicts.

esbena
esbena previously approved these changes Jan 7, 2021
Copy link
Contributor

@esbena esbena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff and comments LGTM, and I believe you have addressed Asger's comments sufficiently.
All of my comments are optional.

@codeql-ci codeql-ci merged commit c193d9f into github:main Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants