New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
C++: Improve simple range analysis #4294
Comments
|
Thanks for the detailed write-up. I'll try to address the individual items as well as the overall direction. With a few exceptions that rarely come up in practice, the While I'd like to accept most of your contributions into Overall, we'd like to add support for languages features when the performance overhead of that support is proportional to the use of that language feature in a code base. Most of what you're proposing falls into that category: binary operators on unsigned values, Some of the other features you're proposing will inevitably make range analysis too slow to be feasible on certain codebases. That includes signed multiplication, signed division, and interprocedural analysis. The Achilles heel of In the near term, I propose that I'll create a In the long term, I hope we can improve the IR-based range analysis so much that it can become the default range analysis. There are still many unknowns in this endeavour, but the IR-based range analysis has two core advantages: loops don't lead to an explosion of (inexact) bounds, and ranges can be determined relative to other variables ( |
That'll come for free in the IR range analysis and all other IR-based libraries. In the IR, all of |
|
Thanks for the response! I think creating an |
|
I've opened #4332 to create |
|
I'll close this issue and let further work happen in separate per-feature issues and PRs. @gsingh93, I invite you to submit PRs that create extension modules and import them from |
gsingh93 commentedSep 17, 2020
•
edited
I've implemented a number of
SimpleRangeAnalysisExprs and I'd like to discuss if any of them could be merged upstream into the main library, or if they could be included as optional "extensions" that could be included in individual queries if needed. We've been running queries with these extensions on a number of big (internal) projects and they've proven useful.The files below may be missing some included files or helper functions (some are here), they're just for reference.
Supporting more operators in more cases
I've improved what is supported for
&,&=,<<,<<=,>>,>>=,\, and\=. (Sidenote: I'm wondering if we could support implementing only the operator and having the assignment version of that operator share that same implementation. I did this myself using this class, but it could be part of the library itself). I'm not going to discuss all the differences here, but as an example, the built in&operator only supports cases where both arguments are either unsigned or a non-negative constant. My changes support "constants" that come from variables, i.e.unsigned a = 7; 42 & a, as well as signed operands where the bounds are non-negative. The other operands "expand" the cases where we compute bounds similar to this.Shift operators
For the shift operators, there are some interesting considerations around what to do in certain cases of undefined or implementation-defined behavior. I've noted some of these cases in the code, and it seems for the cases CodeQL supports we don't treat this as undefined behavior but do what most common compilers do. This is fine, but just worth calling out.
Division
For
\, I chose to not consider division by zero in the bounds, as CodeQL only supports continuous ranges, so even if the code verifies zero isn't a possible value, it's quite common for zero to be in the range. And since this case isn't important unless you're looking for DoS (which we're not), we assume division by zero doesn't happen. This could be make configurable depending on the usecase.Signed multiplication
I've also implemented support for signed multiplication for
*and*=. This does have a significant performance hit, but that hit is acceptable in our use cases. It would be nice if an extension like this was included in the library and then could be imported into queries that were willing to take that performance hit. (Sidenote: I'm unsure why there's a performance hit here. Taking the max/min of four different values doesn't seem like something that should be slow).std::minandstd::maxWe found implementing the resultant range for
min/maxwas useful for reducing some FPs and not a huge performance hit: https://github.com/gsingh93/codeql/blob/b7d39f7aa41258927c1d7259f9ddd283554e8123/cpp/ql/src/experimental/semmle/code/cpp/rangeanalysis/rangeextensions/ContainerMethodRange.qll#L217-L261Container ranges
Probably not something we can merge as-is, but might be worth a discussion. A common FP we've seen is
str.size() + 1, or something like that with any method that returns a value relating to the container's size. In practice, this will never overflow on a 64-bit system, as no 64-bit system can support a container with this many elements. Since we run on 64-bit systems, it was helpful to limit the max size for many of these methods to some arbitrary number (we choseINT_MAX - 1). You can see these changes here. Maybe it's worth allowing users to customize this somehow?I will point out though that
std::arraycan be properly modeled since the size is fixed, the implementation from the above file could be merged upstream.Handling basic guards
I haven't tested this out yet, but it would reduce real world FP's we've seen and @lcartey has provided some code to test out. The basic goal is to filter out issues like this:
The untested implementation to filter this out is here, but we can implement a proper range analysis expression if #4273 is merged.
Interprocedural analysis
I have some extensions here that do some rudimentary interprocedural analysis. These range analysis expressions propagate from callers to the function parameters, and also support assigning to reference parameters (thanks @lcartey). This extension propagates the return value of a function to the caller only if the return value is tainted by user controlled data. The restriction of the data being user controlled is to improve performance, but I don't think this should be hardcoded in a general implementation. An
Extensionclass could be provided, and users could override methods there to customize when to propagate arguments or return values.There's a significant performance hit from these, but we're running them on a number of large internal projects, they're very helpful, and we haven't seen any timeouts. This would be another example of something that could be not included by default, but enabled after including a particular file.
Displaying Ranges in Paths
Triaging path queries that rely on range analysis is much easier when you can see the ranges at each step of the path. Here I've extended
DataFlow::ExprNode,DataFlow::ExplicitParameterNode, andDataFlow::DefinitionByReferenceNodeto provide range information when viewing the path in vscode. This has made triaging issues much faster. This is something that wouldn't be enabled by default, but would be enabled after including a file.The text was updated successfully, but these errors were encountered: