Hackerman's Hacking Tutorials

The knowledge of anything, since all things have causes, is not acquired or complete unless it is known by its causes. - Avicenna

Oct 28, 2023 - 8 minute read - Comments - semgrep Static Analysis

Semgrep's Experimental Rule Syntax

Semgrep has an experimental and (IMO) more readable rule syntax. I am converting my own reference into a tutorial.

Disclaimer: Semgrep (binary, playground, cloud, etc.) supports the experimental syntax, but it's not released. If you're from the future and things have changed, let me know somehow. E.g., make an issue in the blog's source at parsiya/parsiya.net or create a pull request.

TL;DR

Use these tables:

OldExperimental
patterns (top-level)match and all
patterns (other)all
pattern[can be removed]
pattern-not- not
pattern-eitherany
pattern-insideinside
pattern-not-insideinside under not

These items go inside a where clause:

OldExperimental
metavariable-patternmetavariable and pattern
metavariable-regexmetavariable and regex
metavariable-comparisonmetavariable and comparison
metavariable-analysismetavariable and analyzer
focus-metavariablefocus

Taint mode changes

OldExperimental
mode:taintremoved
match (taint mode)taint
pattern-sourcessources
pattern-sinkssinks
pattern-propagatorspropagators
pattern-sanitizerssanitizers

Official References

I've only been able to find two references so far:

Example 1

Modified version of the first example in the Advanced Rule Tutorials, practice playground link.

rules:
- id: blog-2023-10-use-decimalfield-for-money-old
  patterns:
  # I know this `patterns` can be replaced by one `pattern`
  # but it's modified for the tutorial.
  - patterns:
    - pattern: $F = django.db.models.FloatField(...)
    - pattern: $F = django.db.models.FloatField(...)
  - pattern-inside: |
      class $M(...):
        ...      
  - metavariable-regex:
      metavariable: '$F'
      regex: '.*(price|fee|salary).*'
  message: _removed_
  languages: [python]
  severity: ERROR

Top-Level pattern(s) -> match and all

The top-level pattern or patterns becomes match. It's almost always followed by all or any.

rules:
- id: use-decimalfield-for-money-new-syntax
  # top-level patterns replaced by match and all.
  match:
      # the rest of the patterns
      # # I know this `patterns` can be replaced by one `pattern`
      # # but it's modified for the tutorial.
      # - patterns:
      #   - pattern: $F = django.db.models.FloatField(...)
      #   - pattern: $F = django.db.models.FloatField(...)
      # - pattern-inside: |
      #     class $M(...):
      #       ...
      # - metavariable-regex:
      #     metavariable: '$F'
      #     regex: '.*(price|fee|salary).*'
  message: _removed_
  languages: [python]
  severity: ERROR

Other patterns -> all

Other patterns keys that are a subset of the top-level one are replaced by all. Our example has a redundant patterns with two identical children to show how it will be modified.

Note that if we had a pattern-either here we would use any.

rules:
- id: use-decimalfield-for-money-new-syntax
  # top-level patterns replaced by match and all.
  match:
    all:
        # rest of the patterns
        - pattern: $F = django.db.models.FloatField(...)
        - pattern: $F = django.db.models.FloatField(...)
      # - pattern-inside: |
      #     class $M(...):
      #       ...
      # - metavariable-regex:
      #     metavariable: '$F'
      #     regex: '.*(price|fee|salary).*'
  message: _removed_
  languages: [python]
  severity: ERROR

pattern can be removed

The pattern keyword can be omitted. Replace pattern: [something] with just - [something].

- pattern: [something]  ---> - [something]

- pattern: |            ---> - |
    [something]                   [something]
    [more lines]                  [more lines]

More changes:

rules:
- id: use-decimalfield-for-money-new-syntax
  # top-level patterns replaced by match and all.
  match:
    all:
      # the rest of the patterns
      # I know this `patterns` can be replaced by one `pattern`
      # but it's modified for the tutorial.
      - $F = django.db.models.FloatField(...)
      - |
        $F = django.db.models.FloatField(...)        
      # - pattern-inside: |
      #     class $M(...):
      #       ...
      # - metavariable-regex:
      #     metavariable: '$F'
      #     regex: '.*(price|fee|salary).*'
  message: _removed_
  languages: [python]
  severity: ERROR

There's one catch, if your pattern contains : it might mess with the yaml format. Either use a bar to send it to the next line or enclose it in ", explanation at 1:26 in the reference video.

pattern-not -> - not

We don't have it in our current example, but it's similar to pattern.

- pattern-not: [something]  ---> - not: [something]

- pattern-not: |            ---> - not: |
    [something]                       [something]
    [more lines]                      [more lines]    

pattern-inside -> inside

Easy, peasy.

rules:
- id: use-decimalfield-for-money-new-syntax
  # top-level patterns replaced by match and all.
  match:
    all:
      # the rest of the patterns
      # I know this `patterns` can be replaced by one `pattern`
      # but it's modified for the tutorial.
      - $F = django.db.models.FloatField(...)
      - |
        $F = django.db.models.FloatField(...)        
      # pattern-inside
      - inside: |
          class $M(...):
            ...          
      # - metavariable-regex:
      #     metavariable: '$F'
      #     regex: '.*(price|fee|salary).*'
  message: _removed_
  languages: [python]
  severity: ERROR

where

Acts as a container for some elements that add conditions to metavariables. We will use metavariable-regex as an example:

  1. Add a where clause in the same level as all
  2. metavariable-regex is also replaced with metavariable and regex.
rules:
- id: use-decimalfield-for-money-new-syntax
  # top-level patterns replaced by match and all.
  match:
    all:
      # I know this `patterns` can be replaced by one `pattern`
      # but it's modified for the tutorial.
      - $F = django.db.models.FloatField(...)
      - |
        $F = django.db.models.FloatField(...)        
      # pattern-inside
      - inside: |
          class $M(...):
            ...          
    where:
      # metavariable-regex
      - metavariable: $F
        regex: '.*(price|fee|salary).*'
  message: _removed_
  languages: [python]
  severity: ERROR

See the final rule in the playground.

Other elements that appear under where have also been modified:

  • metavariable-pattern
  • metavariable-analysis
  • metavariable-comparison
  • focus-metavariable

We can use them like this:

rules:
- id: sample-rule
  match:
    all:
      # removed
    where:
      # metavariable-regex
      - metavariable: $F
        regex: '.*(price|fee|salary).*'
      # metavariable-analysis
      - metavariable: $F
        analyzer: redos
      # focus-metavariable becomes `focus`
      - focus: $F
  message: _removed_
  languages: [python]
  severity: ERROR

metavariable-pattern is tricky because it can contain multiple patterns, but it's similar to the patterns we've seen before.

    where:
      # metavariable-pattern
      - metavariable: $F
        pattern: "some pattern"
      # if it had multiple patterns
      - metavariable: $F
        all:
          - "pattern1"
          - "pattern2"

Example 2

This one is a C++ Hotspot rule that tracks when arrays are passed to functions. The complete rule is on GitHub and has a handy triage guide.

I will be using a partial version of the rule, playground link.

rules:
- id: arrays-passed-to-functions-partial
  patterns:
    # a lot of ways to create an array
    - pattern-either:
      - pattern-inside: |
          $TYPE $BUF[$SIZE] = $EXPR;
          ...          
      - pattern-inside: |
          $TYPE $BUF[$SIZE];
          ...          
    # we don't want to flag these usages again
    - pattern-not-inside: free($BUF);
    - pattern-not-inside: delete($BUF);
    # exclude uppercase variables, these are usually constants
    - metavariable-regex:
        metavariable: $BUF
        regex: (?![A-Z0-9_]+\b)
    # flag if it's passed to a function
    - pattern: $FUNC(..., $BUF, ...);
  message: _removed_
  languages:
    - cpp
  severity: WARNING

pattern-not-inside -> inside under not

The only new item here is pattern-not-inside.

rules:
- id: arrays-passed-to-functions-partial
  match:
    # removed everything else
    - pattern-not-inside: free($BUF);
    - pattern-not-inside: delete($BUF);

First, we create a not and then add an inside under it. Also note how the inside is indented unlike - not: [pattern] (from pattern-not).

rules:
- id: arrays-passed-to-functions-partial
  match:
    # removed everything else
    - not:
        inside: free($BUF);
    - not:
        inside: delete($BUF);

I thought I could merge the two nots. You cannot. It's a map and if you add two inside, you will get an error that keys must be unique.

pattern-either -> any

any will act as OR.

- pattern-either:
  - pattern-inside: |
      $TYPE $BUF[$SIZE] = $EXPR;
      ...      
  - pattern-inside: |
      $TYPE $BUF[$SIZE];
      ...      

becomes:

- any:
  - inside: |
      $TYPE $BUF[$SIZE] = $EXPR;
      ...      
  - inside: |
      $TYPE $BUF[$SIZE];
      ...      

Final Results

The rest is routine:

  1. Top-level patterns -> match.
  2. pattern-either -> any.
  3. pattern-not-inside -> not and inside.
  4. metavariable-regex -> metavariable and regex.
  5. pattern (the word) is just removed.
rules:
- id: arrays-passed-to-functions-partial
  match:
    # a lot of ways to create an array
    - any:
      - inside: |
          $TYPE $BUF[$SIZE] = $EXPR;
          ...          
      - inside: |
          $TYPE $BUF[$SIZE];
          ...          
    # we don't want to flag these usages again
    - pattern-not-inside: free($BUF);
    - pattern-not-inside: delete($BUF);
    # exclude uppercase variables, these are usually constants
    - metavariable-regex:
        metavariable: $BUF
        regex: (?![A-Z0-9_]+\b)
    # flag if it's passed to a function
    - pattern: $FUNC(..., $BUF, ...);
  message: _removed_
  languages:
    - cpp
  severity: WARNING

Implied Constant Propagation

The original rule and the one we created do not have the same matches. The original rule has three matches, playground link.

old rule old rule

The modified rule only returns one match, playground link.

new rule new rule

The reason is that constant propagation is on by default in the experimental syntax (at least for now). Credit: Cooper Pierce, Semgrep.

We can get the same result by adding an options key and get the same matches, playground link.

rules:
  - id: blah-blah
    options:
      constant_propagation: false
    match:
      all:
      # the rest of the rule

Example 3

In the last example we will look at a complex metavariable-pattern rule from Semgrep examples, playground link for practice.

rules:
- id: blog-2023-10-open-redirect-old
  languages:
    - python
  message: Match found
  severity: WARNING
  patterns:
    - pattern-inside: |
        def $FUNC(...):
          ...
          return django.http.HttpResponseRedirect(..., $DATA, ...)        
    - metavariable-pattern:
        metavariable: $DATA
        patterns:
          # patterns

Converting the outside patterns is easy.

rules:
  - id: blog-2023-10-open-redirect-new
    languages:
      - python
    message: Match found
    match:
      all:
        - inside: |
            def $FUNC(...):
              ...
              return django.http.HttpResponseRedirect(..., $DATA, ...)            
      where:
        - metavariable: $DATA
          patterns:
            # patterns

Now we do the same process for the inner patterns and replace patterns with all (we don't need a match). Things can get complicated quickly. We have three nested where clauses. One for the top metavariable-pattern, another for the 2nd one, and the last one is for metavariable-regex.

The result is in this playground link.

rules:
- id: blog-2023-10-open-redirect-new
  languages:
    - python
  message: Match found
  severity: WARNING
  match:
    all:
      - inside: |
          def $FUNC(...):
            ...
            return django.http.HttpResponseRedirect(..., $DATA, ...)          
    where:
      - metavariable: $DATA
        all:
          - any:
              - $REQUEST
              - $STR.format(..., $REQUEST, ...)
              - $STR % $REQUEST
              - $STR + $REQUEST
              - f"...{$REQUEST}..."
        where:
          - metavariable: $REQUEST
            all:
              - any:
                  - request.$W
                  - request.$W.get(...)
                  - request.$W(...)
                  - request.$W[...]
            where:
              - metavariable: $W
                regex: (?!get_full_path)

What Did We Learn Here Today?

We learned to convert rules from the old Semgrep syntax to the experimental one. IMO, the experimental syntax is more readable. There are some inconsistencies like the constant propagation section (and probably more), but not a big issue.