Stop writing regexes to lint your code

Written on 2021-04-04

The plural of regex is regrets.

The problem with regex-based linting tools is fundamentally you are reimplementing language's grammar in hacky manner and you'll inevitably find out edge cases.

Take this simple python code as an example.

def add(a, b):
    return a + b

How would you match this code using regexes? Firstly regexes are a pain to work with multiple lines, secondly, python is not a regular language so you'll never be able to parse it with 100% guarantee using just regexes.

The most important thing to remember is: code is not a piece of text, it's a tree.

    function_definition
         /        \
      args        return
      /  \           |
     a    b       binary_op
                  /   |   \
                (+)   a    b

Mmmm, LISPy. λ

This code, when represented as tree removes lots of noise in the text that's irrelevant e.g. whitespace, newlines.

Enter Semgrep - "Semantic Grep"

Semgrep patterns

Two most basic primitive operators that will get you quite far using Semgrep are:

Examples:

print(...) # match any print statement

$A == $A   # match places where the same thing is compared with itself

Finding similar bugs

Recently I came across a bug that was modifying objects after committing to the database. This change would not get committed to the database.

def after_submit(self):
    # some irrelevant stuff that can be ignored
    self.status = "Submitted"
    # some more  irrelevant stuff

Naturally, you would want to check if the same issue is occurring in other places too. But there are thousands of instances where this method is used and writing code for one-off tasks is probably not worth the efforts.

Semgrep makes this task easy as well. You can copy-paste original buggy code, replace things to ignore with an ellipsis operator.

def after_submit(self)
    ...
    self.$ATTR = ...

After you have written this rule, you might as well add it to CI so it's never missed in code reviews.

Finally, here's how you'd match the original example shown in first codeblock:

def $FUNC($X, $Y):
  return $X + $Y

You can see this pattern in action on online editor: https://semgrep.dev/s/O1ew/

Conclusion

The examples I shared are trivial but you can use them to build far complex patterns by composing individual patterns and combining them with boolean logic e.g. "match this pattern but not that pattern" and so on. The documentation is excellent for beginners, so I will just point to you to their getting started page to begin using it.