Discovering Leaked API keys in Web Applications with Modern Entropy Analysis

Whether by accident or from bad practice, sensitive data such as API keys are being leaked by developers that push hard coded credentials into production environments.

There are many solutions than have been offered, both paid and free, and while these tools have proven their ability to spot potential tokens in GitHub repositories, the lack of false positive reduction makes it troublesome to scan a large volume of files in other territories. JavaScript files, for example, may contain API keys, but current solutions generate so much noise security researchers find themselves sorting through false positives by hand.

The crux of key discovery relies on entropy analysis. Tokens are designed to have high randomness, or entropy, to make them very difficult for an attacker to brute force. Researchers use this to their advantage by parsing files and looking for high-entropy strings. By far the most common way of doing so is by using Shannon’s Entropy algorithm. A mathematician by the name of Claude Shannon discussed the topic of quantifying entropy in 1948, publishing the book “A Mathematical Theory of Communication”. Researchers have applied this math to aid in discovering leaked secrets by setting a requirement level of 4.3. Take for example the comparison of the two strings “qUMOImqy7XeJn4HB96RGLPTYp67wGm39”, and “TotallyNotASecret183”. Using Shannon’s math “qUMOImqy7XeJn4HB96RGLPTYp67wGm39” has an entropy score of 4.625 and passes the test, while “TotallyNotASecret183” has a score of 3.784 and does not meet the requirement of an API token.

While Shannon’s method of quantifying entropy works on a basic level, taking a look at the math and its modern application shows something curious.



When applied, this math uses the probability of a character being chosen from a sequence to quantify entropy. We challenge this thought. Applied to strings this determines redundancy, NOT randomness. Probability is unchanging regardless of the order of the sequence. This is proven by observing the entropy level of “AABBCCDDEEFF” and “BECFDFBDEACA” – both strings contain the same character set, and both strings share a score of 2.584. Shannon’s entropy calculation does not take into account character relationships and therefore misses the mark.

The goal is to discover and remediate as many leaked tokens in the wild as possible. To accomplish this, several false positive reduction techniques have been developed. 

To attain max randomness, tokens have a standard of being mixed-cased. Thinking simply, a generated ASCII letter has a 50% chance of being uppercase (it is, or it is not). This is easily verified by analyzing the uppercase percentage of 1,000,000 freshly generated JWT tokens as well as 1,216 JavaScript files.

On average, ~48.18% of characters in a JWT token are uppercase

On average, ~4%-8% of characters in a JavaScript file are uppercase

Tokens are random by nature, so it is futile to set a rule of how many characters should be uppercase. However, we can set guidelines to what we should reasonably expect. If ~48.18% of characters in a JWT token are uppercase, it is reasonable to expect at least 15 percent of those characters are uppercase. To emphasize this point, we tested 1,000,000 freshly generated JWT tokens – 0.0% had less than 15% uppercase characters

Here is an example of the false positive reduction in action.

The string “abcdefghijklmnopqrst” has an entropy score of 4.32, which passes the original test that many tools use. However, it lacks at least 15% uppercase characters so it is disregarded. The same concept is used to set a max of 75% uppercase characters allowed to be recognized as a valid token.

It is possible to reduce false positives even further by performing lexical analysis on the potential token. The {} published a ranking of letters in the English language and how often they occur in written text. Since programming languages have their base in English, they follow similar standards. This is easily proven by analyzing over 1,000 JavaScript files and counting occurrences of letters.

Based on the graph, the letters “w”, “k”, “x”, “j”, “q”, and “z” are the 6 least common letters that appear. Keeping this in mind, it is important to note that API tokens are designed to be random, so they do not adhere to these lexical patterns. By analyzing JWT tokens, it is clear that characters that appear less in the English language statistically appear more in strings designed for high entropy. In a 32-bit sequence of alphanumeric + symbol, 5.8 of the 6 least common characters appear. 

By using all of these techniques, the amount of potential false positives is so low a Discord bot was built to automate this process a great deal.

The EAN Discord bot takes a user supplied URL and crawls for linked JavaScript files. These files are then crawled for high entropy secrets and sent through false positive reduction and returned by the bot.

The link to the bot can found here:
https://github.com/Goon-Security/EAN-Discord

The next goal is to completely automate this process and scan the Alexa top 1,000 domains. Goon Security’s next release on July 14th, 2020 will explore the success of this process as well as vulnerabilities discovered from complete automation.

Team:

  • Jaggar Henry
  • Antero Nevarez-Lira
  • Donald Connors
  • Evelyn Griffin

Sources:

Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana, IL: University of Illinois Press.

Lexico. (n.d.). from https://www.lexico.com/explore/which-letters-are-used-most