Whether by accident or from bad practice, sensitive data such as API keys are being leaked by developers that push hard coded credentials into production environments.
The crux of key discovery relies on entropy analysis. Tokens are designed to have high randomness, or entropy, to make them very difficult for an attacker to brute force. Researchers use this to their advantage by parsing files and looking for high-entropy strings. By far the most common way of doing so is by using Shannon’s Entropy algorithm. A mathematician by the name of Claude Shannon discussed the topic of quantifying entropy in 1948, publishing the book “A Mathematical Theory of Communication”. Researchers have applied this math to aid in discovering leaked secrets by setting a requirement level of 4.3. Take for example the comparison of the two strings “qUMOImqy7XeJn4HB96RGLPTYp67wGm39”, and “TotallyNotASecret183”. Using Shannon’s math “qUMOImqy7XeJn4HB96RGLPTYp67wGm39” has an entropy score of 4.625 and passes the test, while “TotallyNotASecret183” has a score of 3.784 and does not meet the requirement of an API token.
While Shannon’s method of quantifying entropy works on a basic level, taking a look at the math and its modern application shows something curious.
When applied, this math uses the probability of a character being chosen from a sequence to quantify entropy. We challenge this thought. Applied to strings this determines redundancy, NOT randomness. Probability is unchanging regardless of the order of the sequence. This is proven by observing the entropy level of “AABBCCDDEEFF” and “BECFDFBDEACA” – both strings contain the same character set, and both strings share a score of 2.584. Shannon’s entropy calculation does not take into account character relationships and therefore misses the mark.
The goal is to discover and remediate as many leaked tokens in the wild as possible. To accomplish this, several false positive reduction techniques have been developed.
On average, ~48.18% of characters in a JWT token are uppercase
Tokens are random by nature, so it is futile to set a rule of how many characters should be uppercase. However, we can set guidelines to what we should reasonably expect. If ~48.18% of characters in a JWT token are uppercase, it is reasonable to expect at least 15 percent of those characters are uppercase. To emphasize this point, we tested 1,000,000 freshly generated JWT tokens – 0.0% had less than 15% uppercase characters.
Here is an example of the false positive reduction in action.
The string “abcdefghijklmnopqrst” has an entropy score of 4.32, which passes the original test that many tools use. However, it lacks at least 15% uppercase characters so it is disregarded. The same concept is used to set a max of 75% uppercase characters allowed to be recognized as a valid token.
Based on the graph, the letters “w”, “k”, “x”, “j”, “q”, and “z” are the 6 least common letters that appear. Keeping this in mind, it is important to note that API tokens are designed to be random, so they do not adhere to these lexical patterns. By analyzing JWT tokens, it is clear that characters that appear less in the English language statistically appear more in strings designed for high entropy. In a 32-bit sequence of alphanumeric + symbol, 5.8 of the 6 least common characters appear.
By using all of these techniques, the amount of potential false positives is so low a Discord bot was built to automate this process a great deal.
The link to the bot can found here:
The next goal is to completely automate this process and scan the Alexa top 1,000 domains. Goon Security’s next release on July 14th, 2020 will explore the success of this process as well as vulnerabilities discovered from complete automation.
- Jaggar Henry
- Antero Nevarez-Lira
- Donald Connors
- Evelyn Griffin
Shannon, C. E., & Weaver, W. (1949). The mathematical theory of communication. Urbana, IL: University of Illinois Press.
Lexico. (n.d.). from https://www.lexico.com/explore/which-letters-are-used-most