Matt Daniels is a designer, coder, and data scientist with a love of hip-hop and numbers.
Literary elites love to rep Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words and arguably had the largest vocabulary, ever.
I decided to compare this data point against the most famous artists in hip hop. I used each artist’s first 35,000 lyrics. That way, prolific artists, such as Jay-Z, could be compared to newer artists, such as Drake.
35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short of the 35,000 words. Quite a few rappers don’t have enough official material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I included data points for Shakespeare and Herman Melville, using the same approach (35,000 words across several plays for Shakespeare, first 35,000 of Moby Dick).
I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin are four unique words. To avoid issues with apostrophes (e.g., pimpin’ vs. pimpin), they’re removed from the dataset. It still isn’t perfect. Hip hop is full of slang that is hard to transcribe (e.g., shorty vs. shawty), compound words (e.g., king shit), featured vocalists, and repetitive choruses.
It’s still directionally interesting. Of the 85 artists in the dataset, let’s take a look at who is on top.