What is a Spectrogram?

Have you ever wondered what it would be like to see sounds?

I once took a Speech Recognition AI course, and one of the concepts that fascinated me was that of spectrograms. Here's the spectrogram for the sound of a person speaking the words "nineteenth century"

A spectrogram helps us visualize sounds by decomposing them into their basic frequencies. In this type of visualization, the x axis is time (the progress through the audio clip), the y axis is frequency (low or high pitched), and the color represents loudness.

Something cool about these plots is that trained professionals can actually deduce what words are being uttered in an audio clip just by looking at the corresponding spectrogram. The representation is so useful for this purpose that many speech recognition software systems create a spectrogram as an initial step in the process of transcribing speech to text.

This is all possible because in speech, each sound has a characteristic look in the spectrogram. For example, different vowels can be distinguished by something called "formants": the position of a series of bands that show up near the bottom of a spectrogram. More specifically, the first 3 formants, F1, F2, and F3:

The image above shows the histogram for the words "bee" and "baa", showing the difference in the frequencies of the formants for these two vowels.

Here's a chart that shows the frequency of the formants for several English vowels:

And here's a cool fact from Encyclopædia Britannica: "Most people cannot hear the pitches of the individual formants in normal speech. In whispered speech, however, there are no regular variations in air pressure produced by the vocal cords, and the higher resonances of the vocal tract are more clearly audible. It is quite easy to hear the falling pitch of the second formant when whispering the series of words heed, hid, head, had, hod, hawed, hood, who’d." (Just don't try this too much, or you'll get dizzy from exhaling so much air.)

Another characteristic of vowels is that they have "overtones". These manifest themselves as equally-spaced horizontal lines that appear in a histogram when we see it in high resolution. In the following chart, pay attention to the very fine evenly-spaced horizontal lines (not the broad yellow blobs):

(Note that formants can span across several overtones.)

On a piano or a guitar, whenever you play a middle C, you're not only producing a pure 262Hz (middle C) sound. The instrument actually also produces at the same time a tone at twice the frequency, three times the frequency, etc. at integer multiples of that C note's frequency (aka the "fundamental frequency"). These are called "overtones" and are what gives a piano or a guitar its characteristic sound (aka its "timbre"), as opposed to sounding like a computer-generated beep. A similar phenomenon happens when a person pronounces a vowel (or any sound that uses the vocal chords). This is why we saw the equally-spaced parallel lines in the high-res spectrogram of vowel sounds.

While vowels can simply be identified by their formants, consonants have a wide range of looks and durations on a spectrogram. A "b" consonant and an "m" consonant look very different in the plot. Some easy to spot consonants are sounds such as "shhh", "chhh", "zzz", and "sss", since they have a very characteristic high-pitched component, so you will see a band of high frequencies light up at the top of the spectrogram. For example, here is a "sss" sound sandwiched between two vowels:

Just for completeness, I should mention that it's not always so clear-cut how to map a slice of a spectrogram to its corresponding phoneme. Different speakers pronounce words in slightly different ways and have different vocal ranges. And even when only considering a single speaker, individual sounds can change depending on the surrounding vowels or consonants. Also, let's not forget that a spoken sentence would look very different from normal on a spectrogram when it's whispered or spoken quickly.

There are smartphone apps that generate a spectrogram in real time. My favorite one so far is SpectrumView by Oxford Research (iOS only), but there are a few others out there. You can also try this cool-looking web app or this more sober-looking web app (make sure to press the Mic checkbox). Some fun things to try are: vowel sounds (notice the overtones? the formants? can you determine your vocal range?), consonant sounds ("sss", "zzz", "mmm", "rrr", "thh", "tee", "dee"), whispering, playing a note on a piano or another instrument (notice the overtones?), whistling (notice the lack of strong overtones?), a waterfall (white noise), and that high-pitched sound coming from the TV that you hear but your parents don't.

So there you have it. Now you're able to see sounds.

P.S. If you'd like to learn more, check out the following links:


The images are from:

Video Games for Language Learning

Playing video games is a surprisingly underrated way of developing vocabulary and reading skills in new languages.

I'm not even talking about educational video games or "gamified" rote-learning language apps. I'm talking about regular video games that let you change the default language.

Video games allow you to immerse yourself in the language. You're forced to actively improve your language skills if you want to progress through the game. You'll need to learn a new word to understand what specific object a game character just asked you to find. For some words, maybe you don't bother to look them up, but you constantly read them in certain contexts and, that way, realize what they mean. In the game, your character's tool inventory shows you pictures of objects with their corresponding labels, so they basically act as vocab flashcards. Finally, regardless of how you learn any new words, you're sometimes even able to associate these words with the specific part of the game where you encountered them: a village, a beach, a castle, a dream—the game world acts as a memory palace, helping you remember the vocabulary more easily.

Of course, the game needs to be at the right level of language learning difficulty. I wouldn't recommend this for someone who doesn't know anything about the language at hand. Yet, after learning the basic grammar, vocabulary, and phonetics of the language, it seems like a great technique to expand your knowledge.

For beginners, it seems best to play a game where text isn't too frequent and the grammatical structures are simple. For more advanced language learners, text-heavy games with complex grammatical structures, more advanced vocabulary, and even some wordplay provide more learning opportunities.

Last year I finished playing Breath of the Wild in Italian and now I'm playing some less text-heavy games in French. I wonder how much playing Ocarina of Time and other games helped me learn English in my youth.

The Secrets of the Mechanical Turk: Artificial Intelligence for Board Games

[Note: I wrote this piece on 29 December 2011, but I'm adding it to my blog today, 27 January 2016. Today was the release of the first Go playing program that can defeat a grandmaster]

In the year 1770, two centuries before the creation of the first general-purpose electronic computer, a man named Wolfgang von Kempelen constructed a seemingly impossible machine. The machine was known as The Turk, and it was an automaton with the ability to play chess. The robot was shaped like a human with Turkish clothes sitting in front of a desk, and was known to have won in chess against people such as Napoleon Bonaparte and Benjamin Franklin. The Turk, however, after 50 years of exhibition, was revealed to be a hoax. It contained an intricate system that allowed a human player hiding inside the desk to decide the next move (Schaffer, pp. 126-165). Artificial intelligence is a topic that has fascinated people even before computers existed, but now that they exist, how does a computer “think” in order to play a board game like chess?


The Turk

Defining AI

Artificial Intelligence is a very broad field of Computer Science. In the general sense, we can define AI as the study of rational agents. A rational agent is any program that “reads” an environment with sensors and performs rational actions on this environment through what we call actuators. A self-driving car, for example, might use a camera as a sensor, and steering, acceleration, and breaking as its actuators. This car would act rationally on its environment, the road, so as to avoid obstacles and reach its destination. An AI environment can be either discrete or continuous, like the board for a chess-playing agent or the road for the mentioned self-driving car, respectively. The environment can also be either deterministic, if the next state of the environment is completely determined by the current state and the action performed by the rational agent, like in checkers, or stochastic, where there is an element of chance that affects the outcome of the action of the agent, like in the card game Blackjack. An environment can also be either fully observable like in solving a Rubik’s cube, where the entire state of the game is known, or partially observable, like in a labyrinth, where the agent does not know everything about the state of the environment. An environment can have the property of being dynamic, where the environment can change while the agent is making a decision, like in timed chess or a self-driving car, or static, where the environment remains constant until the agent deliberates a decision. It can be episodic or sequential, where in the former what the agent experiences is divided into independent, single-action episodes, like a robot that classifies candy by color, and in the latter the actions of the agent affect further actions. Finally, an environment can be single-agent, where only one agent affects the environment, or multi-agent, where more than one agent compete against or collaborate with each other. (Russel, pp. 34-44)

As we can see, AI encompasses a lot of different types of environments. In this chapter, we will be specifically dealing with techniques for solving discrete, fully observable, static, sequential, two-player, turn-based, adversarial board games.


In tic-tac-toe, before a human player makes a move, he or she usually considers the consequences of such move, evaluating if it is a winning move and predicting what the opponent will do on the next turn. Similarly, before choosing a move, we can make a program that plays tic-tac-toe by testing every legal move it can make, then considering every possibility the opponent has after this move, then every possibility the computer has after that, and so on. For this, we create a game tree, a tree where each node represents a state of the game (the location of each piece and the player whose turn it is), each edge represents a move, and the root node represents the start state, in this case the empty 3x3 board, where it is X’s turn to play.  In order to choose the next move to make, our program can do a complete traversal on this tree in order to determine which move leads to the maximum number of possible wins assuming the opponent always plays its best move (this is called the minimax algorithm) (Russel, pp. 161-163). The game of tic-tac-toe is considered solved in the field of Artificial Intelligence since a computer can always know what the outcome of a given game will be if both players play optimally.

Tic-tac-toe Game Tree. Image courtesy of Wikipedia

A partial tic-tac-toe game tree


Chess is another game that can be played by a computer. One would think that a computer would be able to traverse the entire game tree in chess, as with tic-tac-toe, and be able to determine the most favorable strategy. However, chess has a bigger board (8 by 8) than tic-tac-toe, more types of pieces, more rules, and longer games in terms the amount of moves. This causes the game tree for chess to have a higher branching factor and a greater depth, greatly increasing the number of nodes to traverse as compared to tic-tac-toe. According to information theorist Claude Shannon, in his 1950 paper "Programming a Computer for Playing Chess," “A machine operating at the rate of one variation per microsecond would require over 1090 years to calculate the first move!” A modern computer running at one variation per nanosecond would take over 1087 years, a significant improvement in theory, but not in practice!


A partial chess game tree

If a game tree traversal is not feasible for chess, how is it that computers are nowadays able to defeat chess grandmasters? Computers use additional optimization techniques. The basic idea is to be able to stop the search on the tree at a certain point and be able to evaluate the current position without additional search. We evaluate the current position using an evaluation function, which is hard-coded into the program and given a position returns an integer indicating how convenient the position is. The this evaluation function varies from program to program, but it might, for example, give a higher score for having a central position, or open files, or more pieces than the opponent, or any combination of these. The only condition is that the function is based solely on the current position.

Another optimization technique that is used is retrograde analysis. This technique consists of working backwards though the game tree from common, known checkmates, repeatedly calculating what positions could have been the previous ones. By doing this, we can create an endgame tablebase, which is a database with the optimal move given a position during endgame. This technique is only feasible for late game when not many pieces remain on the board, but if used right, it is a powerful technique, since forwards decision tree traversal is weak in these situations.

In Computer Science, chess is considered partially solved, since it is not currently possible for a computer to know the outcome of any given game if both players play optimally; this is only possible for endgames, as seen above.


Go is a very interesting game in the field of AI; no computer program can currently play Go to a level above the one of human players. The game board is a 19x19 grid and the pieces can be placed without many restrictions: on any open position. This gives the game tree an incredibly big branching factor, much bigger than the one in chess.  Additionally, unlike in chess where the game becomes simpler as the game progresses and players capture pieces, Go usually becomes more complex as the game advances, and it is not practical to perform a retrograde analysis. Additionally, it is difficult to create a good evaluation function for this game, by the nature of the trade-offs existent in this game. Our traditional search seems to be practically worthless here, so we need other techniques along with this search.

One of the additional techniques that are used is Monte Carlo tree search. In general, a Monte Carlo method involves using a great amount of random samplings in order to compute something. In a Monte Carlo tree search for Go, for each edge branching out of the current state of the game, we simulate thousands of random games starting with that move, and then choose the move with the best set of resulting games. The advantage of this technique is that we do not need to device an evaluation function, since we simulate each game completely, reaching a leaf node each time. Even though a Monte Carlo tree search provides a strong overall strategy, it might overlook one particularly strong response from the opponent if all other responses are weak. This shortfall can be ameliorated by doing a limited depth normal tree search before generating moves and by adding some knowledge to the move selection rather than selecting moves completely randomly.

Another additional method used is machine learning, where a program previously plays hundreds of games, and modifies the values of its given parameters to improve its performance. The advantage of this method, like with the Monte Carlo method, is that the program develops its own rules and strategies instead of depending on hard-coded ones.

Like with chess, Go is considered partially solved. The game is solved on the smaller 4x4 and 5x5 boards, but, as we have seen, computers show a very weak play for matches on the regular 19x19 board.


The games we have seen so far have all been deterministic, so now we will consider the stochastic game of backgammon. In this game, dice rolls add the element of randomness to the game. Many strong backgammon programs use a form of machine learning called neural networks. This simulates a network of neurons to gain experience from previous games, but their explanation is beyond the scope of this essay. This technique allows us to let a backgammon program play against itself hundreds or thousands of times beforehand, developing a playing strategy that, in the case of backgammon, has been able to win against most human players.


Artificial Intelligence is a very broad and fascinating topic that has become increasingly popular with time. The known methods of solving AI problems might seem cryptic at first, but they become intuitive when we understanding the data structures they need and their underlying motivation. Even though the technologies in Artificial Intelligence are very advanced, like we have seen with chess-playing computers that can beat grandmasters, there is still much progress to be made. Perhaps some day computers will be able to play Go at the level of a grandmaster. Perhaps users will be able to play chess during a road trip while their car drives them to their destination.

Works Cited:

Russel, Stuart, and Peter Norvig. Artificial Intelligence: A Modern Approach. Upper Saddle River, NJ:Prentice Hall, 2010. Print.

Schaffer, Simon, The Sciences in Enlightened Europe, Chicago: The University of Chicago Press, 1999. Print.

Shannon, Claude E., “Programming a Computer for Playing Chess”, Philosophical Magazine  Murray Hill, N.J., 1950. Print.

Code: DERIBALL for TI-83/84 Plus

As an appendix to the previous tutorial, and in honor of my first programming language TI BASIC, here's a similar physics simulation but for the TI-83 Plus or TI-84 Plus graphing calculator. I wrote this calculator program almost five years ago after learning about dot products, so you might say the html5 tutorial I wrote recently is kind of *puts sunglasses on* derivative.

: ClrDraw
: Input
: Pt-On(X,Y)
: X->O:Y->P
: Input
: O-X->V:P-Y->W
: Xmin->A
: Xmax-Xmin->B
: Ymin->C
: Ymax-Ymin->D
: -D/700->G
: 0.9->F:0.8->E
: While 1
: If (Y1(X)≥Y)≠(Y1(X+V)>(Y+W))
: Then
: nDeriv(Y1,X,X)->J
: (V+WJ)/(1+J^2)->K
: (F+E)K-EV->V
: (F+E)KJ-EW->W
: End
: X+V->X
: Y+W->Y
: Pt-On(X,Y)
: W+G->W
: End

To use the program (after having entered the code into your calculator as a new program), edit Y1 under the Y= menu to be any function you like. Try Y1=X^2, for example. Now plot your preferred function and set the window any way you prefer. I've set Xmin and Xmax to -4 and 4, respectively, and Ymin and Ymax to 0 and 10, respectively.

Now run DERIBALL under the PRGM menu. This will show a moveable crosshair next to your plot; select the position for the hypothetical slingshot from which the ball will launch and press ENTER. Then select the point from which you'll release the ball from the slingshot. Voila! You'll now see the ball move around with gravity and bounce against the curve with friction.

Project: Ecua Cines



  • Displays up-to-date movie descriptions, trailers, and showtimes for all major movie theaters in Quito, Ecuador
  • Allows users to quickly compare movie times instead of loading each theater's web page
  • Works on mobile devices
  • Loads faster than any of the corresponding movie theaters' websites
  • Has at least one Easter egg


  • Zurb Foundation, jQuery, Font Awesome, and Box2DWeb for the front-end
  • PHP with Simple HTML DOM and MySQL for the back-end, with cron jobs and Google Page Speed optimizations
  • Photoshop for the logo design


The website automatically obtains all movie showtime information from each cinema's official website, and then it has to be able to show these showtimes grouped by movie or by cinema. While this may seem like a trivial task, it turned out to be an interesting algorithmic challenge.

Human Steps:

  1. Open each movie theater's website
  2. Recognize that "Superman: El hombre de acero", " EL HOMBRE DE ACERO", and "Hombre de Acero" all refer to the same movie.
  3. Copy and paste the movie title (pick one of the three variations) and description along with the showtimes for each theater into Ecua Cines's database

Robot Steps:

  1. Open each movie theater's website
  2.  Find a piece of text that represents the movie's title by traversing each website's html tags according to hard-coded directions.
  3. Recognize that "Superman: El hombre de acero", " EL HOMBRE DE ACERO", and "Hombre de Acero" all refer to the same movie.
  4. Traverse cinema website according to hard-coded rules to find the movie summary text. (we only need to do this once for each movie)
  5. Traverse  cinema website according to hard-coded rules to find the movie times. Get each of the times by matching pieces of text with one or two digits followed by a ":" or an "h" and followed by two more digits. Assume a 24-hour time format.
  6. Save the title, description, and showtimes for each movie as obtained in steps 3, 4, and 5 into Ecua Cines's database

How can we teach a robot to ignore the differences between the three strings of characters from step 3, but still differentiate these from other movie titles? This is what I ended up making the program do, and that has worked in practice:

  • Trim whitespace from the start and end of movie titles
  • Remove all accents from letters, e.g. turn all 'á's into 'a's (this accounts for the fact that many people prefer to capitalize "áéíóú" as "AEIOU" instead of "ÁÉÍÓÚ")
  • TURN THE MOVIE TITLES INTO ALL CAPS (this accounts for variations in capitalization)
  • Remove commonplace Spanish words like "EL," "LA," and "Y" (only keep 'important' words)
  • Match each movie title with the title from another theater with the shortest Levenshtein distance to this title, but requiring a maximum threshold to avoid false positives when there actually is no valid match. (this accounts for small misspellings and singular-plural variation e.g. "MONSTERS UNIVERSITY" vs. "MONSTER UNIVERSITY")
  • If the thresholded Levenshtein method finds no matches, match together movie titles that have a 8-or-more-letter common substring (there's a really cool dynamic programming algorithm for efficiently finding the greatest common substring of two strings)
  • If everything else fails, conclude that this movie is unique to this theater