Friday, November 20, 2015

Tap-to-Jump Game Solver using Computer Vision

Ever been frustrated playing Flappy Bird? How about outsourcing that job to someone, or in this case, something that’s far more competent than us, mere humans. Tap-2-Jump (T2J) games rely on hand-eye-coordination, i.e. to see the game, and react accordingly, at the right time. Looks, like a problem that can be solved with a webcam and a computer, and a pair of ECE-5554 Computer Vision Grad Students at Virginia Tech. Hence, we present an approach to solving tap-to-jump games using computer vision.

Here, we attempt to solve two T2J games, Flappy Bird and TRex Runner (that game you get to play in Google Chrome when there’s no internet). The games are played on a phone and a capacitive actuator is used to register the taps. The computer looks at the phone via a camera and processes the game state and decides on a resultant action. We use MATLAB for image acquisition, processing and interfacing. An Arduino-Uno is used as interface between MATLAB and the capacitive actuator. The results were entertaining, to say the least. The computer vision and response systems work well, but are limited by the physical attributes of the hardware we employed.

enter image description here

enter image description here


T2J games are often fast paced, and require precise timed action. Using a Computer Vision (CV) system to solve this would indeed require the system to operate in real-time. That would mean real-time processing of image data and responding preemptively based on prediction model. The algorithms presented here harness every advantage and assumption to reduce the processing time without a loss in accuracy. At the same time, we maintain the environment to be the same as it would have been if played by a human. The algorithms are perspective invariant (say, the phone is tilted or rotated).

Approach & Experimentation

System Model

enter image description here

A simple off-the-shelf consumer grade webcam, Logitech C310 is used as the image acquisition device. Acquisition caps at 30 FPS at 1280x720 resolution. By default the webcam is focused at face-to-laptop typical distance. Since, this webcam does not feature auto-focus, we had to dismantle the device and readjust the focus control on the sensor to suit our application.

enter image description here

Emulating touch on a capacitive screen is an interesting problem of it’s own [1]. In order to register a touch, a part of the screen needs to be discharged. We use a 5V relay and a folded strip of aluminium foil to switch between GND (touch) and float (no-touch), controlled by an Arduino. MATLAB interfaces with the Arduino over Serial, and simply informs the microcontroller to initiate a tap response ( response time varies around 100ms).

Flappy Bird

Flappy Bird is a straightforward game where we have to navigate a bird through through obstacles. There game world is subject to gravity which effect the flight-path of the bird. As with most T2J games, a collision with an obstacle results in a “game over”.

Algorithms and methods are built using webcam recorded videos of the game being played on the phone (which itself was recorded by screen-capture). This would emulate the actual environment, i.e. the webcam having to see the phone.

Game Space

To begin with, the webcam is polled for a frame. Note that the phone itself might be oriented or tilted. Therefore, we first extract a perspective corrected image from the frame acquired. This is done by detecting the four corners of the phone screen and computing a inverse warp of the scene. From the center of the screen, patches are sought out, such that they have a strong response to the Shi-Tomasi corner detection function [2]. For this purpose, in the HSV color-space, we observe the Intensity channel. From the center of the frame, a patch is shifted upwards until it strikes a strong Edge (the phone’s boundary against the display area). Once, the patch is on the edge, the patch shifted along the edge, until all four corners are found. Below show’s the response of a patch that contains a strong response to a corner.

shi-tomasi reponse

We map pixels of the required space to pixels of the acquired frame. Secondly, to speed up sequential warps, we store the pixel mappings (some coordinates from the acquired image to every coordinate of the game space ), and hence, in subsequent frames, the pixels are extracted without the overhead of warping. The inverse warp pixel map is recomputed at set periods, to adjust for shifts in the environment. The result is a game space of dimension 160x90px, which is a downsampled version of the frame space. Experimentation has proven than these dimensions are sufficient to predict actions.

enter image description here

Detection of Obstacles

To evaluate the control, we then, proceed to detect the obstacles. Observation of the game tells us that the tubes are always green, and have a fixed width. The tubes are detected by processing the green channel of the game space. The bird only needs to avoid these tubes, by passing through the gaps. The information that we truly need to derive is the position of the gaps themselves, which can then be passed to the control model.

Two techniques which utilize statistical information in the RGB color space about image are described below.

Method 1: Intensity Peaks

We threshold a sub-region of the green channel where we most expect the tubes to appear within, and be of concern (close to the bird). The threshold mask is built using the red channel, because the background (sky) has the least contribution in that channel.

enter image description here

The Horizontal Normalized Intensity Sum and Median is then calculated to find the peak spots along the x-axis. The tube graphic has a gradient it’s width which helps obtain a singular peak per tube.

enter image description here

Similarly, once the x-axis position of the tube is detected, the y-axis position of the gap is derived from a smaller sub-region restricted to an area around the detected peaks. To detect the gap, we need to find the fist black edge from the bottom, that forms the graphic of the tube. Therefore we vertically flip the region before building a histogram. For the black horizontal edge, the median intensity would be very low. Hence, to detect the peaks, we invert the normalized intensity map, and select the first peak that lies above a certain threshold (i.e. 0.3).

enter image description here

Method 2: Edge Detection

Laplacian edges are detected from a smoothened sub-region of the green channel. Once, again, normalized statistics are evaluated for the sub region, i.e. the Horizontal Normalized Intensity Sum (HIS) and Median (HIM).

enter image description here

The x-axis point for the tubes are detected as a logical function of the HIS and HIM. The peaks are then filtered to obtain only the first peak among a cluster of relatively spatial adjacent peaks.

response = (HIS>0.3) & HIM
peaks = find(response)

Once the x-axis locations of the tubes are detected, the gap y-axis coordinate is calculated using the same technique described in Method 1.

Method 2 is selected as the technique for obstacle detection as it requires only channel, thereby reducing the time required to generate the game space using the pixel map by half. Secondly, Method 2 has a better response to partially available tubes, as they scroll across the game space. This can be explained, as this method depends on the presence of the edge of the tube, while Method 1 relies on the tube’s gradient across its graphic sprite.

We need to note that the tube detection response is only valid during the period while the control of the bird is with the user. Hence, the responses that occur during game initialization and ending can be ignored.

Detection of the Bird

The next task after detection of the obstacles would be find the protagonist of this game, the Flappy Bird. Examination of the game tells us that the x-axis of the bird is constant, and the detection need only be restricted to the y-axis. The sprite of the bird is not constant, as it rotates based on the game world states and actions. Hence, the detector would need to be sprite rotation invariant. To this effort, two techniques are investigated below.

Method 1: Hough Transform

The Flappy Bird sprite is almost circular, and could be detected by transform the region to the Hough Space for circles. The response is the brightest using the red-channel, as noise from the background (sky) would be the least. Firstly, the vertical region centered around bird is extracted for parsing. Secondly, Edges are detected and are transformed to obtain the Hough Space Response. The response is sought for a radius of 5.2px. Lastly, the response is normalized to the maximum across the horizontal (convert from 2-D sub-region response to 1-D max filter response). The peak indicates the location of the bird.

enter image description here

It is important to account for the tube locations while searching for the bird. Because the gradient along the tube, adds to noise to both the edge detection and the Hough Space. Therefore, we mask the region within which the tubes lie, whenever the tubes intersect the region within which the bird is supposed to fly.

enter image description here

Method 2: Edge Detection

Same as technique for the detection of the tubes, a LoG filter is applied on the sub-region around the bird. We use the Green channel for detection. A good edge response is acquired around the beak, eyes and wings of the sprite. The sub-region is then flatten to obtain the Intensity Sum along the Vertical. The first peak on long the vector would indicate the location of the bird.

enter image description here

Using edge detection for locating the bird is faster than using the Hough Transform. Moreover, since it relies on only green channel, it requires no addition overhead during game space generation. Most importantly, Hough Transform may indicate a false location when the bird is in city-scape region of the game space. The clouds in the background being circular themselves, weigh in on the Hough Response. Edge Detection does not suffer from this.

enter image description here

Additional Signals

The control system would require addition information from the data collected, i.e. location of the tubes and the bird.

Next Valid Tube
Relative to the bird position, the next valid tube is the obstacle that the bird will encounter in the immediate future. This signal is evaluated as a geometric function using distance between the tubes and the bird. We must also ignore the tubes, that the bird is at least half way through.
Game Over Detection
It is very important to detect the end of the game, as this signal would allow us to restart and reattempt playing. This is detected as a function of the overall intensity across the game space. During a game over animation, the the game space flashes bright, causing the average intensity to be almost perfectly white.

In the below demonstration, the red cross marks the bird, the green line marks the current valid obstacle, and the blue lines mark the other tube locations.

Control System

enter image description here

Initially, we studied using a Reinforced Learning [3] algorithm to train the system to control the bird. Software simulation of the system looked promising (shown below). The method evolved itself over time and trained itself to recognize bad actions such as jumping at the obstacles or not jumping when collision was imminent. The system relies on the current state to make a decision, and based on whether the bird is alive, appropriate reward is distributed to the previous states. This machine relies on training a single previous state.

Physical deployment ended in failure, as the machine couldn’t converge on a solution, despite increasing the number of states in the past to consider for reward distribution and policy evaluation. This was due to the low FPS and latency of capture (webcam processed and acquired frame lags behind actual frame). Furthermore there exists an inherent latency in the operation of relays (or for that matter, most mechanical parts), which seemed to range between 100~150ms. Therefore the machine would often attribute rewards to the wrong state.

Finally, we decided to use an Open-Loop Model that took into account the physical shortcoming of the relay and the webcam. Here we predict the apparent path the bird would take relative to the current tap, and decide when the next tap should occur. This requires prior knowledge of the game world system, which we acquired with experimentation (eg: world gravity for the current game space is 280px/s^2). The open-loop model performed significantly better than the RL model. At the current stage of the project the model is quite naive, and prone to error, and requires refinement.

At gameover, we send over a tap signal through the ADB/USB to instruct a tap on the restart symbol.

TRex Runner

enter image description here

TRex Runner (assumed name as there is no official title for the game), is the javascript snippet that we get play in Google Chrome when the internet is down. The protagonist, TRex (could be Stan or Binky), has to evade cacti of various sizes. As with most T2J games, collision results in a gameover. The speed of the game increases with progress. The color-scheme of the game is monochrome. The algorithms used here are derivatives of the ones used for Flappy Bird.

Game Space

The game occupies a very small portion of the screen and is the only region of concern. Secondly, since the game is monochrome, we need only extract any single channel, as the intensity would be the same across all channels. Using inverse warping we map pixels from the frame to a smaller 90x180px game space.

enter image description here

Review of the game gives us a few important considerations.

1. The TRex air-time (the period after a jump is initiated) is constant regardless of game speed and progress.

2. Once a jump in started, another jump cannot be initiated unless the first one completes.

3. The horizontal location of the TRex is constant.

4. All obstacles lie along the same axis (horizontal).

The above observations, hints that knowledge of the obstacles alone and the previous tap would be enough to model a solution for this game. The image processing part of this segment is fairly straightforward and simple.

Detection of Cacti

Cacti are the primary obstacles in this game, and hence, we first need to detect their location. Since all obstacles always appear within a certain horizontal region, a sub-region is extracted. This is then thresholded to detect the obstacles themselves, as the background is white and the obstacles are black. The Median Distribution is evaluated from this response. The response is then filtered to suppress closely spaced peaks, as we only require the last peak in a cluster. This would indicate the x-axis location of the end of each cacti cluster.

enter image description here

Additional Signals

As with Flappy Bird, we need to detect the next valid obstacle and game over state. Valid obstacle detection is again a function of distance relative to TRex. Gameover detection requires analyzing the sub-region within which the redo symbol appears. The symbol being black, offers a different average intensity response from the usual white background.

Control System

A simple open-loop model is deployed to evaluate the tap response timing. Based on how much time has passed since the game began (which gives us an estimate of the speed), and the distance of the next valid cacti cluster from the TRex, a jump is conditioned to occur. A timer is put in place such that, another jump cannot occur unless, this jump has completed. At gameover, we send over a tap signal through the ADB/USB to instruct a tap on the redo symbol.



Naive open-loop control model.


Quite often a score of 400+ is achieved, beyond which the latency of the camera and the relay begins to distort predictions.

Conclusion and Future Work

The obstacle and protagonist detection requires minimal processing effort once the T2J game is studied for consistent patterns. Say, the tubes being always of fixed width or that the TRex has a constant air-time. The computer vision algorithms were robust enough to function at low game space resolution, and yet accurately extract game parameters. Nonetheless, the physical constraints placed upon by the hardware often limited the performance of the overall system. Hence, we had to employ open-loop models to design the control system in an attempt to compensate for latency of acquisition and actuation.

Future experimentation would involve investigation of means of reducing latency in the system, in order to allow a reinforced learning based control system to evolve and play the game. This would involve cameras with higher FPS for acquisition. Faster actuation techniques would need to be studied, say, for example, using a MOSFET as a transmission gate to ground the foil.


Sonal Pinto
Graduate Student
Department of Electrical and Computer Engineering, Virginia Tech
Aravind Preshant Premkumar
Graduate Student
Department of Electrical and Computer Engineering, Virginia Tech


  2. J. Shi and C. Tomasi. Good features to track. CVPR, 593-600, 1994.