Getting traffic data from google maps

Goal of the project

We will scrap google maps in order to find the travel time from a grid of points to a couple of destinations. This way, we will find the most optimal points to minimize both journeys. This code can be used to pinpoint the best locations to pick a home when two people are working at different locations. By scrapping google maps, we can take into account how the traffic impacts the travel time.

You can download the project by going to the GitHub repository

Scrapping google maps

Since google maps is a dynamic website, we cannot use simple tools such as wget or curl. Even webparsers such as scrappy don't render the DOM hence cannot work in this situation. The easiest way to scrap data from such websites is to take control of a browser by using an automation tool. In this case we will use selenium to take control of Google Chrome with the chromedriver.

You have to install selenium with

conda install -c conda-forge selenium

or

pip install selenium

you also need to have the chromedriver.exe downloaded. BeautifulSoup is a package we will use to parse the html of the webpage opened in chrome.

In order to extract the estimated travel time, we need to inspect the source code of the page in find the

element we are interested in. In our case it is section-directions-trip-numbers. In this <div> element we will then get the estimated value contained in the <span> element

Extract time from Google Maps

The code

First, let's import selenium, beautiful soup and some other libraries

# Selenium allows to control chrome programmatically
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options

#beautifulsoup is used to parse the dom of the html page
import bs4 as BeautifulSoup

import numpy as np
import pandas as pd
import os

We will also need some extra libraries for plotting the results

import matplotlib.pyplot as plt
from matplotlib.transforms import offset_copy

import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt

Let's define the GPS coordinates of the two destinations we are interested in. The coordinates can be found the the URL of a google maps search

longitudeDestination1 = 48.9361537
latitudeDestination1 = 2.2507129

longitudeDestination2 = 48.7783875
latitudeDestination2 = 2.1803534

We will search on an equally spaced grid of point starting from (long_begin, lat_begin) and going to (long_end, lat_end). In order to do so, we will : * construct the URL from the GPS coordinates * load the url in chrome with driver.get * read the resulting html with driver.page_source * parse the html with beautiful soup in order to find the first <div> element with the class section-directions-trip-numbers * in this element, we will get the estimated travel time by reading the text value of the second <span> element

def get_travel_time(url, driver):
    """
    get the estimated travel time of the google maps given as url
    """
    resultats = None
    driver.get(url)

    while resultats == None :
        soupe = BeautifulSoup.BeautifulSoup(driver.page_source, "lxml")
        soupe.select("section-directions-trip-numbers")
        resultats = soupe.find('div',attrs={"class":u"section-directions-trip-numbers"})
    return resultats.find_all("span")[2]

Once the function is defined, we only need to call it in a loop in order to get all the point of the grid

chrome_options = Options()
#chrome_options.add_argument("--disable-extensions")
#chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--headless") #make chrome headless. If you want to see the autimation, comment this line
driver = webdriver.Chrome(executable_path = '.\\chromedriver.exe', chrome_options=chrome_options)

nb = 10
ctn = 0
time = []
for coordX in np.linspace(long_begin, long_end, nb):
    for coordY in np.linspace(lat_begin, lat_end, nb):


        url_journey1 = f"https://www.google.com/maps/dir/{coordX},{coordY}/@{longitudeDestination1},{latitudeDestination1},12z/data=!3m1!4b1!4m14!4m13!1m0!1m5!1m1!1s0x47e67bff078f6575:0x95df2619f9304bd7!2m2!1d2.1825421!2d48.778384!2m4!2b1!6e0!7e2!8j1570521600!3e0"
        url_journey2 = f'https://www.google.com/maps/dir/{coordX},{coordY}/@{longitudeDestination2},{latitudeDestination2},14z/data=!3m1!4b1!4m14!4m13!1m0!1m5!1m1!1s0x47e665df0cb0b919:0x5f513cdf2fe6d39d!2m2!1d2.2572779!2d48.9368666!2m4!2b1!6e0!7e2!8j1570521600!3e0'

        temps_user1 = get_travel_time(url_journey1, driver)
        temps_user2 = get_travel_time(url_journey2, driver)

        ctn += 1
        print(f'Downloaded : {ctn/(nb*nb)*100}%')
        time.append([coordX, coordY, temps_user1.text, temps_user2.text, f'https://www.google.com/maps/place/{coordX},{coordY}'])
Downloaded : 1.0%
Downloaded : 2.0%
Downloaded : 3.0%
Downloaded : 4.0%
[...]
Downloaded : 96.0%
Downloaded : 97.0%
Downloaded : 98.0%
Downloaded : 99.0%
Downloaded : 100.0%

After gathering the results, the values stored in the time list are string and cannot be interpreted as numerical values without a post processing. This is why I've written the function analyse_time in order to split the text and convert it to a numerical format expressed in minutes.

def analyse_time(time):
    """
    Analyse the time given by google maps, splits the lower and higher estimate and converts them to minutes
    """
    tlow = time.split(" - ")[0].replace("\xa0", " ")
    thigh = time.split(" - ")[1].replace("\xa0", " ")

    if ("min" not in tlow) and ("h" not in tlow):
        #example : 26 
        tlow = int(tlow.replace(" ", ""))
    elif "h" not in tlow :
        # example 26 min 
        tlow = tlow.replace("min", "")
        tlow = int(tlow.replace(" ", ""))
    else :
        if "min" in tlow:
            #example 1h 26min
            tlow = tlow.split("h")
            tlow = 60*int(tlow[0].replace(" ", "")) + int(tlow[1].replace("min", "").replace(" ", ""))
        else:
            #example 1h
            tlow = 60*int(tlow.replace("h", ""))

    if "h" not in thigh :
        thigh = thigh.replace("min", "")
        thigh = int(thigh.replace(" ", ""))
    else :
        if "min" in thigh:
            thigh = thigh.split("h")
            thigh = 60*int(thigh[0].replace(" ", "")) + int(thigh[1].replace("min", "").replace(" ", ""))
        else:
            thigh = 60*int(thigh.replace("h", ""))

    return (tlow, thigh)

For every result previously gathered, let's apply the function analyse_time then put it in a pandas dataframe. While we are at it, I also computed the geometric mean of the minimum time estimated for both users as well of the maximum time. A geometric mean is interesting in this interesting here because we want to avoid have one user doing a long journey while the other is doing a short one.

df = []
for t in time:
    lat = t[0]
    lon = t[1]
    t1 = analyse_time(t[2])
    t2 = analyse_time(t[3])
    geomlow = np.sqrt(t1[0]*t2[0]) #geometric mean 
    geomhigh = np.sqrt(t1[1]*t2[1]) #geometric mean 
    df.append([lat, lon, geomlow, geomhigh, t1[0], t1[1], t2[0], t2[1]])


traveltime = pd.DataFrame(df, columns = ["latitude", "longitude", "geometric mean low", "geometric mean high", "time low 1", "time high 1", "time low 2", "time high 2"])
traveltime = traveltime.sort_values("geometric mean low")
traveltime = traveltime.reset_index()
traveltime.to_csv("extraction.csv", index=False) #save it to csv
traveltime.head(10) #print the 10 first rows
index latitude longitude geometric mean low geometric mean high time low 1 time high 1 time low 2 time high 2
0 38 48.833333 2.231111 16.124515 33.166248 10 20 26 55
1 89 48.888889 2.260000 16.248077 36.742346 22 45 12 30
2 97 48.900000 2.202222 16.733201 40.987803 35 70 8 24
3 6 48.800000 2.173333 16.733201 35.777088 8 16 35 80
4 98 48.900000 2.231111 17.320508 41.109610 30 65 10 26
5 99 48.900000 2.260000 17.663522 37.815341 26 55 12 26
6 88 48.888889 2.231111 17.663522 40.620192 26 55 12 30
7 7 48.800000 2.202222 17.748239 34.641016 9 16 35 75
8 68 48.866667 2.231111 18.000000 37.416574 18 35 18 40
9 87 48.888889 2.202222 18.330303 43.874822 28 55 12 35

Ok now that we finished preparing the data, it's time to draw some maps. We will use caropy in order to download some Google maps tiles. You might need to manually change the extent of the map.

%matplotlib inline
plt.rcParams['figure.figsize'] = 20, 12
# Create a Stamen terrain background instance.
stamen_terrain = cimgt.GoogleTiles()
fig = plt.figure()
# Create a GeoAxes in the tile's projection.
ax = fig.add_subplot(1, 1, 1, projection=stamen_terrain.crs)

# Limit the extent of the map to a small longitude/latitude range.
ax.set_extent([lat_begin*0.975, lat_end*1.02, long_begin*0.999, long_end*1.001], crs=ccrs.Geodetic())


# Add the Stamen data at zoom level 10.
ax.add_image(stamen_terrain, 10)

Now, we draw the 10 points that minimize time for user 1, color them is red and make the size of the pot proportionnal to the travel time of the second user. And we do the same for he 10 points that minimize time for user 2, color them is blue and make the size of the pot proportionnal to the travel time of the first user.

for i, point in traveltime.sort_values("time low 1").iterrows():
    if i < 10 :
        ax.plot( point.longitude, point.latitude, marker='o', 
            c ='red', markersize=point["time low 2"],
            alpha=0.5, transform=ccrs.Geodetic())

for i, point in traveltime.sort_values("time low 2").iterrows():
    if i < 10 :
        ax.plot( point.longitude, point.latitude, marker='o', 
            c ='blue', markersize=point["time low 1"],
            alpha=0.5, transform=ccrs.Geodetic())

To help with the vizualisation, we add two stars on the maps in order to mark the location of the 2 destinations.

# Add a marker for destination 1
ax.plot( latitudeDestination1, longitudeDestination1, marker='*', 
            c ='green', markersize=25,
            alpha=1, transform=ccrs.Geodetic())
# Add a marker for destination 2
ax.plot( latitudeDestination2, longitudeDestination2, marker='*', 
            c ='orange', markersize=25,
            alpha=1, transform=ccrs.Geodetic())        

geodetic_transform = ccrs.Geodetic()._as_mpl_transform(ax)
text_transform = offset_copy(geodetic_transform, units='dots', x=-25)

Finally, we draw the maps. The optimal point is where both the dots in blue and in red are smaller.

plt.show()

Final output