Toronto and New York Segmentation

Segmentation of Neighborhood of Toronto and New York using Foursquare API and Clustering

Suppose we want to segment neighborhoods of two cities: Toronto and New York, in terms of venues. Also, we need to explore the average income and population density of boroughs of each city.

The necessary data sets were: Location data for Toronto and New York neighborhoods, Venues data, Population density and average income data for Toronto and New York. The data sets were imported from Wikipedia, IBM open data. Link of the data sets are available at the last of this post.

We’ll discuss the process in two separate part:

  • Segmentation of the neighborhoods
  • Exploring the Average income and population density

I’ll use Python and it’s various packages. So, let’s get started!

Segmentation of the neighborhoods

I started with importing the boroughs and neighborhood list of Toronto from Wikipedia and converted it to data frame using pandas package in python. Then, I imported another data set comprised of location data of neighborhood and boroughs. It was in .csv format and then converted to data frame. After Cleaning the data set, two tables were merged to get the final Toronto neighborhood data set.
example code:
a = pd.read_csv('')
gdf = pd.DataFrame(a)
html ="List of postal codes of Canada: M").html().encode("UTF-8")
dfr = pd.read_html(html)[0]
df = pd.DataFrame(dfr)
Then, Geo location data of New York were imported. it was in .json format. Neighborhoods, Boroughs and their corresponding latitude and longitude were filtered out. Then filtered data were converted to a data frame.
example code:
!wget -q -O 'newyork_data.json'
with open('newyork_data.json') as json_data:
    newyork_nigh = json.load(json_data)
ny_neigh = newyork_nigh['features']
# define the dataframe columns
ny_nighcolumn = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
ny_neighborhoods = pd.DataFrame(columns=ny_nighcolumn)
for data in ny_neigh:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
After having neighborhood location data, for each neighborhood, usiing Foursquare API, all venues data were imported into two data frame for Toronto and New York.
example code:
CLIENT_ID = "Your Foursquare client ID"
CLIENT_SECRET = "Your Foursquare Client Secret"
VERSION = '20180605'

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)def getNearbyVenues(names, latitudes, longitudes, radius, LIMIT):
    for name, lat, lng in zip(names, latitudes, longitudes):
 # create the API request URL url = '{}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format( CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT) ...#some code here return(nearby_venues)
For every neighborhood,  I took into account the most common 20 venue categories and eliminated the rest of them.
Then I used KMeans Clustering to cluster all the neighborhoods, once for Toronto and once for New York. I took number of clusters as 5 for both cities. I utilized Sci-kit learn package of python. Then, the cluster labels were assigned to each neighborhoods.
example code:
# set number of clusters
t_kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans_tor = KMeans(n_clusters=t_kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe

Now, the labelled neighborhood data was plotted in a map using folium package.
example code:

#create map
ny_map_clusters = folium.Map(location=[ny_latitude, ny_longitude], zoom_start=10.25)

# set color scheme for the clusters
x = np.arange(ny_kclusters)
ys = [i + x + (i*x)**2 for i in range(ny_kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lats, lons, poi, cluster in zip(newyork_merged['Latitude'], newyork_merged['Longitude'], newyork_merged['Neighborhood'], newyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
        [lats, lons],
        fill_opacity=0.7).add_to(ny_map_clusters)'ny_map_cluster.html') #saving the map as image        
Clusters of Toronto
Clusters of New York

We observe from the figure that Toronto has one big cluster (83% of the neighborhoods) and a smaller one.  Other three clusters are insignificant compared to them. For New York, there are two big (45% and 41% of the neighborhoods) and one mid size clusters. Other two clusters are insignificant compared to them. So, Toronto seems to have more uniform neighborhood type.  New York has much more varieties. So, segmentation is different.

Average Income and Population Density

This part is easier than the first part. Population density and average income data for all boroughs of New York and Toronto were imported from Wikipedia. Then, I converted them to data frame using pandas. After some cleaning and modification, the dataset were ready for some exploratory visualization.
example code:

html2 ="Demographics of Toronto neighbourhoods").html().encode("UTF-8")
dft = pd.read_html(html2)[1]
dft = pd.DataFrame(dft)

html1 ="Boroughs of New York City").html().encode("UTF-8")
dfny = pd.read_html(html1)[0]
dfn = pd.DataFrame(dfny)

dfn.drop(dfn.index[5:], inplace=True)


I have created a column chart showing Population Density of New York and Toronto in two separate chart. After that, I applied similar method for Average Income data.
example code:

plt.figure(figsize = (20,15))['Borough'], dft_2['Population Density'], color = 'tomato')
plt.title("Population Density by Borough in Toronto", fontsize = '28')
plt.xlabel('Borough', fontsize = '18')
plt.ylabel('Population Density', fontsize = '18')
plt.xticks(fontsize='24', rotation ='30')

From these charts, it is clear that in New York, Manhattan has by far the highest average income as well as population density than other boroughs. Other four are poorer.
For Toronto, population density is slightly higher in old city of Toronto and York than the other three. In terms of Average income, Old city of Toronto has highest average income. But, it is not very far from other four.

So, in terms of average income and population density, Toronto is much different from New York. New York wide gap between boroughs and everything seems centralized. On the other hand, Toronto has much more uniform distribution of population and income.

Data Sources:

5 1 vote
Article Rating
Notify of
Newest Most Voted
Inline Feedbacks
View all comments
Rex Silla
1 year ago

Greate pieces. Keep posting such kind of information on your site. Im really impressed by your site.

Candice Scollard
1 year ago

This web site certainly has all the info I wanted about this subject and didn’t know who to ask.

binary options
6 months ago

It’s nearly impossible to find educated people about this subject, however, you seem like you know what you’re talking about! Thanks

SEO Planner
6 months ago

You are so interesting! I don’t think I’ve read through anything like this before. So wonderful to find someone with a few genuine thoughts on this subject. Really.. many thanks for starting this up. This website is something that is needed on the web, someone with a bit of originality!

Would love your thoughts, please comment.x