Suppose a company wants to segment neighborhoods of two cities, in this case, Toronto and New York, in terms of venues. Also, the average income and population density of boroughs of each city are needed to be explored.

The necessary data sets were: Location data for Toronto and New York neighborhoods, Venues data, Population density and average income data for Toronto and New York. The data sets were imported from Wikipedia, IBM open data. Link of the data sets are given at the last of this post.

We’ll discuss the process in two separate part:

  • Segmentation of the neighborhoods
  • Exploring the Average income and population density

I’ll use Python and it’s various packages. So, let’s get started!

Segmentation of the neighborhoods

I started with importing the boroughs and neighborhood list of Toronto from Wikipedia and converted it to data frame using pandas package in python. Then, Another data set comprised of location data of neighborhood and boroughs was imported. It was in .csv format and then converted to data frame. After Cleaning the data set, two tables were merged to get the final Toronto neighborhood data set.
example code:
a = pd.read_csv('https://cocl.us/Geospatial_data')
gdf = pd.DataFrame(a)
html = wp.page("List of postal codes of Canada: M").html().encode("UTF-8")
dfr = pd.read_html(html)[0]
df = pd.DataFrame(dfr)
Then, Geo location data of New York were imported. it was in .json format. Neighborhoods, Boroughs and their corresponding latitude and longitude were filtered out and converted to a data frame.
example code:
!wget -q -O 'newyork_data.json' https://cocl.us/new_york_dataset
with open('newyork_data.json') as json_data:
    newyork_nigh = json.load(json_data)
ny_neigh = newyork_nigh['features']
# define the dataframe columns
ny_nighcolumn = ['Borough', 'Neighborhood', 'Latitude', 'Longitude'] 

# instantiate the dataframe
ny_neighborhoods = pd.DataFrame(columns=ny_nighcolumn)
for data in ny_neigh:
    borough = neighborhood_name = data['properties']['borough'] 
    neighborhood_name = data['properties']['name']
        
    neighborhood_latlon = data['geometry']['coordinates']
    neighborhood_lat = neighborhood_latlon[1]
    neighborhood_lon = neighborhood_latlon[0]
    
    ny_neighborhoods = ny_neighborhoods.append({'Borough': borough,
                                          'Neighborhood': neighborhood_name,
                                          'Latitude': neighborhood_lat,
                                          'Longitude': neighborhood_lon}, ignore_index=True)
After having neighborhood location data, for each neighborhood, usiing Foursquare API, all venues data were imported into two data frame for Toronto and New York.
example code:
CLIENT_ID = "Your Foursquare client ID"
CLIENT_SECRET = "Your Foursquare Client Secret"
VERSION = '20180605'

print('My credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)def getNearbyVenues(names, latitudes, longitudes, radius, LIMIT):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
           ....
......
 # create the API request URL url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format( CLIENT_ID, CLIENT_SECRET, VERSION, lat, lng, radius, LIMIT) ...#some code here return(nearby_venues)
For every neighborhood, the most common 20 venue categories were taken into account and rest were eliminated.
 
Then I used KMeans Clustering to cluster all the neighborhoods, once for Toronto and once for New York. I took number of clusters as 5 for both cities. Sci-kit learn package of python was used. Then, the cluster labels were assigned to each neighborhoods.
example code:
# set number of clusters
t_kclusters = 5

toronto_grouped_clustering = toronto_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans_tor = KMeans(n_clusters=t_kclusters, random_state=0).fit(toronto_grouped_clustering)

# check cluster labels generated for each row in the dataframe
kmeans_tor.labels_[0:10]

Now, the labelled neighborhood data was plotted in a map using folium package.
example code:

#create map
ny_map_clusters = folium.Map(location=[ny_latitude, ny_longitude], zoom_start=10.25)

# set color scheme for the clusters
x = np.arange(ny_kclusters)
ys = [i + x + (i*x)**2 for i in range(ny_kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lats, lons, poi, cluster in zip(newyork_merged['Latitude'], newyork_merged['Longitude'], newyork_merged['Neighborhood'], newyork_merged['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lats, lons],
        radius=5,
        popup=label,
        color=rainbow[int(cluster)-1],
        fill=True,
        fill_color=rainbow[int(cluster)-1],
        fill_opacity=0.7).add_to(ny_map_clusters)
       

ny_map_clusters.save('ny_map_cluster.html') #saving the map as image        
ny_map_clusters
Clusters of Toronto
Clusters of New York

It can be seen that Toronto has one big cluster (83% of the neighborhoods) and a smaller one.  Other three clusters are insignificant compared to them. For New York, there are two big (45% and 41% of the neighborhoods) and one mid size clusters. Other two clusters are insignificant compared to them. So, we see Toronto seems to have more uniform neighborhood type.  New York has much more varieties. So, segmentation is different.

Average Income and Population Density

This part is easier than the first part. Population density and average income data for all boroughs of New York and Toronto were imported from Wikipedia and converted them to data frame. After some cleaning and modification, the dataset were ready for some exploratory visualization.
example code:

html2 = wp.page("Demographics of Toronto neighbourhoods").html().encode("UTF-8")
dft = pd.read_html(html2)[1]
dft = pd.DataFrame(dft)

dft.head()
html1 = wp.page("Boroughs of New York City").html().encode("UTF-8")
dfny = pd.read_html(html1)[0]
dfn = pd.DataFrame(dfny)

dfn.drop(dfn.index[5:], inplace=True)

dfn.head()

I have created a column chart showing Population Density of New York and Toronto in two separate chart. after that similar method was applied for Average Income data.
example code:

plt.figure(figsize = (20,15))
plt.bar(dft_2['Borough'], dft_2['Population Density'], color = 'tomato')
plt.title("Population Density by Borough in Toronto", fontsize = '28')
plt.xlabel('Borough', fontsize = '18')
plt.ylabel('Population Density', fontsize = '18')
plt.xticks(fontsize='24', rotation ='30')
plt.yticks(fontsize='22')
plt.savefig('pdtor.jpg')
plt.show()

From these charts, it is clear that in New York, Manhattan has by far the highest average income as well as population density than other boroughs. Other four are poorer.
For Toronto, population density is slightly higher in old city of Toronto and York than the other three. In terms of Average income, Old city of Toronto has highest average income but not very far from other four.

So, in terms of average income and population density, Toronto is much different from New York. New York wide gap between boroughs and everything seems centralized. On the other hand, Toronto has much more uniform distribution of population and income.

Data Sources:

This Post Has One Comment

  1. Greate pieces. Keep posting such kind of information on your site. Im really impressed by your site.

Leave a Reply