The Data Incubator Capstone Project
Collaborated With Seam Social Labs

Predicting NYC Community Districts social needs using K-Nearest Neighbors

May 2019 • 10 min read

Introduction

Seam Social Labs is a group of social impact leaders who are dedicated to solving complex community problems. The problem we are currently focused on is the role of major developments within communities, e.g. gentrification. Our solution is a web-based platform that is designed to engage real estate developers and investors in community engagement in advance of construction.

Data and Model

New Yory City Mayor's Office of Data Analytics (MODA) and the Department of Information Technology and Telecommunications (DoITT) partnered in 2015 to form the NYC Open Data team. On the NYC Open Data Portal, The spatial data for each community district in the form of latitude and longitude information stored as shapefiles.
Besides the spatial data, we worked with a data set that includes some demographic and income-related features within the districts that can be found here. Finally, on the same page, we reached the annual petitions going back to 2010 from 2018, that each community district submits to the city listing their demands.
The Python code to mine the district petitions as pdf files and extract the top 3 pressing issues using Regex and pdfminer library can be found below.

        from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import re
import os
import pandas as pd
import numpy as np
import dill
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
def convert_pdf_to_txt(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
def pull_top_needs(text):
pattern='three issues:[\n•]+([A-Za-z ]+)[\n•]+([A-Za-z ]+)[\n•]+([A-Za-z ]+)'
return re.findall(pattern,text)
needs=dict()
path=r'C:\Users\Gokmen\Capstoneproject\static\statements'
for boro_year in os.listdir(path):
boro=boro_year[:2]
year=int(boro_year[-4:])
if year==2017:
for dist in os.listdir(path+'\\'+boro_year):
needs[boro+dist[-6:-4]]=pull_top_needs(convert_pdf_to_txt(path+'\\'+boro_year+'\\'+dist))
if year>2017:
for dist in os.listdir(path+'\\'+boro_year):
needs[boro+dist[-6:-4]].extend(pull_top_needs(convert_pdf_to_txt(path+'\\'+boro_year+'\\'+dist)))

Another data wrangling operation we need to carry out is with the spatial data that is made of shapefiles. The data points are either Polygon or Multipolygon instances. Since we will be building our interactive map with Bokeh and it does not accept Polygon or Multipolygon instances, we need to pull the coordinates out of every Polygon or Multipolygon object. Here is the function that helps us do that:

        from shapely.geometry import LineString, Point
from bokeh.models.glyphs import Line,MultiLine
def get_coords(poly):
if poly.type == 'Polygon':
x,y=poly.exterior.xy
return [list(x),list(y)]
else:
X=[]
Y=[]
for p in poly:
x,y=p.exterior.xy
X.append(list(x))
Y.append(list(y))
return [X,Y]

Results

We display our results on the interactive maps below. The first tab named 'Current' shows the current top 3 pressing issues in each community district. Hover over the districts to see the results and click on a specific district on the map to see the comparison of 2019 and 2024. The second tab named '2024' shows our predictions for 2024 that we found using K-Nearest Neighbours. We picked the number of the neighbours to look at as 5 because that has given us the most accurate predictions. In order to evaluate our model, we used Cross Validation. Note that we have historical data points that go back to 2010 that makes our data set big enough to perform our ML algorithm and carry out the validation test. However, it is a bit tricky to evaluate this model because our target variable is not binary. A district's social need can be one of the 26 total issues out there. Therefore, we decided to ignore the order within the top 3 list. For instance, if the issues of affordable housing, schools and parks are the actual data but our predictions are schools, parks and affordable housing; we take it as an accurate prediction. With this adjustment, we reach an accuracy rate slightly above 90%.