Block Pruning Analysis

From DataKind
Jump to: navigation, search

Contents

Team Members

Use this as a template. Replace the elements in ALLCAPS with your information and append it to the end of the table ( Before the |})


Source(s): Children's Memorial Hospital


|-
| NAME || [mailto:YOUREMAIL YOUREMAIL] || ORGANIZATION || ROLE || SKILLSET

Name Email Organization Role Skillset
Eric Busboom Clarinova Data Wrangler ( SQL) Python, SQL
Mike Holland NYU CUSP
Nate McNamara nate@mcnamara.net Morgan Stanley Data Cleaning Programming & Databases
Joel Natividad joel.natividad@ontodia.com Ontodia Data Wrangler LAMP, Linked Data
Aaron Schumacher ajschumacher@gmail.com NYCDOE Analyst R, Python, SQL, web
Dan Blum dbblum@gmail.com CityEats Analyst R, SQL
Mira Bernstein mirabernstein@gmail.com None Data cleaning, modeling Stata
Kim Johnson johnson.kim00@gmail.com Fordham University Cleaning, recoding Stata
Ayeh Bandeh-Ahmadi heyayeh@gmail.com University of Maryland Cleaning, scripting R, SQL, Matlab, Stata
Craig Nowell craig.nowell@gmail.com Greenplum


Source(s): Children's Memorial Hospital


Info

MySQL Database Details

MySQL 5.5.25a in Amazon Cloud (RDS)

Host/Endpoint: nycdatadive2012.cu2ln1dfogze.us-east-1.rds.amazonaws.com

Port: 3306

Database: nycdatadive2012

User: datahacker password: HacK1ngD4ta (this user has select, create view, insert, update privs)

For people who require DBA access (create indices, restructure tables, create/drop tables), please contact Joel, so I can give you the DBA login.

User-Friendly Web Interface (phpMyAdmin): http://75.101.138.139/phpmyadmin/

You can do SQL queries (you can even do Query By Example if you don't know SQL), and download the results in various formats (CSV, XML, PDF, etc.) for further analysis.


Source(s): Children's Memorial Hospital


Approach

Ideas

This R code provides a function to get a block (segment) from an x, y coordinate pair. I don't know why but syntax highlighting is screwing it all up, so here it is plain. It isn't super efficient necessarily but I don't know that it can be sped up any more. It returns the complete row for the relevant segment. - Aaron


Source(s): Children's Memorial Hospital


bll <- read.csv('~/parks/pruning/Block_pruning_citywide_12A_lonlat.csv')

# zero is NA, 9999 means unlikely to ever by pruned and can be treated as NA
bll$Prune_year <- ifelse(bll$Prune_year %in% c(0,9999), NA, bll$Prune_year)

segmentFromXY <- function(x,y) {
  # pretty far down this page for the point-to-segment implementation
  # http://www.mathworks.com/matlabcentral/newsreader/view_thread/164048
  d <- with(bll,
            # test to see if the point is in range of segment
            ifelse((pmin(XFrom,XTo)<=x &
                    pmax(XFrom,XTo)>=x) |
                   (pmin(YFrom,YTo)<=y &
                    pmax(YFrom,YTo)>=y),
            # if in range, use distance to line
            abs( (XTo-XFrom)*(YFrom-y) - (XFrom-x)*(YTo-YFrom))
            /
            sqrt( (XTo-XFrom)^2 + (YTo-YFrom)^2 ),
            # else use min distance to endpoint
            pmin(sqrt((XTo-x)^2+(YTo-y)^2),
                 sqrt((XFrom-x)^2+(YFrom-y)^2))
            ))
  # return the record with lowest distance,
  # preferentially with prune year (most recent)
  # and shortest such segment
  return(head(bll[order(d,-bll$Prune_year,bll$SHAPE_Leng),], 1))
}

# you're likely interested in "Prune_year" and one of "SegmentID" or "OBJECTID"


Source(s): Children's Memorial Hospital


Organizing the Data in q/kdb+

To create a file containing the data for regression, we followed these steps:

1. We read in Alltrees_20120606.csv and Block_pruning_citywide_12A_lonlat.csv into tables called tree and blockPruning, respectively:

uselessColumns:{ // columns where every cell has the same value
  cols[x] where 1 = count each value each
    "exec distinct ",/:(string each cols x),\:" from ",string x}

removeUselessColumns:{ // save memory where possible
  value "delete ",(","sv string uselessColumns x)," from `",string x}

treeFile:`:Alltrees_20120606.csv;
treeFormat:"S*FSIIIISIBBBIIIFF";
tree:(treeFormat;(),",") 0: treeFile;
removeUselessColumns`tree;

bpFormat:"I*IISISISSIIIISIIIIIIIIIIISIIIIIIIIIISISISISISISISISISISIIIISIISSSSSSSSIIII",
         "IIIIIISSIIIIIIISSIIFFIIIIIIIIIIIJFIIFFFF";
blockPruningFile:`:Block_pruning_citywide_12A_lonlat.csv;
blockPruning:(bpFormat;(),",") 0: blockPruningFile;
removeUselessColumns`blockPruning;

2. We added the segmentID to the tree table using Mira's algorithm. First, we prepared a table called bp (based on blockPruning) to make computing the segment easier.

bp:select SegmentID,PhysicalID,Prune_year:(max;Prune_year)fby PhysicalID,XFrom,XTo,YFrom,YTo,
          dx:abs XFrom-XTo,dy:abs YFrom-YTo,invL:1%square[XFrom-XTo]+square YFrom-YTo
     from blockPruning;
update Prune_year:0N from `bp where Prune_year=-0W;
update ZDX:0=dx,dx2:square dx,dy2:square dy,dxdy:dx*dy from `bp;
`XFrom`YFrom xasc `bp; // for segmentFromXYQuick

square:{x*x}

calcScore:{[x;y;BP]
  p:update px:invL*(dx2*DX)+dxdy*DY,py:invL*(dxdy*DX)+dy2*DY
      from update DX:x-XFrom,DY:y-YFrom from BP;
  update score:?[((px%dx)within -.05 1.05)|ZDX&(py%dy)within -0.05 1.05;
                 square[DX-px]+square DY-py;
                 0w] from p}

segmentFromXY:{[x;y]
  $[any null x,y; 0N; exec first SegmentID from calcScore[x;y;bp] where score=min score]}

segmentFromXYQuick:{[x;y] // correct ~80% of the time by our measurements, but takes minutes instead of hours
  $[any null x,y;
      0N;
      exec first SegmentID
        from calcScore[x;y;bp(bp[`XFrom]bin x)+-5+til 10001]
	where score=min score]}

update X:`int$21468535+276815*lon,Y:`int$-14578574+362970*lat from `tree;

update segmentID:segmentFromXYQuick'[X;Y] from `tree;

3. We loaded workorder_extract.csv, which Eric created from :WorkOrders_NamesCleaned.csv, and computed segments for each work order and stored them in a table called workOrderIDsegmentID (if the results have been saved, we load them to save time):

workOrderFile:`:workorder_extract.csv;
workOrderFormat:"IIFFSFFFFFBBIIISFZZZZZZZZZZZZZZZZZZZ";
workOrder:1_(workOrderFormat;(),",") 0: workOrderFile;
removeUselessColumns`workOrder;

workOrderIDsegmentID:$[-11h=type key `:workOrderIDSegmentID.csv;
                         ("II";(),",")0:`:workOrderIDsegmentID.csv;
                         select workOrderID,segmentID:segmentFromXYQuick'[X;Y] from
                           select workOrderID:workorderid,X:`int$woxcoordinate,Y:`int$woycoordinate
                             from workOrder];

4. Then we count the number of work orders each year in each segment which we use to estimate the average cost to prune each segment:

ordersByYearSegment:select orderCount:count i by initiatedate.year,segmentID from
                      workOrderIDsegmentID lj
		        `workOrderID xkey
			  select workOrderID:workorderid,initiatedate
			    from workOrder;
costByYearSegment:update cost:cost*orderCount from ordersByYearSegment lj
                    select treeCount:count i,cost:avg (`s#5 12 20 27f!39 54.1 66.6 84.4)DBH
                      by segmentID from tree;

5. Lastly, we create our independent variable from bp and export the parameters so we can run regressions in R:

rack:flip `segmentID`year!flip(exec distinct SegmentID from bp)cross 1995+til 18;
indep:update pdiff:?[year<=Prune_year;50;year-Prune_year]
        from rack lj select first 2012^Prune_year by segmentID:SegmentID from bp;
regressionParams:delete Prune_year from
                   update 0^orderCount,0^cost,invPdiff:1%pdiff,unknownPruneYear:50=pdiff
		     from indep lj costByYearSegment;
`:regParamsSince2008 0:select from regressionParams where year>=2008;


Source(s): Children's Memorial Hospital


Findings, Results, Assorted Graphs and Such

PowerPoint Slides for Presentation

Put links to your individual power point slides here. We'll assemble them into the final presentation at the end.

  * Two not-very interesting slides on cleaning block pruning data
  * Slides with graphs of WO/tree (results)

Here are two plots that could be somewhat useful. There are two slides that show when the blocks in Brooklyn were last pruned, and the number of blocks pruned per year. https://www.dropbox.com/sh/botl8vigy177nhm/-T0cp_qsa4

Here are two plots on the effectiveness of pruning: http://imgur.com/a/KcYPX Does this mean we shouldn't be optimistic about finding that pruning is helpful? Maybe. But there are a lot of ways that this could be wrong. These are based on matching Brooklyn trees to block file segments and Brooklyn work orders to block file segments, then dividing number of work orders in 2010 or later for segments with a given pruning year, by the number of trees for segments with that pruning year. A lot of things about these results feel strange. Is it reasonable to have a number of work orders around 4% total number of trees, in a span of just 2 years? (Oh, actually yes, based on numbers from their presentation...) Is it really the case that pruning isn't helpful? Really? These estimates inspire further work, not necessarily drawing conclusions. - Aaron

Here are some powerpoint slides -- first shows some of the data; second explains the regression model and shows results from last night. I'm working on rerunning analysis on new data and explaining what it means that the regression (loglog version was most significant) has highly significant coefficients but a very low overall R squared (<0.2%) -- Ayeh https://www.dropbox.com/s/sb8gky3ed0ladug/pruning-v01.pptx


Source(s): Children's Memorial Hospital


Open Questions / Suggestions / Gripes

Data, Links, Other Info We Rounded up

Your Data Wrangers are:

  • Nate, for numerical and categorical analysis of data files. talk to Nate if you want statistics about the data.
  • Eric Busboom, for SQL based extracts. Talk to Eric if you need a CSV or SQL file with particular columns.

Here is a schema for an SQL import for Block_pruning_citywide_12A.csv and Block_pruning_citywide_12A_latlon.csv. Talk to Eric Busboom for the sqlite3 file.

Cleaned up block pruning data: https://www.dropbox.com/sh/vxmr3wssp4yp41g/nlCgRgym6h

  • Restricted to Brooklyn
  • Removed non-pedestrian streets and other weird types streets (based on variables FeatureType and NonPed; see data documentation: http://www.nyc.gov/html/dcp/html/bytes/meta_lion.shtml)
  • Often, several segments have the same PhysicalID (i.e. are part of same physical block -- the divisions between segments reflect some obscure need of some NYC department). Sometimes part of these segments are labeled as pruned in a given year and others are not labeled as pruned (prune_year=0 or missing). In these cases, we assume this was a mistake of data entry and all the segments of the physical block were actually pruned in that year. The exception is the "block" with PhysicalID=0, which has more than 1000 unrelated segments. (Probably a case of missing PhysicalID.) For these segments, we did not change the value of prune_year.
  • If the value of prune_year was 0, 9999, or missing, we converted it to missing. (Based on conversation with parks people about what these values mean; all of them mean we don't know when the block was last pruned, possibly never.)
  • We take out duplicates on SegmentID. (We checked that these duplicates do not differ on any other relevant variables either.)

Uncleaned, but complete package of data used in the project: [dd-nyc2012-pruning-datapkg.zip https://dl.dropbox.com/u/39949226/NYC%20Data%20Dive/dd-nyc2012-pruning-datapkg.zip]

Link to trees census info for Brooklyn. https://www.dropbox.com/s/jy8ocpcpsg07ms7/AllTrees_Brooklyn2.csv

Brooklyn Work Orders: http://dl.dropbox.com/u/53163928/Work%20Orders%20.csv

311 data (Brooklyn) 2004-2007: http://dl.dropbox.com/u/53163928/2004-2007.csv or http://dl.dropbox.com/u/53163928/2004-2007.xls

311 data (Brooklyn) 2008-present: http://dl.dropbox.com/u/53163928/2008-present

Brooklyn tree data with matched segmentID and pruneYear https://www.dropbox.com/s/6njlhus4aiqq8d6/bk_trees_segs_years-v1.csv

Tree census with segmentID according to Mira's algorithm (gzipped) https://www.dropbox.com/s/xt9mhwz3pb0snmo/tree.csv.gz

Two-column csv linking workOrderIDs to segmentIDs computed according to Mira's algorithm https://www.dropbox.com/s/70fsd6ttmvm4gwi/workOrderIDsegmentID.csv

(work orders * average cost per tree) by segment/year https://www.dropbox.com/s/chge00x3ul2atw5/costByYearSegment.csv

Count of work orders by category and years since pruned https://www.dropbox.com/s/ruyfzyyzhheo1hl/workOrdersByCategoryYearsSincePruned.csv

Count of work orders by category and years since pruned *using Mira's cleaned block data* https://www.dropbox.com/s/940vqww3iwsorlm/workOrdersByCategoryYearsSincePrunedCleaned.csv

A simple stacked column chart showing work orders by category versus years since pruned *using Mira's cleaned block data*

https://www.dropbox.com/s/kujl50vufnl7ybt/Work%20Order%20vs%20Years%20Since%20Pruned%20Chart.jpg

As you can see, this chart leads us to believe that work orders actually decrease as the number of years since pruning increase. This output is most likely flawed for the one or all of the following reasons: - Since we do not have a complete picture of all work orders (with last year pruned date for each), we could simply have more data points for work orders with more recent prune dates. - If we make a bold assumption that the data is in fact correct, it's possible that recent pruning drives awareness of 311 and the services the parks department provides. Therefore, residents within recently pruned blocks will call in more tree issues.

Independent Variables (file with year, segment id, pdiff=number of years since last prune year with 50 indicating no info/never, flag=1 indicating if last prune date is unknown) https://www.dropbox.com/s/5pfcnnsgqsq5u38/indep.csv r script that generated this using block_pruning.csv data file: https://www.dropbox.com/s/99esiiiorc4q20s/sumtrees.r

Regression inputs: https://www.dropbox.com/s/ezml8aldfpok076/regParamsSince2008.csv

File with data on number of trees (column c) per segment and number of trees falling into bins of <=12", >12" but <=20", >20" but <=27",>27" respectively in columns c1, c2, c3, c4. Note that columns c1-c4 are sometimes less than column c, so need to be careful using this file: https://www.dropbox.com/s/c6d0mza0cirj0ym/treecounts.csv r script that generated this file using bk_trees_segs_years-v1.csv as input: https://www.dropbox.com/s/t6to7orfi9glahv/convertblocks2.R


Source(s): Children's Memorial Hospital


Go Home

Back to NYC DataDive 2012 main page

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox