Pre-reading: Installing CSVkit and Command Line Basics
We used CSVkit to whittle NYC property records down to manageable pieces. Take another stab (and think about how this might help with, say, 311 call data. Or DOB records.
I want to walk through a chunk of data that I helped someone in Tim’s class manipulate.
NYC’s Department of City Planning publishes incredibly useful property maps of NYC. Not for nothing, these are available on Socrata, but if you find them on NYC Open Data you’re way (way) better off going back to the agency that provides the data. Among other things, City Planning provides clear context for their data.
Today the link for the most up to date MapPLUTO data is http://www.nyc.gov/html/dcp/html/bytes/dwn_pluto_mappluto.shtml but that may change. Download the CSV format. You’ll see why in a moment.
controlc is the “kill” command — it will stop the current process. So if you lose your command prompt or you run something that is taking longer than it should, controlc will set you right. Keep in mind, however, that we chose the very smallest file to work on in class. The other four boroughs have more buildings, bigger data.
We all set up our computers so that we can open a terminal in any folder, just from the context (right click) menu, but you can also use
pwd to see exactly where you are and
cd ... to move up or down the folder tree. I recommend Zed Shaw if that’s sticky.
We also played with tab completion, and used
* as a wildcard.
And, we used
du -h ./* to check the sizes of our files.
wc -l to get wordcounts of a file.
View column names with
csvcut -n MN.csv
Search for a particular column by piping the output of that command to grep:
csvcut -n MN.csv | grep Own
Find the column numbers for these columns:
csvcut -c 2,3,4 MN.csv to print columns 2, 3 and 4 to stdout. Challenge yourself: write the command to produce all of the columns we need.
> to redirect
csvcut‘s output from
stdout to a new file:
csvcut -c 2,3,4 MN.csv > smaller_MN.csv
Remember: that isn’t a complete list of columns!
csvgrep to search for a specific value (11) in the land use column (in my example, it’s column 12:
csvgrep -c 12 -m 11 smaller_MN.csv > vacant_MN.csv
And then count the lines in your resulting file:
wc -l vacant_MN.csv