Category Archives: Hands-On

CSVkit Walkthrough

Pre-reading: Installing CSVkit and Command Line Basics

We used CSVkit to whittle NYC property records down to manageable pieces. Take another stab (and think about how this might help with, say, 311 call data. Or DOB records.

I want to walk through a chunk of data that I helped someone in Tim’s class manipulate.

MapPLUTO

NYC’s Department of City Planning publishes incredibly useful property maps of NYC. Not for nothing, these are available on Socrata, but if you find them on NYC Open Data you’re way (way) better off going back to the agency that provides the data. Among other things, City Planning provides clear context for their data.

Today the link for the most up to date MapPLUTO data is http://www.nyc.gov/html/dcp/html/bytes/dwn_pluto_mappluto.shtml but that may change. Download the CSV format. You’ll see why in a moment.

Getting unstuck

controlc is the “kill” command — it will stop the current process. So if you lose your command prompt or you run something that is taking longer than it should, controlc will set you right. Keep in mind, however, that we chose the very smallest file to work on in class. The other four boroughs have more buildings, bigger data.

Getting Around

We all set up our computers so that we can open a terminal in any folder, just from the context (right click) menu, but you can also use pwd to see exactly where you are and cd ... to move up or down the folder tree. I recommend Zed Shaw if that’s sticky.

We also played with tab completion, and used * as a wildcard.

And, we used du -h ./* to check the sizes of our files.

Use wc and wc -l to get wordcounts of a file.

Using CSVkit

View column names with csvcut -n MN.csv

Search for a particular column by piping the output of that command to grep: csvcut -n MN.csv | grep Own

Find the column numbers for these columns:

  • LandUse
  • OwnerType
  • ZoneDist1-4
  • AssessTot
  • ExemptTot
  • Council
  • ZipCode
  • Address
  • CD
  • Lot
  • Block
  • XCoord
  • YCoord

Use csvcut -c 2,3,4 MN.csv to print columns 2, 3 and 4 to stdout. Challenge yourself: write the command to produce all of the columns we need.

Use a > to redirect csvcut‘s output from stdout to a new file:

csvcut -c 2,3,4 MN.csv > smaller_MN.csv

Remember: that isn’t a complete list of columns!

Use csvgrep to search for a specific value (11) in the land use column (in my example, it’s column 12:

csvgrep -c 12 -m 11 smaller_MN.csv > vacant_MN.csv

And then count the lines in your resulting file: wc -l vacant_MN.csv

Command Line Basics

If you really want to master the command line, Zed Shaw’s Command Line Crash Course is probably indispensible. I’ve never actually taken it but I’ve taken other Zed Shaw classes and he’s good at what he does.

By default, OSX’s Terminal (and Ubuntu) both use Bash. Bash is actually just one of many command line interpreters out there, but comparing the relative merits of shell environments is well, well beyond the scope of our class. You probably won’t use the command line much in your career as a journalist but if you’re going to work with data, there are times when it can save you from tearing your hair out. If you go to Socrata you can download a 2.24 GB file that contains a record of every single call to 311 since 2010. You can’t open it in Excel, however — it is too big.

So if you want to get some kind of handle on what is in the file, you need some way to examine it without opening it. Enter the shell. Continue reading Command Line Basics

Troubleshooting Highcharts

When you’re ready to embed a chart, JSFiddle’s build in iframe tool is handy, but in the long run it isn’t what you want. Embedding Fiddles is fiddly, so instead of mastering JS Fiddle learn how to put a chart on your own page.

I put together some very useful Bootstrap templates including two Highcharts templates that are your best starting points. If your chart isn’t showing up where you think it ought to, try checking the following:

  • Did you place a <div..>…</div> on the page where you want the chart to appear?
  • Does your <div> have a descriptive, one word id? id="container" is not descriptive. id="time_to_leak" is.
  • Is your Highcharts function looking for the right id?
    $(function () {$('#oscar_night').highcharts({ is looking for a div with id="oscar_night"
  • Is your jquery call above your function? (By “jquery call” I mean the script tag that includes a minified copy of jQuery, probably from a content delivery network somewhere. Something like <script src="//ajax.googleapis.com/ajax/libs/jquery/1.11.0/jquery.min.js"></script>
  • Is your highcharts call below your function? (That’s the line that pulls in Highcharts’ scripts <script src="//code.highcharts.com/highcharts.js"></script>)

HTML Fundamentals

We’re using bootstrap to manage most page layout needs, but you still need a little bit of HTML to place images, link to other pages or add breaks between paragraphs.

I like ReText for Ubuntu Linux or Mou for OSX.

Both programs allow you to write your text in a very simplified language called “markdown” — and they’ll produce clean HTML from it. By default a blank Mou document has everything you need to know about markdown right inside it.

I like a lot of things about writing in Markdown: you can stay focused on the content because it is super readable. Markdown enforces structured, heirarchical HTML which will make the web team swoon when you move to a working newsroom.

You can’t do much formatting in Mou, but you shouldn’t be doing much formatting anyway: focus on your reporting.

If you really want to start to understand how HTML works, these readings might help:

Why can’t I just use Dreamweaver?

You might have figured out by now that you can save a Word document as HTML. You might even have a “What You See is What You Get” (or WYSIWYG, prounounced whizzy wig) editor like Dreamweaver Handy. The problem is that both of these will do their utmost to replicate exactly what you’ve got on the screen in the HTML they produce. You won’t be able to read the results and you’ll have a hard time wrestling them into a different template.

Trust me: if you want more fine grained control than you can get in Bootstrap, a WYSIWIG editor is not the path to follow.

Think of it this way: you can make spaghetti sauce from a diced onion, two cloves of garlic and a can of tomatoes. Someone says to add a bay leaf, some thyme and oregano. Then your overbearing gourmet friend says you must start with a soffritto and … at this point no one would blame you for eying the jarred sauce aisle. If you’re just starting out as a cook, you are far, far better off making a super simple sauce from scratch so you can build on it. (There is a Cosby Show episode about what happens if you heat up a jar of sauce and try to pass it off as homemade: You will not impress Clair Huxtable.)

Plus, the truth is that you can read the ingredients on a jar of Ragu, but you can’t read the ingredients on generated HTML and you definitely can’t get the sugar out and the time you’d spend trying to figure out how to make the sauce taste less tinny would be far better spent dicing an onion and getting on with it.

Instead, use a Markdown editor to compose your text and break it into paragraphs, lists and blockquotes, and then use Bootstrap to lay it out.

Bootstrap and SimpleHTTP

Launching a webserver

It seems like an extra hurdle, but getting a little web server running goes a long way towards being able to test out your code before you upload it to a real web server.

Step 0: Start to organize the files that you’re putting on Digital Storage into a single directory so you can keep track of them.
Step 1: Enable New Terminal at Folder
Step 2: Navigate to the folder that contains your HTML files. Right click on the folder in your Finder window and select “New Terminal at Folder” — it might be in a “Services” sub-menu.
Step 3: Check your directory with pwd (Just type pwd at the command prompt and hit return)
Step 4: Launch Python’s Simple HTTP Server with python -m SimpleHTTPServer

Now you can visit http://localhost:8000 and see your html files served over a little local web server.

Bonus: in another terminal window try running python --help — see if you can figure out what the -m flag is doing.

Troubleshooting

Read the error messages you see and think about what they’re saying. If you see Firefox can't establish a connection to the server at localhost:8000. that means there’s no server running on port 8000. You might have the port number wrong, or SimpleServer might have stopped running. Luckily, it is easy enough to restart. If you see :

Error response
Error code 404.
Message: File not found.
Error code explanation: 404 = Nothing matches the given URI.

that is a clue that either you tried to open a page that doesn’t exist (check the spelling?) or your web server is running from the wrong folder.

Bootstrap

The secret truth is that you can load Bootstrap’s CSS library without a web server, but getting used to testing things over a server will pay off in the long run. I promise.

The reason we’re doing it now is so we can include script and CSS files with protocol-relative URLs. Protoco-huh? For now, just trust me. Brocolli is better for you than brownies even though brownies won’t kill you. Drink lots of water, get plenty of sleep and at least a little vigorous exercise, even if you don’t really understand.

We walked through some basics of Bootsrap: you need to call their CSS in the page header, and if you’re using any javascript you need to call their javascript before you close the body of your page. You need a valid doctype definition and a few particular meta tags.

Get started with my template, and remember to keep an eye on the heirarchy. Everything but the doctype goes in either the head or the body (and head and body belong inside of html tags). The head includes a few meta tags, a title and any link tags that call style sheets. Sometimes you’ll put script tags in the head, sometimes you want them right before the closing body tag. Why? That’s complicated. Almost everything else will go inside the body tag.

If you’re using Bootstrap, start by putting everything inside a container div:

<div class="container">
    (stuff goes here)
</div>

You can start adding other elements — paragraphs, a jumbotron or a set of tabs — inside that container.

Quickfloats are handy for pullquotes. You can let them find their own size or use the grid to set their width.

Bootstrap does a lot of fancy footwork to customize layouts on tiny, small, medium and large screens. For now, I suggest that you stick to the col-md- classes. They’re the most universal. Our goal here is to tell good stories, not master the finer points of CSS layout, with or without Bootstrap.

We didn’t talk much about paragraphs, blockquotes or anchors, but you’ll need them if you don’t want your content to run together in a great heap. You can wrap paragaphs in <p> tags manually, but if you want some help with the basic pieces, I’m a huge fan of Mou. You can compose in Markdown, a heavily simplified syntax, and then copy the resulting HTML into a Bootstrap template. Markdown is a very basic language — you can’t do a lot with it, so you can focus on the basics.

CartoDB Walk Through

In 2011, the BLS published a map of fatal workplace injuries by state.

Are these colors continuous or categorical? Should they be? Is anyone surprised that CA and TX have a lot of workplace injuries? How can we improve on this?

So we’ll recreate it:
BLS Fatality Data (csv)
2011 Population Estimates (via census.gov)

Combine those two in a spreadsheet, first. Use “Save As” to create a new spreadsheet so you don’t clobber your data.

Use =find() to confirm that state names match. They will until ~New York

Use =G10/E10 to get fatalities per capita, talk about scientific notation, then =(G10/E10)*100000 for per 100,000

Upload to CartoDB

Upload the CSV to CartoDB. Now you need a shapefile. Get a US State file from NationalAtlas.gov or NOAA or this much, much smaller file from GeoCommons. (Use the KML, the shapefile doesn’t have a projection)

GoogleMaps will only take KML files, CartoDB will take ShapeFiles or KML, but don’t use Safari — a shapefile is a bundle of database files, you don’t want to unzip it.

Merging

CartoDB used to make merging harder but they’ve worked a few things out.

Keep both open: what columns can we use to combine these two?

CartoDB has a good explanation of the join process. It isn’t actually straightforward.

UPDATE fatalities_per_capita
SET the_geom =  us_states.the_geom
FROM us_states
WHERE us_states.state = fatalities_per_capita.state

This works for me because my BLS data is in a table called fatalities_per_capita, my state boundaries are us_states, and in both tables the state names are in a column called “state.”

Display

Use the wizard first. Bubbles for population, color for percapita.

What do methods mean? Learn This was a little buggy but might be fixed by now.

Put it on a web page

Where to Find Boundary Files

How to Geocode

If you need to transform addresses into lat/lon pairs, you have a couple of options:

  • Fusion Tables will do it, but their terms of service say you have to use that data on a Google Map.
  • Geocoder.us Will do one address at a time, or you can pay for a batch
  • CartoDB gives you a bunch free and you can pay for more.
  • More suggestions