Willens Data Sets

The musiXmatch dataset
http://labrosa.ee.columbia.edu/millionsong/musixmatch

Created in partnership with the Million Song Dataset, a dataset created by the Echo Nest for developers who are looking to create music-related digital tools and apps, this app compiles information about songs spread out across many genres and eras. It is maintained by Columbia University, while the Million Song Dataset is maintained by the Echo Nest and MIT.

Rather than full lists of lyrics for every song, the most important and common words from songs are included and grouped together, allowing researchers to identify broad trends in music.

Food Scrap Drop-Off sites
https://data.cityofnewyork.us/Environment/Food-Scrap-Drop-Off-Sites/rmmq-46n5

A list of (primarily greenmarkets) that accept food scrap drop-offs. The city does not maintain all of these sites, but it does house the map at its data portal, nyc.gov/data.

The distribution of these sites ought to be able to tell us a lot about how ideas about composting can travel in communities across New York.

Piracy Data
http://waxy.org/2008/02/pirating_the_20_2/

Data about the year’s Oscar-nominated films and how long they took to leak onto piracy networks. The data’s been compiled and hosted by a guy named Andy Baio, a developer and programmer who’s worked on a variety of projects, including the initial team that built Kickstarter.

I think this data’s interesting because it deals with a topic that’s of ongoing interest in the media industry, it’s got a good peg (the upcoming Academy Awards), and it’s a manageable set that people will understand quickly.

Smiley Data Sets

 

1) http://www.rtknet.org/db/erns/substance

 

 

 

This data was compiled by ERNS, the Emergency Response Notification System. It provides information for toxic chemical spills and other accidents for 2012, including substance, number of incidents, deaths, hospitalizations, injuries, evacuations, and property damage. It’s interesting because these incidents have been in the news recently- the chemical spill in West Virginia, the toxic ash spill into a North Carolina river, etc. Incidents like this affect everyone because many times they affect drinking water. I think graphing this data could help give more insight into these incidents and possibly lead to a deeper story.

 

 

 

2) http://www.health.ny.gov/statistics/vital_statistics/2011/table04.htm

 

 

 

This data was compiled by the Department of Health and includes birth summaries in New York State for 2011 broken down by race and ethnicity. A recent government study found that in 28 states (including NYC), first-time C-sections declined to 21.5% in 2012, from 22.1% in 2009. Since this data includes the method of delivery, it would be interesting to map this out and find out if there is any correlation between method of delivery and race/ethnicity in New York State.

 

 

 

3) https://data.cityofnewyork.us/Social-Services/Dirty-Water/k2um-vsan

 

 

 

I found this data from NYC Open Data. It’s based on 311 Service Requests from 2010 until the present, so it’s changing every day. It includes exact date & time of complaint, complaint type (water quality or water system, drinking water) and even sometimes includes a description of what is wrong with the water (tastes bitter/metallic, looks cloudy, etc.) I think this data would be interesting into mostly because I think it might show patterns (certain boroughs, neighborhoods, streets having more problems than others, etc.) Analyzing this data could also help when it comes to looking into other data about water in NYC. For example, if complaints from a particular area in Queens keep resurfacing over time, it may be worth looking data about that area’s water system/quality.

 

 

 

NYC DoB Complaints, Healthcare Surveys & Costs, and Transportation Fatalities by Mode

1. The New York City Department of Buildings maintains a dataset that records all complaints made to the department. This particular set covers 2013. The complaints range from malfunctioning elevators, boilers and electrical wiring to unsafe working conditions at construction sites. I think that the most interesting visualization would be a map of complaints regarding vital building hardware.  It would work as a service piece to those that live in these buildings as well as prospective buyers and renters.  Highlighting those with repeated complaints that have been open for an extended period of time could expose negligent landlords. https://nycopendata.socrata.com/Housing-Development/DOB-Complaints-Received/eabe-havv?

2. This dataset is maintained by the federal Centers for Medicare and Medicaid Services. It contains the data from the Hospital Consumer Assessment of Healthcare Providers and Systems, a national, standardized survey of hospital patients about their experiences during inpatient hospital stays in 2013.  This is a large dataset that would be difficult to visualize. The best approach would likely be a large map with all of the data with the option to narrow the data down to a specific area or zip code. Would also like to connect this in someway to the voluminous amounts of data present on social media sites like google plus and yelp. Would be a fairly broad service piece that would be interesting to most but espiecally the elderly and chronically ill. https://data.medicare.gov/Hospital-Compare/Survey-of-Patients-Hospital-Experiences-HCAHPS-/rj76-22dk

3. This dataset is also maintained by the Center for Medicare and Medicaid Services. This contains data on hospital costs organized by hospital and specific operation throughout 2013. Costs are averaged out by charges paid by patient and costs covered by insurance with a separate cell that provides information on the number of patients. Similarly to the hospital survey data this would probably be best visualized as a large map with all of the data as well as several graphs and charts to highlight the major differences. This would also be a great service piece for those in constant contact with hospitals as well as the average reader. However unlike the survey data this data is going to be much harder to work. How do you accurately break down averaged costs?
https://data.cms.gov/Medicare/Inpatient-Prospective-Payment-System-IPPS-Provider/97k6-zzx3

4.  These datasets are maintained by the federal National Highway Traffic Safety Agency, the federal Railroad Safety Administration and the National Transportation Safety Board respectively. This data would again be best demonstrated over a large map (they actually have location codes for all the accidents) with the option to filter out methods of transportation and see them individually as well as all at once. This data would also be complimented nicely by some comparative graphs and charts showing the differences between modes of transportation fatalities.
http://www-fars.nhtsa.dot.gov/QueryTool/QuerySection/SelectYear.aspx
train data
http://catalog.data.gov/dataset/railroad-accident-data-on-demand
plane data
http://catalog.data.gov/dataset/ntsb-aviation-accident-database-queryextract-tool

Hartman Data Sets – All of the Olympics

Data Set 1: Olympic Countries

http://www.sports-reference.com/olympics/countries/

Beginning in 1998, a group of ten members of the International Society of Olympic Historians from around the world came together to create a database of Olympic-related information. Each member of the team is responsible for a certain data set and maintains it individually before sending it to the editors for final approval. The website is part of the Fox Sports Network. This data is interesting because it breaks down a multitude of Olympic information in one easy table. The data shows when each country began competing in the summer and winter games, their medal counts and number of athletes ever to compete. I think using this data set to highlight how some countries, for example a small country like Andorra, has sent more athletes to the Winter games than other countries with exceedingly larger populations.

 

Data Set 2: LGBT Representation on TV

http://www.glaad.org/files/2013NRI.pdf

Each year GLAAD (Gay & Lesbian Alliance Against Defamation) evaluates LGBT representation on both network and cable television. GLAAD’s entertainment team, led by Associate Director of Entertainment Media, Matt Kane, research and monitor all network television as well as a number of cable channels throughout the year to track the progress of LGBT inclusion in both television and major motion pictures. The data can be found on their website for the past several years and in the included link, the 2013 report, there are references to past years and how the inclusion has risen or fallen, broken down by channel. This would be interested data to display in a more visual manner because the long report makes it difficult to compare network-to-network, cable channel to cable channel etc. The heavy text document is full of information that would be far more digestible if presented in a series of graphics.

 

Data 3: Olympic Injuries

http://www.klokavskade.no/upload/Publication/Engebretsen_2010_BJSM_Sports%20injuries%20and%20illnesses%20during%20the%20Winter%20Olympic%20Games%202010.pdf

The British Journal of Sports Medicine published an article following the 2010 Winter Olympics detailing the injuries and illnesses incurred by athletes during the games. The data comes from the 82 National Olympic Committee’s head physicians who were asked to report daily occurrences as well as medical centers in Vancouver and Whistler clinics. I found this data interesting because of the timeliness of it, with the 2014 Winter Olympics beginning this week. Leading up to and during the Olympics, the winter version in particular, the media calls attention to the athletes who cannot return to the games due to injury as well as those who are injured during qualifying rounds etc. If I delve further into this data set, I would compare it to the article published by this journal in 2012, following the Summer Olympics, and compare the amount of injuries during the lastest summer and winter games to see which is more dangerous, and specifically which sports produce the most injuries as well as to what body parts.

 

Homework Week 2 (Due Feb 14)

Pitches for your first story are due next week, and you have two spreadsheet exercises to power through. And everyone needs to sign up for a Festival of Data slot.

Answer a handful of specific questions using data that Slate published alongside How Many People Have Been Killed by Guns Since Newtown? and I download the CDC’s data on firearm deaths and find something to say about those numbers. Continue reading Homework Week 2 (Due Feb 14)

Pivot Tables

Spreadsheets are Handy, but pivot tables are incredibly useful.

Wells by County

The Department of Environmental Conservation publishes data on gas wells in New York State. Download it: How many wells are there per county?

  1. Start with Data > Pivot Table Report — look at the cells Excel proposes to use. Does that include all of your data?
  2. Add Row — Use “COUNTY” for the rows. You should see a list of county names.
  3. Add Value — Use “API_WELLNO” for now.
  4. Check the formula — should excel count values or sum them? Or find an average?

And there you have it. More things to play with:

  • Try adding “SLANT” as a Column — horizontal (as opposed to vertical) wells are particularly controversial. Are there any concentrations of horizontal wells?
  • How would you work out how much money each county is collecting in permit fees?
  • Can you see any trends in the average permit fee in each county?

Coalition Casualties

Last semester, Matt Surrusco found iCasualties.org — NYT has a nice profile of Michael White who trolls through news sites and official releases to build out a database of coalition forces deaths. Start with http://icasualties.org/OEF/OEF_US_Fatalities.xls and pivot by “Country of Death” and “Place of Death.”

This data needs some cleanup — we’ll work on that next week.

We used a function: =YEAR() to find the year of each death. We also had to do format > cells … and select general to correct wonky display issues. If you right-click in your pivot table, you’ll see a “Refresh Data” option — you might need that if your year column is not showing up.

Some questions you could answer:

  1. What’s the most common age of death
  2. How many deaths, and at what age?
  3. What’s the most common age of death for members of the CIA? the Army?
  4. What rank and branch had the greatest number of casualties?

Asking Good Questions

Asking Good Questions

If you challenge yourself (and you should challenge yourself), you’re bound to get stuck. If you aren’t hitting walls and getting stuck, you aren’t trying hard enough. Technology is changing constantly, so learning how to ask the right questions and get help with new tools is probably more important than actually learning how to use any one tool well today. Continue reading Asking Good Questions

Spreadsheet Walkthrough

Spreadsheet Skills

This is not quite what we did in class, but close.

Google tracks searches for flu-related terms. Start at http://www.google.org/flutrends/ — it is worth reading up on how they produce this data so you have a sense of the limitations of it, but we’re just going to play with it.

Using formulas

Pay attention to the screen. Look at what happens when you hover, etc.

Review of Spreadsheeting skills with Flu data
-sorting to find max and min
-data types (text, number, location, date, etc.)
-what is a formula and a function, what’s the difference? choosing cells

-use a function to find the mean, median and range: look at how mean and median differ.

-using functions, Max, Min, Average, Median, Unique, Countif, Match, If

Walk Through

  1. Download the world historical flu trends http://www.google.org/flutrends/data.txt
  2. What is this data? (comma separated)
  3. Paste into spreadsheet? Use Data > Text to Columns to separate data into columns according to a delimiter
  4. In which week did which country had the most flu searches?
    =Max()
    =Match(criterion, range, 0)
    =Indirect(“A”&cell) to get date or re-order columns
  5. How much more did that country search for flu in that week than average?
  6. Order the countries by most flu searches (SUM…choose arbitrary 2012-13 to capture searches from all countries, Transpose countries-values to make a quick bar chart)

Homework (Due Feb 7-14)

Homework Week 1 (Due Feb 7)

Since we got a late start and you don’t have a full week, I spread out the homework some.

Send to me by 5 PM on Thursday:

URLs for three data sets that interest you. Use the subject “Homework
Week 1” and I’ll definitely see it.

By 9:30 AM Friday

Install Tabula If you get an error like “Tableau is damaged and can’t be opened. You should move it to the Trash,” the solution to is not at all intuitive: You have to Change your Privacy and Security settings to allow applications downloaded from “Anywhere” — it’s on the “general” tab.

Read Cairo: The Functional Art, Reading part 1: pages 25-31, 36-44, on thinking through a visualization as a tool for the reader; what graphical form best serves the goal? On e-reserve (access details on the syllabus)

Skim http://perceptualedge.com/articles/ie/the_right_graph.pdf and http://www.jiscinfonet.ac.uk/infokits/data-visualisation/type-of-charts/

Due 5 PM Monday:

Write a short blog post that describes the provenance of each of your three data sets data (who maintains it?), where the data can be found (include a link) and in less than 200 words each, explain why the data is interesting.

Due Feb 14:

Register for a Magellan account on CartoDB (use http://cartodb.com/academic to get the discount)

Make sure Firefox is installed on your computer with the Web Developer Toolbar extension.

Begin a scrapbook on WordPress, Tumblr, Pinterest or some other
aggregation service. Send me the URL.