UP504 • How to make good tables and graphs, and to make the most of computer softwarelast updated: Monday, February 4, 2008 0:00 AM |
Dates: February 6, 11 |
sections of this page: Conceptual level Graphing in the Digital Age Tufte's principles Practical issues Guidelines for good graphics Problems with percentages and growth rates |
Examples of excel files: tv show preferences world cities radio allocation chart (pdf) |
Readings:
Tufte, Edward R. 1997. Visual & Statistical Thinking : Displays of Evidence for Decision Making: Graphics Press.
Myers, Dowell. "Ch 5: Strategies of Presentation," in Analysis with Local Census Data. New York, NY: Academic Press, Inc. 1992, pp. 97-125. [electronic reserves]
see also:
The Economist, "Worth a thousand words: A good graphic can tell a story, bring a lump to the throat, even change policies. Here are three of history's best." Dec 19th 2007
From print edition
Tufte, Edward. 1983. The Visual Display of Quantitative Information. Cheshire, Conn: Graphics Press. (Note: the ideas on this page are particularly influenced by Tufte's writings.)
Schmid, Calvin F. 1983. Statistical Graphics. New York: Wiley.
Tufte, Edward. 1990. Envisioning Information. Cheshire, Conn: Graphics Press.
also: look through periodicals and journals (e.g., the Journal of the American Planning Association for examples of good and not so good charts).
Click here
to download an example (an MS Word document) of a data table (and common mistakes
made):
Excel 2003 (Windows): Charts | Top tips for Excel: Charts and graphics
Mac OS X: Using Excel X | Choose the best graphics format for the job (including an overview of different graphics formats)
Excel (either Mac or Windows): combines spreadsheets and graphing
(one can use other applications, such as SPSS, SAS, etc.)
2. This leads to a paradoxical
quality of graphic thinking: on the one hand, a graph should be transparent
enough so that the observer sees data and not design (-> data variation not
design variation). YET, the form of the graphic itself shapes the structure
of perception: the assumption that there is a relationship between time and
a variable, or between different variables, or between space and time, etc.
(relate to paradigms).
4. Inductive vs. deductive:
how much to demonstrate a specific point with the graphic vs having the reader/viewer
draw their own conclusions and see their own patterns. NOTE: pure inductive
data presentation seems impossible: all graphics involve choices over what data
to present and not to present.
5. the connection between
good writing and good graphics:
b. show the ideas not the ink (writing, design)
c. be honest; don't confuse with overly complicated design or prose.
d. how about causality? (relational graphs suggest a causal relationship between variables; passive voice avoids the issue of causality -- active voice addresses it, even if inconclusively)
6. Compare:
information (data) to communication (presentation).
The first is latent, the latter is actual. In
planning theory, there is a shift from the former to the latter (e.g., communicative-based
action).
100110100001010101011110101001000010101010010011010
110101000010110100101010101000100101010101001010101 010011101010101010101001010101010010101101010101010 101111110000101010011010101001001111111010011001001 111100101010101010101010100010101010101010101010011 001101000101100110100000010010110101001001010100101 |
In this age of digital, the issue of information content is often
seen as data storage.
That is, we emphasize the amount of digital space needed for data
storage: e.g., this Netscape Communicator file (consisting
mostly of text) file is about 48,000 bytes. (Had there been more visuals,
the file size would be MUCH larger.)
1 byte = 8 bits.
a byte
is a group of eight binary digits that can represent an alphanumeric character.
with 8 bits, one can represent 256 distinct combinations of eight ordered
bits (28
=256)
kilobytes (thousand bytes), megabytes (million), gigabytes (billion), terabytes (trillion), etc.
pixel = pix (plural for pic or picture)
+ element
the small discrete elements that make up an
image
"a picture tells a 1000 words."
But there is a difference between data storage vs. effective content
A photo in digital form (a 4*6 inch photo scanned on a scanner at 250 dpi -- dots per inch) may require 6 megabytes of storage, which is 6,000,000 bytes or 1,536,000,000 bits (that is, over 1 billion sets of 0/1 binary bits of data to represent a simple snapshot -- and still at a lower visual quality than the standard drugstore photo print.) Typical digital cameras (as of early 2002) record images 1-2 MB, while the better ones have 4-5 MB. Standard 35mm slide film is generally still more detailed (but digital is catching up).
Therefore: a picture may tell a thousand
words, but require 6 million bytes (6 megabytes) to be stored digitally.
1000 words may require just about 6,000 bytes to store.
In other words, one digital picture requires
as much storage space as 1000 words * 1000 = 1,000,000 words (which is equal
to about 10 books!)
Another example: a color pie chart, generated
by Excel, depicting the percentage of men vs. women in planning, contains just
a single data point. Yet the pie chart image itself, stored digitally,
might require 6,000 bytes, which is 1,536,000 bits (1 or 0 elements).
This explosion in memory has allowed for a far greater gap between data storage size and effective content. One might not worry about this, since memory is so cheap and abundant. But it has arguably led to a cluttered computer screen, a loss of the programmer's former elegant parsimonious use of memory, and an emphasis on facade more than on content and communication.
Why the discrepancy between data storage size and effective content?
So why digitize images if they are so data
intensive and of lower visual quality?
This is the digital age: allows for images
to be standardized, manipulated, and transmitted in ways traditional images
cannot. That text, data, graphs, photographs, drawings, sound, etc., can
all be stored and transmitted in a single, standardized format (e.g., CD-rom,
modem lines, etc.)
An Example:
An 8x10 inch color photograph made from a 35 mm negative (traditional silver-based film processed in a darkroom) | a digital image (e.g., taken with a digital camera; or a scanned photograph; or a scanned slide transparency) | |
Storage | image can be stored as a negative film strip or as
a print this "storage" is an inexpensive technology |
stored digitally. thus is treated the same as text, sound, etc. (e.g., ISDN). with high quality images, a high data storage requirement needed. |
Image quality | the image quality is potentially quite high (depending
on the quality of the camera optics, the film, the paper and the processing.)
easy to increase or decrease the size of the image (through magnification of the enlarger image) |
image quality is not as high, though getting better |
Modification of the single image | hard to modify the image (except through "dodging" and other darkroom techniques) | much easier and with far more possibilities (e.g., with Photoshop software). |
combination of multiple images | not easy: either through double exposure techniques or collage cut-and-paste. | much easier and with far more possibilities |
transference of image Copying Image |
the photo can be mailed the photo can be sent by wire or fax after first converted to dots. (with loss of quality) each subsequent copy leads to a reduction in quality from the original |
quite easy (as easy as any other form of digital information)
one can make an identical copy |
2. then perform basic arithmetic (sums, averages, etc.)
3. then to show univariate patterns in the data
4. then to reveal patterns between two or more variables (e.g., correlation) -- and to show that these relationships are statistically significant (that is, the patterns in the sample data reflect patterns in the population as a whole).
5. then to understand causal relationships
6. to recognize the difference between relationships that can be changed and those that can't (policy evaluation)
7. Finally, to relate to the larger context of the world outside the data set.
Relate to Kant time and space as categories of
the mind: the first way we classify sensation. (as paraphrased by Durant):
1. "show the data" [p. 13]
2. "induce the viewer think about substance rather than about methodology, graphic design, technology of graphic production, or something else" [p. 13] (i.e., transparent and revealing)
3. "avoid distorting" [p. 13]
4. "present many numbers in a small space." [p. 13] [data density]
5. "make large data sets coherent" [p. 13] [communication, not just information]
6. "encourage the eye to compare different pieces of data" [p. 13]
7. "reveal the data at different levels: from a broad overview to the fine structure" [p. 13]
8. "serve a reasonably clear function:
description, exploration, tabulation, or decoration." [p.
13]
a table?
a chart?
text?
a photo or slide? (and digital or traditional silver-based film?)
a map? (and a hand-drawn paper map, a vector-based (polygons) GIS map, or a raster-based (grid) GIS map?)
a drawing?
a site plan?
? All are forms
of representation, with advantages and drawbacks. Don't automatically
graph everything: a shortcoming of EXCEL and Lotus: the ease to graph. Create
a graph because it communicates something substantial and meaningful that the
other formats cannot.
GOAL: give the viewer the greatest number
of ideas, in the shortest time, with the least amount of ink, in the smallest
space.
graph: lots of data, to be compared, multivariate; little text/labels.
tables: small, non comparative, highly
labeled data sets, often univariate.
one rule of thumb: what is the Tufte information/ink
ratio for the two approaches? which one is less?
2. what kind of chart? pie,
bar, column, scatter, line, etc.
varies by the number of variables and cases, the
amount of labels, the continuity or discontinuity of data over time, etc.
3. dimensions of data
vs. dimensions of graphs (general rule: don't have
more visual dimensions than information dimensions. i.e., avoid 3-d graphs).
4. complexity vs. simplicity: how much information does the graph include? how much does the reader readily pick up? What is just chart-junk? (This is Tufte's INK/INFORMATION RATIO) or better:
0 -
20 % |
20 -
40 % |
40 -
60 % |
60 -
80 % |
80 -
100 % |
works better than ...
0 -
20 % |
20 -
40 % |
40 -
60 % |
60 -
80 % |
80 -
100 % |
or at least use brightness within a color
0 -
20 % |
20 -
40 % |
40 -
60 % |
60 -
80 % |
80 -
100 % |
Why? since brightness has an order, but color does not (or at least color has multiple dimensions, which can be confusing)
6. close and far: the first overall look and the
second in depth look (graphs should encourage both)
7. Data density: the eye can pick up fine details; most graphs waste this ability to process fine details. (because they often have so little information in them.) e.g., a bar chart of 3 cases; 1 variable. low density of data there. (and why have a chart as all? for decoration and emphasis?). TUFTE is interested more in representing complex, relational data). Remember: graphics can be shrunk way down in size, and the eye can still comprehend.
Low density: can be well less than 1 data entry/square inch. Or as high as 100- 1000s/square inch). Maps can handle higher density, since the reader can arguably (1) easily relate spatial data side-by-side, and (2) it requires little labeling, since one assumes that the reader can interpret a map without labels. (This may be a potential virtue of GIS: geo-coded and spatially displayed data.)
Compare the data density of the following map and this pie chart:
source: http://www.census.gov/geo/www/mapGallery/images/2k_night.jpg
8. Compare to photographs and drawings. Does a
picture tell a 1000 words? or video? when are these effective compared to text
and tables? (especially in this interchangeable world of ISDN?)
9. Know the difference between: unit of
analysis, case, variable, value (of a variable)
a unit of analysis, | a case | a variable, | a value (of a variable for a specific case) |
city | e.g., Los Angeles | e.g., the unemployment rate | e.g., 5.2% |
11. Finally, the current challenge to get computer
software to follow the rules of Tufte. Sometimes you may need to import your
half-finished graph into a paint or draw program. And: there is nothing wrong
with hand-drawn visuals!
(based on reading student assignments from past years)
1. Be sure to use a full title for the graphic
(variables, dates, locations, units of analysis). I.e., rather than "Crime and
Infant Mortality," use "Crime Rate per 100,000 Population (1991) and Infant
Death Rate per 1,000 Live Births (1988) in the Largest 40 U.S. Cities". If you
choose to use a shorter title, be sure that somewhere the variables are fully
defined.
2. List the source of the data (just as you would
for a data table.). Anticipate that some readers may simply photocopy your chart
rather than your whole article or dissertation; the graph should be somewhat
self-standing. (Include a descriptive caption at the bottom if useful).
3. Explain and label missing data. Be sure that
the reader knows the difference between a missing value and a zero-value (if
you are not careful, statistical software will treat these two as the same).
4. Order the chart in some useful way. And if the chart has an ordering to it, be sure to state this (e.g., cities ranked by population size).
alphabetical is not always the best:
try instead ordering based on some relevant variable (here simply the variable displayed):
5. If you use a subset of the cases, be sure to
explain the logic of the selection (e.g., among the 10 largest U.S. Cities).
6. Label the x and y axes.
7. Use a legend or labels to define variables
in a multivariate bar or column chart. You do not need a legend for a
univariate chart.
8. Often an x-y scatterplot is preferable to a bar chart (or column chart) with two variables. Scatterplots use less ink, and they usually reveal bivariate relationships (i.e., the relationship between x and y) far better than bar or column charts.
Here is the same bivariate data displayed two ways:
9. It is fine to do a regression analysis, but
be sure to explain your results.
10. Do not add the Hispanic population
with other racial categories (black, Asian, etc.), since the U.S. Census states
that "persons of Hispanic origin may be of any race."
11. Avoid 3-dimensional graphs unless the data itself is 3-dimensional. Even then, 3-d is hard to read.
|
13. Avoid column charts with too many data points: the columns become too narrow (and the labels too small or some not showing) to read easily. (This also applies to bar charts). This problem literally multiplies with multiple variables displayed on one chart. Above about 10-15 data points (e.g., columns), I would consider an alternative format (such as scatterplot, a table, grouping data, etc.). Or use several charts, side-by-side, with the same format (e.g., one for each variable). [see Tufte on the use of "small multiples"]. an example of a problematic chart below:
Note how it is really hard to see patterns in the data (with 3 variables and 16 cases). The gray background is distracting too. Best to avoid this type of chart. (Remember: just because Excel can create a chart from your data doesn't mean that it is necessarily a good format for the data.)
14. Overall, show the data; have the view think
the patterns in the data, not the graphic design; avoid distortion; encourage
the eye to compare data; clearly label the graph.
1. how to determine the denominator: think of a survey result: what to do with nonresponses, etc.
2. also: "the percentage effect": a percentage
may go down, when the absolute goes up. How do we interpret? (GIve an example).
Well, it depends on whether the actual theory of phenomenon is better explained
by absolute or percentage.
|
2,712,190
|
|
3,226,049
|
|
3,734,258
|
|
3,804,048
|
|
3,879,409
|
|
4,024,286
|
|
4,332,834
|
= +2.0% / year
= +1.6% / year
= +1.56% / year
-- e.g., bank interest, rabbits reproducing, etc.