When your dataset is represented as a table or a database, it’s
difficult to observe much about it beyond its size and the types of
variables it contains. In this tutorial, you’ll learn how to use
graphical and numerical techniques to begin uncovering the structure of
your data. Which variables suggest interesting relationships? Which
observations are unusual? By the end of the tutorial, you’ll be able to
answer these questions and more, while generating graphics that are both
insightful and beautiful.
- Visualize categorical and numerical data using appropriate
- Create graphical representations of multiple variables.
- Describe the structure revealed by graphics in the language of
- Use statistics to summarize important aspects of data.
- Identify unusual observations
Creating graphical and numerical summaries of two categorical
variables, primarily using two R packages: ggplot2 and dplyr.
- Graphical representation of two categorical variables
- Side-by-side bar charts
- Stacked bar charts
- To normalize or not to normalize
- Tabular representation of two categorical variables
- Computation of margins
- Counts vs proportions
- Law of total probability
- Graphical representation of one categorical variable
- Marginal vs conditional
- Bar chart
- data integrity check for levels
- Pie chart
Learn useful statistics for describing distributions of data.
- Graphical representation of one categorical and one numerical
- Side-by-side boxplots
- Faceted histograms
- Colored density curves
- Graphical representation of one numerical variable
- Marginal vs conditioning
- Density plot
- outlier detection
Statistics for describing distributions of data.
- Center: mean, median, mode
- Shape: skewness, modality
- Spread: range, IQR, SD, variance
- Unusual observations
- Transformations: Logarithm and sqrt to reduce skew in graphics and
Apply what you’ve learned to explore and summarize a real world
dataset in this case study of email spam.
- Unwin, Anthony. Graphical Data Analysis with R.
- Velleman, Paul and Hoaglin, David. Exploratory Data
Andrew Bray is an Assistant Professor of Statistics at Reed College
and lover of all things statistics and R.