Data Visualization in RStudio with ggplot2

Data visualization is one of the most effective ways to explore data. R offers many tools for visualization, and among them, ggplot2 Library is the most powerful and flexible. Based on the Grammar of Graphics, ggplot2 lets you build plots layer by layer.

In this blog, we’ll explore the iris dataset (inbuilt in R) and create different types of plots. For each, we’ll cover the description, the code explanation, and the interpretation.

Description of the Iris Dataset

The iris dataset is one of the most famous built-in datasets in R, often used for learning data analysis and visualization. It contains 150 observations of iris flowers, divided equally into three species: Setosa, Versicolor, and Virginica (50 flowers each).

The dataset has four numeric variables that measure flower dimensions:

Sepal.Length: Length of the sepal (in cm)

Sepal.Width: Width of the sepal (in cm)

Petal.Length: Length of the petal (in cm)

Petal.Width: Width of the petal (in cm)

Additionally, it has a categorical variable:

Species: Type of iris flower (Setosa, Versicolor, Virginica)

1. What is a Scatter Plot in R?

A scatter plot is a graph used to show the relationship between two continuous variables. In the iris dataset, we are using Sepal Length (x-axis) and Sepal Width (y-axis) to understand how these measurements vary across different flower species.

Code for Scatter Plot in R:
Code for Scatter Plot in R:
Code Explanation:

1. aes(x = Sepal.Length, y = Sepal.Width, color = Species) maps sepal length and width, with species as color.

2. geom_point(size = 3) plots the points.

3. Labs () adds a title and axis labels.

Creating a Scatter Plot in R Studio:

Creating a Scatter Plot in R Studio:

How to Write an Interpretation of a Scatter plot:

How to Write an Interpretation of a Scatter plot:

From the above scatter plot, we can see that Setosa flowers generally have wider sepals compared to Versicolor and Virginica. The points for Setosa form a clear cluster, making this species easily distinguishable. On the other hand, Versicolor and Virginica overlap more in their sepal dimensions, which makes it harder to separate them based only on sepal size.

2. What is a Bar Plot in R?

A bar plot is a chart used to show the frequency or count of observations across different categories. In the iris dataset, we are using the Species variable on the x-axis to count how many flowers belong to each species group (Setosa, Versicolor, Virginica).

Code for Bar Plot in R:
Code for Bar Plot in R:
Code Explanation:

1. aes(x = Species) treats species as the categorical variable.

2. geom_bar() counts the number of flowers in each species group.

3. labs () add a title and axis labels.

Creating a Bar Plot in R Studio:

Creating a Bar Plot in R Studio:

How to Write an Interpretation of a Bar plot:

How to Write an Interpretation of a Bar plot:

From the above bar plot, we can see that the iris dataset contains an equal number of flowers from each species: Setosa, Versicolor, and Virginica. Each bar reaches the same height, indicating that there are 50 samples for every species. This balanced distribution makes the dataset well-suited for comparing measurements across species without bias toward any single group.

3. What is a Box Plot in R?

A box plot (also called a whisker plot) is a graphical representation used to display the distribution of continuous data and identify patterns such as spread, central tendency, and potential outliers. It summarizes the dataset using five key values: minimum, first quartile (Q1), median, third quartile (Q3), and maximum.

In the iris dataset, we can create a box plot of Petal Length grouped by Species to compare how the lengths vary across Setosa, Versicolor, and Virginica.

Code for Box Plot in R:
Code for Box Plot in R:

Code Explanation:

i. aes(x = Species, y = Petal.Length) compares petal length across species.

ii. geom_boxplot() draws the boxplots.

iii. labs () add a title and axis labels.

Creating a Box Plot in R Studio:
Creating a Box Plot in R Studio:

How to Write an Interpretation of a Box plot:

How to Write an Interpretation of a Box plot:

When interpreting a box plot, we focus on what the plot shows:

1. Median (the bold line in the box): tells us the central value of the data.

2. Interquartile Range (the box itself): shows where the middle 50% of the data lies.

3. Whiskers (lines extending from the box): show the spread of the rest of the data.

4. Outliers (points outside the whiskers): indicate unusual values.

5. Comparison across groups: in your case, species (Setosa, Versicolor, Virginica).

So, based on Above Box Plot of Petal Length by Species, here’s how we interpret:

Median: Setosa has the smallest median petal length, Versicolor is in the middle, and Virginica has the largest median. This shows a clear increasing trend across the three species.

Interquartile Range (IQR): Setosa has a very narrow box, meaning its petal lengths are very consistent with little variation. Versicolor and Virginica have wider boxes, indicating more variation within those species.

Whiskers: The whiskers for Versicolor and Virginica extend further, showing a greater spread in petal lengths compared to Setosa.

Outliers: A few outliers appear for Setosa and Versicolor, but overall, they don’t affect the clear separation between species.

Comparison across groups: Petal length is a strong distinguishing factor. Setosa stands out clearly due to its short and consistent petals, while Versicolor and Virginica show higher lengths but do not overlap much, making them separable.

4. What is a Histogram in R?

A histogram is a chart used to display the distribution of a continuous variable by dividing the data into intervals, called bins, and showing how many observations fall into each bin. In the iris dataset, we are using the Petal Width variable on the x-axis to see how the petal widths of flowers are distributed across all species.

Code for Faceted Histogram (separate histogram for each species) in R:

Code for Faceted Histogram (separate histogram for each species) in R:
Code Explanation (Faceted Histogram):

1. aes(x = Petal.Width, fill = Species)
Plots Petal Width on the x-axis and uses Species to fill the bars with different colors for each flower type.

2. geom_histogram(binwidth = 0.2, color = “black”)
Creates a histogram. binwidth = 0.2 means each bar represents a range of 0.2 units of Petal Width.
The outline of each bar is colored black for better visibility.

3. facet_wrap(~Species)
Splits the histogram into separate panels for each species (Setosa, Versicolor, Virginica), making it easier to compare distributions side by side.

4. labs(title = “Histogram of Petal Width for Each Species”, x = “Petal Width”, y = “Frequency”)
Adds a descriptive title to the plot and labels for the x-axis and y-axis.

Creating a Separate Histogram for each Species in R Studio:

Creating a Separate Histogram for each Species in R Studio:

How to Write an Interpretation of a Histogram:

How to Write an Interpretation of a Histogram:

From the above histograms of Petal Width by species, we can see clear differences among the three flower groups.

Setosa: The petal widths are very small, clustered tightly between 0.1 and 0.6, showing very little variation.

Versicolor: The petal widths are more spread out, mostly between 1.0 and 1.8, indicating moderate variation.

Virginica: This species has the largest petal widths, ranging mostly from 1.8 to 2.5, with values shifted toward the higher end.

These distinct distributions make Petal Width a very strong feature for separating Setosa from the other two species. Versicolor and Virginica show some overlap, but Virginica generally has wider petals than Versicolor.

Using the iris dataset, we explored how to build different plots with ggplot2 in RStudio: scatter, bar, box, and histogram. Each visualization highlights unique aspects of the dataset and makes the patterns across species clear.

For a line plot, we use the inbuilt Nile dataset in R, which records the annual flow of the Nile River from 1871 to 1970.

5. What is a Line Plot in R?

A line plot in R is used to show how a variable changes over time or in sequence. It connects individual data points with lines, making it easy to identify trends, patterns, or fluctuations. In the case of the Nile dataset, the variable being plotted is the annual flow of the Nile River (in 10⁸ m³) from 1871 to 1970, which helps us observe long-term changes in river flow over the years.

Code for Line Plot in R:
Code for Line Plot in R:
Code Explanation

data.frame(Year, Flow) Converts the Nile time series into a data frame with Year and Flow columns.

i. aes(x = Year, y = Flow) Maps Year to the x-axis and Flow to the y-axis.

ii. geom_line(color = “darkblue”) Draws a dark blue line to show the trend of river flow.

iii. labs () add a title and axis labels.

Creating a Line Plot in R Studio:

Creating a Line Plot in R Studio:

How to Write an Interpretation of Line Plot:

How to Write an Interpretation of Line Plot:

From the above line plot, we can see the annual flow of the Nile River from 1871 to 1970. The flow shows clear fluctuations over the years, with some periods of high-water levels and other periods where the flow dropped significantly. Overall, the plot highlights the variability and long-term trends in river flow.

Read Also : AMOS Expert in Pune – Turning Complex Data into Clear Business Strategies

Conclusion:

In conclusion, ggplot2 provides a powerful and flexible way to visualize data in R. Using the iris and Nile datasets, we created scatter, bar, box, histogram, and line plots. These visualizations reveal meaningful patterns and trends, making data easier to understand and interpret effectively.