https://ift.tt/mSGe6ED Real-world data is rarely perfect—it’s often messy, inconsistent, and full of surprises. This post will show you ho...
Real-world data is rarely perfect—it’s often messy, inconsistent, and full of surprises. This post will show you how to perform data cleaning in Python on a dataset from the Museum of Modern Art (MoMA). The dataset, which catalogs hundreds of artworks, includes common data issues like missing values, inconsistent formats, and tricky date ranges.
About the MoMA Dataset
Art is a messy business. Over centuries, artists have created everything from simple paintings to complex sculptures, and art historians have been cataloging everything they can along the way. The Museum of Modern Art, or MoMA for short, is one of the most influential museums in the world and has made its artwork catalog publicly available. You can download the dataset from GitHub and follow along as we clean and prepare it for analysis.
The museum has put out a disclaimer, however, that the dataset is still a work in progress—an evolving artwork in its own right, perhaps. As with many real-world datasets, this means it contains quality issues that need addressing. We'll use Python to explore these quirks, identify patterns, and transform the dataset into a cleaner format for deeper insights.
Why Perform Data Cleaning in Python?
Python has become one of the most popular tools for working with data, thanks to its ease of use and powerful libraries. Organizations across industries rely on Python for everything from processing financial reports to analyzing scientific research. When it comes to data cleaning, Python’s pandas library offers efficient tools for handling messy datasets like the one we’re using from MoMA.
In a moment, we’ll use pandas to quickly load the dataset, explore its structure, and fix common issues. If you’re new to Python or want to brush up on the basics, check out our Introduction to Python Programming course, which provides a solid foundation for getting started.
Exploring the Data with Pandas
Before we start cleaning, let’s take a closer look at the MoMA dataset to understand its structure and identify potential issues. We’ll use pandas, a popular Python library for working with data, to load the first 100 rows into a DataFrame. This will let us explore the dataset and get a preview of what we’re working with.
import pandas
artworks_100 = pandas.read_csv("MOMA_Artworks.csv")
artworks_100.head(10)
Title | Artist | ArtistBio | Date | Medium | Dimensions | CreditLine | MoMANumber | Classification | Department | DateAcquired | CuratorApproved | ObjectID | URL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Ferdinandsbrücke Project, Vienna, Austria , El... | Otto Wagner | (Austrian, 1841–1918) | 1896 | Ink and cut-and-pasted painted pages on paper | 19 1/8 x 66 1/2" (48.6 x 168.9 cm) | Fractional and promised gift of Jo Carole and ... | 885.1996 | A&D Architectural Drawing | Architecture & Design | 1996-04-09 | Y | 2 | https://ift.tt/m5HZtx6 |
1 | City of Music, National Superior Conservatory ... | Christian de Portzamparc | (French, born 1944) | 1987 | Paint and colored pencil on print | 16 x 11 3/4" (40.6 x 29.8 cm) | Gift of the architect in honor of Lily Auchinc... | 1.1995 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 3 | https://ift.tt/QVey7PA |
2 | Villa near Vienna Project, Outside Vienna, Aus... | Emil Hoppe | (Austrian, 1876–1957) | 1903 | Graphite, pen, color pencil, ink, and gouache ... | 13 1/2 x 12 1/2" (34.3 x 31.8 cm) | Gift of Jo Carole and Ronald S. Lauder | 1.1997 | A&D Architectural Drawing | Architecture & Design | 1997-01-15 | Y | 4 | https://ift.tt/NJpE5kF |
3 | The Manhattan Transcripts Project, New York, N... | Bernard Tschumi | (French and Swiss, born Switzerland 1944) | 1980 | Photographic reproduction with colored synthet... | 20 x 20" (50.8 x 50.8 cm) | Purchase and partial gift of the architect in ... | 2.1995 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 5 | https://ift.tt/FSzi73q |
4 | Villa, project, outside Vienna, Austria, Exter... | Emil Hoppe | (Austrian, 1876–1957) | 1903 | Graphite, color pencil, ink, and gouache on tr... | 15 1/8 x 7 1/2" (38.4 x 19.1 cm) | Gift of Jo Carole and Ronald S. Lauder | 2.1997 | A&D Architectural Drawing | Architecture & Design | 1997-01-15 | Y | 6 | https://ift.tt/2s5GO0Q |
5 | The Manhattan Transcripts Project, New York, N... | Bernard Tschumi | (French and Swiss, born Switzerland 1944) | 1976-77 | Gelatin silver photograph | 14 x 18" (35.6 x 45.7 cm) | Purchase and partial gift of the architect in ... | 3.1995.1 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 7 | https://ift.tt/qQXV90a |
6 | The Manhattan Transcripts Project, New York, N... | Bernard Tschumi | (French and Swiss, born Switzerland 1944) | 1976-77 | Gelatin silver photographs | Each: 14 x 18" (35.6 x 45.7 cm) | Purchase and partial gift of the architect in ... | 3.1995.1-24 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 8 | https://ift.tt/Ca02grz |
7 | The Manhattan Transcripts Project, New York, N... | Bernard Tschumi | (French and Swiss, born Switzerland 1944) | 1976-77 | Gelatin silver photograph | 14 x 18" (35.6 x 45.7 cm) | Purchase and partial gift of the architect in ... | 3.1995.10 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 9 | https://ift.tt/bdg9DCp |
8 | The Manhattan Transcripts Project, New York, N... | Bernard Tschumi | (French and Swiss, born Switzerland 1944) | 1976-77 | Gelatin silver photograph | 14 x 18" (35.6 x 45.7 cm) | Purchase and partial gift of the architect in ... | 3.1995.11 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 10 | https://ift.tt/xFSyER1 |
9 | The Manhattan Transcripts Project, New York, N... | Bernard Tschumi | (French and Swiss, born Switzerland 1944) | 1976-77 | Gelatin silver photograph | 14 x 18" (35.6 x 45.7 cm) | Purchase and partial gift of the architect in ... | 3.1995.12 | A&D Architectural Drawing | Architecture & Design | 1995-01-17 | Y | 11 | https://ift.tt/SfVdG8c |
The .head()
method displays the first few rows of the dataset, helping us spot any irregularities right away. At first glance, the dataset includes valuable metadata like the title, artist, year, and medium of each artwork. However, we can already spot some irregularities in the Date
column, such as year ranges and non-standard entries like Unknown
. We’ll address these issues as we move through the cleaning process.
Working with Dates in Pandas
Dates are a critical part of many datasets, especially when analyzing trends or patterns over time. In the MoMA dataset, the Date
column includes a mix of formats, such as year ranges (1976-77
) and non-standard entries (Unknown
). These inconsistencies make it difficult to use the data effectively, whether for plotting or filtering by year.
To start, let’s take a closer look at the unique values in the Date
column using the .value_counts()
method. This will give us an overview of the data and help us identify patterns or irregularities that need fixing.
# Check the distribution of values in the 'Date' column
artworks_100['Date'].value_counts()
Here’s what we find:
1976-77 25
1980-81 15
1979 12
Unknown 7
1980 5
1917 5
1978 5
1923 4
1935 3
1987 2
1903 2
1970 1
1896 1
1975 1
1984 1
1918 1
1986 1
n.d. 1
1906 1
1905 1
1930 1
1974 1
1936 1
1968 1
1900 1
c. 1917 1
dtype: int64
This output highlights four key patterns in the data:
- Pattern 1: Year ranges like
1976-77
- Pattern 2: Approximate dates like
c. 1917
- Pattern 3: Missing values labeled as
Unknown
- Pattern 4: Missing values labeled as
n.d.
Cleaning the Data
Now that we’ve identified irregularities in the Date
column, let’s outline how we’ll perform the data cleaning in Python to address them. Each of the four patterns requires a slightly different approach:
- Pattern 1: Replace year ranges (e.g.,
1976-77
) with the first year in the range. - Pattern 2: Standardize approximate dates (e.g.,
c. 1917
) by removing thec.
prefix. - Pattern 3: Leave missing values labeled as
Unknown
unchanged. - Pattern 4: Convert missing values labeled as
n.d.
toUnknown
for consistency.
By applying these transformations, we’ll ensure that all values in the Date
column are either standardized years or clearly labeled as missing information. This makes the dataset easier to analyze and visualize.
Next, we’ll write some Python code to implement these changes step by step.
Pattern 1: Year Ranges
Year ranges like 1976-77
make it difficult to analyze or visualize the data. To simplify things, we’ll replace these ranges with the first year in the range. For example, 1976-77
will become 1976
.
We can achieve this by splitting the value at the hyphen (-
) and keeping only the first part. Here’s how:
def clean_split_dates(row):
# Convert the current 'Date' value to a string
initial_date = str(row['Date'])
# Split the string at the hyphen
split_date = initial_date.split('-')
# Return the first part if a hyphen is found; otherwise, return the original value
return split_date[0] if len(split_date) > 1 else initial_date
# Apply the function row by row to the 'Date' column
artworks_100['Date'] = artworks_100.apply(clean_split_dates, axis=1)
# Check the updated distribution of values in the 'Date' column
artworks_100['Date'].value_counts()
After applying this transformation, all year ranges have been replaced with single years, making the Date
column more consistent:
1976 25
1980 20
1979 12
Unknown 7
1917 5
1978 5
1923 4
1935 3
1987 2
1903 2
c. 1917 1
1918 1
1975 1
1968 1
n.d. 1
1905 1
1896 1
1984 1
1930 1
1970 1
1974 1
1986 1
1936 1
1900 1
1906 1
dtype: int64
Next, we’ll address Pattern 2 by standardizing approximate dates.
Pattern 2: Approximate Dates
In the dataset, some dates are marked as approximate using the prefix c.
(e.g., c. 1917
). To standardize these entries, we’ll remove the c.
prefix while leaving the year intact. For example, c. 1917
will become 1917
.
Here’s how we can clean these entries:
def clean_approx_dates(row):
# Convert the current 'Date' value to a string
initial_date = str(row['Date'])
# Remove the 'c.' prefix if it exists
if initial_date.startswith('c. '):
return initial_date[3:] # Return the string starting from the 4th character
return initial_date
# Apply the function row by row to the 'Date' column
artworks_100['Date'] = artworks_100.apply(clean_approx_dates, axis=1)
# Check the updated distribution of values in the 'Date' column
artworks_100['Date'].value_counts()
After this step, approximate dates have been cleaned, leaving standardized years. Here’s an updated distribution of the Date
column:
1976 25
1980 20
1979 12
Unknown 7
1917 6
1978 5
1923 4
1935 3
1987 2
1903 2
1918 1
1975 1
1968 1
n.d. 1
1905 1
1896 1
1984 1
1930 1
1970 1
1974 1
1986 1
1936 1
1900 1
1906 1
dtype: int64
After this step, approximate dates have been cleaned, leaving standardized years. We’ll skip Pattern 3 (Unknown
) since we’re leaving those entries unchanged. Instead, let’s move directly to Pattern 4, where we’ll standardize other missing values labeled as n.d.
.
Pattern 4: Missing Values (n.d.
)
In addition to entries labeled Unknown
, the dataset uses n.d.
to indicate missing date information. To make the data more consistent, we’ll replace all instances of n.d.
with Unknown
. This ensures that missing values are represented uniformly across the dataset.
Here’s how we can handle this:
def clean_missing_dates(row):
# Replace 'n.d.' with 'Unknown' if found in the 'Date' column
return 'Unknown' if row['Date'] == 'n.d.' else row['Date']
# Apply the function row by row to the 'Date' column
artworks_100['Date'] = artworks_100.apply(clean_missing_dates, axis=1)
# Check the updated distribution of values in the 'Date' column
artworks_100['Date'].value_counts()
1976 25
1980 20
1979 12
Unknown 8
1917 6
1978 5
1923 4
1935 3
1987 2
1903 2
1918 1
1975 1
1968 1
1905 1
1896 1
1984 1
1930 1
1970 1
1974 1
1986 1
1936 1
1900 1
1906 1
dtype: int64
With n.d.
entries standardized, all missing values are now labeled as Unknown
. This consistency makes it easier to analyze or filter the data when dealing with missing dates.
At this point, we’ve successfully addressed all patterns in the Date
column. The dataset is now much cleaner and ready for further analysis.
Next Steps
We’ve covered some key data cleaning in Python techniques while cleaning the Date
column in the MoMA dataset, but data cleaning often involves addressing a range of issues across multiple columns. If you’d like to go further, our Data Cleaning Project Walkthrough course is a great next step. This six-part course not only teaches you how to clean and transform data with Python and pandas, but also includes two guided projects that are perfect for building a portfolio.
As you work through these projects, you’ll gain hands-on experience that you can showcase to potential employers. For tips on how to present your projects effectively, check out our guide to creating a data science portfolio that stands out.
from Dataquest https://ift.tt/wUOQTZB
via RiYo Analytics
ليست هناك تعليقات