Page Nav

HIDE

Breaking News:

latest

Ads Place

Data Cleaning in Python on MoMA’s Art Collection

https://ift.tt/mSGe6ED Real-world data is rarely perfect—it’s often messy, inconsistent, and full of surprises. This post will show you ho...

https://ift.tt/mSGe6ED

Real-world data is rarely perfect—it’s often messy, inconsistent, and full of surprises. This post will show you how to perform data cleaning in Python on a dataset from the Museum of Modern Art (MoMA). The dataset, which catalogs hundreds of artworks, includes common data issues like missing values, inconsistent formats, and tricky date ranges.

About the MoMA Dataset

Art is a messy business. Over centuries, artists have created everything from simple paintings to complex sculptures, and art historians have been cataloging everything they can along the way. The Museum of Modern Art, or MoMA for short, is one of the most influential museums in the world and has made its artwork catalog publicly available. You can download the dataset from GitHub and follow along as we clean and prepare it for analysis.

The museum has put out a disclaimer, however, that the dataset is still a work in progress—an evolving artwork in its own right, perhaps. As with many real-world datasets, this means it contains quality issues that need addressing. We'll use Python to explore these quirks, identify patterns, and transform the dataset into a cleaner format for deeper insights.

Why Perform Data Cleaning in Python?

Python has become one of the most popular tools for working with data, thanks to its ease of use and powerful libraries. Organizations across industries rely on Python for everything from processing financial reports to analyzing scientific research. When it comes to data cleaning, Python’s pandas library offers efficient tools for handling messy datasets like the one we’re using from MoMA.

In a moment, we’ll use pandas to quickly load the dataset, explore its structure, and fix common issues. If you’re new to Python or want to brush up on the basics, check out our Introduction to Python Programming course, which provides a solid foundation for getting started.

Exploring the Data with Pandas

Before we start cleaning, let’s take a closer look at the MoMA dataset to understand its structure and identify potential issues. We’ll use pandas, a popular Python library for working with data, to load the first 100 rows into a DataFrame. This will let us explore the dataset and get a preview of what we’re working with.


import pandas
artworks_100 = pandas.read_csv("MOMA_Artworks.csv")
artworks_100.head(10)
Title Artist ArtistBio Date Medium Dimensions CreditLine MoMANumber Classification Department DateAcquired CuratorApproved ObjectID URL
0 Ferdinandsbrücke Project, Vienna, Austria , El... Otto Wagner (Austrian, 1841–1918) 1896 Ink and cut-and-pasted painted pages on paper 19 1/8 x 66 1/2" (48.6 x 168.9 cm) Fractional and promised gift of Jo Carole and ... 885.1996 A&D Architectural Drawing Architecture & Design 1996-04-09 Y 2 https://ift.tt/m5HZtx6
1 City of Music, National Superior Conservatory ... Christian de Portzamparc (French, born 1944) 1987 Paint and colored pencil on print 16 x 11 3/4" (40.6 x 29.8 cm) Gift of the architect in honor of Lily Auchinc... 1.1995 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 3 https://ift.tt/QVey7PA
2 Villa near Vienna Project, Outside Vienna, Aus... Emil Hoppe (Austrian, 1876–1957) 1903 Graphite, pen, color pencil, ink, and gouache ... 13 1/2 x 12 1/2" (34.3 x 31.8 cm) Gift of Jo Carole and Ronald S. Lauder 1.1997 A&D Architectural Drawing Architecture & Design 1997-01-15 Y 4 https://ift.tt/NJpE5kF
3 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1980 Photographic reproduction with colored synthet... 20 x 20" (50.8 x 50.8 cm) Purchase and partial gift of the architect in ... 2.1995 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 5 https://ift.tt/FSzi73q
4 Villa, project, outside Vienna, Austria, Exter... Emil Hoppe (Austrian, 1876–1957) 1903 Graphite, color pencil, ink, and gouache on tr... 15 1/8 x 7 1/2" (38.4 x 19.1 cm) Gift of Jo Carole and Ronald S. Lauder 2.1997 A&D Architectural Drawing Architecture & Design 1997-01-15 Y 6 https://ift.tt/2s5GO0Q
5 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.1 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 7 https://ift.tt/qQXV90a
6 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photographs Each: 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.1-24 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 8 https://ift.tt/Ca02grz
7 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.10 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 9 https://ift.tt/bdg9DCp
8 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.11 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 10 https://ift.tt/xFSyER1
9 The Manhattan Transcripts Project, New York, N... Bernard Tschumi (French and Swiss, born Switzerland 1944) 1976-77 Gelatin silver photograph 14 x 18" (35.6 x 45.7 cm) Purchase and partial gift of the architect in ... 3.1995.12 A&D Architectural Drawing Architecture & Design 1995-01-17 Y 11 https://ift.tt/SfVdG8c

The .head() method displays the first few rows of the dataset, helping us spot any irregularities right away. At first glance, the dataset includes valuable metadata like the title, artist, year, and medium of each artwork. However, we can already spot some irregularities in the Date column, such as year ranges and non-standard entries like Unknown. We’ll address these issues as we move through the cleaning process.

Working with Dates in Pandas

Dates are a critical part of many datasets, especially when analyzing trends or patterns over time. In the MoMA dataset, the Date column includes a mix of formats, such as year ranges (1976-77) and non-standard entries (Unknown). These inconsistencies make it difficult to use the data effectively, whether for plotting or filtering by year.

To start, let’s take a closer look at the unique values in the Date column using the .value_counts() method. This will give us an overview of the data and help us identify patterns or irregularities that need fixing.


# Check the distribution of values in the 'Date' column
artworks_100['Date'].value_counts()

Here’s what we find:


1976-77    25
1980-81    15
1979       12
Unknown     7
1980        5
1917        5
1978        5
1923        4
1935        3
1987        2
1903        2
1970        1
1896        1
1975        1
1984        1
1918        1
1986        1
n.d.        1
1906        1
1905        1
1930        1
1974        1
1936        1
1968        1
1900        1
c. 1917     1
dtype: int64

This output highlights four key patterns in the data:

  • Pattern 1: Year ranges like 1976-77
  • Pattern 2: Approximate dates like c. 1917
  • Pattern 3: Missing values labeled as Unknown
  • Pattern 4: Missing values labeled as n.d.

Cleaning the Data

Now that we’ve identified irregularities in the Date column, let’s outline how we’ll perform the data cleaning in Python to address them. Each of the four patterns requires a slightly different approach:

  • Pattern 1: Replace year ranges (e.g., 1976-77) with the first year in the range.
  • Pattern 2: Standardize approximate dates (e.g., c. 1917) by removing the c. prefix.
  • Pattern 3: Leave missing values labeled as Unknown unchanged.
  • Pattern 4: Convert missing values labeled as n.d. to Unknown for consistency.

By applying these transformations, we’ll ensure that all values in the Date column are either standardized years or clearly labeled as missing information. This makes the dataset easier to analyze and visualize.

Next, we’ll write some Python code to implement these changes step by step.

Pattern 1: Year Ranges

Year ranges like 1976-77 make it difficult to analyze or visualize the data. To simplify things, we’ll replace these ranges with the first year in the range. For example, 1976-77 will become 1976.

We can achieve this by splitting the value at the hyphen (-) and keeping only the first part. Here’s how:


def clean_split_dates(row):
    # Convert the current 'Date' value to a string
    initial_date = str(row['Date'])
    # Split the string at the hyphen
    split_date = initial_date.split('-')
    # Return the first part if a hyphen is found; otherwise, return the original value
    return split_date[0] if len(split_date) > 1 else initial_date

# Apply the function row by row to the 'Date' column
artworks_100['Date'] = artworks_100.apply(clean_split_dates, axis=1)

# Check the updated distribution of values in the 'Date' column
artworks_100['Date'].value_counts()

After applying this transformation, all year ranges have been replaced with single years, making the Date column more consistent:


1976       25
1980       20
1979       12
Unknown     7
1917        5
1978        5
1923        4
1935        3
1987        2
1903        2
c. 1917     1
1918        1
1975        1
1968        1
n.d.        1
1905        1
1896        1
1984        1
1930        1
1970        1
1974        1
1986        1
1936        1
1900        1
1906        1
dtype: int64

Next, we’ll address Pattern 2 by standardizing approximate dates.

Pattern 2: Approximate Dates

In the dataset, some dates are marked as approximate using the prefix c. (e.g., c. 1917). To standardize these entries, we’ll remove the c. prefix while leaving the year intact. For example, c. 1917 will become 1917.

Here’s how we can clean these entries:


def clean_approx_dates(row):
    # Convert the current 'Date' value to a string
    initial_date = str(row['Date'])
    # Remove the 'c.' prefix if it exists
    if initial_date.startswith('c. '):
        return initial_date[3:]  # Return the string starting from the 4th character
    return initial_date

# Apply the function row by row to the 'Date' column
artworks_100['Date'] = artworks_100.apply(clean_approx_dates, axis=1)

# Check the updated distribution of values in the 'Date' column
artworks_100['Date'].value_counts()

After this step, approximate dates have been cleaned, leaving standardized years. Here’s an updated distribution of the Date column:


1976       25
1980       20
1979       12
Unknown     7
1917        6
1978        5
1923        4
1935        3
1987        2
1903        2
1918        1
1975        1
1968        1
n.d.        1
1905        1
1896        1
1984        1
1930        1
1970        1
1974        1
1986        1
1936        1
1900        1
1906        1
dtype: int64

After this step, approximate dates have been cleaned, leaving standardized years. We’ll skip Pattern 3 (Unknown) since we’re leaving those entries unchanged. Instead, let’s move directly to Pattern 4, where we’ll standardize other missing values labeled as n.d..

Pattern 4: Missing Values (n.d.)

In addition to entries labeled Unknown, the dataset uses n.d. to indicate missing date information. To make the data more consistent, we’ll replace all instances of n.d. with Unknown. This ensures that missing values are represented uniformly across the dataset.

Here’s how we can handle this:


def clean_missing_dates(row):
    # Replace 'n.d.' with 'Unknown' if found in the 'Date' column
    return 'Unknown' if row['Date'] == 'n.d.' else row['Date']

# Apply the function row by row to the 'Date' column
artworks_100['Date'] = artworks_100.apply(clean_missing_dates, axis=1)

# Check the updated distribution of values in the 'Date' column
artworks_100['Date'].value_counts()

1976       25
1980       20
1979       12
Unknown     8
1917        6
1978        5
1923        4
1935        3
1987        2
1903        2
1918        1
1975        1
1968        1
1905        1
1896        1
1984        1
1930        1
1970        1
1974        1
1986        1
1936        1
1900        1
1906        1
dtype: int64

With n.d. entries standardized, all missing values are now labeled as Unknown. This consistency makes it easier to analyze or filter the data when dealing with missing dates.

At this point, we’ve successfully addressed all patterns in the Date column. The dataset is now much cleaner and ready for further analysis.

Next Steps

We’ve covered some key data cleaning in Python techniques while cleaning the Date column in the MoMA dataset, but data cleaning often involves addressing a range of issues across multiple columns. If you’d like to go further, our Data Cleaning Project Walkthrough course is a great next step. This six-part course not only teaches you how to clean and transform data with Python and pandas, but also includes two guided projects that are perfect for building a portfolio.

As you work through these projects, you’ll gain hands-on experience that you can showcase to potential employers. For tips on how to present your projects effectively, check out our guide to creating a data science portfolio that stands out.



from Dataquest https://ift.tt/wUOQTZB
via RiYo Analytics

No comments

Latest Articles