Page Nav

HIDE
Monday, March 31

Pages

Breaking News:

Ads Place

Analyzing New York City High School Data

https://ift.tt/MPDrnWu In this guide, I'll walk you through analyzing New York City's high school data to identify relationships b...

https://ift.tt/MPDrnWu

In this guide, I'll walk you through analyzing New York City's high school data to identify relationships between various factors and SAT scores. This guided project, Analyzing NYC High School Data, will help you develop hands-on skills in data analysis, visualization, and statistical correlation using Python.

We'll assume the role of a data analyst investigating the potential relationships between SAT scores and factors like school safety, demographic makeup, and academic programs in NYC high schools. The ultimate question we'll explore: Is the SAT a fair test, or do certain demographic factors correlate strongly with performance?

What You'll Learn:

  • How to combine and clean multiple datasets to create a comprehensive analysis
  • How to identify and visualize correlations between different variables
  • How to analyze demographic factors like race, gender, and socioeconomic status in relation to test scores
  • How to draw meaningful insights from data visualizations
  • How to organize an exploratory data analysis workflow in Python

Before getting into this project, you should be fairly fluent with basic Python skills like lists, dictionaries, loops, and conditional logic. You should also have some familiarity with pandas, matplotlib, and basic statistical concepts like correlation. If you need to brush up on these skills, check out our Python for Data Science learning path.

Now, let's dive into our analysis!

Step 1: Understanding the Data

For this project, we'll be working with several datasets related to New York City high schools:

  1. SAT scores: Contains average scores for each high school
  2. School demographics: Information about race, gender, and other demographic factors
  3. AP test results: Data on Advanced Placement test participation
  4. Graduation outcomes: Graduation rates for each school
  5. Class size: Information about class sizes
  6. School surveys: Results from surveys given to students, teachers, and parents
  7. School directory: Additional information about each school

Each dataset contains information about NYC high schools, but they're separated across different files. Our first task will be to combine these datasets into one comprehensive dataset for analysis.

Step 2: Setting Up the Environment

If you're working on this project within the Dataquest platform, your environment is already set up. If you're working locally, you'll need:

  1. Python environment: Ensure you have Python 3.x installed with pandas, numpy, matplotlib, and re (regex) libraries.
  2. Jupyter Notebook: Install Jupyter Notebook or JupyterLab to work with the provided .ipynb file.
  3. Data files: Download the dataset files from the project page.

Let's start by importing the necessary libraries and loading our datasets:

import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

The code above loads each dataset into a dictionary called data, where the keys are the file names (without the .csv extension) and the values are the pandas DataFrames containing the data.

The survey data is stored in tab-delimited (”\t”) text files, so we need to specify the delimiter to load it correctly. For the survey data, we also need to standardize the dbn (District Borough Number) column, which is the unique identifier for each school:

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN",
    "rr_s",
    "rr_t",
    "rr_p",
    "N_s",
    "N_t",
    "N_p",
    "saf_p_11",
    "com_p_11",
    "eng_p_11",
    "aca_p_11",
    "saf_t_11",
    "com_t_11",
    "eng_t_11",
    "aca_t_11",
    "saf_s_11",
    "com_s_11",
    "eng_s_11",
    "aca_s_11",
    "saf_tot_11",
    "com_tot_11",
    "eng_tot_11",
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

Step 3: Data Cleaning and Preparation

Before we can analyze the data, we need to clean and prepare it. This involves standardizing column names, converting data types, and extracting geographical information:

# Standardize DBN column name in hs_directory
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

# Helper function to pad CSD values
def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation

# Create DBN column in class_size
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

# Convert SAT score columns to numeric and calculate total SAT score
cols = ["SAT Critical Reading Avg. Score", "SAT Math Avg. Score", "SAT Writing Avg. Score"]
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data["sat_results"]["sat_score"] = data["sat_results"][cols[0]] + data["sat_results"][cols[1]] + data["sat_results"][cols[2]]

This type of data cleaning is critical for real-world data analysis. If you want to learn more about data cleaning techniques, check out Dataquest's Data Cleaning in Python course.

We also need to extract geographic coordinates from the location field for mapping purposes:

# Extract latitude and longitude from Location 1 field
def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = re.findall("[+-]?\d+\.\d+", coords[0])[0]
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = re.findall("[+-]?\d+\.\d+", coords[0])[1]
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Now we can combine all the datasets into a single DataFrame. We'll use the DBN as the key to merge all the datasets:

# Combine the datasets
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
combined = combined.merge(data["class_size"], on="DBN", how="left")
combined = combined.merge(data["demographics"], on="DBN", how="left")
combined = combined.merge(data["survey"], on="DBN", how="left")
combined = combined.merge(data["hs_directory"], on="DBN", how="left")

# Fill missing values
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

# Add a school district column for mapping
def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

When you run this code, you'll have a single DataFrame called combined that contains all the data from the separate datasets. We've filled in missing values with the mean of each column, and if a column has no numeric values, we've filled it with 0.

If we examine the first few rows of our combined dataset using combined.head(), we’ll see that it has over 160 columns! This is a lot of information to process, so we'll need to be strategic about how we explore the data.

Step 4: Finding Relationships - Correlation Analysis

Since we have so many variables, a good place to start is by identifying which ones have the strongest relationships with SAT scores. We can calculate the correlation between each variable and the total SAT score:

correlations = combined.corr(numeric_only=True)["sat_score"]

# Display strongest negative correlations
print("Strongest negative correlations:")
correlations_ascending = correlations.sort_values()
print(correlations_ascending.head(10))

# Display strongest positive correlations
print("\nStrongest positive correlations:")
correlations_descending = correlations.sort_values(ascending=False)
print(correlations_descending.head(10))

Running this code produces the following output:

Strongest negative correlations:
frl_percent        -0.722225
ell_percent        -0.663531
sped_percent       -0.448065
hispanic_per       -0.394415
black_per          -0.363847
total_enrollment   -0.308963
female_per         -0.213514
male_per          -0.213514
Asian_per         -0.152571
grade9_perc         0.034778
Name: sat_score, dtype: float64

Strongest positive correlations:
sat_score                         1.000000
sat_w                             0.986296
sat_m                             0.972415
sat_r                             0.968731
white_per                         0.661642
asian_per                         0.570730
total_exams_taken                 0.448799
AP Test Takers                     0.429749
high_score_percent                0.394919
selfselect_percent                0.369467
Name: sat_score, dtype: float64

This gives us a good starting point for our investigation. We can see that SAT scores have:

  • Strong negative correlations with:
    • frl_percent (percentage of students on free/reduced lunch, an indicator of poverty)
    • ell_percent (percentage of English Language Learners)
    • sped_percent (percentage of students in special education)
    • hispanic_per (percentage of Hispanic students)
    • black_per (percentage of Black students)
  • Strong positive correlations with:
    • The individual SAT section scores (which makes sense)
    • white_per (percentage of white students)
    • asian_per (percentage of Asian students)
    • Various AP test measures

These correlations suggest that demographic factors like race, socioeconomic status, and English language proficiency may have significant relationships with SAT scores. Let's explore these relationships in more detail.

Step 5: Analyzing Survey Results

As a former high school teacher myself, I've seen firsthand how factors like school safety and academic engagement can impact student performance. Let's look at how survey responses correlate with SAT scores. The survey includes questions about academics, safety, communication, and engagement, answered by students, teachers, and parents:

# Find correlations between survey fields and SAT scores
survey_correlations = combined.loc[:, survey_fields].corr(numeric_only=True)["sat_score"]
survey_correlations = survey_correlations.sort_values(ascending=False)

# Visualize correlations
survey_correlations.plot.barh(figsize=(10, 10))
plt.axvline(x=0.25, linestyle='--')
plt.title("Correlation between Survey Data and SAT Scores")
plt.xlabel("Correlation Coefficient")

When you run this code, you'll see a horizontal bar chart showing the correlation between each survey field and SAT scores. The dotted line at 0.25 helps us identify stronger correlations (those extending beyond the dotted line):

The strongest correlations appear to be with:

  • Number of student and parent survey responses (N_s and N_p) ― suggests that more engaged school communities have higher SAT scores
  • Student perceptions of academics (aca_s_11) ― when students view academics positively, SAT scores tend to be higher
  • Safety perceptions (saf_s_11, saf_t_11, saf_tot_11) ― schools perceived as safer tend to have higher SAT scores

Instructor Insight: The strong correlation between survey response rates and SAT scores isn't surprising ― schools where parents and students take the time to complete surveys are likely communities where education is highly valued. This engagement often translates to better academic preparation and support for standardized tests like the SAT.

Let's explore the relationship between safety perceptions and SAT scores in more detail:

combined.plot.scatter(x="saf_s_11", y="sat_score")
plt.title("Safety vs. SAT Scores")
plt.xlabel("Student Safety Perception")
plt.ylabel("Average SAT Score")

The resulting scatter plot shows a general upward trend - as safety perception increases, SAT scores tend to increase as well. But there's considerable variation, suggesting that safety isn't the only factor at play.

Let's see if there are differences in safety perceptions across NYC boroughs:

borough_safety = combined.groupby("borough")["sat_score", "saf_s_11", "saf_t_11", "saf_p_11", "saf_tot_11"].mean()
borough_safety

This produces a table showing average safety perceptions and SAT scores by borough:

             sat_score   saf_s_11   saf_t_11   saf_p_11  saf_tot_11
borough
Bronx        1126.692308  6.606087  6.539855  7.585507    6.837678
Brooklyn     1230.068966  6.836087  7.259420  7.942391    7.230797
Manhattan    1340.052632  7.454412  7.548529  8.603676    7.790441
Queens       1317.279412  7.149338  7.565476  8.193878    7.533921
Staten Island 1376.600000  6.450000  7.325000  7.966667    7.175000

Interestingly, while there are significant differences in SAT scores across boroughs (with the Bronx having the lowest average and Staten Island the highest), the differences in safety perceptions are less pronounced. This suggests that while safety may correlate with SAT scores overall, borough-level safety differences aren't necessarily driving borough-level SAT score differences.

Instructor Insight: One pattern I found particularly interesting when exploring this data is that parent perceptions of safety are consistently about 1.5 points higher than student perceptions across all boroughs. From my experience in schools, this makes sense - parents aren't in the building every day and may not be aware of the day-to-day dynamics that students experience. Students have a much more immediate and granular understanding of safety issues in their schools.

If you're interested in learning more about data aggregation techniques like the groupby() method we used above, check out Dataquest's Pandas Fundamentals course.

Step 6: Race Demographics and SAT Scores

The correlation analysis showed that race demographics have strong relationships with SAT scores. Let's visualize these correlations:

race_fields = [
    "white_per",
    "asian_per",
    "black_per",
    "hispanic_per"
]

# Correlation between race demographics and SAT scores
race_correlations = combined.loc[:, race_fields].corr(numeric_only=True)["sat_score"]
race_correlations.plot.barh()
plt.title("Correlation between Race Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")

When you run this code, you'll see a bar chart showing that white and Asian percentages have strong positive correlations with SAT scores, while Black and Hispanic percentages have negative correlations.

Let's take a closer look at the relationship between Hispanic percentage and SAT scores, which had one of the strongest negative correlations:

import numpy as np
combined.plot.scatter(x="hispanic_per", y="sat_score")

# Add regression line
x = combined["hispanic_per"]
y = combined["sat_score"]
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, '-', color='red')

plt.title("Hispanic Population Percentage vs. SAT Scores")
plt.xlabel("Hispanic Population Percentage")
plt.ylabel("Average SAT Score")

The scatter plot with the regression line clearly shows the negative relationship - as the percentage of Hispanic students increases, the average SAT score tends to decrease.

Instructor Insight: When creating this visualization, I chose to add a regression line to make the trend clearer. This is a small but powerful enhancement to our scatter plot - while the naked eye can detect a general pattern in the data points, the regression line provides a more precise representation of the relationship.

To understand this relationship better, let's look at schools with very high Hispanic populations:

selected_columns = ["school_name", "num_of_sat_test_takers", "hispanic_per", "sat_score", "sped_per", "ell_per", "school_dist", "borough", "saf_s_11"]
high_hispanic = combined.sort_values("hispanic_per", ascending=False)[selected_columns].head(5)
high_hispanic

This outputs the 5 schools with the highest Hispanic percentages:

                                          school_name  num_of_sat_test_takers  hispanic_per    sat_score   sped_per    ell_per  school_dist    borough  saf_s_11
58   WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL                      66     93.918919  1126.515152  14.779270  12.018348           06  Manhattan  6.816092
62                       GREGORIO LUPERON HIGH SCHOOL                      53     96.164384  1021.245283  15.963855  51.807229           06  Manhattan  6.885291
66                          MULTICULTURAL HIGH SCHOOL                      23     97.297297  1023.217391  16.447368  34.210526           09   Brooklyn  7.656250
68             PAN AMERICAN INTERNATIONAL HIGH SCHOOL                      44     97.917847  1045.750000  10.363636  75.454545           24     Queens  7.976744
69          INTERNATIONAL HIGH SCHOOL AT UNION SQUARE                      49     94.871795  1172.653061  10.638298  84.042553           02  Manhattan  7.767857

A pattern emerges: these schools have very high percentages of English Language Learners (ELL), ranging from 12% to 84%. This suggests that language barriers might be a significant factor in the correlation between Hispanic percentage and SAT scores, as the SAT is administered in English and requires strong English language skills.

Instructor Insight: To better understand these schools, I did some additional research (as we would in a real data analysis project) by searching for information about them online. What I discovered was fascinating ― all of these schools are part of New York City's Outward Bound program, which specifically focuses on serving English language learners. This provides important context to our numerical findings. Even though these schools have lower SAT scores, they're actually specialized programs designed to support a specific student population with unique challenges.

For comparison, let's look at schools with very low Hispanic percentages and high SAT scores:

low_hispanic_high_sat = combined[(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)].sort_values("sat_score", ascending=False)[selected_columns].head(5)
low_hispanic_high_sat

This shows us the top-performing schools with low Hispanic populations:

                                          school_name  num_of_sat_test_takers     hispanic_per      sat_score   sped_per    ell_per   school_dist       borough  saf_s_11
37                             STUYVESANT HIGH SCHOOL                     832         2.699055    2096.105769   0.321812   0.000000           02      Manhattan  8.092486
86                  BRONX HIGH SCHOOL OF SCIENCE, THE                     731         7.132868    1941.606018   0.395778   0.197889           10          Bronx  7.058323
83                STATEN ISLAND TECHNICAL HIGH SCHOOL                     226         7.563025    1933.893805   0.883392   0.000000           31  Staten Island  8.333333
51  HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE                     114         9.322034    1925.657895   0.855856   0.000000           10          Bronx  8.545455
32                QUEENS HIGH SCHOOL FOR THE SCIENCES                      78         9.230769    1877.307692   0.000000   0.000000           28         Queens  8.592593

These schools have very different profiles ― they have virtually no English Language Learners and very low percentages of special education students.

Instructor Insight: Upon further investigation, I discovered that all five of these are specialized high schools in NYC that select students based on academic performance through competitive admissions processes. What's particularly noteworthy about these schools is that despite New York City having a diversity initiative for specialized high schools, none of these top-performing schools have elected to participate in it. This raises important questions about educational equity and access to high-quality education across different demographic groups.

This analysis reveals that the SAT may disadvantage English Language Learners, who may struggle with the test's language demands regardless of their academic abilities.

Step 7: Gender and SAT Scores

Let's examine how gender demographics relate to SAT scores:

gender_fields = ["male_per", "female_per"]
gender_correlations = combined.loc[:, gender_fields].corr(numeric_only=True)["sat_score"]
gender_correlations.plot.barh()
plt.title("Correlation between Gender Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")

The correlation between gender and SAT scores is relatively weak but interesting - there's a positive correlation with female percentage and a negative correlation with male percentage.

Let's visualize this relationship:

combined.plot.scatter(x="female_per", y="sat_score")
plt.axhspan(40, 60, alpha=0.2, color='red')
plt.title("Female Population Percentage vs. SAT Scores")
plt.xlabel("Female Population Percentage")
plt.ylabel("Average SAT Score")

The highlighted area represents what might be considered a "normal" gender balance (40-60% female). We can see that there are schools with high SAT scores across various gender compositions, but let's check if there's a difference in average SAT scores for schools with very high or low female percentages:

print("Average SAT score for schools with >60% female students:", combined[combined["female_per"] > 60]["sat_score"].mean())
print("Average SAT score for schools with <40% female students:", combined[combined["female_per"] < 40]["sat_score"].mean())

This gives us:

Average SAT score for schools with >60% female students: 1301.8308823529412
Average SAT score for schools with <40% female students: 1204.4375

There's a gap of nearly 100 points! Schools with predominantly female populations have higher average SAT scores than schools with predominantly male populations.

Let's look at some of these high-female, high-SAT schools:

high_female_columns = ["school_name", "female_per", "sat_score", "sat_m", "sat_r", "sat_w"]
combined.sort_values(["female_per", "sat_score"], ascending=False)[high_female_columns].head(5)

The result shows specialized schools focusing on arts, which is historically popular for female students. As with the race analysis, we're seeing that specialized schools play a role in these demographic patterns.

Instructor Insight: This finding particularly resonates with my personal experience. Before becoming a data analyst, I attended a performing arts-focused high school that had a similar gender distribution ― over 60% female. These specialized schools often have strong academic programs alongside their focus areas, which may help explain the correlation between female percentage and SAT scores. It's a reminder that behind every data point is a complex institutional story that pure numbers can't fully capture.

Step 8: AP Test Participation and SAT Scores

Finally, let's look at Advanced Placement (AP) test participation and its relationship with SAT scores. AP courses are college-level classes that high school students can take, potentially earning college credit:

# Create percentage of AP test takers relative to student population
combined["ap_per"] = combined["AP Test Takers "] / combined["total_enrollment"]

# Visualize relationship between AP test percentage and SAT scores
combined.plot.scatter(x="ap_per", y="sat_score")
plt.title("AP Test Takers Percentage vs. SAT Scores")
plt.xlabel("AP Test Takers Percentage")
plt.ylabel("Average SAT Score")

The scatter plot doesn't show a very clear trend. There appears to be a vertical line of points around the 1200 mark ― this is actually an artifact of our data processing. When we filled missing values with the mean, it created this vertical pattern.

Instructor Insight: When analyzing real-world data, it's crucial to recognize these kinds of patterns that result from data cleaning decisions. I always recommend adding notes about such artifacts in your analysis to prevent misinterpretation. If this were a professional report, I would explicitly mention that this vertical line doesn't represent a natural clustering in the data.

Let's calculate the correlation to get a more precise measure:

correlation = combined["ap_per"].corr(combined["sat_score"])
print(f"Correlation between AP test percentage and SAT scores: {correlation:.4f}")

This gives:

Correlation between AP test percentage and SAT scores: 0.0566

The correlation is very weak (less than 0.06), suggesting that the percentage of students taking AP tests at a school is not strongly related to the school's average SAT score. This is somewhat surprising, as we might expect schools with more AP participants to have higher SAT scores.

Results

Looking beyond the code and the outputs, this project demonstrates the power of exploratory data analysis to uncover social patterns and potential inequities in educational systems. As someone who has worked in education before transitioning to data analysis, I find these kinds of investigations particularly meaningful.

Our analysis reveals several important insights about NYC high schools and SAT scores:

  1. Socioeconomic factors have some of the strongest correlations with SAT scores. Schools with higher percentages of students on free/reduced lunch tend to have lower SAT scores.
  2. English Language Learning appears to be a significant factor. Schools with high percentages of ELL students (many of which have high Hispanic populations) tend to have lower SAT scores, which raises questions about the fairness of using the SAT for college admissions.
  3. Specialized schools show a different pattern. NYC's specialized high schools, which select students based on academic performance, have very high SAT scores and different demographic compositions than other schools.
  4. Safety and academic engagement correlate with higher SAT scores, though the relationship is complex and varies across boroughs.
  5. Gender composition shows interesting patterns, with predominantly female schools having higher average SAT scores than predominantly male schools.

These findings suggest that the SAT may not be a completely fair measure of student aptitude, as scores correlate strongly with factors like English language proficiency and socioeconomic status. This aligns with broader criticisms of standardized testing, which have led many colleges to move away from requiring SAT scores in recent years.

Next Steps

This analysis barely scratches the surface of what could be done with this rich dataset. With over 160 columns of information, there are many more relationships to explore. Based on my experience with educational data, here are some compelling directions for further analysis:

  1. Analyze class size:

    As a former teacher, I've seen how class size impacts learning. I'd be curious to see if smaller classes correlate with higher SAT scores, particularly in schools serving disadvantaged populations.

  2. Explore school districts rather than boroughs:

    In our analysis, we looked at borough-level patterns, but NYC has numerous school districts that might show more nuanced patterns. Grouping by district instead of borough could reveal more localized relationships between demographics, safety, and test scores.

  3. Consider property values:

    Education and housing are closely linked. An analysis combining this dataset with property value information could reveal interesting patterns about economic segregation and educational outcomes.

  4. Create a more equitable school ranking system:

    Most ranking systems heavily prioritize test scores, which our analysis suggests may disadvantage certain populations. Developing a more holistic ranking system that accounts for factors like English language learner support, improvement over time, and socioeconomic context could provide a more fair assessment of school quality.

To take your data analysis skills further, consider exploring Dataquest's other courses and guided projects in the Data Scientist in Python career path.

I personally made the transition from teaching to data analysis through platforms like Dataquest, and the process of working on projects like this one was instrumental in building my skills and portfolio. Don't be afraid to spend several hours on a project like this one - the deeper you dig, the more insights you'll uncover, and the more you'll learn.

If you have questions or want to share your work on this project, feel free to join the discussion on our Community forums. You can even tag me (@Anna_Strahl) if you'd like specific feedback on your approach!



from Dataquest https://ift.tt/cZmYui0
via RiYo Analytics

ليست هناك تعليقات

Latest Articles