https://ift.tt/MPDrnWu In this guide, I'll walk you through analyzing New York City's high school data to identify relationships b...
In this guide, I'll walk you through analyzing New York City's high school data to identify relationships between various factors and SAT scores. This guided project, Analyzing NYC High School Data, will help you develop hands-on skills in data analysis, visualization, and statistical correlation using Python.
We'll assume the role of a data analyst investigating the potential relationships between SAT scores and factors like school safety, demographic makeup, and academic programs in NYC high schools. The ultimate question we'll explore: Is the SAT a fair test, or do certain demographic factors correlate strongly with performance?
What You'll Learn:
- How to combine and clean multiple datasets to create a comprehensive analysis
- How to identify and visualize correlations between different variables
- How to analyze demographic factors like race, gender, and socioeconomic status in relation to test scores
- How to draw meaningful insights from data visualizations
- How to organize an exploratory data analysis workflow in Python
Before getting into this project, you should be fairly fluent with basic Python skills like lists, dictionaries, loops, and conditional logic. You should also have some familiarity with pandas
, matplotlib
, and basic statistical concepts like correlation. If you need to brush up on these skills, check out our Python for Data Science learning path.
Now, let's dive into our analysis!
Step 1: Understanding the Data
For this project, we'll be working with several datasets related to New York City high schools:
- SAT scores: Contains average scores for each high school
- School demographics: Information about race, gender, and other demographic factors
- AP test results: Data on Advanced Placement test participation
- Graduation outcomes: Graduation rates for each school
- Class size: Information about class sizes
- School surveys: Results from surveys given to students, teachers, and parents
- School directory: Additional information about each school
Each dataset contains information about NYC high schools, but they're separated across different files. Our first task will be to combine these datasets into one comprehensive dataset for analysis.
Step 2: Setting Up the Environment
If you're working on this project within the Dataquest platform, your environment is already set up. If you're working locally, you'll need:
- Python environment: Ensure you have Python 3.x installed with
pandas
,numpy
,matplotlib
, andre
(regex) libraries. - Jupyter Notebook: Install Jupyter Notebook or JupyterLab to work with the provided
.ipynb
file. - Data files: Download the dataset files from the project page.
Let's start by importing the necessary libraries and loading our datasets:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
The code above loads each dataset into a dictionary called data
, where the keys are the file names (without the .csv
extension) and the values are the pandas DataFrames containing the data.
The survey data is stored in tab-delimited (”\t”
) text files, so we need to specify the delimiter
to load it correctly. For the survey
data, we also need to standardize the dbn
(District Borough Number) column, which is the unique identifier for each school:
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
Step 3: Data Cleaning and Preparation
Before we can analyze the data, we need to clean and prepare it. This involves standardizing column names, converting data types, and extracting geographical information:
# Standardize DBN column name in hs_directory
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
# Helper function to pad CSD values
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
# Create DBN column in class_size
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
# Convert SAT score columns to numeric and calculate total SAT score
cols = ["SAT Critical Reading Avg. Score", "SAT Math Avg. Score", "SAT Writing Avg. Score"]
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data["sat_results"]["sat_score"] = data["sat_results"][cols[0]] + data["sat_results"][cols[1]] + data["sat_results"][cols[2]]
This type of data cleaning is critical for real-world data analysis. If you want to learn more about data cleaning techniques, check out Dataquest's Data Cleaning in Python course.
We also need to extract geographic coordinates from the location field for mapping purposes:
# Extract latitude and longitude from Location 1 field
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = re.findall("[+-]?\d+\.\d+", coords[0])[0]
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = re.findall("[+-]?\d+\.\d+", coords[0])[1]
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
Now we can combine all the datasets into a single DataFrame. We'll use the DBN
as the key to merge all the datasets:
# Combine the datasets
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
combined = combined.merge(data["class_size"], on="DBN", how="left")
combined = combined.merge(data["demographics"], on="DBN", how="left")
combined = combined.merge(data["survey"], on="DBN", how="left")
combined = combined.merge(data["hs_directory"], on="DBN", how="left")
# Fill missing values
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
# Add a school district column for mapping
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
When you run this code, you'll have a single DataFrame called combined
that contains all the data from the separate datasets. We've filled in missing values with the mean of each column, and if a column has no numeric values, we've filled it with 0.
If we examine the first few rows of our combined
dataset using combined.head()
, we’ll see that it has over 160 columns! This is a lot of information to process, so we'll need to be strategic about how we explore the data.
Step 4: Finding Relationships - Correlation Analysis
Since we have so many variables, a good place to start is by identifying which ones have the strongest relationships with SAT scores. We can calculate the correlation between each variable and the total SAT score:
correlations = combined.corr(numeric_only=True)["sat_score"]
# Display strongest negative correlations
print("Strongest negative correlations:")
correlations_ascending = correlations.sort_values()
print(correlations_ascending.head(10))
# Display strongest positive correlations
print("\nStrongest positive correlations:")
correlations_descending = correlations.sort_values(ascending=False)
print(correlations_descending.head(10))
Running this code produces the following output:
Strongest negative correlations:
frl_percent -0.722225
ell_percent -0.663531
sped_percent -0.448065
hispanic_per -0.394415
black_per -0.363847
total_enrollment -0.308963
female_per -0.213514
male_per -0.213514
Asian_per -0.152571
grade9_perc 0.034778
Name: sat_score, dtype: float64
Strongest positive correlations:
sat_score 1.000000
sat_w 0.986296
sat_m 0.972415
sat_r 0.968731
white_per 0.661642
asian_per 0.570730
total_exams_taken 0.448799
AP Test Takers 0.429749
high_score_percent 0.394919
selfselect_percent 0.369467
Name: sat_score, dtype: float64
This gives us a good starting point for our investigation. We can see that SAT scores have:
- Strong negative correlations with:
frl_percent
(percentage of students on free/reduced lunch, an indicator of poverty)ell_percent
(percentage of English Language Learners)sped_percent
(percentage of students in special education)hispanic_per
(percentage of Hispanic students)black_per
(percentage of Black students)
- Strong positive correlations with:
- The individual SAT section scores (which makes sense)
white_per
(percentage of white students)asian_per
(percentage of Asian students)- Various AP test measures
These correlations suggest that demographic factors like race, socioeconomic status, and English language proficiency may have significant relationships with SAT scores. Let's explore these relationships in more detail.
Step 5: Analyzing Survey Results
As a former high school teacher myself, I've seen firsthand how factors like school safety and academic engagement can impact student performance. Let's look at how survey responses correlate with SAT scores. The survey includes questions about academics, safety, communication, and engagement, answered by students, teachers, and parents:
# Find correlations between survey fields and SAT scores
survey_correlations = combined.loc[:, survey_fields].corr(numeric_only=True)["sat_score"]
survey_correlations = survey_correlations.sort_values(ascending=False)
# Visualize correlations
survey_correlations.plot.barh(figsize=(10, 10))
plt.axvline(x=0.25, linestyle='--')
plt.title("Correlation between Survey Data and SAT Scores")
plt.xlabel("Correlation Coefficient")
When you run this code, you'll see a horizontal bar chart showing the correlation between each survey field and SAT scores. The dotted line at 0.25
helps us identify stronger correlations (those extending beyond the dotted line):
The strongest correlations appear to be with:
- Number of student and parent survey responses (
N_s
andN_p
) ― suggests that more engaged school communities have higher SAT scores - Student perceptions of academics (
aca_s_11
) ― when students view academics positively, SAT scores tend to be higher - Safety perceptions (
saf_s_11
,saf_t_11
,saf_tot_11
) ― schools perceived as safer tend to have higher SAT scores
Instructor Insight: The strong correlation between survey response rates and SAT scores isn't surprising ― schools where parents and students take the time to complete surveys are likely communities where education is highly valued. This engagement often translates to better academic preparation and support for standardized tests like the SAT.
Let's explore the relationship between safety perceptions and SAT scores in more detail:
combined.plot.scatter(x="saf_s_11", y="sat_score")
plt.title("Safety vs. SAT Scores")
plt.xlabel("Student Safety Perception")
plt.ylabel("Average SAT Score")
The resulting scatter plot shows a general upward trend - as safety perception increases, SAT scores tend to increase as well. But there's considerable variation, suggesting that safety isn't the only factor at play.
Let's see if there are differences in safety perceptions across NYC boroughs:
borough_safety = combined.groupby("borough")["sat_score", "saf_s_11", "saf_t_11", "saf_p_11", "saf_tot_11"].mean()
borough_safety
This produces a table showing average safety perceptions and SAT scores by borough:
sat_score saf_s_11 saf_t_11 saf_p_11 saf_tot_11
borough
Bronx 1126.692308 6.606087 6.539855 7.585507 6.837678
Brooklyn 1230.068966 6.836087 7.259420 7.942391 7.230797
Manhattan 1340.052632 7.454412 7.548529 8.603676 7.790441
Queens 1317.279412 7.149338 7.565476 8.193878 7.533921
Staten Island 1376.600000 6.450000 7.325000 7.966667 7.175000
Interestingly, while there are significant differences in SAT scores across boroughs (with the Bronx having the lowest average and Staten Island the highest), the differences in safety perceptions are less pronounced. This suggests that while safety may correlate with SAT scores overall, borough-level safety differences aren't necessarily driving borough-level SAT score differences.
Instructor Insight: One pattern I found particularly interesting when exploring this data is that parent perceptions of safety are consistently about 1.5 points higher than student perceptions across all boroughs. From my experience in schools, this makes sense - parents aren't in the building every day and may not be aware of the day-to-day dynamics that students experience. Students have a much more immediate and granular understanding of safety issues in their schools.
If you're interested in learning more about data aggregation techniques like the groupby()
method we used above, check out Dataquest's Pandas Fundamentals course.
Step 6: Race Demographics and SAT Scores
The correlation analysis showed that race demographics have strong relationships with SAT scores. Let's visualize these correlations:
race_fields = [
"white_per",
"asian_per",
"black_per",
"hispanic_per"
]
# Correlation between race demographics and SAT scores
race_correlations = combined.loc[:, race_fields].corr(numeric_only=True)["sat_score"]
race_correlations.plot.barh()
plt.title("Correlation between Race Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")
When you run this code, you'll see a bar chart showing that white and Asian percentages have strong positive correlations with SAT scores, while Black and Hispanic percentages have negative correlations.
Let's take a closer look at the relationship between Hispanic percentage and SAT scores, which had one of the strongest negative correlations:
import numpy as np
combined.plot.scatter(x="hispanic_per", y="sat_score")
# Add regression line
x = combined["hispanic_per"]
y = combined["sat_score"]
m, b = np.polyfit(x, y, 1)
plt.plot(x, m*x + b, '-', color='red')
plt.title("Hispanic Population Percentage vs. SAT Scores")
plt.xlabel("Hispanic Population Percentage")
plt.ylabel("Average SAT Score")
The scatter plot with the regression line clearly shows the negative relationship - as the percentage of Hispanic students increases, the average SAT score tends to decrease.
Instructor Insight: When creating this visualization, I chose to add a regression line to make the trend clearer. This is a small but powerful enhancement to our scatter plot - while the naked eye can detect a general pattern in the data points, the regression line provides a more precise representation of the relationship.
To understand this relationship better, let's look at schools with very high Hispanic populations:
selected_columns = ["school_name", "num_of_sat_test_takers", "hispanic_per", "sat_score", "sped_per", "ell_per", "school_dist", "borough", "saf_s_11"]
high_hispanic = combined.sort_values("hispanic_per", ascending=False)[selected_columns].head(5)
high_hispanic
This outputs the 5 schools with the highest Hispanic percentages:
school_name num_of_sat_test_takers hispanic_per sat_score sped_per ell_per school_dist borough saf_s_11
58 WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL 66 93.918919 1126.515152 14.779270 12.018348 06 Manhattan 6.816092
62 GREGORIO LUPERON HIGH SCHOOL 53 96.164384 1021.245283 15.963855 51.807229 06 Manhattan 6.885291
66 MULTICULTURAL HIGH SCHOOL 23 97.297297 1023.217391 16.447368 34.210526 09 Brooklyn 7.656250
68 PAN AMERICAN INTERNATIONAL HIGH SCHOOL 44 97.917847 1045.750000 10.363636 75.454545 24 Queens 7.976744
69 INTERNATIONAL HIGH SCHOOL AT UNION SQUARE 49 94.871795 1172.653061 10.638298 84.042553 02 Manhattan 7.767857
A pattern emerges: these schools have very high percentages of English Language Learners (ELL), ranging from 12% to 84%. This suggests that language barriers might be a significant factor in the correlation between Hispanic percentage and SAT scores, as the SAT is administered in English and requires strong English language skills.
Instructor Insight: To better understand these schools, I did some additional research (as we would in a real data analysis project) by searching for information about them online. What I discovered was fascinating ― all of these schools are part of New York City's Outward Bound program, which specifically focuses on serving English language learners. This provides important context to our numerical findings. Even though these schools have lower SAT scores, they're actually specialized programs designed to support a specific student population with unique challenges.
For comparison, let's look at schools with very low Hispanic percentages and high SAT scores:
low_hispanic_high_sat = combined[(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)].sort_values("sat_score", ascending=False)[selected_columns].head(5)
low_hispanic_high_sat
This shows us the top-performing schools with low Hispanic populations:
school_name num_of_sat_test_takers hispanic_per sat_score sped_per ell_per school_dist borough saf_s_11
37 STUYVESANT HIGH SCHOOL 832 2.699055 2096.105769 0.321812 0.000000 02 Manhattan 8.092486
86 BRONX HIGH SCHOOL OF SCIENCE, THE 731 7.132868 1941.606018 0.395778 0.197889 10 Bronx 7.058323
83 STATEN ISLAND TECHNICAL HIGH SCHOOL 226 7.563025 1933.893805 0.883392 0.000000 31 Staten Island 8.333333
51 HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE 114 9.322034 1925.657895 0.855856 0.000000 10 Bronx 8.545455
32 QUEENS HIGH SCHOOL FOR THE SCIENCES 78 9.230769 1877.307692 0.000000 0.000000 28 Queens 8.592593
These schools have very different profiles ― they have virtually no English Language Learners and very low percentages of special education students.
Instructor Insight: Upon further investigation, I discovered that all five of these are specialized high schools in NYC that select students based on academic performance through competitive admissions processes. What's particularly noteworthy about these schools is that despite New York City having a diversity initiative for specialized high schools, none of these top-performing schools have elected to participate in it. This raises important questions about educational equity and access to high-quality education across different demographic groups.
This analysis reveals that the SAT may disadvantage English Language Learners, who may struggle with the test's language demands regardless of their academic abilities.
Step 7: Gender and SAT Scores
Let's examine how gender demographics relate to SAT scores:
gender_fields = ["male_per", "female_per"]
gender_correlations = combined.loc[:, gender_fields].corr(numeric_only=True)["sat_score"]
gender_correlations.plot.barh()
plt.title("Correlation between Gender Demographics and SAT Scores")
plt.xlabel("Correlation Coefficient")
The correlation between gender and SAT scores is relatively weak but interesting - there's a positive correlation with female percentage and a negative correlation with male percentage.
Let's visualize this relationship:
combined.plot.scatter(x="female_per", y="sat_score")
plt.axhspan(40, 60, alpha=0.2, color='red')
plt.title("Female Population Percentage vs. SAT Scores")
plt.xlabel("Female Population Percentage")
plt.ylabel("Average SAT Score")
The highlighted area represents what might be considered a "normal" gender balance (40-60% female). We can see that there are schools with high SAT scores across various gender compositions, but let's check if there's a difference in average SAT scores for schools with very high or low female percentages:
print("Average SAT score for schools with >60% female students:", combined[combined["female_per"] > 60]["sat_score"].mean())
print("Average SAT score for schools with <40% female students:", combined[combined["female_per"] < 40]["sat_score"].mean())
This gives us:
Average SAT score for schools with >60% female students: 1301.8308823529412
Average SAT score for schools with <40% female students: 1204.4375
There's a gap of nearly 100 points! Schools with predominantly female populations have higher average SAT scores than schools with predominantly male populations.
Let's look at some of these high-female, high-SAT schools:
high_female_columns = ["school_name", "female_per", "sat_score", "sat_m", "sat_r", "sat_w"]
combined.sort_values(["female_per", "sat_score"], ascending=False)[high_female_columns].head(5)
The result shows specialized schools focusing on arts, which is historically popular for female students. As with the race analysis, we're seeing that specialized schools play a role in these demographic patterns.
Instructor Insight: This finding particularly resonates with my personal experience. Before becoming a data analyst, I attended a performing arts-focused high school that had a similar gender distribution ― over 60% female. These specialized schools often have strong academic programs alongside their focus areas, which may help explain the correlation between female percentage and SAT scores. It's a reminder that behind every data point is a complex institutional story that pure numbers can't fully capture.
Step 8: AP Test Participation and SAT Scores
Finally, let's look at Advanced Placement (AP) test participation and its relationship with SAT scores. AP courses are college-level classes that high school students can take, potentially earning college credit:
# Create percentage of AP test takers relative to student population
combined["ap_per"] = combined["AP Test Takers "] / combined["total_enrollment"]
# Visualize relationship between AP test percentage and SAT scores
combined.plot.scatter(x="ap_per", y="sat_score")
plt.title("AP Test Takers Percentage vs. SAT Scores")
plt.xlabel("AP Test Takers Percentage")
plt.ylabel("Average SAT Score")
The scatter plot doesn't show a very clear trend. There appears to be a vertical line of points around the 1200 mark ― this is actually an artifact of our data processing. When we filled missing values with the mean, it created this vertical pattern.
Instructor Insight: When analyzing real-world data, it's crucial to recognize these kinds of patterns that result from data cleaning decisions. I always recommend adding notes about such artifacts in your analysis to prevent misinterpretation. If this were a professional report, I would explicitly mention that this vertical line doesn't represent a natural clustering in the data.
Let's calculate the correlation to get a more precise measure:
correlation = combined["ap_per"].corr(combined["sat_score"])
print(f"Correlation between AP test percentage and SAT scores: {correlation:.4f}")
This gives:
Correlation between AP test percentage and SAT scores: 0.0566
The correlation is very weak (less than 0.06), suggesting that the percentage of students taking AP tests at a school is not strongly related to the school's average SAT score. This is somewhat surprising, as we might expect schools with more AP participants to have higher SAT scores.
Results
Looking beyond the code and the outputs, this project demonstrates the power of exploratory data analysis to uncover social patterns and potential inequities in educational systems. As someone who has worked in education before transitioning to data analysis, I find these kinds of investigations particularly meaningful.
Our analysis reveals several important insights about NYC high schools and SAT scores:
- Socioeconomic factors have some of the strongest correlations with SAT scores. Schools with higher percentages of students on free/reduced lunch tend to have lower SAT scores.
- English Language Learning appears to be a significant factor. Schools with high percentages of ELL students (many of which have high Hispanic populations) tend to have lower SAT scores, which raises questions about the fairness of using the SAT for college admissions.
- Specialized schools show a different pattern. NYC's specialized high schools, which select students based on academic performance, have very high SAT scores and different demographic compositions than other schools.
- Safety and academic engagement correlate with higher SAT scores, though the relationship is complex and varies across boroughs.
- Gender composition shows interesting patterns, with predominantly female schools having higher average SAT scores than predominantly male schools.
These findings suggest that the SAT may not be a completely fair measure of student aptitude, as scores correlate strongly with factors like English language proficiency and socioeconomic status. This aligns with broader criticisms of standardized testing, which have led many colleges to move away from requiring SAT scores in recent years.
Next Steps
This analysis barely scratches the surface of what could be done with this rich dataset. With over 160 columns of information, there are many more relationships to explore. Based on my experience with educational data, here are some compelling directions for further analysis:
-
Analyze class size:
As a former teacher, I've seen how class size impacts learning. I'd be curious to see if smaller classes correlate with higher SAT scores, particularly in schools serving disadvantaged populations.
-
Explore school districts rather than boroughs:
In our analysis, we looked at borough-level patterns, but NYC has numerous school districts that might show more nuanced patterns. Grouping by district instead of borough could reveal more localized relationships between demographics, safety, and test scores.
-
Consider property values:
Education and housing are closely linked. An analysis combining this dataset with property value information could reveal interesting patterns about economic segregation and educational outcomes.
-
Create a more equitable school ranking system:
Most ranking systems heavily prioritize test scores, which our analysis suggests may disadvantage certain populations. Developing a more holistic ranking system that accounts for factors like English language learner support, improvement over time, and socioeconomic context could provide a more fair assessment of school quality.
To take your data analysis skills further, consider exploring Dataquest's other courses and guided projects in the Data Scientist in Python career path.
I personally made the transition from teaching to data analysis through platforms like Dataquest, and the process of working on projects like this one was instrumental in building my skills and portfolio. Don't be afraid to spend several hours on a project like this one - the deeper you dig, the more insights you'll uncover, and the more you'll learn.
If you have questions or want to share your work on this project, feel free to join the discussion on our Community forums. You can even tag me (@Anna_Strahl) if you'd like specific feedback on your approach!
from Dataquest https://ift.tt/cZmYui0
via RiYo Analytics
ليست هناك تعليقات