Page Nav

HIDE

Breaking News:

latest

Ads Place

Project Tutorial: Predicting Tech Salaries with Machine Learning Using the 2023 Stack Overflow Developer Survey (Part 1 of 2)

https://ift.tt/cebHQWG If you've ever wondered whether learning Rust is worth the effort, or whether working remotely actually pays mo...

https://ift.tt/cebHQWG

If you've ever wondered whether learning Rust is worth the effort, or whether working remotely actually pays more, you're not alone. These are the kinds of questions that sound impossible to answer objectively, but with the right dataset and the right tools, they become exactly the kind of problem machine learning was built for.

In this project walkthrough, we're going to use the 2023 Stack Overflow Developer Survey to build a machine learning model that predicts tech salaries. With over 80,000 developer responses from around the world, it's one of the most comprehensive datasets on what people in our industry actually earn, and what might be driving those earnings.

Part 1 focuses entirely on the data preparation side of the pipeline: cleaning messy survey responses, handling missing values, encoding categorical variables, and engineering features that a model can actually work with. It's a lot of work, but it's also where the most important decisions in any ML project get made.

What You'll Learn

By the end of Part 1, you'll know how to:

  • Clean and prepare large-scale survey data for ML modeling
  • Handle missing data and make informed decisions about imputation vs. dropping
  • Filter outliers and assess data quality in a real-world dataset
  • Encode survey responses, including multi-select skills, ordinal rankings, and experience levels
  • Avoid data leakage during preprocessing
  • Apply real-world tradeoffs in preprocessing decisions

Before You Start

To follow along effectively, make sure you have:

  • Intermediate Python and pandas skills
  • Some familiarity with the machine learning workflow
  • Basic experience with exploratory data analysis (EDA)

You'll also want these libraries installed: pandas, scikit-learn, matplotlib, and seaborn.

You can access the project in the Dataquest app here, and find the dataset on the Stack Overflow 2023 Developer Survey Kaggle page. The full solution notebook is available on GitHub.

The Dataset

Stack Overflow runs an annual survey of the people who use their platform, and the result is a fascinating snapshot of the global developer community. We're using the 2023 edition, which includes responses from nearly 90,000 developers covering everything from what languages they use to how they feel about AI.

Our goal is to predict ConvertedCompYearly, which is each respondent's annual compensation normalized to U.S. dollars. This is our target variable, and it's what we'll be trying to predict in Part 2.

One important note: we're working with a slimmed-down version of the dataset. The original includes almost 90,000 rows and many more columns, but we've removed rows where the target variable was null, and dropped columns unrelated to salary prediction (like how often someone uses Stack Overflow, or what technologies they want to work with but haven't yet). If you download the full dataset directly from Kaggle, you'll need to do some additional cleaning before following along.

Let's start by loading our libraries and reading in the data.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

df = pd.read_csv('survey_results_slim.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48019 entries, 0 to 48018
Data columns (total 33 columns):
 #   Column                          Non-Null Count  Dtype
---  ------                          --------------  -----
 0   MainBranch                      48019 non-null  object
 1   Age                             48019 non-null  object
 2   Employment                      48007 non-null  object
 3   RemoteWork                      47940 non-null  object
 4   EdLevel                         48019 non-null  object
 5   LearnCode                       47935 non-null  object
 6   LearnCodeOnline                 38414 non-null  object
 7   YearsCode                       47950 non-null  object
 8   YearsCodePro                    47825 non-null  object
 9   DevType                         47904 non-null  object
 10  OrgSize                         47982 non-null  object
 11  Country                         48019 non-null  object
 12  LanguageHaveWorkedWith          47883 non-null  object
 13  DatabaseHaveWorkedWith          41764 non-null  object
 14  PlatformHaveWorkedWith          37375 non-null  object
 15  WebframeHaveWorkedWith          37865 non-null  object
 16  MiscTechHaveWorkedWith          31769 non-null  object
 17  ToolsTechHaveWorkedWith         44055 non-null  object
 18  NEWCollabToolsHaveWorkedWith    47435 non-null  object
 19  OpSysPersonal use               47534 non-null  object
 20  OpSysProfessional use           45030 non-null  object
 21  OfficeStackAsyncHaveWorkedWith  41999 non-null  object
 22  OfficeStackSyncHaveWorkedWith   47015 non-null  object
 23  AISearchHaveWorkedWith          29485 non-null  object
 24  AIDevHaveWorkedWith             13940 non-null  object
 25  AISelect                        48019 non-null  object
 26  AIToolCurrently Using           19550 non-null  object
 27  TBranch                         46156 non-null  object
 28  ICorPM                          32639 non-null  object
 29  WorkExp                         32638 non-null  float64
 30  ProfessionalTech                31726 non-null  object
 31  Industry                        27747 non-null  object
 32  ConvertedCompYearly             48019 non-null  float64
dtypes: float64(2), object(31)

We've got 48,019 rows and 33 columns, and right away we can see something that's going to shape a lot of our decisions: almost every column is an object type. Machine learning models need numbers to work with, so a significant portion of what we're about to do is figuring out how to convert all of these string values into something numeric and meaningful.

Step 1: Dropping Columns with Too Much Missing Data

Before we tackle data types, let's see which columns are missing so much data that they're not worth keeping.

missing_pct = (df.isnull().sum() / len(df) * 100).sort_values(ascending=False)
print("Missing data percentage:\n", missing_pct[missing_pct > 50])
print("\nTotal columns with more than 50% missing data:", (missing_pct > 50).sum())
Missing data percentage:
 AIDevHaveWorkedWith      70.969824
AIToolCurrently Using    59.286949
dtype: float64

Total columns with more than 50% missing data: 2

Two columns are missing more than half their values: AIDevHaveWorkedWith (71% missing) and AIToolCurrently Using (59% missing). Imputing data when over half of it is missing is a bad idea since you'd essentially be making up most of the column. These go.

Beyond those two, we're also dropping several other columns for practical reasons:

  • LearnCodeOnline: Too many missing values (about 38k non-null out of 48k rows)
  • OpSysPersonal use: Personal OS isn't a good predictor of professional salary
  • Metadata columns (ICorPM, ProfessionalTech, TBranch): Not salary-relevant
  • High-missingness tech columns (AISearchHaveWorkedWith, MiscTechHaveWorkedWith): Too sparse to be useful
  • Office tools (OfficeStackAsyncHaveWorkedWith, OfficeStackSyncHaveWorkedWith, NEWCollabToolsHaveWorkedWith): Knowing someone uses Microsoft Word doesn't tell us much about their salary
cols_to_drop = missing_pct[missing_pct > 50].index.tolist()
cols_to_drop.extend([
    'LearnCodeOnline', 'OpSysPersonal use',
    'ICorPM', 'ProfessionalTech', 'TBranch',
    'AISearchHaveWorkedWith',
    'MiscTechHaveWorkedWith',
    'OfficeStackAsyncHaveWorkedWith',
    'OfficeStackSyncHaveWorkedWith',
    'NEWCollabToolsHaveWorkedWith'
])

cols_to_drop = list(set(cols_to_drop))
df_cleaned = df.drop(columns=cols_to_drop)
print(f"Dropped {len(cols_to_drop)} columns")
print(f"Remaining columns: {df_cleaned.shape[1]}")
Dropped 12 columns
Remaining columns: 21

Good. We're down to 21 columns, all of which have a plausible relationship to salary.

Learning Insight: The decision of what to drop is one of the most judgment-heavy parts of any data project. There's no single right answer. When I was building this project, I leaned toward removing columns aggressively because we had 48,000 rows to work with and wanted to keep the pipeline manageable. If you're extending this project, consider whether any of the dropped columns might be worth revisiting. MiscTechHaveWorkedWith, for instance, could contain interesting signals.

Step 2: Handling the Remaining Missing Values

After our initial column drops, let's check what missingness looks like in what's left.

print("Missing data in remaining columns:")
missing_after = (df_cleaned.isnull().sum() / len(df_cleaned) * 100).sort_values(ascending=False)
print(missing_after[missing_after > 0])

print("\nData types:")
print(df_cleaned.dtypes.value_counts())
Missing data in remaining columns:
Industry                   42.216623
WorkExp                    32.031071
PlatformHaveWorkedWith     22.166226
WebframeHaveWorkedWith     21.145796
DatabaseHaveWorkedWith     13.026094
ToolsTechHaveWorkedWith     8.255066
OpSysProfessional use       6.224619
YearsCodePro                0.404007
LanguageHaveWorkedWith      0.283221
DevType                     0.239489
LearnCode                   0.174931
RemoteWork                  0.164518
YearsCode                   0.143693
OrgSize                     0.077053
Employment                  0.024990
dtype: float64

Data types:
object     19
float64     2

We still have a lot of object columns and some substantial missingness in Industry (42%) and WorkExp (32%). For this project, we're going to take a straightforward approach and drop all rows with any remaining null values.

df_model = df_cleaned.dropna().copy()
print(f"Final dataset shape: {df_model.shape}")
print(f"Rows retained: {len(df_model)} ({len(df_model)/48019*100:.1f}% of filtered data)")
Final dataset shape: (16460, 21)
Rows retained: 16460 (34.3% of filtered data)

We're down to 16,460 rows, which is 34.3% of our starting data. That's a significant reduction, and it's worth being honest about the tradeoff here.

Learning Insight: Dropping all null rows is a reasonable choice when you have enough data remaining, but it can introduce selection bias. In this dataset, someone who completes every field in a long survey might be a different type of person than someone who skips questions. Their salaries could systematically differ. For a next step, consider experimenting with imputation strategies for columns like WorkExp or Industry to see whether they affect your model's results.

Step 3: Removing Salary Outliers

Before we go any further, let's look at what our target variable actually looks like.

print("\nSalary statistics:")
print(df_model['ConvertedCompYearly'].describe())

print(f"\nSalaries > $500k: {(df_model['ConvertedCompYearly'] > 500000).sum()}")
print(f"Salaries < $10k: {(df_model['ConvertedCompYearly'] < 10000).sum()}")
Salary statistics:
count    1.646000e+04
mean     9.412425e+04
std      1.309772e+05
min      1.000000e+00
25%      4.362100e+04
50%      7.496300e+04
75%      1.231530e+05
max      1.031937e+07

Salaries > $500k: 51
Salaries < $10k: 1100

A minimum annual salary of \$1 and a maximum of \$10.3 million. Both of those numbers are suspicious. Let's look at the extremes more closely.

print("Top 10 salaries:")
display(df_model.nlargest(10, 'ConvertedCompYearly')[['Country', 'DevType', 'YearsCodePro', 'ConvertedCompYearly']])

print("\nBottom 10 salaries:")
display(df_model.nsmallest(10, 'ConvertedCompYearly')[['Country', 'DevType', 'YearsCodePro', 'ConvertedCompYearly']])
Top 10 salaries:
      Country                                      DevType YearsCodePro  ConvertedCompYearly
45679 New Zealand              Engineering manager               40       10319366.0
20505 Canada                   Developer, full-stack              2        7435143.0
44175 Brazil                   Developer, full-stack             22        4451577.0
...

Bottom 10 salaries:
     Country       DevType                                    YearsCodePro  ConvertedCompYearly
1192 India         Developer, full-stack                              11    1.0
4861 Viet Nam      Developer, full-stack                               3    1.0
...

The top salaries are self-reported and might be real, but \$10 million from a New Zealand engineering manager and \$1/year from an experienced full-stack developer in India both look like data quality issues. We'll filter those out.

df_model = df_model[
    (df_model['ConvertedCompYearly'] >= 10000) &
    (df_model['ConvertedCompYearly'] <= 500000)
]
print(f"\nAfter salary filtering:")
print(f"Rows: {len(df_model)}")
print(f"Salary range: ${df_model['ConvertedCompYearly'].min():,.0f} - ${df_model['ConvertedCompYearly'].max():,.0f}")
print(f"\nNew salary statistics:")
print(df_model['ConvertedCompYearly'].describe())
After salary filtering:
Rows: 15309
Salary range: $10,000 - $500,000

New salary statistics:
count     15309.000000
mean      97043.815925
std       67251.276938
min       10000.000000
25%       50332.000000
50%       80317.000000
75%      128507.000000
max      500000.000000

Much cleaner. With the extreme outliers removed, our mean (\$97k) and median (\$80k) are meaningfully close together, and the salary range (\$10k to \$500k) feels representative of the real developer market.

We can also visualize the distribution to better understand what we're working with.

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(df_model['ConvertedCompYearly'], bins=50, edgecolor='black')
plt.xlabel('Salary (USD)')
plt.ylabel('Count')
plt.title('Salary Distribution After Cleaning')

plt.subplot(1, 2, 2)
plt.hist(np.log10(df_model['ConvertedCompYearly']), bins=50, edgecolor='black')
plt.xlabel('Log10(Salary)')
plt.ylabel('Count')
plt.title('Log-Transformed Salary Distribution')

plt.tight_layout()
plt.show()

Salary Distribution (After Cleaning and Log-Transformed)

The raw distribution is right-skewed, which is typical for income data. The log-transformed version on the right shows how salaries cluster more naturally in the center when we account for the scale. For linear regression, log-transforming your target variable is worth experimenting with in a next step, since it can help the model perform better on skewed data.

Learning Insight: The \$10k minimum cutoff may be a bit aggressive. Looking at the bottom salary data, some of those \$1/year entries come from countries where cost of living means compensation is often reported differently, or where there may have been a survey misunderstanding. There's room to experiment with different thresholds here. If you want to extend this project, try adjusting the cutoffs and see how the model results change.

Step 4: Engineering Features from Multi-Select Columns

Here's where things get interesting. Several columns in this dataset aren't just single values; they're semicolon-separated lists of everything a developer has worked with. For example, LanguageHaveWorkedWith might look like Python;SQL;JavaScript;TypeScript for a given row.

Machine learning models can't work with strings like that directly, so we need a strategy.

First attempt: count the skills

Our initial hypothesis is that the number of tools someone knows might correlate with their salary.

df_model['Num_Languages'] = df_model['LanguageHaveWorkedWith'].str.count(';') + 1
df_model['Num_Databases'] = df_model['DatabaseHaveWorkedWith'].str.count(';') + 1
df_model['Num_Platforms'] = df_model['PlatformHaveWorkedWith'].str.count(';') + 1
df_model['Num_Webframes'] = df_model['WebframeHaveWorkedWith'].str.count(';') + 1
df_model['Num_Tools'] = df_model['ToolsTechHaveWorkedWith'].str.count(';') + 1

df_model['Total_Skills'] = (
    df_model['Num_Languages'] + df_model['Num_Databases'] +
    df_model['Num_Platforms'] + df_model['Num_Webframes'] + df_model['Num_Tools']
)

print("\nCorrelation with salary:")
count_cols = ['Num_Languages', 'Num_Databases', 'Num_Platforms', 'Num_Webframes', 'Num_Tools', 'Total_Skills']
correlation_data = df_model[count_cols + ['ConvertedCompYearly']].corr()['ConvertedCompYearly'].sort_values(ascending=False)
print(correlation_data)
Correlation with salary:
ConvertedCompYearly    1.000000
Num_Tools              0.114356
Num_Languages          0.074025
Total_Skills           0.044754
Num_Platforms          0.006805
Num_Databases         -0.003100
Num_Webframes         -0.072698

Not great. The number of tools has the highest correlation at just 11%, which is a very weak signal. And by reducing each column to a single count, we lose all the information about which tools someone actually knows. Knowing that someone uses Terraform is much more telling than knowing they use 7 tools.

Better approach: binary features

Instead, we'll explore what the most common individual skills are, then create a binary column for each one indicating whether that developer uses it.

print("Top 20 languages:")
all_langs = df_model['LanguageHaveWorkedWith'].str.split(';').explode()
print(all_langs.value_counts().head(20))

print("\nTop 15 databases:")
all_dbs = df_model['DatabaseHaveWorkedWith'].str.split(';').explode()
print(all_dbs.value_counts().head(15))

print("\nTop 15 platforms:")
all_platforms = df_model['PlatformHaveWorkedWith'].str.split(';').explode()
print(all_platforms.value_counts().head(15))
Top 20 languages:
JavaScript                 11851
SQL                         9565
HTML/CSS                    9423
TypeScript                  8868
Python                      7193
Bash/Shell (all shells)     6042
C#                          5108
Java                        4645
PHP                         3059
Go                          2840
...

Top 15 databases:
PostgreSQL     8827
MySQL          6112
Redis          4865
...

Top 15 platforms:
Amazon Web Services (AWS)    9556
Microsoft Azure              5162
Google Cloud                 4082
...

Now we can see clearly what the most common tools are. Based on this, we'll create binary (0/1) columns for specific high-value and popular skills across languages, databases, cloud platforms, web frameworks, and DevOps tools.

# Languages: High-paying/specialized
df_model['Has_Rust'] = df_model['LanguageHaveWorkedWith'].str.contains('Rust', na=False).astype(int)
df_model['Has_Go'] = df_model['LanguageHaveWorkedWith'].str.contains('Go', na=False).astype(int)
df_model['Has_Scala'] = df_model['LanguageHaveWorkedWith'].str.contains('Scala', na=False).astype(int)
df_model['Has_Kotlin'] = df_model['LanguageHaveWorkedWith'].str.contains('Kotlin', na=False).astype(int)
df_model['Has_Swift'] = df_model['LanguageHaveWorkedWith'].str.contains('Swift', na=False).astype(int)

# Languages: Popular/versatile
df_model['Has_Python'] = df_model['LanguageHaveWorkedWith'].str.contains('Python', na=False).astype(int)
df_model['Has_JavaScript'] = df_model['LanguageHaveWorkedWith'].str.contains('JavaScript', na=False).astype(int)
df_model['Has_TypeScript'] = df_model['LanguageHaveWorkedWith'].str.contains('TypeScript', na=False).astype(int)
df_model['Has_Java'] = df_model['LanguageHaveWorkedWith'].str.contains('Java', na=False).astype(int)
df_model['Has_CSharp'] = df_model['LanguageHaveWorkedWith'].str.contains('C#', na=False).astype(int)
df_model['Has_SQL'] = df_model['LanguageHaveWorkedWith'].str.contains('SQL', na=False).astype(int)

# Databases
df_model['Has_PostgreSQL'] = df_model['DatabaseHaveWorkedWith'].str.contains('PostgreSQL', na=False).astype(int)
df_model['Has_MongoDB'] = df_model['DatabaseHaveWorkedWith'].str.contains('MongoDB', na=False).astype(int)
df_model['Has_Redis'] = df_model['DatabaseHaveWorkedWith'].str.contains('Redis', na=False).astype(int)
df_model['Has_Elasticsearch'] = df_model['DatabaseHaveWorkedWith'].str.contains('Elasticsearch', na=False).astype(int)

# Cloud platforms
df_model['Has_AWS'] = df_model['PlatformHaveWorkedWith'].str.contains('Amazon Web Services|AWS', na=False).astype(int)
df_model['Has_Azure'] = df_model['PlatformHaveWorkedWith'].str.contains('Azure', na=False).astype(int)
df_model['Has_GCP'] = df_model['PlatformHaveWorkedWith'].str.contains('Google Cloud', na=False).astype(int)

# Web frameworks
df_model['Has_React'] = df_model['WebframeHaveWorkedWith'].str.contains('React', na=False).astype(int)
df_model['Has_NextJS'] = df_model['WebframeHaveWorkedWith'].str.contains('Next.js', na=False).astype(int)
df_model['Has_NodeJS'] = df_model['WebframeHaveWorkedWith'].str.contains('Node.js', na=False).astype(int)
df_model['Has_SpringBoot'] = df_model['WebframeHaveWorkedWith'].str.contains('Spring Boot', na=False).astype(int)

# DevOps/infrastructure tools
df_model['Has_Docker'] = df_model['ToolsTechHaveWorkedWith'].str.contains('Docker', na=False).astype(int)
df_model['Has_Kubernetes'] = df_model['ToolsTechHaveWorkedWith'].str.contains('Kubernetes', na=False).astype(int)
df_model['Has_Terraform'] = df_model['ToolsTechHaveWorkedWith'].str.contains('Terraform', na=False).astype(int)

Now let's check how well these binary features correlate with salary compared to our count approach.

skill_features = [col for col in df_model.columns if col.startswith('Has_')]
skill_corr = df_model[skill_features + ['ConvertedCompYearly']].corr()['ConvertedCompYearly'].sort_values(ascending=False)
print("=== ALL Skill Correlations (sorted) ===")
print(skill_corr[skill_corr.index != 'ConvertedCompYearly'])
=== ALL Skill Correlations (sorted) ===
Has_Terraform        0.177524
Has_AWS              0.142293
Has_Kubernetes       0.128906
Has_Go               0.123166
Has_Rust             0.082603
Has_Python           0.079271
Has_Docker           0.064240
Has_Redis            0.061930
Has_React            0.061303
Has_Elasticsearch    0.053530
Has_PostgreSQL       0.049394
Has_Scala            0.046563
Has_TypeScript       0.037359
Has_GCP              0.035238
Has_Swift            0.023807
Has_Kotlin           0.014656
Has_SQL              0.005919
Has_Azure           -0.014817
Has_JavaScript      -0.015086
Has_NextJS          -0.017373
Has_Java            -0.017985
Has_CSharp          -0.020396
Has_NodeJS          -0.024481
Has_SpringBoot      -0.035468
Has_MongoDB         -0.095041

Terraform, AWS, and Kubernetes are the strongest positive signals, while MongoDB and Spring Boot trend negative. None of these are strong correlations on their own, but together they'll give our model meaningful information to work with.

Learning Insight: It might seem strange that some popular skills like JavaScript and Java have slightly negative correlations with salary. This likely reflects the distribution of who uses them, JavaScript developers include a lot of front-end engineers and early-career learners, while Terraform and Kubernetes users tend to be more senior infrastructure specialists. Correlation doesn't tell the whole story, which is exactly why we're building a model rather than just sorting by correlation values.

Now we drop the original multi-select string columns and our count features, since the binary features replace them.

cols_to_drop_final = [
    'LanguageHaveWorkedWith', 'DatabaseHaveWorkedWith',
    'PlatformHaveWorkedWith', 'WebframeHaveWorkedWith',
    'ToolsTechHaveWorkedWith',
    'Num_Languages', 'Num_Databases', 'Num_Platforms',
    'Num_Webframes', 'Num_Tools', 'Total_Skills'
]
df_model = df_model.drop(columns=cols_to_drop_final)
print(f"Final shape with binary skill features: {df_model.shape}")
print(f"\nFeature types:")
print(df_model.dtypes.value_counts())
Final shape with binary skill features: (15309, 41)

Feature types:
int64      25
object     14
float64     2

Good progress. We've got 25 clean integer columns from our binary features, but we still have 14 object columns to deal with. Let's tackle those now.

Step 5: Encoding the Remaining Categorical Columns

Let's get a full picture of what's left.

categorical_cols = df_model.select_dtypes(include='object').columns.tolist()
for col in categorical_cols:
    n_unique = df_model[col].nunique()
    print(f"\n{col}: {n_unique} unique values")
    if n_unique <= 20:
        print(df_model[col].value_counts())

After reviewing the output, here's our encoding plan:

Column Unique Values Strategy
MainBranch 2 Boolean (1/0)
Age 8 Ordinal encoding
EdLevel 8 Ordinal encoding
OrgSize 10 Ordinal encoding
Employment 10 Simplify, then one-hot
RemoteWork 3 One-hot encoding
AISelect 3 One-hot encoding
Industry 12 One-hot encoding
DevType 33 Keep top 10, group rest as "Other", one-hot
Country 140 Keep top 15, group rest as "Other", one-hot
YearsCode 52 Coerce to numeric
YearsCodePro 51 Coerce to numeric
LearnCode 554 Drop (too fragmented)
OpSysProfessional use 947 Drop (too fragmented)

Let's walk through each of these.

Drop the unfixable columns

LearnCode has 554 unique combinations of responses, and OpSysProfessional use has 947. There's no clean way to encode either of these without adding an enormous amount of noise to the model.

df_model = df_model.drop(columns=['LearnCode', 'OpSysProfessional use'])

Ordinal encoding for age, education, and organization size

These columns have a natural order to them: a bachelor's degree ranks higher than a high school diploma, and a 10,000-person company ranks larger than a 10-person startup. Ordinal encoding captures that relationship by assigning a number to each category.

age_order = {
    'Under 18 years old': 0, '18-24 years old': 1,
    '25-34 years old': 2, '35-44 years old': 3,
    '45-54 years old': 4, '55-64 years old': 5,
    '65 years or older': 6, 'Prefer not to say': 2
}
df_model['Age_Ordinal'] = df_model['Age'].map(age_order)
df_model = df_model.drop(columns=['Age'])

edlevel_order = {
    'Primary/elementary school': 0,
    'Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)': 1,
    'Some college/university study without earning a degree': 2,
    'Associate degree (A.A., A.S., etc.)': 3,
    "Bachelor's degree (B.A., B.S., B.Eng., etc.)": 4,
    "Master's degree (M.A., M.S., M.Eng., MBA, etc.)": 5,
    'Professional degree (JD, MD, Ph.D, Ed.D, etc.)': 6,
    'Something else': 2
}
df_model['EdLevel_Ordinal'] = df_model['EdLevel'].map(edlevel_order)
df_model = df_model.drop(columns=['EdLevel'])

orgsize_order = {
    'Just me - I am a freelancer, sole proprietor, etc.': 0,
    '2 to 9 employees': 1, '10 to 19 employees': 2,
    '20 to 99 employees': 3, '100 to 499 employees': 4,
    '500 to 999 employees': 5, '1,000 to 4,999 employees': 6,
    '5,000 to 9,999 employees': 7, '10,000 or more employees': 8,
    "I don't know": 4
}
df_model['OrgSize_Ordinal'] = df_model['OrgSize'].map(orgsize_order)
df_model = df_model.drop(columns=['OrgSize'])

Simplifying employment status

Employment has 10 unique combinations, many of them overlapping. We simplify these down to four categories based on what matters most.

def simplify_employment(emp):
    if pd.isna(emp):
        return 'Unknown'
    elif 'Employed, full-time' in emp and 'Independent contractor' in emp:
        return 'Full-time + Freelance'
    elif 'Employed, full-time' in emp:
        return 'Full-time'
    elif 'Independent contractor' in emp or 'freelancer' in emp or 'self-employed' in emp:
        return 'Freelance'
    elif 'part-time' in emp:
        return 'Part-time'
    else:
        return 'Other'

df_model['Employment_Simple'] = df_model['Employment'].apply(simplify_employment)
df_model = df_model.drop(columns=['Employment'])

print("\nSimplified Employment:")
print(df_model['Employment_Simple'].value_counts())
Simplified Employment:
Employment_Simple
Full-time                12776
Full-time + Freelance     1462
Freelance                  866
Part-time                  205

One-hot encoding low-cardinality columns

For columns with a small number of unique values and no natural ordering, we use pd.get_dummies(). This creates a new binary column for each category, so a column like RemoteWork with values of Remote, Hybrid, and In-person becomes three separate 0/1 columns.

one_hot_cols = ['MainBranch', 'RemoteWork', 'AISelect', 'Employment_Simple']
for col in one_hot_cols:
    dummies = pd.get_dummies(df_model[col], prefix=col, drop_first=True)
    df_model = pd.concat([df_model, dummies], axis=1)
    df_model = df_model.drop(columns=[col])

industry_dummies = pd.get_dummies(df_model['Industry'], prefix='Industry', drop_first=True)
df_model = pd.concat([df_model, industry_dummies], axis=1)
df_model = df_model.drop(columns=['Industry'])

Coercing years of experience to numeric

YearsCode and YearsCodePro should be numbers, but they're stored as objects because some responses include text like "Less than 1 year". We convert them to numeric and let anything non-numeric become a null, which we'll clean up shortly.

df_model['YearsCode'] = pd.to_numeric(df_model['YearsCode'], errors='coerce')
df_model['YearsCodePro'] = pd.to_numeric(df_model['YearsCodePro'], errors='coerce')

Handling high-cardinality columns: DevType and Country

DevType has 33 unique values and Country has 140. One-hot encoding these directly would add a huge number of columns, most of which would have very little data. Instead, we keep the most common values and group everything else into "Other".

# DevType: keep top 10
top_10_devtypes = df_model['DevType'].value_counts().head(10).index
df_model['DevType_Grouped'] = df_model['DevType'].apply(
    lambda x: x if x in top_10_devtypes else 'Other'
)
devtype_dummies = pd.get_dummies(df_model['DevType_Grouped'], prefix='DevType', drop_first=True)
df_model = pd.concat([df_model, devtype_dummies], axis=1)
df_model = df_model.drop(columns=['DevType', 'DevType_Grouped'])

# Country: keep top 15
top_15_countries = df_model['Country'].value_counts().head(15).index
df_model['Country_Grouped'] = df_model['Country'].apply(
    lambda x: x if x in top_15_countries else 'Other'
)
country_dummies = pd.get_dummies(df_model['Country_Grouped'], prefix='Country', drop_first=True)
df_model = pd.concat([df_model, country_dummies], axis=1)
df_model = df_model.drop(columns=['Country', 'Country_Grouped'])

Let's verify our work.

print(f"\nColumn types:")
print(df_model.dtypes.value_counts())
print(f"\nAll numeric? {df_model.select_dtypes(include='object').shape[1] == 0}")

null_counts = df_model.isnull().sum()
print(f"\nRemaining nulls:")
print(null_counts[null_counts > 0])
Column types:
bool       44
int64      27
float64     5

All numeric? True

Remaining nulls:
YearsCode           20
YearsCodePro       236
OrgSize_Ordinal    140

No more object columns. We introduced a small number of nulls during the numeric coercion step, so we'll do one final drop.

df_model = df_model.dropna()
print(f"\nShape after dropping nulls: {df_model.shape}")
print(f"Rows lost: {15309 - len(df_model)}")
print(f"Percentage retained: {len(df_model)/15309*100:.1f}%")

print(f"\n=== FINAL CLEAN DATASET ===")
print(f"Rows: {len(df_model)}")
print(f"Columns: {df_model.shape[1]}")
print(f"Features: {df_model.shape[1] - 1}")
print(f"Target: ConvertedCompYearly")
Shape after dropping nulls: (14924, 76)
Rows lost: 385
Percentage retained: 97.5%

=== FINAL CLEAN DATASET ===
Rows: 14924
Columns: 76
Features: 75
Target: ConvertedCompYearly

We retained 97.5% of our data in this final step. We now have a clean, fully numeric dataset with 14,924 rows and 75 features, ready for modeling.

Learning Insight: We went from 33 columns to 76 columns. That might seem like a lot, but most of those new columns came from one-hot encoding categorical variables. Each binary column is just a yes/no signal, and a well-structured linear model can handle this number of features well. In Part 2, we'll also trim the features that contribute very little signal before training.

What We Built

Let's recap the transformation pipeline we just built:

  1. Dropped 12 columns with >50% missing data or low relevance to salary
  2. Removed rows with any remaining null values, bringing us from 48,019 to 16,460 rows
  3. Filtered salary outliers below \$10,000 and above \$500,000
  4. Explored count-based features for multi-select skill columns, then replaced them with 25 binary skill indicators
  5. Applied ordinal encoding to Age, EdLevel, and OrgSize
  6. Simplified and one-hot encoded Employment, RemoteWork, AISelect, and Industry
  7. Grouped high-cardinality columns (DevType, Country) and one-hot encoded the results
  8. Coerced years experience columns to numeric and dropped the remaining nulls

That's a substantial pipeline, and every step required judgment calls about tradeoffs between keeping data and maintaining quality. The decisions we made here will directly affect how the model performs in Part 2.

Next Steps

In Part 2, we'll take this clean dataset and build the actual salary prediction model. We'll calculate feature correlations with salary, address multicollinearity among the experience columns, train a linear regression model, and then interpret what the results tell us about what actually drives developer compensation.

In the meantime, head into the Dataquest app to work through the project yourself. The solution notebook is available for reference on GitHub. If you hit any snags, drop a question in the Dataquest Community Forum.

If you're newer to machine learning concepts and want to build a stronger foundation before diving in, check out the Machine Learning in Python course. It covers the core ideas behind supervised learning, linear regression, and model evaluation that we'll be applying in Part 2.



from Dataquest https://ift.tt/ufMUFEv
via RiYo Analytics

No comments

Latest Articles