When you're working with data in Python, for loops are a powerful tool that can speed things up. But they can ...
When you're working with data in Python, for
loops are a powerful tool that can speed things up. But they can also be a little bit confusing when you're just starting out. In this tutorial, we're going to dive headfirst into for
loops and learn how they can be used to do all sorts of interesting things when you're doing data cleaning or data analysis in Python.
Please check out our Comprehnsive Python Guide for a structured learning experience.
What Are for
In most data science tasks, Python for
loops let you "loop through" an iterable object (such as a list, tuple, or set), processing one item at a time. During each iteration (i.e., each pass of the loop), Python updates an iteration variable to represent the current item, which you can then use within the loop's code block. For example, you could loop through a list of numbers, accessing each number individually to perform calculations or other actions.
An iterable object is any Python object that you can iterate over, accessing its items one by one. For example, lists and tuples are iterables that let you access each item in the order they appear. Strings are also iterable, allowing you to loop through each character individually, from start to finish. Dictionaries are iterable too, but looping through them gives you their keys, which you can then use to access corresponding values. Many objects in Python are considered iterable, but in this beginner's tutorial, we'll focus on the most common types.
Creating a for
To create a Python for
loop, you start by defining an iteration variable and the iterable object you want to loop through. The iteration variable temporarily holds each item from the iterable during each loop. Then, inside the loop, you specify the actions you want to perform using this variable. For example, when looping through a list, you first choose a variable name to represent each item, specify the list itself, and then decide what you'd like to do with each list item.
Let's look at a quick example. Imagine we have a Python list of names, and we want to print each name individually. Below, we'll create our list and use a for
loop to iterate through it, printing each item in order.
our_list = ['Lily', 'Brad', 'Fatima', 'Zining']
for name in our_list:
You might notice we didn't explicitly define the name
variable before our loop—so where does it come from? This variable, called the iteration variable, is automatically created by the for
loop. During each iteration, Python updates it to represent the current item from the iterable. You can actually name it anything you like—Python doesn't mind—but it's best to choose something clear and descriptive.
So, in the example above:
refers to'Lily'
during the first iteration of the loop...- ...then
on the second iteration... - ...then
, and so on until the loop ends.
This works exactly the same no matter what we name our variable. If we rewrite our code using x
instead of name
, we'll get identical results:
for x in our_list:
This looping technique applies to any iterable object. Strings, for example, are also iterable, so you can loop through them to access each character individually:
for letter in 'Hello':
Using for
Loops with Lists of Lists
In actual data analysis work, it's unlikely that we're going to be working with short, simple lists like the one above, though. Generally, we'll have to work with data sets in a table format, with multiple rows and columns. This kind of data can be stored in Python as a list of lists, where each row of a table is stored as a list within the list of lists, and we can use for
loops to iterate through these as well.
To learn how to do this, let's take a look at a more realistic scenario and explore this small data table that contains some US prices and US EPA range estimates for several electric cars.
vehicle | range | price |
Tesla Model 3 LR | 310 | 49900 |
Hyundai Ioniq EV | 124 | 30315 |
Chevy Bolt | 238 | 36620 |
We can express this same data set as a list of lists, like so:
ev_data = [['vehicle', 'range', 'price'],
['Tesla Model 3 LR', '310', '49900'],
['Hyundai Ioniq EV', '124', '30315'],
['Chevy Bolt', '238', '36620']]
You may have noticed that in the list above, our range
and price
values are actually stored as strings rather than integers. It's not uncommon that you'll get data stored in this way, but for analysis, we'd want to convert those strings into integers so we can do some calculations with them. Let's use a for
loop to iterate through our list of lists, selecting the price
entry in each list and changing it from a string to an integer.
To do that, we need to do a few things. First, we need to skip the first row in our table, since those are the column names and we will get an error if we attempt to convert a non-numerical string like 'range'
into an integer. We skip the first row using list slicing to select each row after the first row using ev_data[1:]
. (If you need to brush up on this, or any other aspects of lists, check out our interactive course on Python programming fundamentals).
Then, we'll loop through the list of lists, and for each iteration we'll select the element in the range
column, which is the second column in our table. We'll assign the value found in this column to a variable called ev_range
. To do this, we'll use the index number 1
(in Python, the first entry in an iterable is at index 0
, the second entry is at index 1
, etc.).
Finally, we'll convert the range
numbers to integers using Python's built-in int()
function, and replace the original strings with these integers in our data set.
for row in ev_data[1:]: # loop through each row in ev_data starting with row 2 (index 1)
ev_range = row[1] # each car's range is found in column 2 (index 1)
ev_range = int(ev_range) # convert each range value from a string to an integer
row[1] = ev_range # assign ev_range, which is now an integer, back to index 1 in each row
[['vehicle', 'range', 'price'], ['Tesla Model 3 LR', 310, '49900'], ['Hyundai Ioniq EV', 124, '30315'], ['Chevy Bolt', 238, '36620']]
Now that we've got those values stored as integers, we can also use a for
loop to do some calculations. Let's say, for example, that we want to figure out the average range of an EV on this list. We'd need to add the range numbers together, and then divide them by the total number of cars in our list.
Again, we can use a for
loop to select the specific column we need within our data set. We'll start by creating a variable called total_range
where we can store the sum of the ranges. Then we'll write another for
loop, again skipping the header row, and again identifying the second column (index 1) as the range value.
After that, all we need to do is add this value to total_range
within our for
loop, and then calculate the value using total_range
divided by the number of cars after the loop has completed.
Note: We'll calculate the number of cars by counting the length of our list, minus the header row, in the code below. With a list as short as ours, we could also simply divide by 3, since the number of cars is very easy to count, but that would break our calculation if additional car data was added to the data set. For that reason, it's better to use len()
to calculate the length of our car list in code so that if additional entries are added to our data set in the future, we can re-run this code and it will still produce the correct answer.
total_range = 0 # create a variable to store the sum of range values
for row in ev_data[1:]: # loop through each row in ev_data starting with row 2 (index 1)
ev_range = row[1] # each car's range is found in column 2 (index 1)
total_range += ev_range # add this number to the number stored in total_range
number_of_cars = len(ev_data[1:]) # calculate the length of our list, minus the header row
print(total_range / number_of_cars) # print the average range
Python for
loops are powerful, and you can nest more complex instructions inside of them. To demonstrate this, let's repeat the above two steps for our 'price'
column, this time within a single for
total_price = 0 # create a variable to store the total range number
for row in ev_data[1:]: # loop through each row in ev_data starting with row 2 (index 1)
price = row[2] # each car's price is found in column 3 (index 2)
price = int(price) # convert each price number from a string to an integer
row[2] = price # assign price, which is now an integer, back to index 2 in each row
total_price += price # add each car's price to total_price
number_of_cars = len(ev_data[1:]) # calculate the length of our list, minus the header row
print(total_price / number_of_cars) # print the average price
We can also nest other elements, like if
or else
statements and even other for
loops, within our for
As an example, imagine we wanted to find every car with a range of greater than 200 miles in our list. We can start by creating a new empty list to hold our long-range car data. Then, we'll use a for
loop to iterate through ev_data
, the list of lists containing car data we created earlier, appending a car's row to our long-range list only if the its range value is above 200:
long_range_car_list = [] # creating a new list to store our long range car data
for row in ev_data[1:]: # iterate through ev_data, skipping the header row
ev_range = row[1] # assign the range number, which is at index 1 in the row, to the range variable
if ev_range > 200: # append the whole row to long-range list if range is higher than 200
[['Tesla Model 3 LR', 310, 49900], ['Chevy Bolt', 238, 36620]]
These operations would also be simple to perform by hand with such a tiny data set, of course. But these same techniques will work on data sets with thousands and thousands of rows, which can make cleaning, sorting, and analyzing huge datasets into very quick work.
Other Useful Techniques: Range, Break, and Continue
You can get a surprising amount of mileage out of for
loops just by mastering the techniques described above, but let's dive even deeper and learn a few other things that may be helpful, even if you use them a bit less frequently in the context of data science work.
loops can be used in tandem with Python's range()
function to iterate through each number in a specified range. For example:
for x in range(5, 9):
Note that Python doesn't include the maximum value of a range in the range count, which is why the number 9 doesn't appear above. If we wanted this code to count from 5 to 9 including 9, we'd need to change range(5, 9)
to range(5, 10)
for x in range(5, 10):
If you only specify a single number in your range()
function, Python will treat that as the maximum value, and assign a default minimum value of zero:
for x in range(3):
You can even add a third argument to the range()
function to specify that you'd like to count in increments of a specific number. As you can see above, the default value is 1, but if you add a third argument of 3, for example, you can use range()
with a for
loop to count up in threes:
for x in range(0, 9, 3):
By default, a Python for
loop will loop through each possible iteration of the interable object you've assigned it. Normally when we're using a for
loop, that's fine, because we want to perform the same action on each item in our list (for example).
Sometimes, though, we may want to stop your loop if a certain condition is met. In that circumstance, the break
statement is useful. When used with an if
statement inside of a for
loop, break
allows us to break out of that loop before its conclusion.
Let's take a look at a quick example first, using the list of names we created earlier called our_list
for name in our_list:
When we run this code, nothing is printed. That's because the break
statement comes before print(name)
in our for
loop. When Python sees break
, it stops executing the for
loop and code that appears after break
in the loop doesn't get run.
Let's add an if
statement to this loop, so that we break out of the loop when Python gets to the name Zining:
for name in our_list:
if name == 'Zining':
Here, we can see that the name Zining wasn't printed. Here's what's happening with each loop iteration:
- Python checks to see if the first name is 'Zining'. It isn't, so it continues executing the code below our
statement, and prints the first name. - Python checks to see if the second name is 'Zining'. It isn't, so it continues executing the code below our
statement, and prints the second name. - Python checks to see if the third name is 'Zining'. It isn't, so it continues executing the code below our
statement, and prints the third name. - Python checks to see if the fourth name is 'Zining'. It is, so
is executed and thefor
loop ends.
Let's return to the code we wrote for collecting long-range EV car data and work through one more example. We'll insert a break statement that stops the look as soon as it encounters the string 'Tesla'
long_range_car_list = [] # creating our empty long-range car list again
for row in ev_data[1:]: # iterate through ev_data as before looking for cars with a range > 200
ev_range = row[1]
if ev_range > 200:
if 'Tesla' in row[0]: # but if 'Tesla' appears in the vehicle column, end the loop
[['Tesla Model 3 LR', 310, 49900]]
In the code above, we can see that the Tesla was still added to long_range_car_list
, because we appended it to that list before the if
statement where we used break
. The Chevy Bolt was not added to our list, because although it does have a range of more than 200 miles, break
ended the loop before Python reached the Chevy Bolt row.
(Remember, for
loops execute in sequential order. If the Bolt was listed before the Tesla in our original data set, it would have been included in long_range_car_list
When we're looping through an iterable object like a list, we might also encounter situations where we'd like to skip a particular row or rows. For simple situations like skipping a header row, we can use list slicing, but if we want to skip rows based on more complex conditions, this quickly becomes impractical. Instead, we can use the continue
statement to skip a single iteration ("loop") of a for
loop and move to the next.
When Python sees continue
while executing a for
loop on a list, for example, it will stop at that point and move on to the next item on the list. Any code that comes below the continue
will not be executed.
Let's go back our list of names (our_names
) and use continue
with an if
statement to end a loop iteration before printing if the name is 'Brad':
for name in our_list:
if name == 'Brad':
Above, we can see that Brad's name was skipped, and the rest of the names in our list were printed in sequence. That illustrates the difference between break
and continue
in a nutshell:
ends the loop entirely. When Python executesbreak
, thefor
loop is over.continue
ends a specific iteration of the loop and moves to the next item in the list. When Python executescontinue
it moves immediately to the next loop iteration, but it does not end the loop entirely.
To get some more practice with continue
, let's make a list of short-range EVs, using continue
to take a slightly different approach. Instead of identifying the EVs with less than 200 miles of range, we'll write a for
loop that adds every EV to our short-range list, but with a continue
statement before we append to the new list that runs if the range is greater than 200:
short_range_car_list = [] # creating our empty short-range car list
for row in ev_data[1:]: # iterate through ev_data as before
ev_range = row[1]
if ev_range > 200: # if the car has a range of > 200
continue # end the loop here; do not execute the code below, continue to the next row
short_range_car_list.append(row) # append the row to our short-range car list
[['Hyundai Ioniq EV', 124, 30315]]
That's probably not the most efficient and readable way to create our short-range car list, but it does demonstrate how continue
works, so let's walk through precisely what's happening here.
On its first loop, Python is looking at the Tesla row. That car does have an EV range of more than 200 miles, so Python sees the if
statement is True
, and executes the continue
nested inside that if
statement, which makes it immediately jump to the next row of ev_data
to begin its next loop.
On the second loop, Python is looking at the next row, which is the Hyundai row. That car has a range of under 200 miles, so Python sees that the conditional if
statement is not met, and executes the rest of the code in the for
loop, appending the Hyundai row to short_range_car_list
On the third and final loop, Python is looking at the Chevy row. That car has a range of more than 200 miles, which means the conditional if
statement is True
. Thus, Python once again executes the nested continue
, which concludes the loop and, since there are no more rows of data in our data set, ends the for
loop entirely.
Additional Resources
Hopefully at this point, you're feeling comfortable with for
loops in Python, and you have an idea of how they can be useful for common data science tasks like data cleaning, data preparation, and data analysis.
Ready to take the next step? Here are some additional resources to check out:
- Python tutorials — Our ever-expanding list of Python tutorials for data science.
- Data Science Courses — Take your studies to the next level with fully interactive programming, data science, and stats courses, right in your browser.
- Python's official documentation on For Loops - The official documentation doesn't go into as much depth as this tutorial, but it does review the basics of
loops explain some related concepts like While Loops. - Dataquest's Python Fundamentals for Data Science course - Our Python fundamentals course offers a from-scratch introduction to coding in Python for data science. It covers lists, loops, and a whole lot more, and you can code iteractively right from within your browser.
- Dataquest's Intermediate Python for Data Science course - When you feel like you've mastered
loops and other core Python concepts, this is another interactive course that'll help you take your Python skills to the next level. - Free Datasets for Projects - Practice
loops on your own by grabbing a free data set from one of these sources and applying your new skills to large, real-world data sets. The data sets in the first section (for data visualization) should work particularly well for practice projects since they should already be relatively clean.
Best of luck, and happy looping!
from Dataquest
via RiYo Analytics
ليست هناك تعليقات