https://ift.tt/UP4ezXs A cool and useful algorithm explained and extended By Carl M. Kadie and Christopher Meek Source: https://openai.c...
A cool and useful algorithm explained and extended
By Carl M. Kadie and Christopher Meek
If I (Carl) interviewed you for a job at Microsoft, I probably asked you this question:
How would you select a random line from a text file of unknown length?
I asked this question in Microsoft Research, in the original anti-spam team, and in an Office Machine Learning/Data Science group. We love the question because it touches on probability, algorithms, and systems issues.
Even having retired from Microsoft, we still think about the random-line question. For example, we recently learned about bytecount. It is a Rust crate that uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. As we’ll describe later, our learning of the crate led — indirectly — to improving our favorite solution to the random-line question.
We’ve organized this article as a series of questions and hints. If you wish, you can try to answer the questions yourself, before seeing our answers.
We gave interviewees this advice for whiteboard interviews:
- Feel free to ask clarifying questions. (In the context of this article, you can read a little further to see if we offer clarifications.)
- Start with a correct algorithm first, even if it might be suboptimal. If we want a better algorithm, we will prompt you for it with follow-up questions.
- Don’t worry about exact syntax. We don’t, for example, care if you remember the exact method for generating a random number. (In this article, we’ll give answers in Python.)
- If you get stuck, we can provide some hints. (Again, in the context of this article, read a little farther to look for hints.)
Let’s start with the questions:
Q: How would you select a random line from a text file of unknown length?
A: We must first clarify the meaning of a “random line”. We mean that every line in the text file has an equal chance of being returned. In other words, we want the algorithm to select among the lines via a uniform distribution.
This requirement means you cannot use this algorithm, that we’ll call AlgoRandomSeek:
- Ask the filesystem for the length of the file.
- Randomly select a file position and seek to that position.
- Return a line near the position.
Sub Q: What's wrong with AlgoRandomSeek?
Sub A: Although the algorithm can return any line in the file, it returns longer lines with higher probability than shorter lines and, thus, does not select lines via a uniform distribution.
Aside: We like being asked for this clarification. It distinguishes the everyday meaning of “random” from the technical meaning used in statistics, data science, decision theory, and machine learning.
Hint: If you were coding in Python and needed to solve the random line problem quickly (but correctly) what would you do?
We asked GitHub Copilot for a solution, and it gave us this code. Call it AlgoReadAll:
This is obviously correct and fast. As a bonus, it uses with open(file_name) as f: to open the file in a context manager, which is good practice. To earn another bonus, you could mention that the code throws an exception on empty files.
On the downside, this code reads the whole file into memory and, so, won’t work on large files. Also (a minor point), f.read().splitlines() can be replaced with f.readlines().
Aside: You could code this algorithm in one line: random.choice(open(file_name).readlines()). We, however, dislike one-line solutions that merely hurt readability.
Let’s follow up by asking for a solution that works on large files:
Q: Using little memory, how could you select a random line from a text file of unknown length?
A: We must first clarify “little memory”. For this question, assume that you can store any line from the file, or several lines, but not all the lines.
AlgoTwoPass solves the problem by 1. counting the lines in the file, 2. randomly selecting a line index, 3. returning the line with that index. (Throughout this article, “index0” is an index that starts its count at 0. If an index starts counting from 1, we’ll call it “index1”.)
On my machine, this outputs:
100,989 of 146,933: made in 1875 and the number of patents soon rapidly increased;
AlgoTwoPassis correct. I’d also give this code a bonus for:
- using a random number generator with an explicit seed — machine learning and data science often require reproducibility, even from randomness.
- [very minor] using Python format strings, including thousand separators.
- mentioning that it returns None when applied to an empty file.
Not required, but interesting would be this more functional implementation:
Aside 1: Background information: This code uses Python iterators. When next() is applied to a Python iterator, the next value in a sequence is returned or, if the sequence is complete, an exception is thrown. The itertools.islice function can skip ahead in an iterator without returning a value. The expression open(file_name) returns an object that is, among other things, an iterator over the lines in the file. The enumerate()function takes an iterator and returns a new iterator such that each call to next() on the new iterator returns an index starting at 0 and a value from the original iterator.
Aside 2: We would not penalize someone for making an off-by-one error with rng.randint or itertools.islice. We might, however, ask how the code could be tested to check for such errors.
We next follow up by asking for a faster algorithm:
Q: In one pass, can you select a random line from a text file of unknown length?
A: After a hint, we’ll develop an algorithm via a series of sub-questions.
Hint: Think about this recursively.
Sub Q: Put yourself in the position of the program. If we promise you the file contains exactly one line, what should you do?
Sub A: If the file contains exactly one line, just output it.
Sub Q: Whoops, we lied to you. The file may contain exactly one line or two, but no other number of lines. What should you do?
Sub A: With probability 0.5, replace the result we were going to output, with the second line.
Sub Q: Sorry, we lied again! The file may contain exactly one, two, or three lines. What should you do? What is the probability of selection for each line? How can this be generalized?
Sub A: With probability ⅓, replace the result we were going to output, with the third line.
The probability of selection for each line is:
- 1st line: 1 × ½ × ⅔= ⅓
- 2nd line: ½ × ⅔= ⅓
- 3rd line: ⅓= ⅓
So, the probability distribution is uniform. We can generalize this to AlgoOnePass:
Aside: I (Carl) recall one interviewee who started to prove this algorithm’s correctness via induction. I could tell they could easily do it, so I stopped them and moved on. They earned a bonus.
Bonus Q: We never asked this in an interview, but can you write AlgoOnePass recursively?
Bonus A: Here is AlgoOnePassRecurse:
Python’s optional parameters work well with recursion. Sadly, Python crashes if you recurse more than a few thousand times. This code uses the package tail-recursive to avoid that crash. However, even with that, this recursive code runs hundreds of times slower than the iterative code.
Aside: Thinking practically, is one-pass important compared to two-pass? Often not. If you can wait 20 seconds for a random line, you can probably wait 20 to 40 seconds. On the other hand, some data — “streaming data” — can’t be accessed twice, so AlgoOnePass has practical value.
One straightforward generalization to AlgoOnePass, that we won’t cover, is selecting multiple random lines, not just one. Wikipedia describes (more-or-less) AlgoOnePass and this generalization in its article on Reservoir sampling: Simple Algorithm R.
In interviews, this is usually where we stopped with “random line from a file”. However, we recently learned about the Rust crate bytecount. The bytecount crate uses SIMD CPU instructions to speed up, among other things, counting the lines in a file. That got us playing with the problem again. This led to a new approach and an improved algorithm. The new algorithm doesn’t use bytecount. It does, however, in the specific case of returning one random line, outperform the more general Optimal: Algorithm L described in Wikipedia.
Aside: We call this the “new algorithm”, but it may have been discovered earlier. In any case, we hope you’ll find the algorithm and its development interesting.
As before, we’ll present the new algorithm via a series of questions and hints. We have not, however, ever put these questions to an interviewee.
Q: In one pass, can you select a random line from a text file of unknown length such that the random number generator is called much less than the number of lines, n?
A: We must clarify “much less”. We mean that the number of calls to the random number generator is less than O(n) [see Big O notation — Wikipedia]. In other words, reducing the calls by one or by half is not enough. The random number of calls required should grow proportional to, for example, log(n) or sqrt(n).
Hint: First modify AlgoOnePass to print the index of every item that gets assigned to result. Call these the “keep indices”. Run the code on, say, 1,000,000 items.
Outputs:
1 36 41 187 226 403 2,608 5,756 20,162 48,750 912,566 922,409
This says that when we run with random seed 0, the first item (index 1) is kept as a possible result. Then no item is kept until the 36th item, then the 41st. If the iterator contains between 922,409 and 1,000,000 items, then the 922,409th item will be the last item kept and so will be returned.
The keep indices seem to grow roughly exponentially. If that conjecture is true, the number of keep indices is O(log n), where n is the number of items in the iterator.
Sub Q: Can we randomly generate a sequence of keep indices directly?
Sub A: Yes! We’ve detailed our solution in a short, on-line, technical paper [Meek & Kadie, Streaming Random Selection Using the Attenuated Geometric Distribution, 2022]. We call the distribution of keep indices the “Attenuated Geometric Distribution”. We show that if index1is one keep index number then we can generate the next one with:
r = rng.random()
index1 = max(index1+ceil(r * index1 / (1.0 — r)),1)
where rng.random() generates a uniform floating-point value between 0 .0 (inclusive) and 1.0 (exclusive). Bonus: max(…,1) handles the very, very unlikely case of generating 0.0 randomly. Also, recall our convention that “index1” is an index that starting counts from 1 instead of 0.
We can now use this insight to create AlgoOnePassSkip:
We can get some confidence in the code by using it to pick a number between 0 (inclusive) and 100 (exclusive), 100,000 times.
Output plots should look uniform. Which they do:
This algorithm differs from Optimal: Algorithm L (recommended by Wikipedia) in two important ways.
- AlgoOnePassSkip only works for selecting one random item, while Algorithm L can select any specified number of random items.
- When only one random item is desired, AlgoOnePassSkip needs one random number per keep index, while Algorithm L needs two.
Thus, for the special case in which we want only one random item, AlgoOnePassSkip uses half as many random draws as Algorithm L.
Summary
We’ve seen four ways to select an item randomly from a sequence of unknown length. In the context of a text file, the first solution required that the file fit into memory. The next solution used less memory but required two passes through the file. We then used a probability calculation to reduce this to one pass. That one-pass solution required one random number per line. The last solution required many fewer random numbers than lines. It also uses half as many random numbers as the “optimal” (more general) algorithm.
None of our code uses system-level methods such as bytecount to speed up the linear pass through a file. Adding system-level optimizations would be an interesting extension.
Interview Question: Select a Random Line from a File (in Python) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/Sd7WHUl
via RiYo Analytics
No comments