Page Nav

HIDE

Breaking News:

latest

Ads Place

Nonlinear Change of Variable in Probability Distributions

https://ift.tt/zhcKGau See change of variable in probability distributions with Pawan’s games Relevance in data science In data science, ...

https://ift.tt/zhcKGau

See change of variable in probability distributions with Pawan’s games

Relevance in data science

In data science, the probability distribution is an extremely important topic. Probability distributions help the data scientists find the patterns present in the data. They help in finding anomalies, generating artificial data, and doing a million more things with data. Feature transformation is just a change of variable in the probability distribution. So, mastering probability distributions is immensely helpful to being a champion data scientist.

Birthday present

Pawan got two birthday gifts — one from his father and another from his mother. Both the parents gifted him random number generators. The random number generators generated random real numbers to two decimal places. Father’s gift chose the random number between 0 and 10 whereas the mother’s gift chose the random real number between 0 and 100.

Photo by Raychan on Unsplash

Now, Pawan would go to his friends with his father’s gift and ask them to choose a range spaced one unit apart — examples: 1–2, 5.5–6.5, etc. The person(s) whose range contain(s) the randomly generated number would be the winner(s). Pawan is happy to host this game and his friends are happy to play this game because of fairness. Any interval of one unit distance has the same winning probability.

Photo by Toa Heftiba on Unsplash

He and his friends played the game a number of times, and it started to get boring. Then, Pawan came up with another game using his mother’s gift. He asked his friends to choose a similar range as before. Then, he would square the ends of the range and declare the squared range as the new range. Now, he would generate a random number between 0 and 100 using his mother’s gift and declare the person winner whose squared range contains the randomly generated number. The range his friends choose would be inside 0–10 but the new range he got by squaring would be inside 0–100. This way he used his mother’s gift.

The second game looked fair to them initially. So, they started playing the second game. They played it a number of times. After playing for some time, Pawan and his friends noticed range with higher end values won more often than the range with lower end values. The ranges like 0–1 and 1–2 won extremely less often compared to ranges like 8–9 and 9–10. They had not anticipated the bias of the game with respect to the range as it was working well with the first game.

Noticing the bias toward the higher values, Pawan and his friends stopped playing the second random number generator game and went to play table tennis instead.

Photo by Ilya Pavlov on Unsplash

Curious Pawan

Pawan could not move on from the unfair second game. After dinner, he went to his room and started thinking about the game. He was staring at the ceiling lying on his bed and a mathematical thought came to his mind — probability distribution. He got up and went to his study table and started working out probability distributions for both games.

Photo by Thought Catalog on Unsplash

I am sure he completed the mathematical analysis of the two games and came up with the conclusion/proof of why the first game is fair and why the second game is unfair. He is such a mathematical fanatic after all. Let’s also work out the mathematical analysis ourselves and see if we can come up with similar conclusion/proof.

The mathematical viewpoint of the games

First game

In the first game, the random number generator can generate any number between 0 to 10. So, the probability is zero outside this range and a non-zero constant in this range. To find out the constant, we can simply integrate the probability distribution and equal it to 1. Then, we can solve the equation to find out the value of the constant. Let’s do that first.

The probability distribution is:
P(X=x) = c, where c is some constant and x ∈ [0, 10]; 0 otherwise.

Integrating the probability over the interval -∞ to +∞,
∫(-∞ to +∞)p(x)dx = 1 … (I)

The probability outside the interval [0, 10] is 0. So, (I) becomes:
∫(0 to 10)p(x)dx = 1

As p(x) is just a constant c in the interval [0, 10],
c∫(0 to 10)dx = 1
or, 10c = 1
So, c = 1/10

So, the probability distribution would be:
P(X=x) = 1/10, x ∈ [0, 10]; 0 otherwise. … (II)

This is an example of continuous uniform distribution.

Now, why would any interval of range 1 in [0, 10] have the same probability?

Suppose the interval is [a, b]. As the interval has the range of 1, b - a=1.

So, the probability that the random number falls in the interval [a, b] is:
P(X ∈ [a, b]) = ∫(a to b)p(x)dx = 1/10 * (b - a)=1/10 * 1 = 1/10

The probability does not depend on the range endpoints if the range length is the same. Hence, no matter what range endpoints you choose, you have the same probability of winning.

Now, we understand and have proof why the first game is fair and everybody liked it.

Second game

In the second game too, the probability distribution is continuous uniform. Only the difference is the range.

So, the probability density function is:
p(Y=y) = 1/100, y∈ [0, 100]; 0 otherwise.

So, why is this game biased if it has the same probability density form?

It is because of the difference between how they chose random variables. In the first game, Pawan’s friends themselves chose the random variable X but in the second game, Pawan’s friends did not choose the random variable Y. They instead chose random variable X, and Y was constructed by squaring X.

So, the relationship would be:
Y= X²

Now, what would be the probability density function in terms of the random variable X? Pawan’s friends only chose X, so we would want to express the probability density function in the second game in terms of X.

This is where the “Change of Variable” of a probability density function comes into play.

Change of variable

The change of variable can be done with probability distributions too. The existing probability distribution might have some random variable and we might want to express it in terms of other random variables. In that case, we use this concept. Change of variable can be either linear or nonlinear. Linear change of variable is straightforward. The nonlinear change of variable is a bit different. We would discuss the nonlinear change of variable here and work out the second game example mathematically.

The probability density function in the second game is:
p(Y=y) = 1/100, y∈ [0, 100]; 0 otherwise. … (III)

And its relationship to the probability density function in (II) is:
Y = X²

We would want to express this probability distribution (III) in terms of X so that we can see how the game is biased with respect to X.

Let Δx be the change in the variable x and Δy be the corresponding change in the variable y.

Then, the probability in both the coordinate systems is approximately equal for a small value of Î”x.

So, p_x(x)Δxp_y(y)Δy … (IV)

p_x and p_y represent the probability density functions in terms of the random variables X and Y respectively.

We know p_y but we don’t know p_x and we would like to know p_x so that we can see the probability density function in terms of the random variable X.

(IV) can be rewritten as:
p_x(x)p_y(y)|Δy/Δx|

We have included the magnitude sign because probabilities can never be negative.

Now, as Δx → 0, Δy/Δxdy/dx and p_x(x)p_y(y)|dy/dx|. So, the limiting case becomes:
p_x(x) = p_y(y)|dy/dx| … (V)

We have the relationship of y with respect to x given by:
y = f(x) = x²

So, (V) can be written as:
p_x(x) = p_y(f(x))|f’(x)| … (VI)

(VI) gives the general expression for probability distribution with the change of variable.

Our second example has:
f(x) = x² and f’(x) = 2x.

So, (VI) reduces to:
p_x(x) = p_y(x²)|2x|

p_y is uniform throughout the range [0, 100]. So, if x ∈ [0, 10], p_y(x²) = P(Y=x²) = 1/100.

Hence, the formula reduces further to:
p_x(x) = x/50
i.e. P(X=x) = x/50 … 
(VII)

(VII) gives the probability distribution for our second game in terms of the variable chosen by Pawan’s friends. The uniform probability distribution drastically changes when such modification is done to the game. (VII) shows clearly the game is biased with respect to range endpoints. No wonder, the game didn’t last long, did it?

But, wait! Let’s verify if it is a valid probability distribution. (VII) is non-negative throughout the range [0, 10]. So, the non-negative condition of P(X=x) is satisfied. Now, let’s check if it integrates to 1 in the interval -∞ to +∞.

We just need to integrate in the interval [0, 10] as the value of density is 0 elsewhere.

∫(0 to 10)x/50 dx
= 10²/(2 * 50)
=100/100
=1

Yes, it integrates to 1. Hence, it is a valid probability distribution.

How does the probability change with respect to the range in the second game?

To find out the probability of the random number lying in the range [a, b] of interval 1, we find the definite integral of the probability density function in the range a to b.

So, p = ∫(a to b)p(x)dx
= ∫(a to b)x/50 dx
= (1/50) * ∫(a to b)xdx
=(b² - a²)/100
=(b-a)(b+a)/100
=(a+1+a)/100
[As the interval is 1]
=(2a+1)/100

So, the probability has a positive linear change with respect to the value of a. Hence, the higher-end values have more probability of winning than the lower-end values.

We unveiled the mystery behind the unfair game! Does not it feel satisfied?

Summing up

We explored the change of variable in probability distributions by giving an example of two games. Change of variable with respect to probability distributions is extremely important because of its relationship with feature transformation. Feature transformation is paramount for building a good model. Most of the time, the trick of building a great model is feature transformation, and it helps immensely to build a great model. We showed here the mathematics of “change of variable” with respect to probability distributions.


Nonlinear Change of Variable in Probability Distributions was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/sm7g9XQ
via RiYo Analytics

No comments

Latest Articles