in

Visualizing the true extent of the curse of dimensionality | by Florin Andrei | Medium


Utilizing the Monte Carlo methodology to visualise the habits of observations with very massive numbers of options

Consider a dataset, fabricated from some variety of observations, every remark having N options. In case you convert all options to a numeric illustration, you can say that every remark is a degree in an N-dimensional area.

When N is low, the relationships between factors are simply what you’ll anticipate intuitively. However generally N grows very massive — this might occur, for instance, if you happen to’re creating a number of options through one-hot encoding, and so on. For very massive values of N, observations behave as if they’re sparse, or as if the distances between them are in some way greater than what you’ll anticipate.

The phenomenon is actual. Because the variety of dimensions N grows, and all else stays the identical, the N-volume containing your observations actually does improve in a way (or no less than the variety of levels of freedom turns into bigger), and the Euclidian distances between observations additionally improve. The group of factors really does develop into extra sparse. That is the geometric foundation for the curse of dimensionality. The habits of the fashions and methods utilized to the dataset will likely be influenced as a consequence of those modifications.

Many issues can go improper if the variety of options could be very massive. Having extra options than observations is a typical setup for fashions overfitting in coaching. Any brute-force search in such an area (e.g. GridSearch) turns into much less environment friendly — you want extra trials to cowl the identical intervals alongside any axis. A refined impact has an impression on any fashions primarily based on distance or neighborhood: because the variety of dimensions grows to some very massive values, if you happen to contemplate any level in your observations, all the opposite factors look like distant and in some way practically equidistant — since these fashions depend on distance to do their job, the leveling out of variations of distance makes their job a lot tougher. E.g. clustering doesn’t work as effectively if all factors look like practically equidistant.

For all these causes, and extra, methods comparable to PCA, LDA, and so on. have been created — in an effort to maneuver away from the peculiar geometry of areas with very many dimensions, and to distill the dataset right down to plenty of dimensions extra appropriate with the precise info contained in it.

It’s laborious to understand intuitively the true magnitude of this phenomenon, and areas with greater than 3 dimensions are extraordinarily difficult to visualise, so let’s do some easy 2D visualizations to assist our instinct. There’s a geometric foundation for the explanation why dimensionality can develop into an issue, and that is what we are going to visualize right here. In case you have not seen this earlier than, the outcomes may be shocking — the geometry of high-dimensional areas is much extra advanced than the everyday instinct is prone to recommend.

Take into account a sq. of dimension 1, centered within the origin. Within the sq., you inscribe a circle.

a circle inscribed in a square
a circle inscribed in a sq.

That’s the setup in 2 dimensions. Now assume within the common, N-dimensional case. In 3 dimensions, you could have a sphere inscribed in a dice. Past that, you could have an N-sphere inscribed in an N-cube, which is probably the most common solution to put it. For simplicity, we are going to refer to those objects as “sphere” and “dice”, irrespective of what number of dimensions they’ve.

The quantity of the dice is mounted, it’s all the time 1. The query is: because the variety of dimensions N varies, what occurs to the amount of the sphere?

Let’s reply the query experimentally, utilizing the Monte Carlo methodology. We are going to generate a really massive variety of factors, distributed uniformly however randomly throughout the dice. For every level we calculate its distance to the origin — if that distance is lower than 0.5 (the radius of the sphere), then the purpose is contained in the sphere.

random points
random factors

If we divide the variety of factors contained in the sphere by the entire variety of factors, that may approximate the ratio of the amount of the sphere and of the amount of the dice. For the reason that quantity of the dice is 1, the ratio will likely be equal to the amount of the sphere. The approximation will get higher when the entire variety of factors is massive.

In different phrases, the ratio inside_points / total_points will approximate the amount of the sphere.

The code is quite simple. Since we want many factors, specific loops should be averted. We may use NumPy, nevertheless it’s CPU-only and single-threaded, so it will likely be gradual. Potential options: CuPy (GPU), Jax (CPU or GPU), PyTorch (CPU or GPU), and so on. We are going to use PyTorch — however the NumPy code would look nearly similar.

In case you comply with the nested torch statements, we generate 100 million random factors, calculate their distances to the origin, rely the factors contained in the sphere, and divide the rely by the entire variety of factors. The ratio array will find yourself containing the amount of the sphere in several numbers of dimensions.

The tunable parameters are set for a GPU with 24 GB of reminiscence — alter them in case your {hardware} is totally different.

machine = torch.machine("cuda:0" if torch.cuda.is_available() else "cpu")
# drive CPU
# machine = 'cpu'

# cut back d_max if too many ratio values are 0.0
d_max = 22
# cut back n if you happen to run out of reminiscence
n = 10**8

ratio = np.zeros(d_max)

for d in tqdm(vary(d_max, 0, -1)):
torch.manual_seed(0)
# mix massive tensor statements for higher reminiscence allocation
ratio[d - 1] = (
torch.sum(
torch.sqrt(
torch.sum(torch.pow(torch.rand((n, d), machine=machine) - 0.5, 2), dim=1)
)
<= 0.5
).merchandise()
/ n
)

# clear up reminiscence
torch.cuda.empty_cache()

Let’s visualize the outcomes:


How poor stakeholder administration ruins analytics | by Robert Yi | Jun, 2023

Creating Insightful Dashboards with Spark and Tableau Desktop | by Yu Huang, M.D., M.S. in CS | Jun, 2023