Dynamic Pricing with Multi-Armed Bandit: Studying by Doing | by Massimiliano Costacurta | Aug, 2023

Making use of Reinforcement Studying methods to real-world use instances, particularly in dynamic pricing, can reveal many surprises

Picture by Markus Spiske on Unsplash

Within the huge world of decision-making issues, one dilemma is especially owned by Reinforcement Studying methods: exploration versus exploitation. Think about strolling right into a on line casino with rows of slot machines (often known as “one-armed bandits”) the place every machine pays out a distinct, unknown reward. Do you discover and play every machine to find which one has the best payout, or do you stick to 1 machine, hoping it’s the jackpot? This metaphorical situation underpins the idea of the Multi-armed Bandit (MAB) downside. The target is to discover a technique that maximizes the rewards over a sequence of performs. Whereas exploration gives new insights, exploitation leverages the data you already possess.

Now, transpose this precept to dynamic pricing in a retail situation. Suppose you’re an e-commerce retailer proprietor with a brand new product. You aren’t sure about its optimum promoting value. How do you set a value that maximizes your income? Do you have to discover completely different costs to grasp buyer willingness to pay, or must you exploit a value that has been performing effectively traditionally? Dynamic pricing is actually a MAB downside in disguise. At every time step, each candidate value level could be seen as an “arm” of a slot machine and the income generated from that value is its “reward.” One other approach to see that is that the target of dynamic pricing is to swiftly and precisely measure how a buyer base’s demand reacts to various value factors. In less complicated phrases, the purpose is to pinpoint the demand curve that finest mirrors buyer conduct.

On this article, we’ll discover 4 Multi-armed Bandit algorithms to judge their efficacy towards a well-defined (although not simple) demand curve. We’ll then dissect the first strengths and limitations of every algorithm and delve into the important thing metrics which can be instrumental in gauging their efficiency.

Historically, demand curves in economics describe the connection between the value of a product and the amount of the product that customers are prepared to purchase. They often slope downwards, representing the widespread remark that as value rises, demand sometimes falls, and vice-versa. Consider widespread merchandise akin to smartphones or live performance tickets. If costs are lowered, extra individuals have a tendency to purchase, but when costs skyrocket, even the ardent followers would possibly assume twice.

But in our context, we’ll mannequin the demand curve barely in another way: we’re placing value towards chance. Why? As a result of in dynamic pricing eventualities, particularly digital items or companies, it’s typically extra significant to assume by way of the chance of a sale at a given value than to take a position on precise portions. In such environments, every pricing try could be seen as an exploration of the chance of success (or buy), which could be simply modeled as a Bernoulli random variable with a chance p relying on a given take a look at value.

Right here’s the place it will get significantly fascinating: whereas intuitively one would possibly assume the duty of our Multi-armed Bandit algorithms is to unearth that very best value the place the chance of buy is highest, it’s not fairly so simple. In reality, our final objective is to maximise the income (or the margin). This implies we’re not trying to find the value that will get the most individuals to click on ‘purchase’ — we’re trying to find the value that, when multiplied by its related buy chance, offers the best anticipated return. Think about setting a excessive value that fewer individuals purchase, however every sale generates important income. On the flip facet, a really low value would possibly entice extra consumers, however the complete income would possibly nonetheless be decrease than the excessive value situation. So, in our context, speaking concerning the ‘demand curve’ is considerably unconventional, as our goal curve will primarily characterize the chance of buy reasonably than the demand immediately.

Now, attending to the maths, let’s begin by saying that shopper conduct, particularly when coping with value sensitivity, isn’t all the time linear. A linear mannequin would possibly recommend that for each incremental improve in value, there’s a continuing decrement in demand. In actuality, this relationship is commonly extra complicated and nonlinear. One approach to mannequin this conduct is through the use of logistic features, which might seize this nuanced relationship extra successfully. Our chosen mannequin for the demand curve is then:

Right here, a denotes the utmost achievable chance of buy, whereas b modulates the sensitivity of the demand curve towards value modifications. The next worth of b means a steeper curve, approaching extra quickly to decrease buy possibilities as the value will increase.

4 examples of demand curves with completely different mixtures of parameters a and b

For any given value level, we’ll be then capable of get hold of an related buy chance, p. We are able to then enter p right into a Bernoulli random variable generator to simulate the response of a buyer to a selected value proposal. In different phrases, given a value, we are able to simply emulate our reward operate.

Subsequent, we are able to multiply this operate by the value with a purpose to get the anticipated income for a given value level:

Unsurprisingly, this operate doesn’t attain its most in correspondence with the best chance. Additionally, the value related to the utmost doesn’t rely upon the worth of the parameter a, whereas the utmost anticipated return does.

Anticipated income curves with associated maxima

With some recollection from calculus, we are able to additionally derive the formulation for the spinoff (you’ll want to make use of a mix of each the product and the chain rule). It’s not precisely a soothing train, nevertheless it’s nothing too difficult. Right here is the analytical expression of the spinoff of the anticipated income:

This spinoff permits us to seek out the precise value that maximizes our anticipated income curve. In different phrases, through the use of this particular formulation in tandem with some numerical algorithms, we are able to simply decide the value that units it to 0. This, in flip, is the value that maximizes the anticipated income.

And that is precisely what we’d like, since by fixing the values of a and b, we’ll instantly know the goal value that our bandits must discover. Coding this in Python is a matter of some traces of code:

For our use case, we’ll set a = 2 and b = 0.042, which is able to give us a goal value of about 30.44, related to an optimum chance of 0.436 ( → optimum common reward is 30.44*0.436=13.26). This value is clearly unknown usually and it’s precisely the value that our Multi-armed Bandit algorithms will search.

Now that we’ve recognized our targets, it’s time to discover numerous methods for testing and analyzing their efficiency, strengths, and weaknesses. Whereas a number of algorithms exist in MAB literature, in terms of real-world eventualities, 4 major methods (together with their variations) predominantly type the spine. On this part, we’ll present a short overview of those methods. We assume the reader has a foundational understanding of them; nevertheless, for these fascinated about a extra in-depth exploration, references are supplied on the finish of the article. After introducing every algorithm, we’ll additionally current its Python implementation. Though every algorithm possesses its distinctive set of parameters, all of them generally make the most of one key enter: the arm_avg_reward vector. This vector denotes the common reward garnered from every arm (or motion/value) as much as the present time step t. This crucial enter guides all of the algorithms in making knowledgeable selections concerning the subsequent value setting.

The algorithms I’m going to use to our dynamic pricing downside are the next:

Grasping: This technique is like all the time going again to the machine that gave you essentially the most cash the primary few occasions you performed. After attempting out every machine a bit, it sticks with the one which appeared one of the best. However there is likely to be an issue. What if that machine was simply fortunate firstly? The Grasping technique would possibly miss out on higher choices. On the brilliant facet, the code implementation is actually easy:

It’s important to distinguish the preliminary situation (when all rewards are 0) from the common one. Usually, you’ll discover solely the ‘else’ half carried out, which certainly works even when all rewards are at 0. But, this method can result in a bias towards the primary aspect. For those who make this oversight, you would possibly find yourself paying that bias, significantly if the optimum reward occurs to be tied to the primary arm (sure, I’ve been there). The Grasping method is often the least-performing one and we’ll primarily use it as our efficiency baseline.

ϵ-greedy: The ε-greedy (epsilon-greedy) algorithm is a modification to sort out the principle disadvantage of the grasping method. It introduces a chance ε (epsilon), sometimes a small worth, to pick a random arm, selling exploration. With a chance 1−ε, it chooses the arm with the best estimated reward, favoring exploitation. By balancing between random exploration and exploitation of identified rewards, the ε-greedy technique goals to realize higher long-term returns in comparison with purely grasping strategies. Once more, the implementation is fast, it’s merely a further ‘if’ on high of the Grasping code.

UCB1 (Higher Confidence Sure): The UCB1 technique is sort of a curious explorer looking for one of the best restaurant in a brand new metropolis. Whereas there’s a favourite spot they’ve loved, the attract of probably discovering an excellent higher place grows with every passing day. In our context, UCB1 combines the rewards of identified value factors with the uncertainty of these much less explored. Mathematically, this stability is achieved by way of a formulation: the common reward of a value level plus an “uncertainty bonus” based mostly on how lengthy because it was final tried. This bonus is calculated as

and represents the “rising curiosity” concerning the untried value. The hyperparameter C controls the stability between exploitation and exploration, with greater values of C encouraging extra exploration of less-sampled arms. By all the time deciding on the value with the best mixed worth of identified reward and curiosity bonus, UCB1 ensures a mixture of sticking to what’s identified and venturing into the unknown, aiming to uncover the optimum value level for optimum income. I’ll begin with the by-the-book implementation of this method, however we’ll quickly see that we have to tweak it a bit.

Thompson Sampling: This Bayesian method addresses the exploration-exploitation dilemma by probabilistically deciding on arms based mostly on their posterior reward distributions. When these rewards adhere to a Bernoulli distribution, representing binary outcomes like success/failure, Thompson Sampling (TS) employs the Beta distribution as a conjugate prior (see this table for reference). Initiating with a non-informative Beta(1,1) prior for each arm, the algorithm updates the distribution’s parameters upon observing rewards: successful will increase the alpha parameter, whereas a failure augments the beta. Throughout every play, TS attracts from the present Beta distribution of every arm and opts for the one with the highest sampled worth. This system permits TS to dynamically regulate based mostly on gathered rewards, adeptly balancing between the exploration of unsure arms and the exploitation of these identified to be rewarding. In our particular situation, though the foundational reward operate follows a Bernoulli distribution (1 for a purchase order and 0 for a missed buy), the precise reward of curiosity is the product of this fundamental reward and the present value below take a look at. Therefore, our implementation of TS will want a slight modification (which may also introduce some surprises).

The change is definitely fairly easy: to find out essentially the most promising subsequent arm, samples extracted from the posterior estimates are multiplied by their respective value factors (line 3). This modification ensures selections are anchored on the anticipated common income, shifting the main focus from the best buy chance.

At this level, having gathered all the important thing components to assemble a simulation evaluating the efficiency of the 4 algorithms in our dynamic pricing context, we should ask ourselves: what precisely will we be measuring? The metrics we select are pivotal, as they’ll information us within the strategy of each evaluating and enhancing the algorithm implementation. On this endeavor, I’m zeroing in on three key indicators:

  1. Remorse: This metric measures the distinction between the reward obtained by the chosen motion and the reward that will have been obtained by taking the absolute best motion. Mathematically, remorse at time t is given by: Remorse(t)=Optimum Reward(t)−Precise Reward(t). Remorse, when amassed over time, gives perception into how a lot we’ve “misplaced” by not all the time selecting one of the best motion. It’s most well-liked over cumulative reward as a result of it gives a clearer indication of the algorithm’s efficiency relative to the optimum situation. Ideally, a remorse worth near 0 signifies proximity to optimum decision-making.
  2. Reactivity: This metric gauges the pace at which an algorithm approaches a goal common reward. Basically, it’s a measure of the algorithm’s adaptability and studying effectivity. The faster an algorithm can obtain the specified common reward, the extra reactive it’s, implying a swifter adjustment to the optimum value level. In our case the goal reward is ready at 95% of the optimum common reward, which is 13.26. Nonetheless, preliminary steps can exhibit excessive variability. For example, a fortunate early selection would possibly end in successful from a low chance arm related to a excessive value, rapidly attaining the edge. Because of such fluctuations, I’ve opted for a stricter definition of reactivity: the variety of steps required to realize 95% of the optimum common reward ten occasions, excluding the preliminary 100 steps.
  3. Arms Allocation: This means the frequency with which every algorithm makes use of the obtainable arms. Introduced as a share, it reveals the algorithm’s propensity to pick every arm over time. Ideally, for essentially the most environment friendly pricing technique, we’d need an algorithm to allocate 100% of its decisions to the best-performing arm and 0% to the remainder. Such an allocation would inherently result in a remorse worth of 0, denoting optimum efficiency.

Evaluating MAB algorithms poses challenges because of the extremely stochastic nature of their outcomes. Which means due to the inherent randomness in figuring out portions, the outcomes can significantly range from one run to a different. For a sturdy analysis, the simplest method is to execute the goal simulation a number of occasions, accumulate the outcomes and metrics from every simulation, after which compute the common.

The preliminary step entails making a operate to simulate the decision-making course of. This operate will implement the suggestions loop represented within the under picture.

Suggestions loop carried out within the simulation operate

That is the implementation of the simulation loop:

The inputs to this operate are:

  • costs: An inventory of candidate costs we want to take a look at (basically our “arms”).
  • nstep: The entire variety of steps within the simulation.
  • technique: The algorithm we purpose to check for making selections on the subsequent value.

Lastly, we have to write the code for the outer loop. For each goal technique, this loop will name run_simulation a number of occasions, gather and combination the outcomes from every execution, after which show the outcomes.

For our evaluation, we’ll use the next configuration parameters:

  • costs: Our value candidates → [20, 30, 40, 50, 60]
  • nstep: Variety of time steps for each simulation → 10000
  • nepoch: Variety of simulation executions → 1000

Moreover, by setting our value candidates, we are able to promptly get hold of the related buy possibilities, that are (roughly) [0.60, 0.44, 0.31, 0.22, 0.15].

After operating the simulation we’re lastly capable of see some outcomes. Let’s begin from the plot of the cumulative remorse:

From the graph, we are able to see that TS is the winner by way of imply cumulative remorse, nevertheless it takes round 7,500 steps to surpass ε-greedy. However, now we have a transparent loser, which is UCB1. In its fundamental configuration, it basically performs on par with the grasping method (we’ll get again to this later). Let’s attempt to perceive the outcomes higher by exploring the opposite obtainable metrics. In all 4 instances, the reactivity reveals very giant commonplace deviations, so we’ll deal with the median values as a substitute of the means, as they’re extra immune to outliers.

The preliminary remark from the plots reveals that whereas TS surpasses ε-greedy by way of the imply, it barely lags behind by way of the median. Nonetheless, its commonplace deviation is smaller. Significantly fascinating is the reactivity bar plot, which reveals how TS struggles to quickly obtain a good common reward. At first, this was counterintuitive to me, however the mechanism behind TS on this situation clarified issues. We beforehand talked about that TS estimates buy possibilities. But, selections are made based mostly on the product of those possibilities and the costs. Having data of the true possibilities (that, as talked about, are [0.60, 0.44, 0.31, 0.22, 0.15]) permits us to calculate the anticipated rewards TS is actively navigating: [12.06, 13.25, 12.56, 10.90, 8.93]. In essence, though the underlying possibilities differ significantly, the anticipated income values are comparatively shut from its perspective, particularly in proximity to the optimum value. This implies TS requires extra time to discern the optimum arm. Whereas TS stays the top-performing algorithm (and its median finally drops under that of the ε-greedy one if the simulation is extended), it calls for an extended interval to establish one of the best technique on this context. Beneath, the arm allocation pies present how TS and ε-greedy do fairly effectively in figuring out one of the best arm (value=30) and utilizing it more often than not in the course of the simulation.

Now let’s get again to UCB1. Remorse and reactivity verify that it’s mainly appearing as a completely exploitative algorithm: fast to get a very good stage of common reward however with massive remorse and excessive variability of the end result. If we have a look at the arm allocations that’s much more clear. UCB1 is barely barely smarter than the Grasping method as a result of it focuses extra on the three arms with greater anticipated rewards (costs 20, 30, and 40). Nonetheless, it basically doesn’t discover in any respect.

Enter hyperparameter tuning. It’s clear that we have to decide the optimum worth of the load C that balances exploration and exploitation. Step one is to change the UCB1 code.

On this up to date code, I’ve included the choice to normalize the common reward earlier than including the “uncertainty bonus”, which is weighted by the hyperparameter C. The rationale for that is to permit for a constant search vary for one of the best hyperparameter (say 0.5–1.5). With out this normalization, we may obtain related outcomes, however the search interval would want changes based mostly on the vary of values we’re coping with every time. I’ll spare you the boredom of discovering one of the best C worth; it may be simply decided by way of a grid search. It seems that the optimum worth is 0.7. Now, let’s rerun the simulation and look at the outcomes.

That’s fairly the plot twist, isn’t it? Now, UCB1 is clearly one of the best algorithm. Even by way of reactivity, it has solely barely deteriorated in comparison with the earlier rating.

Moreover, from the angle of arm allocation, UCB1 is now the undisputed chief.

  • Idea vs. Expertise: Beginning with book-based studying is a vital first step when delving into new subjects. Nonetheless, the earlier you immerse your self in hands-on experiences, the sooner you’ll rework data into data. The nuances, subtleties, and nook instances you encounter when making use of algorithms to real-world use instances will provide insights far past any knowledge science ebook you would possibly learn.
  • Know Your Metrics and Benchmarks: For those who can’t measure what you’re doing, you possibly can’t enhance it. By no means start any implementations with out understanding the metrics you propose to make use of. Had I solely thought-about remorse curves, I may need concluded, “UCB1 doesn’t work.” By evaluating different metrics, particularly arm allocation, it grew to become evident that the algorithm merely wasn’t exploring sufficiently.
  • No One-Measurement-Matches-All options: Whereas UCB1 emerged because the best choice in our evaluation, it doesn’t indicate it’s the common answer on your dynamic pricing problem. On this situation, tuning was simple as a result of we knew the optimum worth we sought. In actual life, conditions are by no means so clear-cut. Do you possess sufficient area data or the means to check and regulate your exploration issue for the UCB1 algorithm? Maybe you’d lean in the direction of a reliably efficient possibility like ε-greedy that guarantees fast outcomes. Or, you is likely to be managing a bustling e-commerce platform, showcasing a product 10000 occasions per hour, and also you’re prepared to be affected person, assured that Thompson Sampling will attain the utmost cumulative reward finally. Yeah, life ain’t simple.

Lastly, let me say that if this evaluation appeared daunting, sadly, it already represents a really simplified state of affairs. In real-world dynamic pricing, costs and buy possibilities don’t exist in a vacuum — they really exist in ever-changing environments they usually’re influenced by numerous elements. For instance, it’s extremely unbelievable that buy chance stays constant all year long, throughout all buyer demographics and areas. In different phrases, to optimize pricing selections, we should contemplate our clients’ contexts. This consideration would be the point of interest of my subsequent article, the place I’ll delve deeper into the issue by integrating buyer data and discussing Contextual Bandits. So, keep tuned!

Effectively Serving Open Supply LLMs | by Ryan Shrott | Aug, 2023

LLaMA in R with Keras and TensorFlow