We consider a problem where customers repeatedly interact with a platform. During each interaction with the platform, the customer is shown an assortment of items and selects among these items according to a Multinomial Logit choice model. The probability that a customer interacts with the platform in the next period depends on the customer's cumulative number of past purchases. The goal of the platform is to maximize the total revenue obtained from each customer over a finite time horizon. We first study a non-learning version of the problem where consumer preferences are completely known. We formulate the problem as a dynamic program and prove structural properties of the optimal policy. Next, we provide a formulation in a contextual episodic reinforcement learning setting, where the parameters governing consumer preferences and return probabilities are unknown and learned over multiple episodes. We develop an algorithm based on the principle of optimism under uncertainty for this contextual reinforcement learning problem and provide a regret bound.