- read

Challenges I often encounter when working with adtech datasets

Abhishek Pandey 39

Data lies at the heart of programmatic ad tech. From setting up and driving real-time bidding engines to optimize supply paths, data-driven decision-making has become an integral part of the programmatic campaign lifecycle, helping marketers get more value from their campaigns.

Anyone who has enabled or interacted with data-oriented projects in practical scenarios will be able to tell you that the quality and quantity of data projects people deal with pose a variety of challenges, in some cases even dictating the eventual approach and outcomes of a process. Ad Techdata is no different.

The programmatic advertising being an early adopter of big data behaviors epitomizes principles of 5 V’s (volume, velocity, variety, veracity and value) in every sense. Trillions of ad impressions are served daily at extremely low latency programmatically and this volume is growing by the day. With the growing number of connected screens and platforms, the variety and veracity of data ad Tech vendors become challenging. These considerations present significant challenges when developing, testing, deploying and updating scalable models in real-time scenarios.

While a majority of the use cases in ad tech are based around structured data fields, researchers have been working on the inclusion of a wider variety of features (multi-field categorical features, textual features and visual features) to enhance model performance. Further accounting for biases within the dataset based on campaign objectives is another consideration ML/ AI practitioners need to remain cognizant of when developing and testing approaches.

While the scale, variety and veracity of data keep researchers on their toes, these are somewhat applicable to every other avenue where AI is leveraged for practical purposes and something most experienced AI practitioners adapt to overtime/ or find a way around them.

The advancements in computational capacity and ML toolkits have somewhat been able to counter most of the challenges list above. But even then certain inherent qualities within Ad Tech datasets are somewhat more complex in nature and require a certain amount of human intervention to resolve.

This blog focuses on 5 key challenges I often come across when working with these datasets.

1. Data Imbalance- The response rate in adtech datasets is extremely rare and for actual campaigns can often be under 1%. This induces a severe challenge in terms of extremely heavy class imbalance within the dataset that apart from presenting significant model training challenges renders a lot of general model evaluation metrics useless.

https://datascience.aero/predicting-improbable-part-1-imbalanced-data-problem/

While balancing techniques like oversampling and undersampling offer ways of mitigating this problem, they come with their own sets of challenges. Undersampling tends to lead to loss of credible information from the dataset which severely impacts the pattern recognition capabilities of the models. Oversampling techniques like SMOTE, ADASYN enhance the performance on the unbalanced data at the expense of computational efficacy as it explodes the size of the training data. Class weight balancing often offers a viable alternative but the choice may vary based on the use case.

2. Data Sparsity / High Dimensionality — AdTech datasets are often cursed with high dimensionality across the different categorical features they collect. One-hot coding of these features while impractical also yields extremely sparse datasets which are not viable for linear models.

The inability to dummy encode features like site domain and device model (often due to limited computational resources in hand) where cardinality can go into millions makes it impossible to use traditional dimensionality reduction techniques like PCA, DCT or t-SNE for large scale problems. Embedding categorical variables using multiple techniques like graph-based embedding, count-based-embedding or via embedding layers in Keras can help successfully resolve this challenge.

3. Low latency — The entire ad-serving process takes place within fractions of a second, hence practical models need to be able to deliver recommendations with high accuracy and at extremely low latency.

Source: Medium

The low latency requirement for practical ML models often requires practitioners to establish tradeoffs between model performance and practical efficacy. Understanding the time and space complexity associated with the different algorithms can help the practitioner make an informed decision when testing different variations and while tuning the hyperparameters. Including the time and space considerations and setting up cut-offs based on rate-limit in the ecosystem during the model selection process can further help avoid the use of excessively complex algorithms which may impact the decision-making cycles.

4. Online learning — Online learning is a common technique used in areas of machine learning where it is computationally infeasible to train over the entire dataset, requiring the need for out-of-core algorithms. It is also used in situations where it is necessary for the algorithm to dynamically adapt to new patterns in the data, or when the data itself is generated as a function of time. Most prediction models would require updating based on continuous feedback based on the user response.

Source: Wikipedia

This is also extremely applicable in cases where one could start with a general model but adapt it to a specific campaign setup (for whom prior data is unavailable) based on existing user response. The inability to learn and adapt in a continuous environment is why tree-based ensembles are not deployed in real-time scenarios despite offering superior performance.

5. Cold Start — Cold-start problems are more common than you would imagine in programmatic campaigns. The onboarding of a new line item or an update in targeting parameters can often cause significant performance issues for a pre-trained model. This is also valid in cases where learnings have to be applied to campaigns where no historic data is available for model training and the model is likely to come across a lot of unseen events. Dynamic Creative optimization and audience recommenders are some places where these are extremely applicable.

Proper pre-campaign planning and expectation settings can often help set up relevant baselines to avoid downturns in performance and training and updating the model based on historic learnings can help provide a reasonable solution to the pre-campaign problem. Using generic vertical-based models as a starting point also helps reduce the time and test budget before the deployed model can effectively begin to offer incremental results.

While these are some of the common issues I often come across, they are not the only ones. A proper EDA of the data source is essential to reveal existing patterns and biases in the data which might adversely impact the outcomes of a data solution. It is only when the data is processed effectively that it can deliver models with high efficacy.