How we handle messy data

Jan 26, 2024

Saiyam Shah

As our platform has to work with various data providers, CDPs, and many different companies with very different data lakes and schemas, it's actually more common for us to encounter messy data than anything else.

Here's our approach to cleaning up this data so it can be usable and actionable:

What is "messy" data?

"Messy data" can come in various forms. Here's what we see most commonly:

Empty data values
Missing data points
Unstructured form
Duplicate rows
Wrongly formatted columns
Incorrect data types
and more.

Also, this data can come in any of these formats:

CSV
JSON
Parquets
txt
Data tables
Avro
etc.

How we clean messy data

We're aware of all of these possible scenarios, but we don't have liberties to drastically change our customer's data formats, so we apply a structured minimal data model along with a data cleaner and translator to all the data we ingest.

This allows to ingest data in any format and any consistency to our system making it easy for customers to get onboarded with minimal effort.

Use as little as possible

The minimum data that we require for learning infrastructure consists of the following fields

user_id: An ID signifying the identifier of the user. (If it is PII we hash it.)
event_name: This can be anything like app_opened, added_to_cart, order_completed etc.
timestamp: The timestamp of the event - converted to UTC
metadata: This is any kind of metadata that is included with the event in whatever form. We convert this to a JSON string, so it can be unstructured and can contain any information present in the customer database.

Parallel Process

We employ a parallel-compute data processing pipeline when we ingest data from the customer into the system. This parallel processing pipeline performs all of the necessary data cleaning steps and then translation as outlined below:

This allows us to be able to handle huge volumes of data, as it distributes data translation and cleaning across several compute machines and scales vertically whenever needed. Any other custom use cases to format data are also baked into the parallel pipeline.

Conclusion

By employing a minimal data model Aampe can work with any format of data the customer gives us and translate into the Aampe model with the required minimal fields. Any extra information allows us to add use cases, but the minimum data is used by our AI model to learn each user’s timing, copy, frequency and channel preferences to target efficiently.

To learn more, click the orange button, below.

See All Posts

Aug 29, 2024