Several years back, we were given several long lists of Korean addresses and contact information to import into our CRM system. The lists were a mess, with addresses in several different formats, unseparated fields, characters that had been converted to multi-character sets, and lots of near-duplicate data. And they were not formatted at all for import into a US-structured CRM system.
After examining the data, we spent days developing a series of scripts to iteratively attack the problems, one at a time. When we were done, almost 95% of the data was in a form that could be imported. The rest required manual intervention.
We were given many more projects. Different types of data. Similar problems. A customized solution for each dataset. Lots of labor hours.
With Generative AI and tools like ChatGPT the game has changed and projects like this can be easy. You can complete them in hours, not days, and with better accuracy than previous methods allowed.
Today we'll explore some easy methods for cleaning data. But first we'll discuss the risks involved in the process and how to mitigate them.
From this...
to this...
Risks and mitigation
Any time you send data to external services for processing you run the risk of that data being intercepted. With many free cloud services you may even inadvertently grant the host the right to use that data for a wide variety of purposes. And there's always the risk of bad actors masquerading as legitimate services to intercept your data. To reduce these risks, be sure to:
Have and enforce company policies for using data in the cloud, approving tools, certifying users, and approving use cases.
Identify and document the sensitivity of your data sets and manage access.
Prohibit the use of "free" tools and restrict your users to only using paid, licensed products. These usually have much more restrictive and controllable terms regarding the provider's use of input prompts and output generated by your users.
Assure that any tools you use effectively encrypt your data, both in transit and while stored.
Use tools that separate your raw data from the language model platform you are using. These are often referred to as Retrieval Augmented Generation (RAG) systems.
For highly critical applications, use self-contained tools that run directly on your device or network, without internet access.
Matrix Potential can help you navigate the risks of exposing your data and build secure tools for your use case. Get in touch with us here.
Turning poorly formatted data into tables
Here are some prompts you can use to clean up data. Depending on the tool you choose, you can use them iteratively and even build specific prompt sets for different types of data.
When collecting data from multiple sources or conducting research, the information likely won't come in a format you can simply import into your internal systems. Here's an example prompt to help you clean up poorly formatted contact lists. It can easily be applied to other types of data.
I'm going to provide a large amount of text. Convert this text into a table with the following columns: First Name, Last Name, Title, Company, Street Address, City, State, Zip Code, Email Address, Phone. Not all columns may be present for each row of data.
If the text contains a middle initial, include it in the First Name field, separated from the first name by a space character and followed by a period.
Zip codes may contain 5 or 9 digits. If they are 9 digits, a dash character should separate the first 5 digits from the next 4 digits. Zip codes should be stored as text, not numbers.
Include any data that doesn't fit these categories in a column labeled Other, with elements separated by commas.
This is an example for a contact list, but you can use a similar prompt for many other types of unformatted or poorly structured data.
Multiple steps
A very complicated prompt may generate errors, but you can break the process down into several steps to avoid confusing the system. In the example above, instead of specifying the columns "First Name" and "Last Name," and requesting special handling for middle initials, request only a "Name" field in the first prompt. In a second prompt, you could state:
Reformat the table above, separating the Name column into Salutation, First Name, Middle Initial, Last Name, and Suffix columns. Not all elements may exist for each name.
Supplementing the original data
You can also have the system analyze the data to create additional fields based on other content. In the example above, here's how you can analyze email addresses:
For the table above, add another column for Email Type. If the email address does not contain a valid URL or is missing, the value for this field should be "Invalid". If the email address contains gmail.com, outlook.com, icloud.com, yahoo.com, aol.com, or msn.com, set the value for this field to "Personal". All other rows should have the value "Business".
Substituting Characters
Sometimes, raw international data may come with multi-character strings representing accented characters or digraphs, or you may want to replace accented characters with their unaccented versions for easy import. (This is common when the imported text was encoded in ISO 8859-1 instead of UTF-8.)
A prompt like this can accomplish this task simply, and is much more efficient than an iterative search and replace. It usually works best to do this before parsing your data into a table.
Replace all instances of the text below as follows, ignoring quotation marks:
"á" with "á"
"é" with "é"
"ó" with "ó"
"Ã-" with "í"
"ñ" with "n"
"Á" with "A"
"ß" with "ss"
Complete all of one substitution before beginning the next on the list. All substitutions are case sensitive.
Preparing for export/import
When complete, you can extract the result for import into another system with a prompt similar to:
Create a tab-delimited text file from the above table.
or
Create a comma-delimited text file from the above table.
If you're going to require a comma-delimited text file to import data into your new system, prior to doing this you should strip out any remaining commas in your table prior to exporting so they will end up in the proper field when imported. Using the method above, a simple prompt like this can assure that your data will import properly.
In the table above, replace any remaining commas with the "|" character.
Do you regularly need to move data from a variety of sources into your database-driven systems? Matrix Potential's data science team can help you with projects of any size. Get in touch with us here.
Comments