Regex Operations in BigQuery

BigQuery is taking the analytics industry by storm. One of the most important use cases is to extract and transform parts of a string. Below are some of the powerful constructs provided by BigQuery.

BigQuery provides the following constructs for regex operations.

ConstructArgumentsUse
REGEXP_CONTAINSREGEXP_CONTAINS(value, regexp)Checks if the string contains the regex
REGEXP_EXTRACTREGEXP_EXTRACT(value, regexp[, position[, occurrence]])Extract the substring that matches the regex.
REGEXP_EXTRACT_ALLREGEXP_EXTRACT_ALL(value, regexp)Return all substrings that match the regex
REGEXP_INSTRREGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]])Return 1-based index for the first occurrence of the regex
REGEXP_REPLACEREGEXP_REPLACE(value, regexp, replacement)Replace all substrings in value

Source: Google Cloud Platform

Table Setup

Let’s create a simple table with some string data to play around with.

-- Create Store
CREATE OR REPLACE TABLE `gcp-tour-276710.thefellowcoder.store`
(
    date DATE,
    details STRING,
    address STRING
)        

-- Add values
INSERT INTO `gcp-tour-276710.thefellowcoder.store`
VALUES 
('2021-07-14', '<h1>Nike Shoes</h1><p>Base: $450 Addon: $20</p><p>Support email: help@nike.com</p>','Los Angeles, California 90001 '),
('2021-06-13', '<h1>Rebook Shoes</h1><p>Base: $150 Addon: $10</p><p>Support email: help@rebook.com</p>','Sacramento, California 94203 '),
('2021-07-11', '<h1>Adidas Shoes</h1><p>Base: $350 Addon: $30</p><p>Support email: help@adidas.com</p>','Beverly Hills, California 90209 '),
('2021-07-11', '<h1>Carnival All Rounder Shoes</h1><p>Base: $250 Addon: $0</p>','Beverly Hills, California 90209 ')

Sample Queries

Let's see each of the constructs in action

-- All stores that have a support email
select * from `gcp-tour-276710.thefellowcoder.store` 
where REGEXP_CONTAINS(details, r'\S+@\S+\.\S+')
-- Extract Product Names
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>') 
from `gcp-tour-276710.thefellowcoder.store`
-- Extract all prices
select REGEXP_EXTRACT_ALL(details, r'\$(.\d+)') 
from `gcp-tour-276710.thefellowcoder.store`
-- Get index of email
select details, REGEXP_INSTR(details, r'\S+@\S+\.\S+') 
from `gcp-tour-276710.thefellowcoder.store`
  • REGEXP_REPLACE example
-- Replace all email-ids
select REGEXP_REPLACE(details, r'\S+@\S+\.\S+', 'admin@thefellowcoder.com') from `gcp-tour-276710.thefellowcoder.store`

Common Use-cases:

Extract part of the string with regex

This is one of the most common use cases wherein we must extract a substring that matches a regex.

-- Extract product names from all details
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>') 
from `gcp-tour-276710.thefellowcoder.store`

Replace regex with string

Let’s say you want to sanitize the string and remove all special characters from the string.

-- remove special characters
select details,REGEXP_REPLACE(details, r'[^0-9a-zA-Z]+',' ') 
from `gcp-tour-276710.thefellowcoder.store`

Matches any of the regex patterns

check for multiple regex patterns for a string

-- check for pincodes starting with 90/94
select address,REGEXP_EXTRACT(address, r'9[0|4]\d+') as pincode 
from `gcp-tour-276710.thefellowcoder.store`

Useful regex examples for data cleaning.

StringRegexOutput
<h1>Hello World!</h1><h1>([^<]+)<\/h1>Hello World!
<p>Support email: thefellowcoder@gmail.com</p>\w+@\w+\.\w+thefellowcoder@gmail.com
Rebook ShOes(?i)shoesShOes
This,contains$special\characters[^0-9a-zA-Z]+This contains special characters
String needs to be trimmed.^(\s+).*?(\s+)$String needs to be trimmed.

Did you find this article valuable?

Support Smit Thakkar by becoming a sponsor. Any amount is appreciated!