Regex Operations in BigQuery

Home » Blog » GCP » Regex Operations in BigQuery

BigQuery is taking the analytics industry by a storm. One of the most important use-cases is to extract and transform parts of a string. Below are some of the powerful constructs provided by BigQuery.

Regex Constructs

BigQuery provides the following constructs for regex operations

ConstructArgumentsUse
REGEXP_CONTAINSREGEXP_CONTAINS(value, regexp)Checks if the string contains the regex
REGEXP_EXTRACTREGEXP_EXTRACT(value, regexp[, position[, occurrence]])Extract the substring that matches the regex.
REGEXP_EXTRACT_ALLREGEXP_EXTRACT_ALL(value, regexp)Return all substrings that match the regex
REGEXP_INSTRREGEXP_INSTR(source_value, regexp [, position[, occurrence, [occurrence_position]]])Return 1-based index for the first occurrence of the regex
REGEXP_REPLACEREGEXP_REPLACE(value, regexp, replacement)Replace all substrings in value
Source : Google Cloud Platform


Table Setup

Let’s create a simple table with some string data to play around.

-- Create Store
CREATE OR REPLACE TABLE `gcp-tour-276710.thefellowcoder.store`
(
    date DATE,
    details STRING,
    address STRING
)        

-- Add values
INSERT INTO `gcp-tour-276710.thefellowcoder.store`
VALUES 
('2021-07-14', '<h1>Nike Shoes</h1><p>Base: $450 Addon: $20</p><p>Support email: [email protected]</p>','Los Angeles, California 90001 '),
('2021-06-13', '<h1>Rebook Shoes</h1><p>Base: $150 Addon: $10</p><p>Support email: [email protected]</p>','Sacramento, California 94203 '),
('2021-07-11', '<h1>Adidas Shoes</h1><p>Base: $350 Addon: $30</p><p>Support email: [email protected]</p>','Beverly Hills, California 90209 '),
('2021-07-11', '<h1>Carnival All Rounder Shoes</h1><p>Base: $250 Addon: $0</p>','Beverly Hills, California 90209 ')

Sample Queries

Lets see each of the constructs in action

-- All stores that have a support email
select * from `gcp-tour-276710.thefellowcoder.store` where REGEXP_CONTAINS(details, r'\[email protected]\S+\.\S+') 
-- Extract Product Names
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>') from `gcp-tour-276710.thefellowcoder.store`
-- Extract all prices
select REGEXP_EXTRACT_ALL(details, r'\$(.\d+)') from `gcp-tour-276710.thefellowcoder.store`
-- Get index of email
select details, REGEXP_INSTR(details, r'\[email protected]\S+\.\S+') from `gcp-tour-276710.thefellowcoder.store`
  • REGEXP_REPLACE example
-- Replace all email-ids
select REGEXP_REPLACE(details, r'\[email protected]\S+\.\S+', '[email protected]') from `gcp-tour-276710.thefellowcoder.store`

Common Use-cases:

Extract part of the string with regex

This is one of the most common use-cases wherein we must extract substring that matches a regex.

-- Extract product names from all details
select REGEXP_EXTRACT(details, r'<h1>([^<]+)<\/h1>') from `gcp-tour-276710.thefellowcoder.store`

Replace regex with string

Let’s say you want to sanitize the string and remove all special characters from the string.

-- remove special characters
select details,REGEXP_REPLACE(details, r'[^0-9a-zA-Z]+',' ') from `gcp-tour-276710.thefellowcoder.store`

Matches any of the regex patterns

check for multiple regex patterns for a string

-- check for pincodes starting with 90/94
select address,REGEXP_EXTRACT(address, r'9[0|4]\d+') as pincode from `gcp-tour-276710.thefellowcoder.store`

Useful regex examples for data cleaning.

StringRegexOutput
<h1>Hello World!</h1><h1>([^<]+)<\/h1>Hello World!
<p>Support email: [email protected]</p>\[email protected]\w+\.\w+[email protected]
Rebook ShOes(?i)shoesShOes
This,contains$special\characters[^0-9a-zA-Z]+This contains special characters
String needs to be trimmed. ^(\s+).*?(\s+)$String needs to be trimmed.
Cheatsheet

Leave a Comment