I am going to show you how to write a python script that will read through a .sql file and return the names of all the tables used in the file. Note: coding starts in the section with the title “Begin Follow-Along-Coding:”.
Why do this?
Well at my current job, we have a lot of .sql scripts in our data pipeline. Many of these scripts were written by people who either no longer work at this company, or we don’t even know who wrote these. We don’t know much off the top of our heads about these scripts. As a result, making changes to the data pipeline can be complex, time-consuming, and require a fair amount of investigation into .sql files to find out where changes need to be made. A current priority of our team is to reduce the complexity of making changes to the data pipeline. As I thought more about where to start making this process easier (I wasn’t sure where to start at first), I thought maybe keeping track of the tables referenced in each .sql file will help.
Why will this help?
I think this will help because when we need to make a change to logic in the pipeline, we can quickly and easily see all the tables in a .sql file and weigh whether or not this file contains logic that that needs updating. In other words, we can see all the table dependencies of a .sql file in one place, and from this, we can know more quickly and easily (in theory) if this is the right place for changes.
Great! Now how do we go about this?
Manually looking through each file and copying and pasting each table name somewhere?
That’s a lot of files. and a lot of tables.
Hmmmmmmmm….

Ah, I got it, time for a google search then? This must be out there already, I mean every IDE changes the color of table names…
‘find all table names in .sql file’… No that's not it
‘find all table names in .sql file python’… hmmm I see some stackoverflow questions about it…
This one is exactly what I am asking about… “How to extract tables names in a SQL script?”… and some people even wrote up algorithms for it…
replace_list = ['\n', '(', ')', '*', '=']
for i in replace_list:
txt = txt.replace(i, ' ')
txt = txt.split()
res = []
for i in range(1, len(txt)):
if txt[i-1] in ['from', 'join'] and txt[i] != 'select':
res.append(txt[i])
print(res)
hmm, I have no idea what that is doing, and I don't want to spend time trying to figure out…
sql_line = re.sub('\s+', ' ', sql)
tbl_re = re.compile(r'(?:\b(?:from)|(?:join)\b\s+)(\w+)\b')
tablenames = tbl_re.findall(sql_line)
print(tablenames)
Wow, I have even less of an idea of what this one is doing…
Come on there’s no like prebuilt python function?? Let me check a different stackoverflow…
hmmm, this one looks similar to what I want… “How to extract table names and column names from sql query?”
The first answer to this one starts with this: “Really, this is no easy task.” 😮

Ok, I’m gonna try writing this myself, so I can be sure it does what I need it to do… Now, where can I get a good SQL query to use for debugging?
Ahhhhh back on that first stackoverflow question:
select *
from (
select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b
on a.col1 = b.col2
left join
test.test_c c
on b.col2 = c.col3
left join
(select
col4
from
test.test_d) d
on c.col3 = d.col4
Perfect — its got subqueriers, weird spacing, everything except comments… Lets fix that:
// comment
-- comment
# commentselect *
from (
select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b
on a.col1 = b.col2
left join
test.test_c c
on b.col2 = c.col3
left join
(select
col4
from
test.test_d) d
on c.col3 = d.col4
👍 👍
Begin Follow-Along-Coding:
So I am going to assume that you have experience running python scripts and that you have python 3 installed on your computer. If that is not the case: fret not, there are a plethora of good tutorials for both of these things. Go get that figured out, then come back and do this. 🙂
Ok, just so we’re all starting with the same setup, you can either download this git repo, or you can make a folder with a file called ‘file.sql’ with the above SQL query in it, and a file called ‘main.py’ with this line in it:
if __name__ == '__main__':
print('Hello World!')
Now, make sure you open up this folder in the IDE of your choice (you can use IDLE that comes with python if you’d like, or you can even edit these files in a plain text editor and run the script ‘main.py’ from your command line).
Once you’ve done that, try running the ‘main.py’ script — You should see it return this output:
Hello World!
If it isn’t, keep trying different things, until you see it output that! 😝
Let's start the coding!! 😎😎 WAIT — first, let’s breakdown the process into smaller and easier steps:
Goal #1: get the .sql file into the easiest possible string format to extract the table names from (table names always come after either the keyword from
or the keyword join
, though not necessarily on the same line or always one space away from the keyword).
Goal #2: extract all table names from the string (and nothing extra!).
Now let’s start with having ‘main.py’ read the contents of ‘file.sql’. Luckily we are using python, so this is super easy! Update ‘main.py’ to look like this:
if __name__ == '__main__':
file = open('file.sql', "r")
for line in file:
print(line)
file.close()
So, the new code we added tells ‘main.py’ to open up ‘file.sql’, store its lines in the variable file
, go through those lines one by one, and print the line (in the for
loop). Once when we finish going through and printing the lines, we close the file with file.close()
(this is just telling python that we are done reading file
and it is good practice to use close()
after open()
).
Why are we going through ‘file.sql’ line by line?
Well, we don’t have to. There are other ways to do it, but this is how my brain thinks and as you’ll see in a bit, it makes it easy to deal with comments (for me at least).
Anyway, hit save, then run ‘main.py’. The output should look like this:
// comment— comment# commentselect *from (select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) aleft join test.test_b bon a.col1 = b.col2left jointest.test_c con b.col2 = c.col3left join(selectcol4fromtest.test_d) don c.col3 = d.col4
Whoa, why is everything double or triple spaced?? I thought we were printing line by line?
I know this isn’t great — but I’m not exactly sure. I know that each line has either a ‘\n’ character or ‘\n\r’ at the end of it, and that's causing there to be a new line after every line from the file. Regardless, we should probably remove that and try again.
There’s a function for that it’s called rstrip()
and it strips any combination of \n
, \r
, and ‘ ‘
off the end of a string. So let's change `main.py` to include that:
if __name__ == '__main__':
file = open('file.sql', "r")
for line in file:
print(line.rstrip())
file.close()
Run that and you should get an output that looks exactly like ‘file.sql’. Nice. 👍 We’re making progress. We can now access each line and do varying things to it to make it easier to extract the table names.
Next, let’s remove the comments because we only want table names that are part of the script logic. I am not exactly sure about all of the comment indicators for .sql files b/c there are so many flavors of SQL, but I know from experience that I have used //
, --
, and #
to stat a comment. This does not include multiline comments, but we will get to that later.
So to deal with comments starting with //
, --
, or #
I am going to remove the rest of the line after one of those indicators (this is why it is nice to go through the .sql file line by line). I am going to do this by splitting the line into parts based on the comment indicators, then taking the first part — index 0 of splitting a string on one of the indicators (Note: if the line does not contain //
, --
, or #
this will just return the whole line string, and if the comment indicator is the first char(s) of the string, it will return an empty string ‘’
).
Here is what I will update ‘main.py’ to look like:
if __name__ == '__main__':
file = open('file.sql', "r")
for line in file:
line = line.split('//')[0]
line = line.split('--')[0]
line = line.split('#')[0]
print(line.rstrip())
file.close()
This is what the output looks like when I run it:
select *
from (
select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b
on a.col1 = b.col2
left join
test.test_c c
on b.col2 = c.col3
left join
(select
col4
from
test.test_d) d
on c.col3 = d.col4
No comments! Wow, that was easy 🙂
Now it’s time for a little reorganization because we’re starting to have logic under if __name__ == ‘__main__’:
, which I am pretty sure is not a good idea. Let’s put the logic in a function and call it from where the code is right now:
def process_sql_file(file_name):
file, string = open(file_name, "r"), ''
for line in file:
line = line.rstrip()
line = line.split('//')[0]
line = line.split('--')[0]
line = line.split('#')[0]
string += ' ' + line
file.close()
return string
if __name__ == '__main__':
print(process_sql_file('file.sql'))
Note: I added a new variable called string
and set it to ‘’
, then appended each line to the end of it with a space (string += ‘ ‘ + line
) and returned string
as the output of our new function called process_sql_file
. Also instead of hardcoding ‘file.sql’ into our logic, I replaced that with a function variable argument called file_name
, so we can use this function for .sql files with other names as well!
Ok, so when we run ‘main.py’ now we get one long string:
select * from ( select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a left join test.test_b b on a.col1 = b.col2 left join test.test_c c on b.col2 = c.col3 left join (select col4 from test.test_d) d on c.col3 = d.col4
Good! Now there are three more things I can think of that will make this string even easier for extracting table names: 1) squish all the places with multiple whitespaces down to just one whitespace, 2) ensure consistency of spacing of parentheses from words — either have '(‘
and ’)’
either touching words or not touching, and I am going to choose not touching, and 3) ensure multiline comments (starting with ‘/*’
and ending with ‘*/’
) are also ignored.
Let’s tackle 1) & 2) first because they seem easier:
def process_sql_file(file_name):
file, string = open(file_name, "r"), ''
for line in file:
line = line.rstrip()
line = line.split('//')[0]
line = line.split('--')[0]
line = line.split('#')[0]
line = line.replace('(', ' ( ')
line = line.replace(')', ' ) ')
string += ' ' + line
file.close()
# remove extra whitespaces
string = ' '.join(string.split())
return string
if __name__ == '__main__':
print(process_sql_file('file.sql'))
So for ensuring the parentheses have a whitespace between them and the words around them, I added line = line.replace(‘(‘, ‘ ( ‘)
and the corresponding one for ‘)’
. Then for removing extra whitespaces I added string = ‘ ‘.join(string.split())
(split()
with no arguments actually splits on whitespace and several whitespaces in a row, as if they are only 1 whitespace — thanks stackoverflow 😉).
Running ‘main.py’ now returns a string with 1 whitespace between each word (or non-whitespace character). Next, it’s on to removing inline comments!
Remember these start with ‘/*’
and end with ‘*/’
. After some thought, I realized the easiest way to do this is by removing anything between pairs of ‘/*’
and ‘*/’
in string
, after string
has all lines appended to it. I am not sure that there should ever be a ‘/*’
without a ‘*/’
in a .sql file, but I am just going to assume that only portions of string
between ‘/*’
and ‘*/’
should be removed.
Ok to do this, I am first going to add some inline comments to `file.sql` to help us debug this logic:
// comment
-- comment
# comment
select *
from (
select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1) a
left join test.test_b b # comment
on a.col1 = b.col2
left join
test.test_c c // comment
/* sdfiujsdf
iuhsidfgoisdf
lkjsdfgoijsdf */
on b.col2 = c.col3 //
left join
(select
col4
from
/* sdfiujsdf
iuhsidfgoisdf
lkjsdfgoijsdf */ test.test_d) d
/* sdfiujsdf
iuhsidfgoisdf
lkjsdfgoijsdf */
on c.col3 = d.col4
Note: even though this is probably never a good way to write SQL code, I left some SQL code on the same line that a multiline comment ends lkjsdfgoijsdf */ test.test_d) d
, just to make sure we can handle that (though with the way we're going about this, I don’t see why that should be an issue).
Now to write the multiline comment removal logic, I am just going to find the next pair of ‘/*’
and ‘*/’
(starting from index 0 and searching to the right), while there is a pair left in string
, and save string
as the portion of string
ending before the starting index of ‘/*’
joined with the portion of string
starting after the ending index of ‘*/’
and to the end of string
— so basically just removing what is between ‘/*’
and ‘*/’
. I am doing this using python string slices.
Is there a better way to do this? Probably, but this makes sense to me, so I’m doing it this way. process_sql_file
looks like this now:
def process_sql_file(file_name):
file, string = open(file_name, "r"), ''
for line in file:
line = line.rstrip()
line = line.split('//')[0]
line = line.split('--')[0]
line = line.split('#')[0]
line = line.replace('(', ' ( ')
line = line.replace(')', ' ) ')
string += ' ' + line
file.close()
# remove multi-line comments:
while string.find('/*') > -1 and string.find('*/') > -1:
l_multi_line = string.find('/*')
r_multi_line = string.find('*/')
string = string[:l_multi_line] + string[r_multi_line + 2:]
# remove extra whitespaces
string = ' '.join(string.split())
return string
Running ‘main.py’, should now output:
select * from ( select col1 from test.test_a join test.test_a1 on a.col1 = a1.col1 ) a left join test.test_b b on a.col1 = b.col2 left join test.test_c c on b.col2 = c.col3 left join ( select col4 from test.test_d ) d on c.col3 = d.col4
OK! now time for goal #2! Wait, what was goal #2 again??
Goal #2: extract all table names from the string (and nothing extra!).
Ok! I am going to reformat the code again, so it is more organized:
def process_sql_file(file_name):
file, string = open(file_name, "r"), ''
for line in file:
line = line.rstrip()
line = line.split('//')[0]
line = line.split('--')[0]
line = line.split('#')[0]
line = line.replace('(', ' ( ')
line = line.replace(')', ' ) ')
string += ' ' + line
file.close()
# remove multi-line comments:
while string.find('/*') > -1 and string.find('*/') > -1:
l_multi_line = string.find('/*')
r_multi_line = string.find('*/')
string = string[:l_multi_line] + string[r_multi_line + 2:]
# remove extra whitespaces
string = ' '.join(string.split())
return string
def find_table_names(string):
return string
def find_table_names_from_sql_file(file_name):
string = process_sql_file(file_name)
table_names = find_table_names(string)
return table_names
if __name__ == '__main__':
print(find_table_names_from_sql_file('file.sql'))
So now we have 3 functions, but really only process_sql_file
and find_table_names
will be doing anything. find_table_names_from_sql_file
is just combining process_sql_file
and find_table_names
into one process. It is sometimes good to break code down into smaller, simpler pieces of code as this allows for easier testing and more reusability (if the code is designed well for that!). For now find_table_names
just returns its string
argument.
Ok, let’s get started on goal #2 then!
The way I go about finding table names is very similar to removing the multiline comment, but with slightly different logic in the while
loop. So given that we have a nicely formatted string of our .sql file, I decided to go through the string and find the word that comes after the next closest ‘from’
or ‘join’
, then chopping off everything in the string through the end of the table name, and repeating this process again. The specific steps written out are:
While there is a ‘from ‘
or ‘join ‘
substring left in string
:
- Find the next closest from
‘from ‘
or‘join ‘
and delete the part of the string up to and including the white space after‘from ‘
or‘join ‘
. - Find the next closest
‘ ‘
instring
, and save it astable_name
. - Remove part of
string
up to the next closest‘ ‘
. - If
table_name
is not‘(‘
— the start of a subquery — add table_name totable_names
Here is the code:
def find_table_names(string):
table_names = []
# find next closest 'from ' or 'join '
# take tablename after it assuming it isn't '('
while string.find('from ') > -1 or string.find('join ') > -1:
# find next closest 'from ' index
next_from_index = float('inf')
if string.find('from ') > -1:
next_from_index = string.find('from ')
# find next closest 'join ' index
next_join_index = float('inf')
if string.find('join ') > -1:
next_join_index = string.find('join ')
# find the next closest keyword in string
# i.e. the one with the smallest index
next_keyword_index = min(next_from_index, next_join_index)
# set remaining string to be slice
# starting after next 'from ' or 'join ':
string = string[next_keyword_index + 5:]
# find index of next whitespace in remaining string:
next_space_index = string.find(' ')
# save table_name as slice from start of
# remaining string until next whitespace
table_name = string[:next_space_index]
# set remaining string as slice starting
# after end of table_name
string = string[next_space_index:]
# if the table_name isn't the start of
# a subquery, save it to table_names
if table_name != '(':
table_names.append(table_name)
return table_names
Run ‘main.py’ now, and you will see that the output is a list of all table names in ‘file.sql’!!
output:
['test.test_a', 'test.test_a1', 'test.test_b', 'test.test_c', 'test.test_d']
Woo Hoo!!!!! 😎😎😎😎
Note: next_from_index
and next_join_index
are initially set to infinity, then only set to the index where ‘from ‘
or ’join ‘
if the index returned by find(‘from ‘)
or find(‘join ‘)
is greater than -1. If ‘from ‘
or ’join ‘
is not found in the remaining string, find
will return -1, and the line min(next_from_index, next_join_index)
will end up choosing -1 as the index we want (and that’s not the index we want!!). There might be more concise ways to do this, but I want this code to be easy to read (for me) in case I come back and edit it months from now.
One final thing: how we have the code right now, it will return table names in the order they show up in the .sql file, but if it will leave repeats in the return list. I am going to change find_table_names
so that it returns only one copy of each table name, and sorts them alphabetically.
Here’s what that looks like:
def find_table_names(string):
table_names = set()
while string.find('from ') > -1 or string.find('join ') > -1:
next_from_index = float('inf')
if string.find('from ') > -1:
next_from_index = string.find('from ')
next_join_index = float('inf')
if string.find('join ') > -1:
next_join_index = string.find('join ')
next_keyword_index = min(next_from_index, next_join_index)
string = string[next_keyword_index + 5:]
next_space_index = string.find(' ')
table_name = string[:next_space_index]
string = string[next_space_index:]
if table_name != '(':
table_names.add(table_name)
return list(sorted(table_names))
It just so happens that ‘file.sql’ has all tables occur in alphabetical order and only 1 time each, but the same cannot be said for all .sql files!
So for anyone who wants the final code, it is here!
So what did we learn? Well, finding table names in a .sql file is a very doable task for 1 thing. It is almost like a leetcode problem. We also learned that there is no built-in python function for this (or I couldn’t find it easily in a google search). I wonder if we learned partly how compilers/interpreters do their thing? Maybe? We definitely learned that programming is fun! 😎
This is my first time doing a follow-along-coding article. Let me know if you enjoyed it, hated it, think it's cool, learned something, didn’t learn something!
Thanks for reading!