- read

Get File Names when reading data from ADLS into Azure Databricks

Ramesh Nelluri 46

In many of the interviews, I stress to evaluate one of the focus item, how much importance one gives to technical documentation of the technology he/she working on.

Photo by Scott Graham on Unsplash

Here is one such sample, how do you get “File Names” when you read multiple files (all with same schema — csv, json, orc..etc) into databricks as a data-frame or a table?

Flights data multiple CSV files of same schema
Files

As showed on the left image there are flights csv data files as flights1.csv till flights20.csv

Each file consist of random number of records.

Read all the files into a single data-frame / table and get the filename as an additional column in the target table.

Most common solution given in the discussion is to mount ADLS and go through file by file to create a corresponding data-frame and merge all data-frames in the end.

Lets look into the traditional approach mentioned by many of the programs in my interviews

Step1: Mount ADLS to Databricks — there are various ways to do this, using SAS token, using secrets from key vault, credentials pass through.

I choose to do with credentials passthrough. Following is the snippet of code to mount storage container in databricks (run this in a notebook)

Mount a ADLS gen2 storage container with databricks

This code creates the mount with given name and lists all mounts which are in databricks.

Step2: Loop through files from the directory file by file and add an additional column with file name and append the data frame with main data-frame

Loop through files, add column ‘filename’ and union all dataframes into one

I encountered this solution in majority of the interviews I did in the recent times. Curiously enquire further how does scale when you have many small files written to directory? as this one runs through a loop of files and creates a data-frame and appends with main data-frame.

Result

Number of records for each file and corresponding file name also associated with the dataframe / Table

Number of files directly proportionate to the time and resources utilized to run this code.

This curiosity flames further thinking, to find a good solution.

Now lets look into Spark inbuilt capabilities to solution the problem above.

Step1: Mount ADLS to Databricks — there are various ways to do this, using SAS token, using secrets from key vault, credntials pass through.

Step2: Create table / data frame using *.csv (all files should have same schema) and the use “input_file_name()” SQL function

Create table and use input_file_name() function

In this solution, we are using spark inbuilt function which picks metadata of the table and underlying data.

When we read through and take help of technical documentation of product, there are better, easier, and performent options available as the new versions released.