In many of the interviews, I stress to evaluate one of the focus item, how much importance one gives to technical documentation of the technology he/she working on.
Here is one such sample, how do you get “File Names” when you read multiple files (all with same schema — csv, json, orc..etc) into databricks as a data-frame or a table?
As showed on the left image there are flights csv data files as flights1.csv till flights20.csv
Each file consist of random number of records.
Read all the files into a single data-frame / table and get the filename as an additional column in the target table.
Most common solution given in the discussion is to mount ADLS and go through file by file to create a corresponding data-frame and merge all data-frames in the end.
Lets look into the traditional approach mentioned by many of the programs in my interviews
Step1: Mount ADLS to Databricks — there are various ways to do this, using SAS token, using secrets from key vault, credentials pass through.
I choose to do with credentials passthrough. Following is the snippet of code to mount storage container in databricks (run this in a notebook)
This code creates the mount with given name and lists all mounts which are in databricks.
Step2: Loop through files from the directory file by file and add an additional column with file name and append the data frame with main data-frame
I encountered this solution in majority of the interviews I did in the recent times. Curiously enquire further how does scale when you have many small files written to directory? as this one runs through a loop of files and creates a data-frame and appends with main data-frame.
Number of records for each file and corresponding file name also associated with the dataframe / Table
Number of files directly proportionate to the time and resources utilized to run this code.
This curiosity flames further thinking, to find a good solution.
Now lets look into Spark inbuilt capabilities to solution the problem above.
Step1: Mount ADLS to Databricks — there are various ways to do this, using SAS token, using secrets from key vault, credntials pass through.
Step2: Create table / data frame using *.csv (all files should have same schema) and the use “input_file_name()” SQL function
In this solution, we are using spark inbuilt function which picks metadata of the table and underlying data.
When we read through and take help of technical documentation of product, there are better, easier, and performent options available as the new versions released.