utils#

Utility functions for general purpose tasks.

This module contains the following utility functions:

  • is_running_on_databricks: Check if the code is running locally or on Databricks.

  • get_spark_context: Get the Spark context.

  • find_filepath: Find a file or folder in the initial_dir directory or its parent directories.

  • remove_files: Remove files from a directory matching a specified pattern.

  • display_files: Display tables of removed and not removed files in a given directory.

wip.utils.dbutils_glob(pattern: str)[source]#

Perform a glob-like pattern matching for files in ABFSS using dbutils.fs.

Parameters

pattern (str) – The glob pattern to match against file names. Supports ‘*’ and ‘?’ wildcards.

Returns

A list of matched file paths in ABFSS.

Return type

List[str]

wip.utils.display_files(removed_files: List[str], not_removed_files: List[str])[source]#

Display tables of removed and not removed files in a given directory.

This function creates and displays two tables:

  • One for files successfully removed

  • Table of files that were not removed from the specified directory.

The tables include file names and directory paths.

Parameters
  • removed_files (List[str]) – List of file paths that were successfully removed.

  • not_removed_files (List[str]) – List of file paths that were not removed.

Notes

This function uses rich.console.Console and rich.table.Table for displaying the tables in a formatted manner. It relies on logger for logging the number of removed and not removed files.

Examples

>>> display_files(["/path/to/dir/removed.txt"], [])
# This will display a table of removed files.
wip.utils.exists(path: str | Path) bool[source]#

Check if a file or directory exists locally or in DataBricks.

Parameters

path (str | Path) – The file path to check if it exists.

Returns

Whether the file or directory exists.

Return type

bool

wip.utils.find_filepath(filename: str | Path, initial_dir: str | Path | None = None, max_upper_dirs: int = 4) Path[source]#

Find a file or folder in the initial_dir directory or its parent directories.

Parameters
  • filename (str | Path) – The filename to find.

  • initial_dir (str | Path | None) – The initial directory to start searching from. If None, the current directory is used.

  • max_upper_dirs (int, default 3) – The maximum number of parent directories to search. Note that increasing the maximum number of parent directories to search can increase search time exponentially.

Returns

The path to the file.

Return type

Path

Raises

FileNotFoundError

If one of the following occurs:

  • If the file isn’t found.

  • If the initial directory doesn’t exist.

  • If the initial directory is a file.

wip.utils.get_dbutils()[source]#

Get the Databricks dbutils module.

Returns

The Databricks dbutils module, that contains modules like fs.

Return type

ModuleType

wip.utils.get_function_kwargs(func: Callable, **kwargs) Tuple[Dict[str, Any], Dict[str, Any]][source]#

Return a dictionary of keyword arguments accepted by a given function.

Parameters
  • func (Callable) – The function whose keyword arguments are to be retrieved.

  • kwargs (Any) – Keyword arguments to pass to the function.

Returns

A dictionary of keyword arguments accepted by a given function and another dictionary with the remaining keyword arguments.

Return type

Tuple[Dict[str, Any], Dict[str, Any]]

wip.utils.get_function_parameters(func: Callable) List[str][source]#

Returns a list of parameter names accepted by a given function.

Parameters

func (Callable) – The function whose parameters are to be retrieved.

Returns

A list of parameter names.

Return type

List[str]

wip.utils.get_spark_context()[source]#

Get the Spark context.

Returns

The Spark context.

Return type

pyspark.context.SparkContext

wip.utils.is_running_on_databricks() bool[source]#

Check if the code is running locally or on Azure Databricks.

Function checks if the environment variable DATABRICKS_RUNTIME_VERSION exists. If it’s, then the code is running on Azure Databricks.

Returns

True if running on Azure Databricks, False otherwise.

Return type

bool

wip.utils.remove_files(directory: str | Path, pattern: str, verbose: bool = False) Tuple[List[str], List[str]][source]#

Remove files from a directory matching a specified pattern.

This function attempts to delete files in a specified directory that match a given pattern. It returns lists of both removed and not removed files. If the directory does not exist or is not a directory, it logs an error.

Parameters
  • directory (str | Path) – The directory from which files are to be removed. Accepts either a string path or a Path object.

  • pattern (str) – The pattern used to match files for removal, e.g., ‘*.txt’, or the name of the file to remove.

  • verbose (bool, default False) – If True, displays tables of removed and not removed files.

Returns

A tuple containing two lists:

  • The first list contains paths of files successfully removed

  • The second list contains paths of files that were not removed.

Return type

Tuple[List[str], List[str]]

Raises

Exception – General exceptions are caught and logged if file removal fails.

See also

os.remove

For the removal of individual files.

glob.glob

For a pattern matching of file paths.

Examples

>>> remove_files("/path/to/dir", "*.txt")
(['/path/to/dir/file1.txt', '/path/to/dir/file2.txt'], [])
>>> remove_files("/path/to/dir", "**/*.txt")
(['/path/to/dir/folder1/file1.txt', '/path/to/dir/folder2/file2.txt'], [])
>>> remove_files("/path/to/dir", "file1.txt")
(['/path/to/dir/file1.txt'], [])

Notes

This function logs errors and exceptions using the logger from wip.logging_config. It uses Path from pathlib for path manipulations and checks.

..versionadded:: 2.4.0

Include the remove_files_databricks function for removing files from ABFSS paths in Databricks.

wip.utils.remove_files_databricks(directory: str | Path, pattern: str, verbose: bool = True) Tuple[List[str], List[str]][source]#

Remove files from a Storage Account container path in Databricks.

This function attempts to delete files in a specified directory that match a given pattern. It returns lists of both removed and not removed files. If the directory does not exist or is not a directory, it logs an error.

Parameters
  • directory (str | Path) – The directory from which files are to be removed. Accepts either a string path or a Path object.

  • pattern (str) – The pattern used to match files for removal, e.g., ‘*.txt’, or the name of the file to remove.

  • verbose (bool, default False) – If True, displays tables of removed and not removed files.

Returns

A tuple containing two lists:

  • The first list contains paths of files successfully removed

  • The second list contains paths of files that were not removed.

Return type

Tuple[List[str], List[str]]

Raises

Exception – General exceptions are caught and logged if file removal fails.

Notes

This function assumes it’s running in a Databricks environment. It uses Databricks’ dbutils.fs module to interact with ABFSS paths.

Changed in version 2.8.9: Added a try/except clause to check if the path being accessed actually exists inside Azure Container.