Introduction

OpenCF Core: The File Convertion Framework

The opencf-core package provides a robust framework for handling file conversion tasks in Python. It offers a set of classes and utilities designed to simplify the process of reading from and writing to different file formats efficiently.

Features

  • Modular Input/Output Handlers: Defines abstract base classes for file readers and writers, allowing for easy extension and customization.

  • Support for Various File Formats: Provides built-in support for common file formats such as text, CSV, JSON, XML, Excel, and image files.

  • MIME Type Detection: Includes a MIME type guesser utility to automatically detect the MIME type of files, facilitating seamless conversion based on file content.

  • File Type Enumeration: Defines an enum for representing different file types, enabling easy validation and processing of input and output files.

  • Exception Handling: Implements custom exceptions for handling errors related to unsupported file types, empty suffixes, file not found, and mismatches between file types.

  • Base Converter Class: Offers an abstract base class for implementing specific file converters, providing a standardized interface for file conversion operations.

  • Resolved Input File Representation: Introduces a class for representing input files with resolved file types, ensuring consistency and correctness in conversion tasks.

Conversion Strategies

When using the opencf-core, you can adopt different strategies for file conversion based on your specific requirements:

1. Direct Conversion

In this approach, conversion is achieved without utilizing a dedicated writer. The reader module parses the input files into a list of objects. Subsequently, the _convert method orchestrates the writing process into a file or folder. This method is suitable for scenarios where direct manipulation of data structures suffices for conversion.

2. Indirect Conversion

Conversely, indirect conversion employs a converter that supports a dedicated writer. Here, the convert function’s primary role is to transform the parsed list of objects into a format compatible with the writer. The actual conversion process may be executed by the writer, leveraging its capabilities. For instance, converting images to videos involves parsing images into a list of Pillow objects, which are then reformatted into a numpy array. This array, encapsulating frame dimensions and color channels, serves as input for the video writer.

Component Instances

The file conversion process can be dissected into three distinct instances:

  • Reader: Handles input-output (IO) operations, transforming files into objects. Readers are implementations of the abstract class Reader present in io_handler.py.

  • Converter: Facilitates object-to-object conversion, acting as an intermediary for data transformation. Converters are implementations of the abstract class BaseConverter present in base_converter.py.

  • Writer (Optional): Reverses the IO process, converting objects back into files. Writers are implementations of the abstract class Writer present in io_handler.py.

Modules

  • io_handler.py: Contains classes for reading from and writing to files, including text, CSV, JSON, XML, and image files. It includes abstract classes for Reader and Writer.

  • mimes.py: Provides a MIME type guesser utility for detecting file MIME types based on file content.

  • filetypes.py: Defines enums and classes for representing different file types and handling file type validation.

  • base_converter.py: Implements the base converter class and the resolved input file class for performing file conversion tasks. It includes the BaseConverter abstract class.

Installation

pip install opencf-core

Usage

The opencf-core package can be used independently to build custom file conversion utilities or integrated into larger projects for handling file format transformations efficiently.

from opencf_core.io_handler import CsvToListReader, ListToCsvWriter
from opencf_core.base_converter import BaseConverter, ResolvedInputFile
from opencf_core.filetypes import FileType

class CSVToJSONConverter(BaseConverter):
    file_reader = CsvToListReader()
    file_writer = DictToJsonWriter()

    @classmethod
    def _get_supported_input_type(cls) -> FileType:
        return FileType.CSV

    @classmethod
    def _get_supported_output_type(cls) -> FileType:
        return FileType.JSON

    def _convert(self, input_path: Path, output_file: Path):
        # Implement conversion logic from CSV to JSON
        pass

# Usage
input_file_path = "input.csv"
output_file_path = "output.json"
input_file = ResolvedInputFile(input_file_path, is_dir=False, should_exist=True)
output_file = ResolvedInputFile(output_file_path, is_dir=False, should_exist=False, add_suffix=True)
converter = CSVToJSONConverter(input_file, output_file)
converter.convert()

More Examples

The examples folder in this repository contains practical demonstrations of how to use the opencf-core package for file conversion tasks. Currently, it includes the following examples:

  • simple_converter.py: Demonstrates a basic file converter that converts Excel (XLSX) files to CSV format. It utilizes the XLXSToCSVConverter class defined within the opencf-core package to perform the conversion.

  • cli_app_example.py: Illustrates how to build a command-line interface (CLI) application using the ConverterApp class from the opencf-core.converter_app module. This CLI app allows users to specify input and output files, as well as input and output file types, for performing file conversions.

These examples serve as practical demonstrations of how to leverage the capabilities of the opencf-core package in real-world scenarios. Users can refer to these examples for guidance on building their own file conversion utilities or integrating file conversion functionality into existing projects.

You can have a more practical insight by reading the support associated to the examples

Todo

Backend Support

  • Introduce the concept of backend labeling for Reader and Writer implementations.

  • Enable multiple file readers/writers to share common backends. For instance, if an ImageOpenCVReader utilizes both numpy and OpenCV, the VideoWriter can leverage the same dependencies.

  • Allow users to specify preferred backend configurations, ensuring that conversion methods accommodate all selected backends seamlessly.

Contributing

Contributions to the opencf-core package are welcome! Feel free to submit bug reports, feature requests, or pull requests via the GitHub repository.

Disclaimer

Please note that while the opencf-core package aims to provide a versatile framework for file conversion tasks, it may not cover every possible use case or handle all edge cases. Users are encouraged to review and customize the code according to their specific requirements.

Usage Examples

Introduction

You can define your own converters like in simple_converter.py. Then, you can choose some converter to create a CLI App like done cli_app_example.py. I’ve added support to add multiple files as input.

Multiple Files Support

At the beginning, I wanted to get a file and then write another file. Then, I figured, for some conversions (like img to pdf), I may want to send multiple files as input. When the converter only needs one file, it will just get the first element of the list of inputs.

Then, I have extended the functionality to support lists of elements, including:

  • Individual Files: You can specify individual files directly.

  • Folders: You can specify a folder, and all files within the folder will be considered.

  • Glob Patterns: You can use glob patterns to match multiple files based on pattern matching.

This enhancement provides greater flexibility and convenience for batch processing and complex file selection scenarios.

For example, the script below demonstrates how to convert all .txt files in a directory to a single output file:

python examples/cli_app_example.py examples/data/*.txt -o examples/output.txt

Similarly, you can specify folders or multiple files directly:

python examples/cli_app_example.py examples/data/file1.txt examples/data/file2.txt -o examples/output.txt

Or specify a folder to include all files within it:

python examples/cli_app_example.py examples/data/ -o examples/output.txt

Example Usage of TXTToTXTConverter with Enhanced Support

# Using glob patterns
python examples/cli_app_example.py examples/data/*.txt -o examples/output.txt

# Using a list of files
python examples/cli_app_example.py examples/data/file1.txt examples/data/file2.txt -o examples/output.txt

# Using a folder
python examples/cli_app_example.py examples/data/ -o examples/output.txt

# Combining different types
python examples/cli_app_example.py examples/data/file1.txt examples/data/*.txt -o examples/output.txt

Folder Saving Support

After the multiple files support, I figured, sometimes, for some conversions like (pdf to img), I may want to save multiple files. So, I chose to give more flexibility in the options: output filepath (-o) and output file type (-ot).

Setting -o as a Folder

You cannot set a folder without adding a valid filetype because the output format needs to be inferred somehow. So, let’s proceed under the assumption the filetype (-ot) has also been set.

When you set a folder (as output_path) and a filetype, the folder would be created and files would be set in it. How does that work?

  • When the converter has a writer, only the filepath is used for saving.

  • When the converter doesn’t have a writer, the folder is sent along with a default filepath inside the folder. So, in the converter, you can choose any option. Below, for example, I use the output_file for saving instead of the output_folder.

    class TXTToTXTConverter(BaseConverter):
    
        file_reader = TxtToStrReader()
        # no file writer means the converter will handle the saving
    
        @classmethod
        def _get_supported_input_type(cls) -> FileType:
            return FileType.TEXT
    
        @classmethod
        def _get_supported_output_type(cls) -> FileType:
            return FileType.TEXT
    
        def _convert(self, input_contents: List[str], output_file: Path, **kwargs):
            md_content = "\n".join(input_contents)
            output_file.write_text(md_content)
    

For example, the script below will save the file examples/output/opencf-output.md:

python examples/cli_app_example.py examples/data/*.txt -o examples/output -ot md

Setting -o as a Filepath

When you send an output path that has a suffix (like myfile.txt, not myfile), the filepath will be sent to the converter. The output format will be inferred from the filetype (-ot) if you set it. Or, it will be inferred from the filepath suffix. If both (the suffix and the output type) are valid formats, they should match, or an error will be raised.

For example, the script below will save the file output.f:

python examples/cli_app_example.py examples/data/*.txt -o examples/output.f -ot md

Usage Example of TXTToTXTConverter to Merge TXT Files

python examples/cli_app_example.py examples/data/example.txt examples/data/example2.txt -o examples/output.txt -ot txt
# or
find examples -type f -name "*.txt" | xargs python examples/cli_app_example.py -o examples/output.txt -ot txt
# or
python examples/cli_app_example.py examples/data/*.txt -o examples/output.txt

Usage Example of TXTToMDConverter to Merge TXT Files into a MD File

python examples/cli_app_example.py examples/data/*.txt -o examples/output.md

XLSX to CSV Conversion Example

Below is an example of a converter that reads an XLSX file and converts it to a CSV file.

import sys
from pathlib import Path
from typing import List

import pandas as pd
from opencf_core.base_converter import BaseConverter, ResolvedInputFile
from opencf_core.filetypes import FileType
from opencf_core.io_handler import Reader, ListToCsvWriter

class SpreadsheetToPandasReader(Reader):
    input_format = pd.DataFrame

    def _check_input_format(self, content: pd.DataFrame):
        return isinstance(content, pd.DataFrame)

    def _read_content(self, input_path: Path) -> pd.DataFrame:
        return pd.read_excel(input_path)

class XLXSToCSVConverter(BaseConverter):
    file_reader = SpreadsheetToPandasReader()
    file_writer = ListToCsvWriter()

    @classmethod
    def _get_supported_input_types(cls) -> FileType:
        return [FileType.XLSX, FileType.XLS]

    @classmethod
    def _get_supported_output_types(cls) -> FileType:
        return FileType.CSV

    def _convert(self, input_contents: List[pd.DataFrame]):
        df = input_contents[0]

        # Convert DataFrame to a list of lists
        data_as_list = df.values.tolist()

        # Insert column names as the first sublist
        data_as_list.insert(0, df.columns.tolist())

        return data_as_list

if __name__ == "__main__":
    input_file_path = "examples/data/example.xlsx"
    output_file_path = "examples/data/example.csv"

    input_file = ResolvedInputFile(input_file_path, is_dir=False, should_exist=True)
    output_file = ResolvedInputFile(
        output_file_path, is_dir=False, should_exist=False, add_suffix=True
    )

    converter = XLXSToCSVConverter(input_file, output_file)
    converter.convert()

For example, to convert an XLSX file to CSV, run the script as follows:

python examples/cli_app_example.py examples/data/example.xlsx -o examples/data/example.csv

Abstract Converter Class

The Converter class provides a structured way to define data converters, including methods to check input and output formats, and perform the conversion. Here’s the abstract base class and an example implementation:

Previous Implementation Approach

Previously, when writing an implementation of WriterBasedConverter, one would typically override the _convert method directly. Here’s a simplified example to illustrate:

class TXTToTXTConverter(BaseConverter):

    file_reader = TxtToStrReader()
    # no file writer means the converter will handle the saving

    @classmethod
    def _get_supported_input_type(cls) -> FileType:
        return FileType.TEXT

    @classmethod
    def _get_supported_output_type(cls) -> FileType:
        return FileType.TEXT

    def _convert(self, input_contents: List[str], output_file: Path, **kwargs):
        md_content = "\n".join(input_contents)
        output_file.write_text(md_content)

In this method:

  • The _convert method is overridden to implement the conversion logic.

  • The input and output formats are defined within the _convert method itself.

New Implementation Approach with Converter Class

With the new Converter class, the conversion process is broken down into more modular steps:

  1. Checking Input Format: Ensure that the content meets the expected input format.

  2. Checking Output Format: Ensure that the content meets the expected output format.

  3. Performing Conversion: Implement the actual conversion logic.

This structure provides a more robust framework for implementing converters and facilitates better code reuse and readability.

Example Implementation: StrToStrConverter

from typing import List

class StrToStrConverter(Converter):
    def _check_input_format(self, content: List[str]) -> bool:
        return isinstance(content, List) and all(
            isinstance(item, str) for item in content
        )

    def _check_output_format(self, content: str) -> bool:
        return isinstance(content, str)

    def _convert(self, content: List[str]) -> str:
        md_content = "\n".join(content)
        return md_content

Example Usage: MDToTXTConverter

The MDToTXTConverter class demonstrates the new approach where an attribute converters is defined:

class MDToTXTConverter(WriterBasedConverter):
    file_reader = TxtToStrReader()
    converters = [StrToStrConverter()]
    file_writer = StrToTxtWriter()

    @classmethod
    def _get_supported_input_types(cls) -> FileType:
        return FileType.MD

    @classmethod
    def _get_supported_output_types(cls) -> FileType:
        return FileType.TEXT

Key Differences and Benefits

Modularity and Reusability
  • Old Way: The conversion logic is embedded directly within the _convert method, making it less modular and harder to reuse.

  • New Way: The Converter class separates the concerns of checking input/output formats and performing the conversion, promoting modularity and reusability.

Clarity and Structure
  • Old Way: The conversion logic can become cluttered, especially when handling complex conversions involving multiple steps or checks.

  • New Way: By defining distinct methods for checking formats and performing conversion, the new approach offers a clearer and more structured way to implement converters.

Attribute converters
  • Old Way: The _convert method must be overridden for each specific converter.

  • New Way: One can define a list of converter instances in the converters attribute, allowing for chaining or combining multiple conversion steps easily.

Practical Example

To convert markdown files (.md) to text files (.txt) using the new MDToTXTConverter, you would use the following command:

python examples/cli_app_example.py examples/data/*.md -o examples/output.txt

Summary

The introduction of the abstract Converter class offers a more structured and modular approach to defining data converters. By separating the checking of input/output formats and the conversion logic, it enhances code clarity, reusability, and maintainability. The new approach also allows for defining a chain of converters through the converters attribute, further improving flexibility in handling complex conversion tasks.

Summary

Here’s a recap of the main points:

  • Custom Converters: You can define and use custom converters for various data transformation tasks.

  • Multi-file Support: The application can handle multiple files, folders, and glob patterns as input, providing flexibility for batch processing.

  • Output Options: The application supports saving output to a specified file or folder, with the ability to infer or specify the output format.

  • Abstract Converter Class: A structured way to define data converters, with methods to check input and output formats and perform the conversion.

  • Practical Example: Demonstrated using the MDToTXTConverter and StrToStrConverter classes to convert markdown files to text files.

By incorporating these features, the CLI application becomes a powerful tool for various file conversion tasks, accommodating complex input and output scenarios.