# CDAIG Preliminary Algorithm

## Author & Contact

- **Name:** David Alegre Alarza  
- **Institution:** Universitat Autònoma de Barcelona, Group of Interactive Coding of Images (GICI)  
- **Email:** david.alegre.alarza@uab.cat  
- **ORCID:** [0009-0009-7633-1184](https://orcid.org/0009-0009-7633-1184)

---

## Work in Progress

This software is still under development and may contain bugs.  

It is designed to build the **Compact Directed Acyclic Image Graph (CDAIG)** of two-dimensional data using a preliminary algorithm. This structure is a 2D extension of the original **Compact Directed Acyclic Word Graph (CDAWG)**, an automaton that recognizes the suffixes of a word.  

The CDAIG recognizes the *Isuffixes* of a word, a way to define suffixes in two-dimensional data.

---

## Status

- Basic implementation of the preliminary algorithm almost completed:
  - `SlowFind` function implemented
  - Hopcroft algorithm implemented
  - Isuffix logic and input reading implemented
- Some known issues (see [Known Issues](#known-issues))

---

## Requirements / Installation

- **Language / Compiler:** Clang 16.0.0 (or equivalent GCC)  
- **Additional dependencies:**
  - To visualize the automaton: [Graphviz](https://graphviz.org/)  

    Installation:
    - Linux: `sudo apt-get install graphviz`
    - macOS: `brew install graphviz`
    - Windows: [Download here](https://graphviz.org/download)  

    Example command:  
    ```bash
    dot -Tpng archive.dot -o archive.png
    ```

---

## Compilation Guide

To compile the source code with `gcc`:

```bash
gcc -Wall -Wextra main.c cdaig.c hopcroft.c -o cdaig
```

---

## Basic Usage

Run the program with:

```bash
./cdaig <INPUT_FILE> <NUMBER_OF_ROWS> <NUMBER_OF_COLUMNS>
```

- `<INPUT_FILE>` must follow the `.raw` format (row-major order, concatenation of rows).  
- **Output:**
  - On screen:
    1. `V`: number of states of the automaton
    2. `E`: number of transitions of the automaton
    3. `Suff`: number of unique Isuffixes of the automaton
  - A `.dot` file with the name `<INPUT_FILE>.cdaig.dot`

---

## Experiments Directory Structure

The project includes an `Experiments` folder containing two subdirectories, each with specific experiment data and related programs:

- **Aligned Sequences Experiment (`Alignment Experiment/`)**
  - Contains data and scripts related to the alignment of biological sequences.
  - Includes:
    - The experiment execution script (experiment.sh).
    - The program that processes FASTA data (fasta_extractor.c).
    - The program used to merge two sequences into a single file (merge.c).

- **Synthetic Mutations Experiment (`Synthetic Experiment/`)**
  - Contains data and scripts used to generate synthetic mutations.
  - Includes:
    - The experiment execution script (experiment.sh).
    - The mutator program (mutator.c).
    - The program that processes FASTA data (fasta_extractor.c).
    - The program used to merge two sequences into a single file (merge.c).

Each subdirectory contains the necessary files to reproduce the experiments independently.

---

## Known Issues

- Variable `beginpos` in `struct edge` is not correctly computed when `<NUMBER_OF_ROWS>` is greater than 2.

---

## Next Steps

- Add filename parsing to extract parameters automatically.  
- Add option to determine symbol length (currently reads byte by byte).  
- Add support for other output formats.  
- Performance optimization.  

---

**Last Modified:** 03/10/2025
