Practical Embedding of Data as a Virtual File System in C/C++ Executables

Note: The source code for this project can be found here. The title image was made by me (generate your own with this gist).

Compiled executables often load resources, such as image files or shader code, directly from disk at runtime.

This has the effect that we require the resources to exist at some predefined absolute or relative path on the local system when the program is executed.

This can complicate the software shipping process and require additional installation steps for the user, such as setting path variables or copying files to intended locations, instead of simply shipping a standalone executable independent of the local file system.

In this article, I will present a practical method for embedding arbitrary data in C/C++ executables as a virtual filesystem. This requires only a minor modification to the build system and no modifications to the original source code, along with some other advantages:

Files remain locally as-is, allowing for editing of the data with your software of choice (including full syntax highlighting)
Easy switching between local and embedded file system at build time

This method achieves this result by leveraging the C-preprocessor and linker. A small program which implements this concept can be found on Github here, called “c-embed”.

Note: Here are some links to discussion about this problem.

https://stackoverflow.com/questions/7288279/how-to-embed-a-file-into-an-executable
https://caiorss.github.io/C-Cpp-Notes/resources-executable.html
https://codeplea.com/embedding-files-in-c-programs
https://csl.name/post/embedding-binary-data/

Note: These methods only makes sense if the data you wish to use is sufficiently small to be loaded in RAM at all times.

Alternative Methods

Embedding string / char* Literals

The most simple method is to embed your data directly as a string or char* literal.

// embedded string / char literals

const char* binary_data = { 0x00, 0x01, ... };

const char* shader_source = "\
here is a text file!\n\
note the linebreak and backslash!\
";

The main drawbacks of this approach are manual symbol management, lack of syntax highlighting for text and the conversion of binary data to code.

A common approach to alleviate these issues is to extend the build toolchain to generate a header file which contains the encoded data. This allows the files to remain as they are during building, but still requires corresponding symbol management and toolchain extension.

Note: Even using a program that outputs a header file still requires that you access the data through their symbols and not the names of the files.

ASCII File #include as string literal

Another simple method is to use the #include preprocessor directive in order to embed an ASCII file as a string literal.

// preprocessor include string literal

const char* file_content =
    #include "file.txt"
;

As this uses the C-preprocessor, this requires no toolchain modification and embeds the file at compile time.

This has one sneaky drawback though: It requires that you wrap your text file in an innocuous little quotation R””(_)””!

R""(
the actual content of the text file is this!
)""

This is not very pleasing and can mess up syntax highlighting for some languages on many text editors (such as GLSL), and is of course not extendable to binary data.

objcopy, ld and link

The last common approach I will represent is objcopy, a binary utility which converts files directly into object files, which you can link to expose symbols for access. An object file can be generated as follows:

objcopy --input binary --output elf64-x86-64 --binary-architecture i386 data.file data.file.o

Multiple object files can then be easily merged using ld:

ld -relocatable *.o -o merged.o

Note: Objcopy builds the object files for a specific output architecture, which needs to be taken into consideration. In 2022, I still couldn’t get this approach to work on MacOS.

Linking the object file, we expose three symbols per file for data access, which objcopy creates with a specific naming convention: objcopy takes the relative path of the file and converts all characters which are not valid C symbol characters (e.g. ‘.‘, ‘/‘ ) to underscores, prepending “_binary_“

extern char _binary_data_file_start;
extern char _binary_data_file_end;
extern char _binary_data_file_size;

To access the full binary data, we simply retrieve the pointer &_binary_data_file_start as our “file starting position” and can begin to iterate.

The default objcopy naming convention can lead to symbol naming collisions, for instance with the following files:

./data/txt
./data.txt
./data_txt

While we can manually circumvent this issue by using the –redefine-sym flag for objcopy…

objcopy data.txt -o data.txt.o --redefine-sym _binary_data_txt_start=custom_symbol_start

…this becomes difficult to automate and we generally can’t guarantee that we avoid naming collisions, even using tricks, because of the reduced character set.

Additionally, this procedure is difficult to automate. For each new file, we need to manage a set of three new symbols, which we can not force the program to declare automatically at compile time based on which object files we link.

While this approach is probably the most elegant presented so far, it is not scalable because the C-preprocessor does not allow for any kind of macro iteration (which could be used for automated symbol declaration and accessing).

Note: Macro iteration is only possible up to a defined maximum iteration number using variadic argument tricks. I tried this previously but found this approach unelegant.

Embedding Virtual File Systems

My approach for embedding virtual file systems in a C binary was inspired by the objcopy approach.

Having tried to force the C-preprocessor to automatically declare and make available all files as symbols, I realized that the lack of true macro iteration was the main bottleneck.

The solution is to avoid defining a set of external symbols for each file, but instead only two sets of external symbols: One for a single binary concatenation of all files, and one for a generated indexing structure.

The symbols can thus have fixed, predefined names and the iteration logic is the responsibility of the indexing structure.

Both the binary file-system concatenation and indexing structure are generated as files, and then the regular objcopy process is applied.

The Indexing Structure

The indexing structure file is a simple concatenation of a fixed-size indexing struct EMAP_S, containing the required data for finding blocks in the virtual file system.

The main goal of the indexing structure is to avoid naming collisions and to allow for fast indexing of the virtual file system. This is easily achieved using a hashing function.

u_int32_t hash(char * key){   // Hash Function: MurmurOAAT64
  u_int32_t h = 3323198485ul;
  for (;*key;++key) {
    h ^= *key;
    h *= 0x5bd1e995;
    h ^= h >> 15;
  }
  return h;
}

struct EMAP_S {     // Map Indexing Struct
  u_int32_t hash;
  u_int32_t pos;
  u_int32_t size;
};
typedef struct EMAP_S EMAP;

When a new file is added to the virtual file system (by concatenating the binary data), we concatenate a new index struct by hashing the unsanitized file name, the file’s size in bytes and the starting position in the virtual file system.

To retrieve the position of a file, we can simply iterate over the (short) list of file indices, comparing the file name hash.

Note: this also avoids the “reduced character set” problem.

Binary File-System Concatenation

We can write a simple C program that iterates through a directory structure, concatenating all files into a binary file while building the indexing structure file.

A single file addition can be written in a function:

FILE* ms = NULL;    // Mapping Structure
FILE* fs = NULL;    // Virtual Filesystem
FILE* file = NULL;  // Embed Target File Pointer
u_int32_t pos = 0;  // Current Position

void concatenate(char* filename){

  file = fopen(filename, "rb");  // Open the Embed Target File
  if(file == NULL){
    printf("Failed to open file %s.", filename);
    return;
  }

  fseek(file, 0, SEEK_END);     // Define Map
  EMAP map = {hash(filename), pos, (u_int32_t)ftell(file)};
  rewind (file);

  char* buf = malloc(sizeof(char)*(map.size));
  if(buf == NULL){
    printf("Memory error for file %s.", filename);
    return;
  }

  u_int32_t result = fread(buf, 1, map.size, file);
  if(result != map.size){
    printf("Read error for file %s.", filename);
    return;
  }

  fwrite(&map, sizeof map, 1, ms);  // Write Mapping Structure
  fwrite(buf, map.size, 1, fs);     // Write Virtual Filesystem

  free(buf);        // Free Buffer
  fclose(file);     // Close the File
  file = NULL;      // Reset the Pointer
  pos += map.size;  // Shift the Index Position

}

The directory iteration function can be written as:

#define CEMBED_DIRENT_FILE 8
#define CEMBED_DIRENT_DIR 4
#define CEMBED_MAXPATH 512

void iterdir(char* d){

  char* fullpath = (char*)malloc(CEMBED_MAXPATH*sizeof(char));

  DIR *dir;
  struct dirent *ent;

  if ((dir = opendir(d)) != NULL) {

    while ((ent = readdir(dir)) != NULL) {

      if(strcmp(ent->d_name, ".") == 0) continue;
      if(strcmp(ent->d_name, "..") == 0) continue;

      if(ent->d_type == CEMBED_DIRENT_FILE){
        strcpy(fullpath, d);
        strcat(fullpath, "/");
        strcat(fullpath, ent->d_name);
        cembed(fullpath);
      }

      else if(ent->d_type == CEMBED_DIRENT_DIR){
        strcpy(fullpath, d);
        strcat(fullpath, "/");
        strcat(fullpath, ent->d_name);
        iterdir(fullpath);
      }

    }

    closedir(dir);

  }

  else {

    strcpy(fullpath, d);
    cembed(fullpath);

  }

  free(fullpath);

}

Finally, calling this function on the desired root directory or file will create two files, containing the virtual file system and virtual file system index:

int main(int argc, char* argv[]){

  if(argc <= 1)
    return 0;

  // Build the Mapping Structure and Virtual File System

  ms = fopen("virtual_file_system.map", "wb");
  fs = fopen("virtual_file_system.fs", "wb");

  if(ms == NULL || fs == NULL){
    printf("Failed to initialize map and filesystem. Check permissions.");
    return 0;
  }

  for(int i = 1; i < argc; i++)
    iterdir(argv[i]);

  fclose(ms);
  fclose(fs);

  return 0;
}

Note: I am aware that the embedding process thus requires an additional data “duplication” operation before the embedding process. If you have ideas how to skip this, let me know.

Accessing the Virtual File System

Having run our virtual file system map and index through objcopy and ld, we can link the resulting object file and access both of them using the external symbols:

extern char cembed_map_start; // Embedded Indexing Structure
extern char cembed_map_end;
extern char cembed_map_size;

extern char cembed_fs_start;  // Embedded Virtual File System
extern char cembed_fs_end;
extern char cembed_fs_size;

We can then write convenient <stdio.h> style accessor functions to manipulate these symbols and extract data!

We begin by creating a virtual file stream struct, which contains the relevant indexing pointers:

struct EFILE_S {    // Virtual File Stream
  char* pos;
  char* end;
  size_t size;
  int err;
};
typedef struct EFILE_S EFILE;

A simple fopen style function for the virtual file system then would look as follows. Note that this is the only function that needs to access the map (i.e. retrieve the pointers).

EFILE* eopen(const char* file, const char* mode){

  EMAP* map = (EMAP*)(&cembed_map_start);
  const char* end = &cembed_map_end;

  if( map == NULL || end == NULL )
    ethrow(EERRCODE_NOMAP);

  const u_int32_t key = hash((char*)file);
  while( ((char*)map != end) && (map->hash != key) )
    map++;

  if(map->hash != key)
    ethrow(EERRCODE_NOFILE);

  EFILE* e = (EFILE*)malloc(sizeof *e);
  e->pos = (&cembed_fs_start + map->pos);
  e->end = (&cembed_fs_start + map->pos + map->size);
  e->size = map->size;

  return e;

}

while the analogs to feof and fgets can be written as:

bool eeof(EFILE* e){
  if(e == NULL){
    (eerrcode = (EERRCODE_NULLSTREAM));
    return true;
  }
  if(e->end < e->pos){
    (eerrcode = (EERRCODE_OOBSTREAMPOS));
    return true;
  }
  if((e->end - e->pos) - e->size < 0){
    (eerrcode = (EERRCODE_OOBSTREAMPOS));
    return true;
  }
  return (e->end == e->pos);
}

char* egets ( char* str, int num, EFILE* stream ){

  if(eeof(stream))
    return NULL;

  for(int i = 0; i < num && !eeof(stream) && *(stream->pos) != '\r'; i++)
    str[i] = *(stream->pos++);

  return str;

}

The remaining <stdio.h> style functions I wrote for this system can be found in the repository on GitHub.

Zero-Modification Embedding

The nice part about accessing an arbitrary number of files with a fixed number of symbols through <stdio.h> style functions is that the C-preprocessor can be used to make switching between local and virtual file systems very easy.

Simply passing a flag to our compiler to define a macro CEMBED_TRANSLATE allows us to rename all <stdio.h> functions to our embedded access functions, essentially remapping file access from the local to the virtual files.

#ifdef CEMBED_TRANSLATE
#define FILE EFILE
#define fopen eopen
#define fclose eclose
#define feof eeof
#define fgets egets
#define fgetc egetc
#define perror eerror
#define fread eread
#define fseek eseek
#define ftell etell
#endif

Finally, by passing additional flags to the compiler, we can include any required headers directly from the console instead of modifying our code. This means that with a well defined makefile, we can define which file system we use entirely at build time without modifying our code at all.

Note: In the following example, the c-embed binary produces the file system and indexing files and runs objcopy and ld. Finally, it prints the name of the produced object file.

# c-embed build system

# data directory to embed
DAT = data

# c-embed configuration clags
CEF = -include /usr/local/include/c-embed.h -DCEMBED_TRANSLATE

# build rules
.PHONY: embedded relative

embedded: CF = $(shell c-embed $(DAT)) $(CEF)
embedded: all

relative: CF =
relative: all

build:
	gcc main.c $(CF) -o main

all: build

Compiling the following program with make relative or make embedded will produce a binary which uses the local and the embedded file systems respectively:

#include <stdio.h>

int main(int argc, char* args[]){

  FILE* eFile = fopen("data/data2/data2.txt", "r");

  char buffer [100] = {' '};

  if (eFile == NULL)
    perror ("Error opening file");

  else while(!feof(eFile)){
    if( fgets(buffer, 100, eFile) == NULL ) break;
    fputs (buffer , stdout);
  }

  fclose(eFile);

}

Final Words

Possible Extensions

The virtual file system can be made writeable, either by padding the virtual file system file appropriately or by creating a more complex indexing structure (i.e. fragment the files).

It would make sense to include some more meta-data in the file structure, for instance the file names or creation times, so that a simple program could print information about the files contained in the object file. These are currently not included as the files are found purely by the hash.

Finally, it would be cool if the program was capable of embedding multiple file system object files simultaneously. Currently this is not possible, because only one set of symbols is used. A similar problem arises with naming conflicts and lack of macro iteration, and this warrants more consideration.