Parsing Org Files

Keywords: awk org pkm
Created: 2020-05-23
Table of Contents

Extracting information from large amounts of org-mode files as quickly as possible.

Dataset

(as of [2020-05-23 Sat] )

  • 2234 files
  • 599764 lines
  • 28M in total size

Extracting Titles

A core component of my org setup is a ivy based selector for quickly opening and linking to files.

This selector should display the #+TITLE: of a file, if available.

In addition, files can have aliases (set with the #+ZK_ALIAS: ), e.g. to allow linking to a file titled "Polynomial Time Reduction" with a link description of "polynomial time reducible".

AWK

AWK is favorite tool for tasks like this, a programming language for processing text.

BEGINFILE {
    i = 0;
    has_title = 0;
    OFS="\t";
}

match($0, /^#\+TITLE:[ \t]+(.*)/, a) {
    print FILENAME, a[1];
    has_title = 1;
    next;
}

match($0, /^#\+ZK_ALIAS:[ \t]+(.*)/, a) {
    print FILENAME, a[1];
    has_title = 1;
    next;
}

{
    i += 1;
    if (i > 10) {
        nextfile;
    }
}

ENDFILE {
    if (!has_title) {
        print FILENAME, FILENAME;
    }
}

This piece of code parses at most 10 lines of each file, printing out each title and alias it encounters together with the filename.

org-files | xargs awk -f awk/titles.awk

xargs could awk in parallel but that doesn't work well with Emacs' shell-command-to-string .

C

Here's my first try at extracting titles and aliases with C. Files are not processed in parallel but lines are matched using string comparisons instead of regex.

#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
#include <string.h>

const char* ORG_SUFFIX = ".org";
const int MAX_LINES = 10;

int is_org_file(char* path) {
  size_t pathlen = strlen(path);
  if (pathlen < 4) {
    return 0;
  }

  return !strncmp(path + pathlen - 4, ORG_SUFFIX, 4);
}

void process_file(char* path) {
  char *line = NULL;
  size_t len = 0;
  ssize_t read;
  int n_lines = 0;
  char *title = NULL;
  int found_title = 0;

  FILE *fp = fopen(path, "r");

  while ((read = getline(&line, &len, fp)) != -1) {
    if (len > 8 && strncmp(line, "#+TITLE:", 8) == 0) {
      // Remove trailing whitespace
      line[read - 1] = 0;
      title = line + 8;
      // Skip leading whitespace
      while (title && (title[0] == ' ' || title[1] == '\t')) {
        title++;
      }
      printf("%s\t%s\n", path, title);
      found_title = 1;
    } else if (len > 11 && strncmp(line, "#+ZK_ALIAS:", 11) == 0) {
      line[read - 1] = 0;
      title = line + 11;
      while (title && (title[0] == ' ' || title[1] == '\t')) {
        title++;
      }
      printf("%s\t%s\n", path, title);
      found_title = 1;
    }
    if (n_lines++ > MAX_LINES) {
      break;
    }
  }

  if (!found_title) {
    printf("%s\t%s\n", path, path);
  }

  fclose(fp);
}

void process_dir(char* path) {
  DIR *dp = opendir(path);
  struct dirent *ep;

  if (dp != NULL) {
    while ((ep = readdir(dp))) {
      if (ep->d_name && ep->d_name[0] != '.') {
        char full_name[256];
        strcpy(full_name, path);
        strcat(full_name, ep->d_name);

        if (ep->d_type == DT_REG) {
          if (is_org_file(ep->d_name)) {
            process_file(full_name);
          }
        } else if (ep->d_type == DT_DIR) {
          strcat(full_name, "/");
          process_dir(full_name);
        }
      }
    }

    closedir(dp);
  } else {
    perror("Couldn't open the directory");
  }
}

int main(int argc, char *argv[]) {
  process_dir("/home/leon/org/");
  return 0;
}

Getting the Data Into Emacs

The next task is to get the results into Emacs as fast as possible.

shell-command-to-string

Just getting a string with the outputs of the programs is pretty fast.

Variant Time (average)
AWK 112ms
C 41ms

TSV Parsing

(defun org-zk-parse-titles (titles)
  (mapcar
   (lambda (line) (split-string line "\t"))
   (split-string titles "\n")))

To my surprise, parsing splitting the output into lines and lines into fields adds a few milliseconds.

Variant Time (average)
AWK 112ms
C 46ms

read

split-string is implemented in EmacsLisp. To improve performance, we can generate valid S-expressions in the AWK and C programs and use read to parse the output

This is useful when the data format is more complex than a two-column csv file.


Last export: 2020-07-17 Fri 23:16

If you have an idea how this page could be improved or a comment send me a mail.