Table of Contents
Extracting information from large amounts of org-mode files as quickly as possible.
Dataset
(as of
)- 2234 files
- 599764 lines
- 28M in total size
Extracting Titles
A core component of my org setup is a ivy based selector for quickly opening and linking to files.
This selector should display the
#+TITLE:
of a file, if available.
In addition, files can have aliases (set with the
#+ZK_ALIAS:
),
e.g. to allow linking to a file titled "Polynomial Time Reduction"
with a link description of "polynomial time reducible".
AWK
AWK is favorite tool for tasks like this, a programming language for processing text.
BEGINFILE { i = 0; has_title = 0; OFS="\t"; } match($0, /^#\+TITLE:[ \t]+(.*)/, a) { print FILENAME, a[1]; has_title = 1; next; } match($0, /^#\+ZK_ALIAS:[ \t]+(.*)/, a) { print FILENAME, a[1]; has_title = 1; next; } { i += 1; if (i > 10) { nextfile; } } ENDFILE { if (!has_title) { print FILENAME, FILENAME; } }
This piece of code parses at most 10 lines of each file, printing out each title and alias it encounters together with the filename.
org-files | xargs awk -f awk/titles.awk
xargs
could
awk
in parallel but that doesn't work well with Emacs'
shell-command-to-string
.
C
Here's my first try at extracting titles and aliases with C. Files are not processed in parallel but lines are matched using string comparisons instead of regex.
#include <stdio.h> #include <sys/types.h> #include <dirent.h> #include <string.h> const char* ORG_SUFFIX = ".org"; const int MAX_LINES = 10; int is_org_file(char* path) { size_t pathlen = strlen(path); if (pathlen < 4) { return 0; } return !strncmp(path + pathlen - 4, ORG_SUFFIX, 4); } void process_file(char* path) { char *line = NULL; size_t len = 0; ssize_t read; int n_lines = 0; char *title = NULL; int found_title = 0; FILE *fp = fopen(path, "r"); while ((read = getline(&line, &len, fp)) != -1) { if (len > 8 && strncmp(line, "#+TITLE:", 8) == 0) { // Remove trailing whitespace line[read - 1] = 0; title = line + 8; // Skip leading whitespace while (title && (title[0] == ' ' || title[1] == '\t')) { title++; } printf("%s\t%s\n", path, title); found_title = 1; } else if (len > 11 && strncmp(line, "#+ZK_ALIAS:", 11) == 0) { line[read - 1] = 0; title = line + 11; while (title && (title[0] == ' ' || title[1] == '\t')) { title++; } printf("%s\t%s\n", path, title); found_title = 1; } if (n_lines++ > MAX_LINES) { break; } } if (!found_title) { printf("%s\t%s\n", path, path); } fclose(fp); } void process_dir(char* path) { DIR *dp = opendir(path); struct dirent *ep; if (dp != NULL) { while ((ep = readdir(dp))) { if (ep->d_name && ep->d_name[0] != '.') { char full_name[256]; strcpy(full_name, path); strcat(full_name, ep->d_name); if (ep->d_type == DT_REG) { if (is_org_file(ep->d_name)) { process_file(full_name); } } else if (ep->d_type == DT_DIR) { strcat(full_name, "/"); process_dir(full_name); } } } closedir(dp); } else { perror("Couldn't open the directory"); } } int main(int argc, char *argv[]) { process_dir("/home/leon/org/"); return 0; }
Getting the Data Into Emacs
The next task is to get the results into Emacs as fast as possible.
shell-command-to-string
Just getting a string with the outputs of the programs is pretty fast.
Variant | Time (average) |
---|---|
AWK | 112ms |
C | 41ms |
TSV Parsing
(defun org-zk-parse-titles (titles) (mapcar (lambda (line) (split-string line "\t")) (split-string titles "\n")))
To my surprise, parsing splitting the output into lines and lines into fields adds a few milliseconds.
Variant | Time (average) |
---|---|
AWK | 112ms |
C | 46ms |
read
split-string
is implemented in EmacsLisp.
To improve performance, we can generate valid S-expressions
in the AWK and C programs and use
read
to parse the output
This is useful when the data format is more complex than a two-column csv file.