Tuesday 28 May 2013

Better than system("touch")

I've seen a lot of people use
system("touch");
to make sure a file exists, and/or has a recent access/modification time. For example, see here, here and here.
I'm here to tell you: system() sucks! Why? Take a look at man system:
       int system(const char *command);

DESCRIPTION
       system() executes a command specified in command by calling /bin/sh
       -c command, and returns after the command has been completed.
So it's not just running the touch program. It's starting a shell, then running whatever you passed in that shell. This is:
  1. Slow. First you start a shell, then you start another program from that shell? Seems like a lot of hassle.
  2. A security risk. Say you take the filename from the user, then run something like:
    std::stringstream ss;
    ss << "touch " << filename;
    system(ss.str().c_str());
    
    What happens if I (the malicious user) give input like "fakename ; rm -rf --no-preserve-root /;"? Well it creates(/updates the timestamp of) fakename, then tries to delete everything!
  3. Very platform dependent. The POSIX Standard has this to say:
    [T]he system() function shall pass the string pointed to by command to that command processor to be executed in an implementation-defined manner; this might then cause the program calling system() to behave in a non-conforming manner or to terminate. And that's just system. The utility you are calling may vary significantly. Alright, touch probably won't, but I've seen people use system with, for instance, ls, whose output will vary significantly in format across platforms.
So what should we do instead? Well obviously someone wrote touch, so we should be able to replicate it's behaviour from our own program. The logic surrounding parsing arguments and so on is something we should be pretty familiar with. What we need to know is how touch actually creates and updates a file. It needs to make calls out to the operating system ("system calls"). There is a handy command line tool to see what system calls are being made by a program, called strace (on some systems, truss. I don't know a full list of which to use where, but I do know it's strace on Linux and AIX, truss on Solaris and FreeBSD).
I ran strace touch twice, once to create a file, then once to update it. It was basically the same each time, so I'll just show one. You get a lot of cruft just from a program starting up, obtaining heap memory, etc, but I cut it down to just the relevant bits:
$ strace touch testfile
...
open("testfile", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3
dup2(3, 0)                              = 0
close(3)                                = 0
dup2(0, 0)                              = 0
utimensat(0, NULL, NULL, 0)             = 0
...
The two we care about are open and utimensat. Respectively, these open a file, creating it if necessary (O_CREAT), and update the timestamp. open takes:
  1. const char* pathname
    The path (absolute or relative) of the file (or directory) to be opened.
  2. int flags
    A bitmask of flags, ord together, indicating how to open the path, e.g. O_CREAT to create the file if it doesn't already exist.
  3. mode_t mode
    Only required if O_CREAT is provided, this argument provides the permissions with which to create the file. This will be filtered against your umask: mode^umask.
utimensat takes:
  1. int dirfd
    An open file descripter to a directory from which to interpret a relative path. We will use the special value AT_FDCWD, which just means we interpret relative paths from the working directory of the program.
  2. const char* pathname
    As above.
  3. const struct timespec times[2]
    Two sets of values defining the times to be set. By passing a null pointer for this array, we just get the current time.
  4. int flags
    Another bitmask specifying details of how the call will be carried out. Nothing relevant to us.
So we put these together in a c++ way, and get something like:
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <fcntl.h>
#include <unistd.h>
#include <utime.h>

#include <iostream>
#include <string>

#include <cstdlib>

void touch(const std::string& pathname)
{
    int fd = open(pathname.c_str(),
                  O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK,
                  0666);
    if (fd<0) // Couldn't open that path.
    {
        std::cerr
            << __PRETTY_FUNCTION__
            << ": Couldn't open() path \""
            << pathname
            << "\"\n";
        return;
    }
    int rc = utimensat(AT_FDCWD,
                       pathname.c_str(),
                       nullptr,
                       0);
    if (rc)
    {
        std::cerr
            << __PRETTY_FUNCTION__
            << ": Couldn't utimensat() path \""
            << pathname
            << "\"\n";
        return;
    }
    std::clog
        << __PRETTY_FUNCTION__
        << ": Completed touch() on path \""
        << pathname
        << "\"\n";
}

int main(int argc, char* argv[])
{
    if (argc!=2)
        return EXIT_FAILURE;
    touch (argv[1]);
    return EXIT_SUCCESS;
}
Of course, it would be very easy to rewrite this function in c. Also, if you only want to make sure the file exists, and don't care about the timestamps, you could just create a std::ofstream (remembering to pass app and check is_open()).

Reading Input with std::getline

A lot of beginners seem to have trouble with more complicated input, especially with reading in a loop until the end of a file (or end of input through std::cin). I thought about trying to go through all the various things I've seen people do wrong, but that could get pretty messy. So instead I thought I'd just show some examples building up to doing it right (or at least in a way that works, and that I think is fairly nice. "Right" is open to a lot of interpretation).
Let's start simple. Read a user's name, and greet them:
#include <iostream>
#include <string>

std::string readName()
{
    std::cout << "What's your name? ";
    std::string name;
    std::cin >> name;
    return name;
}

void greet (const std::string& name)
{
    std::cout << "Hello, " << name << '\n';
}

int main()
{
    greet(readName());
}
(By the way, if you didn't already know, it is safe to bind a const T& to a temporary, but not a T&). We run this:
$ ./hello1 
What's your name? Chris Sharpe
Hello, Chris
Pretty close, but reading with std::cin >> someString stops at the first whitespace. Instead we are going to use std::getline. The only change is in readName():
std::string readName()
{
    std::cout << "What's your name? ";
    std::string name;
    std::getline(std::cin, name); // Change here.
    return name;
}
Then:
$ ./hello2 
What's your name? Chris Sharpe
Hello, Chris Sharpe
That's what we want! What about a more complicated, and more useful example. Reading a configuration file, where each option might have a different type of value, for instance a filename (std::string) or switch for some program behavior (bool). It is convenient to use the formatted input we get with >>, but we want to actually read from the input file with getline.
The important functions are parseConfigFile() and parseConfigLine().
#include <fstream>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <string>
#include <utility>

// Class to hold the program's configuration options.
class Config
{
    private:
        // Member data
        std::string someFilePath_;
        int         someSize_;
        bool        someSwitch_;

        // Static data: names of options
        static const std::string SOME_FILE_PATH;
        static const std::string SOME_SIZE;
        static const std::string SOME_SWITCH;

        // Private functions
        void parseConfigFile(const std::string&);
        void parseConfigLine(const std::string&);

    public:
        // Constructors
        Config();
        Config(const std::string&);

        // Accessors
        const std::string& someFilePath() const;
        int                someSize()     const;
        bool               someSwitch()   const;

        std::string dumpConfigAsString()  const;
};


int main(int argc, char* argv[])
{
    if (argc>1)
    {
        Config config{argv[1]};
        std::cout << config.dumpConfigAsString();
    }
    else
    {
        Config defaultConfig{};
        std::cout << defaultConfig.dumpConfigAsString();
    }
    return 0;
}


// class Config

const std::string Config::SOME_FILE_PATH {"some_file_path"};
const std::string Config::SOME_SIZE {"some_size"};
const std::string Config::SOME_SWITCH {"some_switch"};

Config::Config()
    // Set the defaults
    : someFilePath_{"default.png"}
    , someSize_{2}
    , someSwitch_{true}
{}

Config::Config(const std::string& configFilePath)
    : Config{} // Set the defaults
{
    parseConfigFile(configFilePath);
}


void Config::parseConfigFile(const std::string& configFilePath)
{
    std::ifstream inpFile{configFilePath};
    if (!inpFile.is_open())
    {
        std::cerr
            << "Could not open configuration file \""
            << configFilePath
            << "\"\n" // Note the string concatenation
               "Using default configuration options.\n";
        return;
    }

    std::string configLine;
    // Read a line at a time.
    // Doing this inside the loop condition means
    // we end correctly at the bottom of the file.
    while ( std::getline(inpFile, configLine) )
    {
        parseConfigLine(configLine);
    }
}

void Config::parseConfigLine(const std::string& configLine)
{
    // Ignore comment or empty lines
    if ('#' == configLine[0] || configLine.empty())
        return;

    // Split the line using >> operations
    std::istringstream iss {configLine};

    std::string configOption;
    iss >> configOption;

    // Compare against the known configurable options
    if ( SOME_FILE_PATH == configOption )
    {
        std::string tmpString;
        if (iss >> tmpString)
        {
            someFilePath_ = std::move(tmpString);
        }
        else // The read to std::string failed
        {
            std::cerr
                << "Failed to read configuration option \""
                << SOME_FILE_PATH
                << "\" as a string.\n";
        }
    }
    else if ( SOME_SIZE == configOption )
    {
        int tmpInt;
        if (iss >> tmpInt)
        {
            someSize_ = tmpInt;
        }
        else // The read to int failed
        {
            std::cerr
                << "Failed to read configuration option \""
                << SOME_SIZE
                << "\" as an integer.\n";
        }
    }
    else if ( SOME_SWITCH == configOption )
    {
        bool tmpBool;
        if (iss >> std::boolalpha >> tmpBool)
        {
            someSwitch_ = tmpBool;
        }
        else // The read to bool failed
        {
            std::cerr
                << "Failed to read configuration option \""
                << SOME_SWITCH
                << "\" as a boolean switch.\n";
        }
    }
    else
    {
        std::cerr
            << "Unrecognised configuration option \""
            << configOption
            << "\"\n";
    }
}

const std::string& Config::someFilePath() const
{
    return someFilePath_;
}

int                Config::someSize()     const
{
    return someSize_;
}

bool               Config::someSwitch()   const
{
    return someSwitch_;
}

std::string Config::dumpConfigAsString()  const
{
    std::ostringstream oss;
    oss
        << SOME_FILE_PATH << ' ' << someFilePath() << '\n'
        << SOME_SIZE      << ' ' << someSize()     << '\n'
        << SOME_SWITCH    << ' ' << std::boolalpha
                                 << someSwitch()   << '\n';
    return oss.str();
}
This is fairly long, just to make it a realistic example, but there are two absolutely key points:
  1. Don't mix use of operator>> and getline on the same stream. operator>> will leave a newline character on the stream. Any following use of getline would immediately hit that and return an empty line. This can really confuse people, especially when the reads happen far apart in code, so there is no obvious connection.
  2. Have the read as the loop condition when reading a whole file. This is partly because you want to check immediately after the read, and partly because you might otherwise be tempted to check stream.eof(), which will only work if you have just tried to read past the end of the file. Not if that last read used up all the characters, and you'll get completely stuck if the last few characters can't be read in the way you want (i.e. you are using formatted input with operator>>). Consider this badly broken example:
    int main()
    {
        int tmpInt;
        while (!std::cin.eof())
        {
            std::cout << "Number: ";
            std::cin >> tmpInt;
            std::cout << "Read " << tmpInt << '\n';
        }
    }
    
    A run of this program might go like this:
    $ ./a.out 
    Number: 123
    Read 123
    Number: abc
    Read 0
    Number: Read 0
    Number: Read 0
    Number: Read 0
    ...
    
    You can see it gets stuck in the loop because it can't actually read any further, but we technically haven't read past the end of the file.
If you've been paying attention, you'll have noticed this is broken if the file name has a space in it. How could you fix that?
As an aside, you might have heard of the GNU Readline Library which allows entry history, line editing, and other cool stuff. Recently I was working with a little test application that would be much easier to use with these features, but I couldn't make the changes to use Readline, so I started writing a little wrapper shell script. After wasting a fair amount of effort for something that worked ok, I found rlwrap. Guess I should have searched harder in the first place. There's always someone who's already solved your problem.