Skip to content

ENH: Construct dataframe from shell command #16846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timothymillar opened this issue Jul 7, 2017 · 3 comments
Closed

ENH: Construct dataframe from shell command #16846

timothymillar opened this issue Jul 7, 2017 · 3 comments
Labels
IO Data IO issues that don't fit into a more specific label

Comments

@timothymillar
Copy link

Problem description

R's data.table library has a function fread that can take a shell command as input to construct a dataframe.

This is highly useful for reading data formats that require a small amount of wrangling (without creating additional files). There are many examples of these formats in bioinformatics (sam, bam, gff, etc.)

As far as I can tell there is no (straightforward) equivilent in pandas.

Is there interest in a pull request for an additional function to add this functionality?

Example solution

import io
import pandas
import subprocess

def read_shell(command, shell=False, **kwargs):
    """
    Takes a shell command as a string and and reads the result into a Pandas DataFrame.
    
    Additional keyword arguments are passed through to pandas.read_csv.
    
    :param command: a shell command that returns tabular data
    :type command: str
    :param shell: passed to subprocess.Popen
    :type shell: bool
    
    :return: a pandas dataframe
    :rtype: :class:`pandas.dataframe`
    """
    proc = subprocess.Popen(command, 
                            shell=shell,
                            stdout=subprocess.PIPE, 
                            stderr=subprocess.PIPE)
    output, error = proc.communicate()
    
    if proc.returncode == 0:
        with io.StringIO(output.decode()) as buffer:
            return pandas.read_csv(buffer, **kwargs)
    else:
        message = ("Shell command returned non-zero exit status: {0}\n\n"
                   "Command was:\n{1}\n\n"
                   "Standard error was:\n{2}")
        raise IOError(message.format(proc.returncode, command, error.decode()))

Expected usage

command = "samtools view eaxample.bam | head | cut -f 1,2,3,4,5,6,7 -d '\t'"

read_shell(command, shell=True, sep='\t', header=None)  # note options passed to pandas.read_csv
@TomAugspurger
Copy link
Contributor

This seems a bit out of scope for pandas to me, but I'll let others chime in.

We would certainly welcome a cookbook example, if it is indeed out of scope.

@TomAugspurger TomAugspurger added the IO Data IO issues that don't fit into a more specific label label Jul 12, 2017
@jbrockmendel
Copy link
Member

Closing and adding to a tracker issue #30407 for IO format requests, can re-open if interest is expressed.

@alexlenail
Copy link
Contributor

@timothymillar's code above solved my issue, and at first I agreed it may be out of scope to be included in pandas, but having used it with for a variety of different CLIs dozens of times over the past few weeks, I'd vote to add it to pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

No branches or pull requests

4 participants