Python流式高效访问(超)大文件的库:smart_open

jopen 10年前

Python流式高效访问(超)大文件的库(支持云端/本地的压缩/未压缩文件:S3, HDFS, gzip, bz2...)

>>> # stream lines from an S3 object  >>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'):  ...    print line    >>> # can use context managers too:  >>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin:  ...     for line in fin:  ...         print line  ...     fin.seek(0)  # seek to the beginning  ...     print fin.read(1000)  # read 1000 bytes    >>> # stream from HDFS  >>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'):  ...    print line    >>> # stream content *into* S3 (write mode):  >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout:  ...     for line in ['first line', 'second line', 'third line']:  ...          fout.write(line + '\n')    >>> # stream from/to local compressed files:  >>> for line in smart_open.smart_open('./foo.txt.gz'):  ...    print line    >>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout:  ...    fout.write("some content\n")

项目主页:http://www.open-open.com/lib/view/home/1422349535814