Python流式高效访问(超)大文件的库:smart_open
jopen
10年前
Python流式高效访问(超)大文件的库(支持云端/本地的压缩/未压缩文件:S3, HDFS, gzip, bz2...)
>>> # stream lines from an S3 object >>> for line in smart_open.smart_open('s3://mybucket/mykey.txt'): ... print line >>> # can use context managers too: >>> with smart_open.smart_open('s3://mybucket/mykey.txt') as fin: ... for line in fin: ... print line ... fin.seek(0) # seek to the beginning ... print fin.read(1000) # read 1000 bytes >>> # stream from HDFS >>> for line in smart_open.smart_open('hdfs://user/hadoop/my_file.txt'): ... print line >>> # stream content *into* S3 (write mode): >>> with smart_open.smart_open('s3://mybucket/mykey.txt', 'wb') as fout: ... for line in ['first line', 'second line', 'third line']: ... fout.write(line + '\n') >>> # stream from/to local compressed files: >>> for line in smart_open.smart_open('./foo.txt.gz'): ... print line >>> with smart_open.smart_open('/home/radim/foo.txt.bz2', 'wb') as fout: ... fout.write("some content\n")