Class HadoopDataInputStream

  • All Implemented Interfaces:
    Closeable, AutoCloseable, org.apache.flink.core.fs.ByteBufferReadable

    public final class HadoopDataInputStream
    extends org.apache.flink.core.fs.FSDataInputStream
    implements org.apache.flink.core.fs.ByteBufferReadable
    Concrete implementation of the FSDataInputStream for Hadoop's input streams. This supports all file systems supported by Hadoop, such as HDFS and S3 (S3a/S3n).
    • Field Detail

      • MIN_SKIP_BYTES

        public static final int MIN_SKIP_BYTES
        Minimum amount of bytes to skip forward before we issue a seek instead of discarding read.

        The current value is just a magic number. In the long run, this value could become configurable, but for now it is a conservative, relatively small value that should bring safe improvements for small skips (e.g. in reading meta data), that would hurt the most with frequent seeks.

        The optimal value depends on the DFS implementation and configuration plus the underlying filesystem. For now, this number is chosen "big enough" to provide improvements for smaller seeks, and "small enough" to avoid disadvantages over real seeks. While the minimum should be the page size, a true optimum per system would be the amounts of bytes the can be consumed sequentially within the seektime. Unfortunately, seektime is not constant and devices, OS, and DFS potentially also use read buffers and read-ahead.

        See Also:
        Constant Field Values
    • Constructor Detail

      • HadoopDataInputStream

        public HadoopDataInputStream​(org.apache.hadoop.fs.FSDataInputStream fsDataInputStream)
        Creates a new data input stream from the given Hadoop input stream.
        Parameters:
        fsDataInputStream - The Hadoop input stream
    • Method Detail

      • seek

        public void seek​(long seekPos)
                  throws IOException
        Specified by:
        seek in class org.apache.flink.core.fs.FSDataInputStream
        Throws:
        IOException
      • getPos

        public long getPos()
                    throws IOException
        Specified by:
        getPos in class org.apache.flink.core.fs.FSDataInputStream
        Throws:
        IOException
      • getHadoopInputStream

        public org.apache.hadoop.fs.FSDataInputStream getHadoopInputStream()
        Gets the wrapped Hadoop input stream.
        Returns:
        The wrapped Hadoop input stream.
      • forceSeek

        public void forceSeek​(long seekPos)
                       throws IOException
        Positions the stream to the given location. In contrast to seek(long), this method will always issue a "seek" command to the dfs and may not replace it by skip(long) for small seeks.

        Notice that the underlying DFS implementation can still decide to do skip instead of seek.

        Parameters:
        seekPos - the position to seek to.
        Throws:
        IOException
      • skipFully

        public void skipFully​(long bytes)
                       throws IOException
        Skips over a given amount of bytes in the stream.
        Parameters:
        bytes - the number of bytes to skip.
        Throws:
        IOException
      • read

        public int read​(ByteBuffer byteBuffer)
                 throws IOException
        Specified by:
        read in interface org.apache.flink.core.fs.ByteBufferReadable
        Throws:
        IOException
      • read

        public int read​(long position,
                        ByteBuffer byteBuffer)
                 throws IOException
        Specified by:
        read in interface org.apache.flink.core.fs.ByteBufferReadable
        Throws:
        IOException