Connectors¶
Connectors allow reading and writing data to and from external sources. This concept is similar to Presto Connectors. The TableScanNode operator reads external data via a connector. The TableWriteNode operator writes data externally via a connector. The various connector interfaces in Velox are described below.
Connector Interface¶
Interface Name |
Description |
---|---|
ConnectorSplit |
A chunk of data to process. For example, a single file. |
DataSource |
Provides methods to consume and process a split. A DataSource can optionally consume a dynamic filter during execution to prune some rows from the output vector. |
DataSink |
Provides methods to write a Velox vector externally. |
Connector |
Allows creating instances of a DataSource or a DataSink. |
Connector Factory |
Enables creating instances of a particular connector. |
Velox provides Hive and TPC-H Connectors out of the box. Let’s see how the above connector interfaces are implemented in the Hive Connector in detail below.
Hive Connector¶
The Hive Connector is used to read and write data files (Parquet, DWRF) residing on an external storage (S3, HDFS, GCS, Linux FS).
HiveConnectorSplit¶
The HiveConnectorSplit describes a data chunk using parameters including file-path, file-format, start, length, storage format, etc.. It is not necessary to specify start and length values that align with row boundaries. For example, in a Parquet file, those row groups with offset in the range of [start, length) are processed as part of the split. For a given a set of files, users or applications are responsible for defining the splits.
HiveDataSource¶
The HiveDataSource implements the addSplit API that consumes a HiveConnectorSplit. It creates a file reader based on the file format, offset, and length. The supported file formats are DWRF and Parquet. The next API processes the split and returns a batch of rows. Users can continue to call next until all the rows in the split are fully read. HiveDataSource allows adding a dynamic filter using the addDynamicFilter API. This allows supporting Dynamic Filter Pushdown.
HiveDataSink¶
The HiveDataSink writes vectors to files on disk. The supported file formats are DWRF and Parquet. The parameters to HiveDataSink also include column names, sorting, partitioning, and bucketing information. The appendData API instantiates a file writer based on the above parameters and writes a vector to disk.
HiveConnector¶
The HiveConnector implements the createDataSource connector API to create instances of HiveDataSource. It also implements the createDataSink connector API to create instances of HiveDataSink. One of the parameters to these APIs is ConnectorQueryCtx, which provides means to specify a memory pool and connector configuration.
HiveConnectorFactory¶
The HiveConnectorFactory enabled creating instances of the HiveConnector. A connector name say “hive” is required to register the HiveConnectorFactory. Multiple instances of the HiveConnector can then be created by using the newConnector API by specifying a connectorId and connector configuration listed here. Multiple instances of a connector are required if you have multiple external sources and each require a different configuration.
Storage Adapters¶
Hive Connector allows reading and writing files from a variety of distributed storage systems. The supported storage API are S3, HDFS, GCS, Linux FS.
S3 is supported using the AWS SDK for C++ library. S3 supported schemes are s3:// (Amazon S3, Minio), s3a:// (Hadoop 3.x), s3n:// (Deprecated in Hadoop 3.x), oss:// (Alibaba cloud storage), and cos://, cosn:// (Tencent cloud storage).
HDFS is supported using the Apache Hawk libhdfs3 library. HDFS supported schemes are hdfs://.
GCS is supported using the Google Cloud Platform C++ Client Libraries. GCS supported schemes are gs://.