Connectors allow reading and writing data to and from external sources. This concept is similar to Presto Connectors. The TableScanNode operator reads external data via a connector. The TableWriteNode operator writes data externally via a connector. The various connector interfaces in Velox are described below.
A chunk of data to process. For example, a single file.
Provides methods to consume and process a split. A DataSource can optionally consume a dynamic filter during execution to prune some rows from the output vector.
Provides methods to write a Velox vector externally.
Allows creating instances of a DataSource or a DataSink.
Enables creating instances of a particular connector.
Velox provides Hive and TPC-H Connectors out of the box. Let’s see how the above connector interfaces are implemented in the Hive Connector in detail below.
The Hive Connector is used to read and write data files (Parquet, DWRF) residing on an external storage (S3, HDFS, GCS, Linux FS).
The HiveConnectorSplit describes a data chunk using parameters including file-path, file-format, start, length, storage format, etc.. It is not necessary to specify start and length values that align with row boundaries. For example, in a Parquet file, those row groups with offset in the range of [start, length) are processed as part of the split. For a given a set of files, users or applications are responsible for defining the splits.
The HiveDataSource implements the addSplit API that consumes a HiveConnectorSplit. It creates a file reader based on the file format, offset, and length. The supported file formats are DWRF and Parquet. The next API processes the split and returns a batch of rows. Users can continue to call next until all the rows in the split are fully read. HiveDataSource allows adding a dynamic filter using the addDynamicFilter API. This allows supporting Dynamic Filter Pushdown.
The HiveDataSink writes vectors to files on disk. The supported file formats are DWRF and Parquet. The parameters to HiveDataSink also include column names, sorting, partitioning, and bucketing information. The appendData API instantiates a file writer based on the above parameters and writes a vector to disk.
The HiveConnector implements the createDataSource connector API to create instances of HiveDataSource. It also implements the createDataSink connector API to create instances of HiveDataSink. One of the parameters to these APIs is ConnectorQueryCtx, which provides means to specify a memory pool and connector configuration.
The HiveConnectorFactory enabled creating instances of the HiveConnector. A connector name say “hive” is required to register the HiveConnectorFactory. Multiple instances of the HiveConnector can then be created by using the newConnector API by specifying a connectorId and connector configuration listed here. Multiple instances of a connector are required if you have multiple external sources and each require a different configuration.
Hive Connector allows reading and writing files from a variety of distributed storage systems. The supported storage API are S3, HDFS, GCS, Linux FS.
If file is not found when reading, openFileForRead API throws VeloxRuntimeError with error_code::kFileNotFound. This behavior is necessary to support the ignore_missing_files configuration property.
S3 is supported using the AWS SDK for C++ library. S3 supported schemes are s3:// (Amazon S3, Minio), s3a:// (Hadoop 3.x), s3n:// (Deprecated in Hadoop 3.x), oss:// (Alibaba cloud storage), and cos://, cosn:// (Tencent cloud storage).
HDFS is supported using the Apache Hawk libhdfs3 library. HDFS supported schemes are hdfs://.
GCS is supported using the Google Cloud Platform C++ Client Libraries. GCS supported schemes are gs://.
ABS (Azure Blob Storage) is supported using the Azure SDK for C++ library. ABS supported schemes are abfs(s)://.
S3 Storage adapter using a proxy¶
By default, the C++ AWS S3 client does not honor the configuration of the environment variables http_proxy, https_proxy, and no_proxy. The Java AWS S3 client supports this. The environment variables can be specified as lower case, upper case or both. In order to enable the use of a proxy the hive connector configuration variable hive.s3.use-proxy-from-env must be set to true. By default, the value is false.
This is the behavior when the proxy settings are enabled:
http_proxy/HTTP_PROXY, https_proxy/HTTPS_PROXY and no_proxy/NO_PROXY environment variables are read. If lower case and upper case variables are set lower case variables take precendence.
The no_proxy/NO_PROXY content is scanned for exact and suffix matches.
CIDR expressions provided in no_proxy/NO_PROXY are not supported for matching.
IP addresses or domains can be specified.
The no_proxy/NO_PROXY list is comma separated.
Use . or *. to indicate domain suffix matching, e.g. .foobar.com will match test.foobar.com or foo.foobar.com.