根据我的理解,做这些事情的标准和有效的方法是使用 Tensorflow Transform。如果我们必须使用 TF Transform,这并不意味着我们应该使用整个 TFX 管道。TF 变换也可以用作独立的。
Tensorflow Transform 创建了一个 Beam Transormation Graph,它将这些变换作为常量注入到 Tensorflow Graph 中。由于这些转换在图中表示为常量,因此它们将在训练和服务中保持一致。培训和服务之间一致性的优势是
- 消除培训服务偏差
- 消除了在服务系统中包含代码的需要,从而改善了延迟。
TF Transform 的示例代码如下:
导入所有依赖项的代码:
try:
import tensorflow_transform as tft
import apache_beam as beam
except ImportError:
print('Installing TensorFlow Transform. This will take a minute, ignore the warnings')
!pip install -q tensorflow_transform
print('Installing Apache Beam. This will take a minute, ignore the warnings')
!pip install -q apache_beam
import tensorflow_transform as tft
import apache_beam as beam
import tensorflow as tf
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
下面提到的是我们提到所有转换的预处理功能。截至目前,TF Transform 不提供用于缺失值插补的直接 API。因此,仅为此,我们必须使用低级 API 编写自己的代码。
def preprocessing_fn(inputs):
"""Preprocess input columns into transformed columns."""
# Since we are modifying some features and leaving others unchanged, we
# start by setting `outputs` to a copy of `inputs.
outputs = inputs.copy()
# Scale numeric columns to have range [0, 1].
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(outputs[key])
for key in OPTIONAL_NUMERIC_FEATURE_KEYS:
# This is a SparseTensor because it is optional. Here we fill in a default
# value when it is missing.
dense = tf.sparse_to_dense(outputs[key].indices,
[outputs[key].dense_shape[0], 1],
outputs[key].values, default_value=0.)
# Reshaping from a batch of vectors of size 1 to a batch to scalars.
dense = tf.squeeze(dense, axis=1)
outputs[key] = tft.scale_to_0_1(dense)
# For all categorical columns except the label column, we generate a
# vocabulary but do not modify the feature. This vocabulary is instead
# used in the trainer, by means of a feature column, to convert the feature
# from a string to an integer id.
for key in CATEGORICAL_FEATURE_KEYS:
tft.vocabulary(inputs[key], vocab_filename=key)
# For the label column we provide the mapping from string to index.
table = tf.contrib.lookup.index_table_from_tensor(['>50K', '<=50K'])
outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])
return outputs
您可以参考下面提到的链接以获取详细信息和 TF 变换教程。
https://www.tensorflow.org/tfx/transform/get_started
https://www.tensorflow.org/tfx/tutorials/transform/census