开箱即用的 Flink 提供了监视新文件的目录并读取它们的工具 - 通过StreamExecutionEnvironment.getExecutionEnvironment.readFile
(参见类似的堆栈溢出线程示例 - How to read new added file in a directory in Flink / Monitoring directory for new files with Flink for data streams, ETC。)
查看函数的源代码readFile
,它调用createFileInput() 方法,该方法简单地实例化ContinuousFileMonitoringFunction
并ContinuousFileReaderOperatorFactory
配置源 -
addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
ContinuousFileMonitoringFunction实际上是大部分逻辑发生的地方。
所以,如果我要实现你的要求,我会ContinuousFileMonitoringFunction
用我自己的逻辑来扩展功能,将处理过的文件移动到历史文件夹中,并从这个函数构造源代码。
鉴于该run
方法在checkpointLock
-
synchronized (checkpointLock) {
monitorDirAndForwardSplits(fileSystem, context);
}
我会说在检查点完成文件上移动到历史文件夹是安全的,这些文件的修改日早,在拆分收集时globalModificationTime
会更新。monitorDirAndForwardSplits
也就是说,我将扩展ContinuousFileMonitoringFunction
类并实现CheckpointListener
接口,并将notifyCheckpointComplete
已处理的文件移动到历史文件夹:
public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
...
@Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
// do move logic
}
/**
* Returns the paths of the files already processed.
*
* @param fileSystem The filesystem where the monitored directory resides.
*/
private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {
final FileStatus[] statuses;
try {
statuses = fileSystem.listStatus(path);
} catch (IOException e) {
// we may run into an IOException if files are moved while listing their status
// delay the check for eligible files in this case
return Collections.emptyMap();
}
if (statuses == null) {
LOG.warn("Path does not exist: {}", path);
return Collections.emptyMap();
} else {
Map<Path, FileStatus> files = new HashMap<>();
// handle the new files
for (FileStatus status : statuses) {
if (!status.isDir()) {
Path filePath = status.getPath();
long modificationTime = status.getModificationTime();
if (shouldIgnore(filePath, modificationTime)) {
files.put(filePath, status);
}
} else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
files.putAll(listEligibleFiles(fileSystem, status.getPath()));
}
}
return files;
}
}
}
然后使用自定义函数手动定义数据流:
ContinuousFileMonitoringFunction<OUT> monitoringFunction =
new ArchivingContinuousFileMonitoringFunction <>(
inputFormat, monitoringMode, getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);
final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;
env.addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);