python - boto3 textract start_document_text_detection 不接受用于 s3 上的输入文件的文件夹

Question

我编写了一个 lambda 来从存储在 s3 中的图像文件中提取文本。lambda 由新对象触发。图像存储在文件夹中。当我测试存储在 S3 存储桶根目录上的文件时，一切正常。当我使用文件夹时，事情就会中断。

当 documentLocation 看起来像这样时：

{'S3Object': {'Bucket': 'extractbucket', 'Name': 'img024.jpg'}}

一切正常。

当它看起来像这样时：

`{'S3Object': {'Bucket': 'extractbucket', 'Name': 'afold/img024.jpg'}}`

我得到InvalidParameterException

重现步骤

这是我的 lambda 函数（Python3.8，region:us-east-2）：

import json
import boto3

def lambda_handler(event, context):

    bucket="extractbucket"
    client = boto3.client('textract')
    jobFile = event['Records'][0]['s3']['object']['key']
    #process using S3 object
    docLoc = {
        "S3Object":{
            "Bucket": bucket,
            "Name": jobFile
        }
    }
    response = client.start_document_text_detection(
        DocumentLocation=docLoc,
        JobTag=jobFile,
        NotificationChannel={
            "RoleArn":"arn:aws:iam::xxxxx:role/Textract_demo_sns",
            "SNSTopicArn": "arn:aws:sns:us-east-2:xxxxx:TxtExtractComplete"
        }
    )
      return {
        'statusCode': 200,
        'body': json.dumps("sent filejobID:" + jobFile + " to queue")
    }

我使用 S3 触发器测试对此进行了测试，将文件名放在对象/键中。当我使用根文件进行测试时，一切正常，当我使用文件夹中的文件进行测试时，事情就中断了。休息时间如下：

调试日志

InvalidParameterException: An error occurred (InvalidParameterException) when calling the StartDocumentTextDetection operation: Request has invalid parameters
Traceback (most recent call last):
  File "/var/task/lambda_function.py", line 18, in lambda_handler
    response = client.start_document_text_detection(
  File "/var/runtime/botocore/client.py", line 316, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/var/runtime/botocore/client.py", line 626, in _make_api_call
    raise error_class(parsed_response, operation_name)END

任何帮助将不胜感激，感谢您的时间。

score 4 · Accepted Answer

我的测试表明它start_document_text_detection()可以很好地处理子目录中的对象。

我怀疑它Key包含类似 URL 的字符而不是纯斜杠。您可以通过打印的值jobFile并查看日志以查看该值来对此进行测试。

这是可以避免此问题的代码：

bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'])

您还需要import urllib.

score 4 · Accepted Answer

JobTag 标识符不接受空格或符号，包括“/”。

因此，当添加带有文件夹的文件，并将密钥分配给 JobTag 时，JobTag 会感应到斜线，并返回无效参数错误。

解决方法：使用 python replace删除/替换作业标签中的斜杠

python - boto3 textract start_document_text_detection 不接受用于 s3 上的输入文件的文件夹

2 回答 2

Related

Reference