python - 使用 Git blame (Python3) 稳健地检索 SHA 和行内容

Question

我正在为一个git blame用于检索文件信息的包（Python >= 3.5）做出贡献。我正在努力用仅支持我们实际需要的一小部分功能的自定义代码替换GitPython依赖项（并以我们实际需要的形式提供数据）。

我发现这git blame -lts最接近我的需要，即检索文件中每一行的提交 SHA 和行内容。这给了我这样的输出

82a3e5021b7131e31fc5b110194a77ebee907955 books/main/docs/index.md  5) Softwareplattform [ILIAS](https://www.ilias.de/), die an zahlreichen

我已经处理了这个

       line_pattern = re.compile('(.*?)\s.*\s*\d\)(\s*.*)')

        for line in cmd.stdout():
            m = line_pattern.match(line)
            if m:
                sha = m.group(1)
                content = m.group(2).strip()

效果很好。然而，该软件包的维护者正确地警告说：“这可能会为非常特定的用户组引入难以调试的错误。可能需要在多个 OS 和 GIT 版本中进行大量的单元测试。”

我之所以采用我的方法是因为我发现git blame --porcelain解析的输出有些乏味。

30ed8daf1c48e4a7302de23b6ed262ab13122d31 1 1 1
author XY
author-mail <XY>
author-time 1580742131
author-tz +0100
committer XY
committer-mail <XY>
committer-time 1580742131
committer-tz +0100
summary Stub-Outline-Dateien
filename home/docs/README.md
        hero: abcdefghijklmnopqrstuvwxyz
82a3e5021b7131e31fc5b110194a77ebee907955 18 18

82a3e5021b7131e31fc5b110194a77ebee907955 19 19
        ---
82a3e5021b7131e31fc5b110194a77ebee907955 20 20

...

我不喜欢这种对字符串列表的迭代所涉及的管家工作。

我的问题是：

1）我应该更好地使用--porcelain输出，因为它明确用于机器消耗吗？2) 我可以期望这种格式比 Git 版本和操作系统更强大吗？我是否可以假设以 TAB 字符开头的行是内容行，这是源行的最后输出行，并且该选项卡之后的任何内容都是原始行内容？

score 1 · Accepted Answer

不知道这是否是最好的解决方案，我没有在这里等待答案就试了一下。我假设我的两个问题的答案是“是”。

可以在此处的上下文中看到以下代码：https ://github.com/uliska/mkdocs-git-authors-plugin/blob/6f5822c641452cea3edb82c2bbb9ed63bd254d2e/mkdocs_git_authors_plugin/repo.py#L466-L565

    def _process_git_blame(self):
        """
        Execute git blame and parse the results.

        This retrieves all data we need, also for the Commit object.
        Each line will be associated with a Commit object and counted
        to its author's "account".
        Whether empty lines are counted is determined by the
        count_empty_lines configuration option.

        git blame --porcelain will produce output like the following
        for each line in a file:

        When a commit is first seen in that file:
            30ed8daf1c48e4a7302de23b6ed262ab13122d31 1 2 1
            author John Doe
            author-mail <j.doe@example.com>
            author-time 1580742131
            author-tz +0100
            committer John Doe
            committer-mail <j.doe@example.com>
            committer-time 1580742131
            summary Fancy commit message title
            filename home/docs/README.md
                    line content (indicated by TAB. May be empty after that)

        When a commit has already been seen *in that file*:
            82a3e5021b7131e31fc5b110194a77ebee907955 4 5
                    line content

        In this case the metadata is not repeated, but it is guaranteed that
        a Commit object with that SHA has already been created so we don't
        need that information anymore.

        When a line has not been committed yet:
            0000000000000000000000000000000000000000 1 1 1
            author Not Committed Yet
            author-mail <not.committed.yet>
            author-time 1583342617
            author-tz +0100
            committer Not Committed Yet
            committer-mail <not.committed.yet>
            committer-time 1583342617
            committer-tz +0100
            summary Version of books/main/docs/index.md from books/main/docs/index.md
            previous 1f0c3455841488fe0f010e5f56226026b5c5d0b3 books/main/docs/index.md
            filename books/main/docs/index.md
                    uncommitted line content

        In this case exactly one Commit object with the special SHA and fake
        author will be created and counted.

        Args:
            ---
        Returns:
            --- (this method works through side effects)
        """

        re_sha = re.compile('^\w{40}')

        cmd = GitCommand('blame', ['--porcelain', str(self._path)])
        cmd.run()

        commit_data = {}
        for line in cmd.stdout():
            key = line.split(' ')[0]
            m = re_sha.match(key)
            if m:
                commit_data = {
                    'sha': key
                }
            elif key in [
                'author',
                'author-mail',
                'author-time',
                'author-tz',
                'summary'
            ]:
                commit_data[key] = line[len(key)+1:]
            elif line.startswith('\t'):
                # assign the line to a commit
                # and create the Commit object if necessary
                commit = self.repo().get_commit(
                    commit_data.get('sha'),
                    # The following values are guaranteed to be present
                    # when a commit is seen for the first time,
                    # so they can be used for creating a Commit object.
                    author_name=commit_data.get('author'),
                    author_email=commit_data.get('author-mail'),
                    author_time=commit_data.get('author-time'),
                    author_tz=commit_data.get('author-tz'),
                    summary=commit_data.get('summary')
                )
                if len(line) > 1 or self.repo().config('count_empty_lines'):
                    author = commit.author()
                    if author not in self._authors:
                        self._authors.append(author)
                    author.add_lines(self, commit)
                    self.add_total_lines()
                    self.repo().add_total_lines()

python - 使用 Git blame (Python3) 稳健地检索 SHA 和行内容

1 回答 1

Related

Reference