Introducing Python’s Parse: The Final Different to Common Expressions | by Peng Qian

The parse API is much like Python Regular Expressions, primarily consisting of the parse, search, and findall strategies. Fundamental utilization could be discovered from the parse documentation.

Sample format

The parse format is similar to the Python format syntax. You’ll be able to seize matched textual content utilizing {} or {field_name}.

For instance, within the following textual content, if I need to get the profile URL and username, I can write it like this:

content material:
Good day everybody, my Medium profile url is https://qtalen.medium.com,
and my username is @qtalen.parse sample:
Good day everybody, my Medium profile url is {profile},
and my username is {username}.

Otherwise you need to extract a number of telephone numbers. Nonetheless, the telephone numbers have totally different codecs of nation codes in entrance, and the telephone numbers are of a set size of 11 digits. You’ll be able to write it like this:

compiler = Parser("{country_code}{telephone:11.11},")
content material = "0085212345678901, +85212345678902, (852)12345678903,"outcomes = compiler.findall(content material)
for end in outcomes:
print(end result)

Or if it’s worthwhile to course of a chunk of textual content in an HTML tag, however the textual content is preceded and adopted by an indefinite size of whitespace, you’ll be able to write it like this:

content material:
<div>           Good day World               </div>sample:
<div>{:^}</div>

Within the code above, {:11} refers back to the width, which suggests to seize a minimum of 11 characters, equal to the common expression (.{11,})?. {:.11} refers back to the precision, which suggests to seize at most 11 characters, equal to the common expression (.{,11})?. So when mixed, it means (.{11, 11})?. The result’s:

Capture fixed-width characters. — Seize fixed-width characters. Picture by Writer

Probably the most highly effective characteristic of parse is its dealing with of time textual content, which could be straight parsed into Python datetime objects. For instance, if we need to parse the time in an HTTP log:

content material:
[04/Jan/2019:16:06:38 +0800]sample:
[{:th}]

Retrieving outcomes

There are two methods to retrieve the outcomes:

For capturing strategies that use {} and not using a area identify, you’ll be able to straight use end result.fastened to get the end result as a tuple.
For capturing strategies that use {field_name}, you should utilize end result.named to get the end result as a dictionary.

Customized Sort Conversions

Though utilizing {field_name} is already fairly easy, the supply code reveals that {field_name} is internally transformed to (?P<field_name>.+?). So, parse nonetheless makes use of common expressions for matching. .+? represents a number of random characters in non-greedy mode.

The transformation process of parse format to regular expressions — The transformation technique of parse format to common expressions. Picture by Writer

Nevertheless, usually we hope to match extra exactly. For instance, the textual content “my e-mail is [email protected]”, “my e-mail is {e-mail}” can seize the e-mail. Generally we could get soiled information, for instance, “my e-mail is xxxx@xxxx”, and we don’t need to seize it.

Is there a approach to make use of common expressions for extra correct matching?

That’s when the with_pattern decorator turns out to be useful.

For instance, for capturing e-mail addresses, we will write it like this:

@with_pattern(r'b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b')
def e-mail(textual content: str) -> str:
return textual contentcompiler = Parser("my e-mail deal with is {e-mail:E mail}", dict(E mail=e-mail))
legal_result = compiler.parse("my e-mail deal with is [email protected]")  # authorized e-mail
illegal_result = compiler.parse("my e-mail deal with is xx@xx")     # unlawful e-mail

Utilizing the with_pattern decorator, we will outline a customized area sort, on this case, E mailwhich is able to match the e-mail deal with within the textual content. We will additionally use this strategy to match different sophisticated patterns.