The parse
API is much like Python Regular Expressions, primarily consisting of the parse
, search
, and findall
strategies. Fundamental utilization could be discovered from the parse documentation.
Sample format
The parse format is similar to the Python format syntax. You’ll be able to seize matched textual content utilizing {}
or {field_name}
.
For instance, within the following textual content, if I need to get the profile URL and username, I can write it like this:
content material:
Good day everybody, my Medium profile url is https://qtalen.medium.com,
and my username is @qtalen.parse sample:
Good day everybody, my Medium profile url is {profile},
and my username is {username}.
Otherwise you need to extract a number of telephone numbers. Nonetheless, the telephone numbers have totally different codecs of nation codes in entrance, and the telephone numbers are of a set size of 11 digits. You’ll be able to write it like this:
compiler = Parser("{country_code}{telephone:11.11},")
content material = "0085212345678901, +85212345678902, (852)12345678903,"outcomes = compiler.findall(content material)
for end in outcomes:
print(end result)
Or if it’s worthwhile to course of a chunk of textual content in an HTML tag, however the textual content is preceded and adopted by an indefinite size of whitespace, you’ll be able to write it like this:
content material:
<div> Good day World </div>sample:
<div>{:^}</div>
Within the code above, {:11}
refers back to the width, which suggests to seize a minimum of 11 characters, equal to the common expression (.{11,})?
. {:.11}
refers back to the precision, which suggests to seize at most 11 characters, equal to the common expression (.{,11})?
. So when mixed, it means (.{11, 11})?
. The result’s:
Probably the most highly effective characteristic of parse is its dealing with of time textual content, which could be straight parsed into Python datetime objects. For instance, if we need to parse the time in an HTTP log:
content material:
[04/Jan/2019:16:06:38 +0800]sample:
[{:th}]
Retrieving outcomes
There are two methods to retrieve the outcomes:
- For capturing strategies that use
{}
and not using a area identify, you’ll be able to straight useend result.fastened
to get the end result as a tuple. - For capturing strategies that use
{field_name}
, you should utilizeend result.named
to get the end result as a dictionary.
Customized Sort Conversions
Though utilizing {field_name}
is already fairly easy, the supply code reveals that {field_name}
is internally transformed to (?P<field_name>.+?)
. So, parse
nonetheless makes use of common expressions for matching. .+?
represents a number of random characters in non-greedy mode.
Nevertheless, usually we hope to match extra exactly. For instance, the textual content “my e-mail is [email protected]”, “my e-mail is {e-mail}”
can seize the e-mail. Generally we could get soiled information, for instance, “my e-mail is xxxx@xxxx”, and we don’t need to seize it.
Is there a approach to make use of common expressions for extra correct matching?
That’s when the with_pattern
decorator turns out to be useful.
For instance, for capturing e-mail addresses, we will write it like this:
@with_pattern(r'b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Z|a-z]{2,}b')
def e-mail(textual content: str) -> str:
return textual contentcompiler = Parser("my e-mail deal with is {e-mail:E mail}", dict(E mail=e-mail))
legal_result = compiler.parse("my e-mail deal with is [email protected]") # authorized e-mail
illegal_result = compiler.parse("my e-mail deal with is xx@xx") # unlawful e-mail
Utilizing the with_pattern
decorator, we will outline a customized area sort, on this case, E mail
which is able to match the e-mail deal with within the textual content. We will additionally use this strategy to match different sophisticated patterns.