python之路，正则表达式

阅读量：4661 次

发布时间：2019-06-09

本文共 14780 字，大约阅读时间需要 49 分钟。

python3　　正则表达式

前言：

（1）. 处理文本称为计算机主要工作之一

（2）根据文本内容进行固定搜索是文本处理的常见工作

（3）为了快速方便的处理上述问题，正则表达式技术诞生，逐渐发展为一个单独技术被众多语言使用

1，定义：

为高级文本匹配模式，提供了搜索，替代等功能，本质是由一些字符和特殊符号组成的字串，这个字串描述了字符和字符的重复行为，可以匹配某一类特征的字符串集合。

2，要求

（1）熟练正则表达式符合和用法；

（2）能够正确的理解和简单使用正则表达式进行匹配

（3）能够使用python， re模块操作正则表达式

3，正则特点：

（1）方便进行检索和修改

（2）支持语音众多

（3）使用灵活变化多样

（4）文本处理， mongo存储某一类字符串，django, tornado路由，爬虫文本匹配；

正则的规则和用法；

导入re模块

findall(pattern, string, flags=0)

功能：使用正则表达式匹配字符串

参数： regex : 正则表达式

　　 string 目标字符串

返回值：匹配到的内容(列表)

元字符（即正则表达式中有特殊含义的字符）

*普通字符

元字符： abc

匹配规则：匹配相应的普通字符

eg: ab ---> abcdef : ab

*使用或多个正则同时匹配

元字符： |

匹配规则：符号两侧的正则均能匹配

eg： ab|fg ----> absrgerfg : ab fg

*匹配单一字符

元字符： .

匹配规则：匹配任意一个字符，'\n' 除外；

eg: f.o ---> foo fuo fao f@o

*匹配字符串开头

元字符：^

匹配规则：匹配一个字符串的开头位置

eg, ^Hello ---> Hello world　　 :　　Hello

*匹配字符串结尾

元字符：$

匹配规则：匹配一个字符串的开头位置

eg, py$ ---> Hello.py　　 :　　py

*匹配重复0次货多次

元字符： *

匹配规则：匹配前面出现的正则表达式0次或多次

eg： ab* a ab　　abbbb

>>> re.findall('ab','absgerewrgabsgre')['ab', 'ab']>>> re.findall('ab|fg','gwrgergabrgefg')['ab', 'fg']>>> re.findall('f.o','foofaoagref@o')['foo', 'fao', 'f@o']>>> re.findall('^H','Hello world')['H']>>> re.findall('^Hello','Hello world')['Hello']>>> re.findall('py$','hello.py')['py']>>> re.findall('py$','python')[]>>> re.findall('ab*','absgewraggerweabbbbbgrergbgreeab')['ab', 'a', 'abbbbb', 'ab']>>> re.findall('.*py$','hello.py')['hello.py']>>> re.findall('.*py$','hellopy')['hellopy']>>> re.findall('.*py$','hello.py')['hello.py']>>>

View Code

*匹配重复1次或多次

元字符： +

匹配规则：匹配前面正则表达式至少出现一次

eg：re.findall('ab+','absgweweabagwerabbbbb')

['ab', 'ab', 'abbbbb']

*匹配重复0次或1次

元字符：？

匹配规则：匹配前面出现的正则表达式0次或1次

eg:>>> re.findall('ab?','absgweweabagwerabbbbb')

['ab', 'ab', 'a', 'ab']

*匹配重复指定次数

元字符： {N}

匹配规则：匹配前面的正则表达式N次

eg： ab{3}　　---- abbb

* 匹配重复指定次数范围

元字符： {M, N}

匹配规则：匹配前面的正则表达式 m次到n次

eg： >>> re.findall('ab{3,5}','abbsgweabbbweabbbbagwerabbbbb')

['abbb', 'abbbb', 'abbbbb']

>>> import re>>> re.findall('ab*','absgweweabagwerabbbbb')['ab', 'ab', 'a', 'abbbbb']>>> re.findall('ab+','absgweweabagwerabbbbb')['ab', 'ab', 'abbbbb']>>> re.findall('.+\.py$','hello.py')['hello.py']>>> re.findall('.+\.py$','h.py')['h.py']>>> re.findall('ab?','absgweweabagwerabbbbb')['ab', 'ab', 'a', 'ab']>>> re.findall('ab{3}','absgweweabagwerabbbbb')['abbb']>>> re.findall('.{8}','absgweweabagwerabbbbb')['absgwewe', 'abagwera']>>>>>> re.findall('ab{3,5}','abbsgweabbbweabbbbagwerabbbbb')['abbb', 'abbbb', 'abbbbb']>>> re.findall('.{4,6}','absgweweabagwerabbbbb')['absgwe', 'weabag', 'werabb']>>>

View Code

字符集匹配

元字符：[abcd]

匹配规则：匹配中括号中的字符集，或者是字符集区间的一个字符；

eg: [abcd] --- a 　　b　　c　　d

　　[0-9] ----- 1,2,3,,匹配任意一个数字字符

　　[A-Z] --- A,B,C 匹配任意一个大写字符；

　　[a-z] --- a,b,c 匹配任意一个小写字符；

多个字符集形式可以写在一起

[+-*/0-9a-g] 　　

>>> re.findall('^[A-Z][0-9a-z]{5}','Hello1 Join')

['Hello1']

>>> re.findall('^[A-Z][0-9a-z]+','Hello1 Join')

['Hello1']

>>> re.findall('^[A-Z][0-9a-z]+','Hello1Join')

['Hello1']

>>> re.findall('^[A-Z][0-9a-z]+','Hello1join')

['Hello1join']

*字符集不匹配

元字符：[^.....]

匹配规则：匹配出字符集中字符的任意一个字符

eg: [^abcd]　　　-> e 　　f 　& #

>>> re.findall('[^_0-9a-zA-Z]','helo@163.com')

['@', '.']

*匹配任意数字（非数字）字符

元字符： \d　　[0-9]　　\D　　[^0-9]

匹配规则：\d 匹配任意一个数字字符；\D 匹配任意一个非数字字符

eg：>>> re.findall('1\d{10}','13523538796')

['13523538796']

*匹配任意普通字符（特殊字符）

元字符： \w 　[_0-9a-zA-Z]　, 　　\W　　[^_0-9a-zA-Z]

匹配规则： \w 　　匹配数字字母下划线；　　 \W 　除了数字字母下划线　

eg：>>> re.findall('[A-Z]\w*','Hello World')

['Hello', 'World']

*匹配任意（非）空字符

元字符： \s　　\S

匹配规则： \s 任意空字符 [\n \0 \t \r ] 空格　　换行　　回车　　制表　　

　　　　　\S 任意非空字符

eg：>>> re.findall('hello\s+\S+','hello l&#y hello lucy helloksge')

['hello l&#y', 'hello lucy']

>>> re.findall('1\d{10}','13523538796')['13523538796']>>> re.findall('\w*','Hello World')['Hello', '', 'World', '']>>> re.findall('[A-Z]\w*','Hello World')['Hello', 'World']>>> re.findall('[a-z]*-[0-9]{2}','wangming-20')['wangming-20']>>> re.findall('\w+-\d+','wangming-20')['wangming-20']>>> re.findall('\w+.\d+','wangming-20')['wangming-20']>>> re.findall('hello \w+','hello lily hello lucy helloksge')['hello lily', 'hello lucy']>>> re.findall('hello \w+','hello lily hello   lucy helloksge')['hello lily']>>> re.findall('hello\s+\w+','hello lily hello   lucy helloksge')['hello lily', 'hello   lucy']>>> re.findall('hello\s+\S','hello l&#y hello   lucy helloksge')['hello l', 'hello   l']>>> re.findall('hello\s+\S+','hello l&#y hello   lucy helloksge')['hello l&#y', 'hello   lucy']

View Code

*匹配字符串开头结尾

元字符： \A ^ ,　　\Z $

匹配规则： \A 表示匹配字符串开头位置

　　　　　\Z　表示匹配字符串结尾位置

eg：\Aabc\Z 　　---> abc

*匹配（非）单词边界

元字符： \b　　\B

匹配规则： \b 　　匹配一个单词的边界

　　　　 \B　　匹配一个单词的非边界

数字字母下划线和其他字符的交界处认为是单词边界；

eg：>>> re.findall(r'\bis\b','This is a test')

['is']

>>> re.findall('\Aabc\Z','abcabc')[]>>> re.findall('\Aabc\Z','abbc')[]>>> re.findall('\Aabc\Z','abc')['abc']>>> re.findall('\Aabc\Z','abc abc')[]>>> re.findall('\Aabc','abc abc')['abc']>>> re.findall('abc\Z','abc abc')['abc']>>> re.findall('abc\Z','abcsgeraqabc')['abc']>>> re.findall('\Aabc\Z','abcsgaberaqabc')[]>>> re.findall('is','This is a test')['is', 'is']>>> re.findall('\bis\b','This is a test')[]>>> re.findall(r'\bis\b','This is a test')['is']>>> re.findall(r'\b86\b','10086 1008612')[]>>> re.findall(r'\b10086\b','10086 1008612')['10086']>>> re.findall(r'\Bis','This is a test')['is']>>>

View Code

元字符总结：

字符：匹配实际字符

匹配单个字符： . \d　　　\D 　\w　　\W　　\s　　\S　　[.....] [^.....]

匹配重复次数： * + ？ {N} 　　{M,N}

匹配字符串位置： ^　　$ \A　　\Z　　\b　　\B

其他：　　|

r 字串和转义

转义： . *　　?　　$　　" "　　' '　　[ ]　　()　　{ }　　\

r　　---->　　将字符串变为raw字串

不进行字符串的转义

两种等价的写法：

>>> re.findall('\\? \\* \\\\','what? * \\')

['? * \\']

>>> re.findall(r'\? \* \\','what? * \\')

['? * \\']

贪婪和非贪婪

和重复元字符相关；

*　　+　　？　　{m,n}

贪婪模式：

　　在使用重复元字符的时候（*　　+　　？　　{m,n}），元字符的匹配总是尽可能多的向后匹配更多内容，即为贪婪模式；

>>> re.findall('ab*','abbbbasgerab')

['abbbb', 'a', 'ab']

>>> re.findall('ab+','abbbbasgerab')

['abbbb', 'ab']

>>> re.findall('ab?','abbbbasgerab')

['ab', 'a', 'ab']

>>> re.findall('ab{3,5}','abbbbasgerab')

['abbbb']

非贪婪模式：

　　尽可能少的匹配内容，只要满足正则条件即可；

贪婪 - -> 非贪婪 *？　　？？　　+？　　{m,n}?

>>> re.findall('ab??','abbbbasgerab')

['a', 'a', 'a']

>>> re.findall('ab{3,5}?','abbbbasgerabb')

['abbb']

>>> re.findall('ab+?','abbbbasgerab')

['ab', 'ab']

>>> re.findall('ab*?','abbbbasgerab')

['a', 'a', 'a']

正则表达式的分组

使用（）为正则表达式分组

（ab）cde　　：表示给ab分了一个子组；

》re.match('(ab)cdef','abcdefghig').group()

>>> re.match('(ab)cdef','abcdefghig')<_sre.SRE_Match object; span=(0, 6), match='abcdef'>>>> re.match('(ab)cdef','abcdefghig').group()'abcdef'>>> re.match('(ab)cdef','cabcdefghig').group()Traceback (most recent call last):  File "
      
       ", line 1, in 
       
        AttributeError: 'NoneType' object has no attribute 'group'>>>必须开头

View Code

1，正则表达式的子组用（）表示，增加子组后对整体的匹配没有影响；

2，每个正则表达式可以有多个子组，子组由外到内由左到右为第一个第二个第三个。。。子组；

（（ab）cd（ef）） 3个子组

>>> re.match('(ab)cdef','abcdefghig').group()'abcdef'>>> re.match('(ab)cdef','abcdefghig').group(1)'ab'>>> re.match('(ab)cd(ef)','abcdefghig').group(1)'ab'>>> re.match('(ab)cd(ef)','abcdefghig').group(2)'ef'>>> re.match('((ab)cd(ef))','abcdefghig').group(2)'ab'>>> re.match('((ab)cd(ef))','abcdefghig').group(1)'abcdef'>>> re.match('((ab)cd(ef))','abcdefghig').group(2)'ab'>>> re.match('((ab)cd(ef))','abcdefghig').group(3)'ef'>>> re.match('((ab)cd(ef))','abcdefghig').group()'abcdef'>>>

View Code

3，子组表示一个内部整体，很多函数可以单独提取子组的值；

>>> re.match('(ab)cdef','abcdefghig').group(1)

'ab'

4，子组可以改变重复行为，将子组作为一个整体重复；

>>> re.match('(ab)*','ababababab').group()

'ababababab'

捕获族或非捕获族（命名组和非命名组）

格式： (?P<name>regex)

(?P<word>ab)cdef

（1）某些函数可以通过名字提取子组内容，或者通过名字进行键值对的生成。

　　>>> re.match('(?P<word>ab)cdef','abcdefghi').group()

　　　　'abcdef'

（2）起了名字的子组，可以通过名称重复使用；

（？P=name）

　　>>> re.match('(?P<word>ab)cdef(?P=word)','abcdefabghi').group()

　　　　'abcdefab'

练习：

匹配长度为8-10位的密码。必须以字母开头，数字字母下划线组成

^[a-zA-Z]\w{7,9}$

匹配身份证号

\d{17}(\d|x)

re模块

compile(pattern, flags=0)

功能：获取正则表达式对象

参数： pattern 传入正则表达式

　　 flags 功能标志位提供正则表达式结果的辅助功能；

返回值：返回相应的正则对象；

注： compile 函数返回值的属性函数和re模块属性函数有相同的部分；

（1）相同点

　　功能完全相同

（2）不同点

　　compile 返回值对象属性函数参数中没有pattern和flags部分，因为这两个参数内容在compile生成对象时已经指明，而re模块直接调用这些函数时则需要传入；

　　compile 返回值对象属性函数参数中有pos和endpos参数，可以指明匹配目标字符串的起始位置，而re模块直接调用这些函数时是没有这两个参数;

>>> obj = re.compile('abc')>>> obj.findall('abcdef')['abc']>>>>>> re.findall('abc','abcdef')['abc']>>>>>> obj.findall('abcdef',pos=0, endpos=20)['abc']>>> obj.findall('abcdef',pos=4, endpos=20)[]>>>>>>

View Code

findall(string, pos, endpos)

功能：将正则表达式匹配到的内容存入一个列表返回

参数：要匹配的目标字符串

返回值：返回匹配到的内容列表

注：如果正则表达式中有子组，则返回子组的匹配内容；

# python3 regex.py['hello', 'world'][root@shenzhen re]# vim regex.py[root@shenzhen re]# python3 regex.py['Hello', 'world'][root@shenzhen re]# vim regex.py[root@shenzhen re]# python3 regex.py['Hello_world'][root@shenzhen re]# vim regex.py[root@shenzhen re]# python3 regex.py['_Hello_world'][root@shenzhen re]# vim regex.py[root@shenzhen re]# cat regex.py#!/usr/local/bin/python3import repattern = r'\w+'obj = re.compile(pattern)l = obj.findall('_Hello_world')print(l)[root@shenzhen re]############[root@shenzhen re]# python3 regex1.py[('ab', 'ef'), ('ab', 'ef')][root@shenzhen re]# cat regex1.py#!/usr/local/bin/python3import repattern = r'(ab)cd(ef)'obj = re.compile(pattern)l = obj.findall('abcdefaaagabcdef')print(l)[root@shenzhen re]#

View Code

pattern = r'((ab)cd(ef))' 。。。====[('abcdef', 'ab', 'ef'), ('abcdef', 'ab', 'ef')]

split()

功能：以正则表达式切割字符串

返回值：分割后的内容放入列表

eg ： l1 = re.split(r'\s+','hello world nihao China')

》['hello', 'world', 'nihao', 'China']

sub(pattern， re_string ，string，max)

功能：用目标字符串替换正则表达式匹配内容；

参数：re_string 用什么来替换

　　 string 要匹配的目标字符串

　　 max　　最多替换几次

返回值：返回替换后的字符串；

eg：s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day')

　　》##i,##om, ##t is a fine day

　　s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day'，2)

　　》##i,##om, It is a fine day

subn()

功能：同sub

参数：同sub

返回值：比sub多一个实际替换的个数；

eg：s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day',2)

('##i,##om, ##t is a fine day', 2)

s = re.sub(r'[A-Z]','##','Hi,Tom, It is a fine day')

('##i,##om, ##t is a fine day', 3)

groupindex : compile 对象属性，得到捕获组名和第几组数字组成的字典；

groups ： compile属性，得到一共多少子组；

{
    'word': 2, 'test': 3}3[root@shenzhen re]# cat regex1.py#!/usr/local/bin/python3import repattern = r'((?P
      
       ab)cd(?P
       
        ef))'obj = re.compile(pattern)print(obj.groupindex)print(obj.groups)

View Code

finditer()

功能：同findall 查找所有正则匹配到的内容；

参数：同findall

参数值：返回一个迭代器，迭代的每一项都是matchobj

match（pattern, string, flags=0）

功能：匹配一个字符串开头的位置；

参数：目标字符串

返回值：如果匹配到，则返回一个match obj ；

　　　　如果没有匹配到，则返回None

功能：同match 只是可以匹配任意位置，只能匹配一处；

参数：目标字符串

返回值：如果匹配到，则返回一个match obj

　　　　如果没有匹配到，返回None

import re obj = re.compile(r'foo')iter_obj = obj.finditer\('foo,food on the table')for i in iter_obj:    print(i.group())    # print(dir(i))#match 匹配开头print("*********************")try:    m_obj = obj.match('Foo,food on the table')    print(m_obj.group())except AttributeError:    print("match none")print("*********************")try:    m_obj = obj.search('Foo,food on the table')    print(m_obj.group())except AttributeError:    print("match none")######################## python3 regex2.py<_sre.SRE_Match object; span=(0, 3), match='foo'><_sre.SRE_Match object; span=(4, 7), match='foo'>

View Code

fullmatch()

要求目标字符串能够被正则表达式完全匹配；

>>> obj = re.fullmatch('\w+','abcd1')

>>> obj.group()

'abcd1'

>>>

match 对象属性及函数

属性：

re', #使用正则表达式

'pos' #目标字符串的开始位置

'endpos' #目标字符串的结束位置

'lastgroup' #获取最后一组的名称（捕获族）

'lastindex' #最后一组是第几组

]# cat regex3.py#!/usr/local/bin/python3import rere_obj = re.compile('(ab)cd(?P
      
       ef)')match_obj = re_obj.search('hi,abcdefghigk')print('re:', match_obj.re)print('pos:', match_obj.pos)print('endpos:', match_obj.endpos)print('lastgroup:', match_obj.lastgroup)print('lastindex:', match_obj.lastindex)print('*'*50)print('search : ', match_obj.group())[root@shenzhen re]#[root@shenzhen re]# python3 regex3.pyre: re.compile('(ab)cd(?P
       
        ef)')pos: 0endpos: 14lastgroup: doglastindex: 2**************************************************search :  abcdef

View Code

方法：

'end' #获取匹配内容在字符串中的结束位置

'start' #获取匹配内容在字符串中的开始位置

'span' #获取匹配内容在字符串中的起止位置

'group' # 获取match对象匹配的内容

参数：默认为0，表示获取整体匹配内容； >=1 表示获取某个子组的匹配内容

返回值：返回对应的字符串；

'groups' #获取所有子组当中的内容；

'groupdict' #返回一个字典；返回所有捕获组构成的字典

# cat regex3.py#!/usr/local/bin/python3import rere_obj = re.compile('(ab)cd(?P
      
       ef)')match_obj = re_obj.search('hi,abcdefghigk')################print('start():',match_obj.start())print('end():',match_obj.end())print('span():',match_obj.span())print('group():',match_obj.group())print('group(1):',match_obj.group(1))print('group(2):',match_obj.group(2))print('groups()',match_obj.groups())print('groupdict():',match_obj.groupdict())#print('search : ', match_obj.group())#######################start(): 3end(): 9span(): (3, 9)group(): abcdefgroup(1): abgroup(2): efgroups() ('ab', 'ef')groupdict(): {
     'dog': 'ef'}

View Code

flags： re直接调用的匹配函数大多有flags参数。功能为辅助正则表达式匹配的标志位；

dir ()

前后__: 魔法方法、特殊方法

都是大写的是：模块的系统变量全局变量

首字母大写后面小写：类

都是小写的是：属性函数，属性变量，方法；

I, IGNORECASE #匹配时忽略大小写

S,DOTALL #匹配换行，对 . 元字符起作用

M, MULTILINE # 开头结尾计算换行，对^ 元字符起作用

X, VERBOSE #让正则能添加注释

同时添加多个flags

re.I | re.S

[root@shenzhen re]# python3 regex4.py['abcd', 'ABcd', 'ABCD']['hello world', '', 'nihao china', '', '']['hello world', 'nihao china']['hello world\nnihao china\n', '']helloabcdef[root@shenzhen re]# cat regex4.py#!/usr/local/bin/python3import rere_obj = re.compile('abcd',re.I)l = re_obj.findall('hi,abcd,ABcd, ABCD')print(l)s = '''hello worldnihao china'''l1 = re.findall('.*',s)print(l1)l2 = re.findall('.+',s)print(l2)l3 = re.findall('.*',s,re.S)print(l3)obj = re.search('^hello',s)print(obj.group())#objj = re.search('^\snihao',s,re.M).group()#print(objj)re_obj = re.compile('''(ab)#This is group1                    cd                    (?P
      
       ef)#This is group dog                    ''',re.X)print(re_obj.search('abcdefghi').group())[root@shenzhen re]#

View Code

练习1：

import reimport time import sys#匹配具体内容def reg(data,port):    pattern = r'^\S+'    re_obj = re.compile(pattern)    try:        head_word = re_obj.match(data).group()    except Exception:        return None    if port == head_word:        pattern = r'address is (\w{4}\.\w{4}\.\w{4})'        try:            match_obj = re.search(pattern,data)            return match_obj.group(1)        except Exception:            return None    else:        return Nonedef main(port):    fd = open('1.txt','r')    fd.readline()    fd.readline()    while True:        data = ''        while True:            s = fd.readline()            if s == '\n':                break             if s == '':                print("search over")                return            data += s         # 将每段数据传入函数进行匹配        result = reg(data,port)        if result:            print("address is :",result)            return  if __name__ == "__main__":    if len(sys.argv) < 2:        print("argv error")        sys.exit(1)    main(sys.argv[1])

View Code

转载于:https://www.cnblogs.com/weizitianming/p/9362766.html

你可能感兴趣的文章

python3 正则表达式

python3　　正则表达式